Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation data fine, but all Test data interpreted as the same class (binary). #2

Closed
ajthinking opened this issue Apr 5, 2019 · 2 comments

Comments

@ajthinking
Copy link
Owner

ajthinking commented Apr 5, 2019

After validation we can see the distribution of classes [0,1] is similar to that of the training data.
image
But looking at the test data, ALL rows are interpreted as [0].
image
How can this be?
Full notebook: https://github.com/ajthinking/kaggle-santander/blob/master/tabular.ipynb

@ajthinking ajthinking changed the title Validation data fine, Test data all interpreted as the same class (binary). Validation data fine, but all Test data interpreted as the same class (binary). Apr 5, 2019
@ajthinking
Copy link
Owner Author

I noted the following code produces a more reasonable result.

for index, row in df_test.iterrows():
    print(learn.predict(row))
    if index > 1000:
        break

So might be some bad usage of get_preds ?

@ajthinking
Copy link
Owner Author

Index 1 of get_preds is actually the Labels. In the case of the Test data we have no labels, so it will default to zero. Instead convert the probabilities to classes.

probs = learn.get_preds(ds_type=DatasetType.Test)[0]

def probs2class(item): 
    return max(range(len(item)), key=item.__getitem__) 

test_df = pd.DataFrame({'ID_code': df_test['ID_code'], 'target': list(map(probs2class, probs))})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant