Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: NaN outputs for multiclass classification using MCC loss function #306

Closed
kevingreenman opened this issue Jul 1, 2022 · 4 comments · Fixed by #309
Closed

[BUG]: NaN outputs for multiclass classification using MCC loss function #306

kevingreenman opened this issue Jul 1, 2022 · 4 comments · Fixed by #309
Labels
bug Something isn't working

Comments

@kevingreenman
Copy link
Member

kevingreenman commented Jul 1, 2022

Describe the bug
For a dataset where multiclass classification trains normally with the default cross_entropy loss function, it produces an error with the mcc loss function.

Example(s)
The script

python train.py --data_path debug.csv --dataset_type multiclass --save_dir debug-results --multiclass_num_classes 3

runs without error while the script

python train.py --data_path debug.csv --dataset_type multiclass --save_dir debug-results --multiclass_num_classes 3 --loss_function mcc

encounters ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). here. The NaN values first appear during the encoding step here. The weights of self.W_i start out normally, but turn to NaNs, and I haven't yet been able to trace why that's happening.

The contents of debug.csv are as follows:

isomeric_smiles,PUBCHEM_ACTIVITY_OUTCOME_INT
C1CCCN(CC1)C(=O)C2=CSC3=CC=CC=C32,1
C1=CC(=C(C=C1[N+](=O)[O-])[Hg])O.C1=NC2=C(N1[C@H]3[C@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=[SH+])N,1
C1CCN(CC1)C(=O)C2=CC=C(C=C2)COC3=CC=CC=C3Br,1
C1CCN(CC1)C(=O)CCC2=CC=C(C=C2)OC3=CC(=CC(=C3)[N+](=O)[O-])[N+](=O)[O-],1
C1CN(CC(=O)O[Hg]OC(=O)CN1CC(=O)O)CC(=O)O,1
C1=CC(=CC=C1N)[Hg]S[C-]2C3=C(NC(=N2)N)N(C=N3)[C@H]4[C@H]([C@@H]([C@H](O4)CO)O)O,1
CC1=CC(=CC=C1)N2C=NC3=C2C=CC(=C3)C(=O)N4CCCCC4,1
C1CCC(C1)NP2(=O)COC3=CC=CC=C3OC2,1
CC1CCCN(C1)C(=O)C2=CC=C(C=C2)COC3=CC=CC=C3Br,1
C1CCN(CC1)C(=O)C2=CC3=C(C=C2)OCO3,1
@kevingreenman kevingreenman added the bug Something isn't working label Jul 1, 2022
@kevingreenman
Copy link
Member Author

I just realized that I only have 1 class represented in my debugging file. If I change debug.csv to

isomeric_smiles,PUBCHEM_ACTIVITY_OUTCOME_INT
C1CCCN(CC1)C(=O)C2=CSC3=CC=CC=C32,2
C1=CC(=C(C=C1[N+](=O)[O-])[Hg])O.C1=NC2=C(N1[C@H]3[C@H]([C@@H]([C@H](O3)CO)O)O)N=C(NC2=[SH+])N,0
C1CCN(CC1)C(=O)C2=CC=C(C=C2)COC3=CC=CC=C3Br,1
C1CCN(CC1)C(=O)CCC2=CC=C(C=C2)OC3=CC(=CC(=C3)[N+](=O)[O-])[N+](=O)[O-],1
C1CN(CC(=O)O[Hg]OC(=O)CN1CC(=O)O)CC(=O)O,1
C1=CC(=CC=C1N)[Hg]S[C-]2C3=C(NC(=N2)N)N(C=N3)[C@H]4[C@H]([C@@H]([C@H](O4)CO)O)O,1
CC1=CC(=CC=C1)N2C=NC3=C2C=CC(=C3)C(=O)N4CCCCC4,1
C1CCC(C1)NP2(=O)COC3=CC=CC=C3OC2,1
CC1CCCN(C1)C(=O)C2=CC=C(C=C2)COC3=CC=CC=C3Br,2
C1CCN(CC1)C(=O)C2=CC3=C(C=C2)OCO3,0

the error no longer occurs, so it must be related to that. But this issue also happened on my full set of 130K molecules that has all 3 classes represented, so it must be more nuanced than that.

@kevingreenman
Copy link
Member Author

I confirmed that in my full dataset, the train, val, and test splits are identical for the model trained with cross_entropy loss and the model trained with mcc loss, and each split has all 3 classes represented.

@kevingreenman
Copy link
Member Author

The mcc_multiclass_loss function is returning Inf; I'm investigating why

@kevingreenman
Copy link
Member Author

Based on the multiclass definition of MCC from sklearn, it seems that if all of the true values OR predicted values belong to the same class (that number is equal to the total number of samples), then the MCC will be Inf because the denominator will be 0. So if any batch has all predicted values or true values being the same, then MCC will be Inf, which will cause the weights to become NaN through backprop, which will make the predictions be NaN as well. It seems that mcc_multiclass_loss should raise an informative error message in this case. This could be a common problem for people trying to train on imbalanced datasets. Is there a way to modify our data loader to have the option of doing stratified sampling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant