lower AUROCs than the author got #6

yoheimatt · 2018-09-25T17:49:39Z

Thank you for sharing the code with the community.
I ran the Keras version of the code.
It seems I was unable to get the AUROC close to what you have gotten:

0 Atelectasis 0.689804
1 Cardiomegaly 0.699429
2 Effusion 0.769636
3 Infiltration 0.655084
4 Mass 0.601279
5 Nodule 0.571633
6 Pneumonia 0.634000
7 Pneumothorax 0.677171
8 Consolidation 0.725847
9 Edema 0.817075
10 Emphysema 0.603675
11 Fibrosis 0.660121
12 Pleural_Thickening 0.650140
13 Hernia 0.647572

How many epochs do I need to run?

georgeAccnt-GH · 2018-09-25T19:04:14Z

First of all, thank you for trying the code. Please feel free to add pain points. I am sure there were a few. We are also working on a streamlined version that will drop the deprecated Workbench and leverage the much more useful recently released AML SDK.
About the classif performance issue, you should try around 200 epochs. The value used in the repo (1) is just for demo purposes. How many epochs did u use? If you are using Azure DLVM for training, you could scale its size up to reduce time. I think on an NC12 (2 GPUs) it will take days (about 20 to 30 minutes per epoch)

yoheimatt · 2018-09-26T00:57:19Z

Thank you for your quick reply. I did find a small potential issue. In https://github.com/Azure/AzureChestXRay/blob/master/AzureChestXRay_AMLWB/Code/src/azure_chestxray_utils.py
I think there could be an underscore missing in 'Pleural Thickening'. Without it, the processing create zero case of positive Pleural Thickening.

I will follow your suggestion of running 200 epochs. To be honest, I ran out of patience and stopped the training at 50th epoch after I didn't see much improvement. And you are right, it takes less than 30 minutes per epoch.

Stexan · 2019-01-19T18:02:57Z

Hello @georgeAccnt-GH, and thank you very much for your implementation of the study! Our team has also tried to replicate your results, and while we have better results than the original poster, we still didn't reach your AUC (you have a mean of 0,84 and we have a mean of 0,81).

What happens is that around epoch 30-35, the algorithms starts overfitting, so training becomes useless as performance on the valid/test sets just drops. We have followed the exact same steps that you have implemented yourself.

Do you think the data splits have an impact and the difference might come from there? Or is there anything else you did specifically to make the network not overfit so fast (we have also tried random crops along with the augmentation techniques used in your implementation, but that didn't help much either)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lower AUROCs than the author got #6

lower AUROCs than the author got #6

yoheimatt commented Sep 25, 2018 •

edited

georgeAccnt-GH commented Sep 25, 2018

yoheimatt commented Sep 26, 2018

Stexan commented Jan 19, 2019

lower AUROCs than the author got #6

lower AUROCs than the author got #6

Comments

yoheimatt commented Sep 25, 2018 • edited

georgeAccnt-GH commented Sep 25, 2018

yoheimatt commented Sep 26, 2018

Stexan commented Jan 19, 2019

yoheimatt commented Sep 25, 2018 •

edited