Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Image Classification] Low Accuracy on EuroSAT Dataset #386

Closed
luisquintanilla opened this issue Nov 27, 2019 · 17 comments
Closed

[Image Classification] Low Accuracy on EuroSAT Dataset #386

luisquintanilla opened this issue Nov 27, 2019 · 17 comments
Assignees
Labels
Priority:0 Work that we can't release without
Milestone

Comments

@luisquintanilla
Copy link
Contributor

See issue in ML.NET Repo for more details. dotnet/machinelearning#4504.

@luisquintanilla
Copy link
Contributor Author

See explanation of how to achieve good performance on this dataset using the ML.NET Image Classification API. Still need to think about how to get similar performance on AutoML / Model Builder given the characteristics of the dataset.

dotnet/machinelearning#4504 (comment)

@luisquintanilla
Copy link
Contributor Author

luisquintanilla commented Dec 2, 2019

Issue Summary

The EuroSAT paper, a geo-referenced aerial/satellite image dataset of 27,000 images categorized into 10 different classes is said to achieve 98.57% classification accuracy using CNNs. More specifically, using ResNet50, it achieves 96.37% accuracy using a 90/10 train/test split. Using ML.NET Image Classification API as well as Model Builder achieves 99%+ accuracy while training. However, when evaluating the model, both with and without cross validation, accuracy drops between 61-69% using only the CPU and 59% using the GPU. See performance comparisons in table below.

Method Number of Images Cross-Validation Training Accuracy Evaluation Accuracy
API (CPU) 20000 (18000 Train, 2000 Test) No 0.9946118 0.698
Model Builder (CPU) 27000 Yes 0.9954983 0.6168
Model Builder (GPU) 27000 Yes N/A 0.5949

Dataset

Dataset download link

Below is the file I used to read the file paths and labels. To test, you can change the parent directory C:\Users\luquinta.REDMOND\Datasets\EuroSAT to wherever you've saved the labelled subdirectories. Also, the extension on the attached file is .txt. The extension in the code is .tsv so make sure to change accordingly when setting the value of TRAIN_DATA_FILEPATH at the top of Program.cs

traindata.txt

Source code / logs

The source code is at the following repo: https://github.com/luisquintanilla/EuroSATTrainSample/blob/master/EuroSATTrainSample/Program.cs

Output logs:

ImageClassificationTrainResultsModelBuilder.txt
ImageClassificationTrainResultsAPI.txt

Potential Solutions

Disabling early stopping and DnnImageFeaturizer seem to yield the most impactful results (accuracy of 93%+)

1. You were using early stopping (and they were not), please use default 200 epochs and turn off early stopping .. doing this alone with will get your accuracy between 93-94%. Early stopping works great when you supply a validation set but you were not even doing that so it falls back to using trainset as validation set! While this is not ideal but it seems to work in practice for some datasets we have tested on but definitely not all.
2. As a comparison point, the DNNImageFeaturizer style gets a bit over 94% accuracy taking 18min on CPU for its first model (then continues sweeping).
3. You were not correctly splitting your dataset into 90:10 train:test split (please see the attached
Program.txt that contains your code amended with code to split the dataset correctly.
4. You were not shuffling the dataset prior to training as the paper does, shuffle transform does not shuffle at the level of an individual data point but it shuffles in blocks.

@luisquintanilla
Copy link
Contributor Author

@JakeRadMSFT See summary of issue above

@JakeRadMSFT
Copy link
Contributor

JakeRadMSFT commented Dec 2, 2019

@codemzs I see you closed this issue on the ML.NET side. That's fine but we need some help with next steps.

This is currently blocking our Documentation Folks from being able to use this dataset in the documentation as they had planned. It doesn't seem like they should have to hand pick a data-set for use in the documentation. Customers will likely hit this with their datasets too.

These are the options I can think of:

  • Update AutoML to not use Early Stopping
    • Negatives: Adds time to training that already takes time.
  • Update AutoML to try with and without Early Stopping
    • Negatives: Also adds more time
  • Have Model Builder pre-split and randomize the dataset
    • Negatives: None but it sounds like this doesn't fully solve the problem. Also, isn't that compatible with streaming the dataset. (but we don't do that yet)
  • Add DNN Feturizer approach and try this first.
    • Negatives: Not a true DNN Model?

Thoughts?

@JakeRadMSFT
Copy link
Contributor

@justinormont Thoughts?

@codemzs
Copy link
Member

codemzs commented Dec 2, 2019

@luisquintanilla Your "as a comparison point" statement seems a little misleading, Image Classification algorithm actually gets you ~97.1833333333333% accuracy (almost a point higher than the EuroSAT paper with resnet50), please refer to my logs - EuroSAT_90_10_split_200_epochs_shuffle.txt.

Also DNN Featurizer was using Resnet18 and not Resnet50. I cannot stress enough the importance of keeping comparisons apples to apples. No matter how different DNN models(i.e resnet 18, 50, 101 etc) fare against each other when you compare you need to make sure all parameters are the same.

@JakeRadMSFT Lets setup a meeting and talk offline. We are also adding retrain of DNN layers in the next release and that significantly boosts the accuracy but early stopping needs to be enabled when validation set is passed, we can certainly modify the train-test split code to also give us validation set so that early stopping works well. Without validation set early stopping uses train set as validation set that misleads it to stop early ....

@JakeRadMSFT
Copy link
Contributor

@codemzs I'd prefer to keep the conversation all in one place and I want to keep Luis in the loop.

What can we do? What do you recommend? I'd like to unblock documentation as soon as we can.

I'm not sure our users are too concerned with Resnet18 vs Resnet50 they just want a model that performs well. It seems odd that a dataset with 27000 images would perform so poorly. Should we turn off Early Stopping if it doesn't work consistently with datasets?

@codemzs
Copy link
Member

codemzs commented Dec 3, 2019

I can recommend several remedies but will prefer offline discussion for efficiency reasons. You may invite Luis to that meeting. Thanks!

@luisquintanilla
Copy link
Contributor Author

The comment summarizing the issue and potential solutions is just a reference so the Model Builder team doesn't have to keep flipping back and forth between the original issue and this one. Although if they need more information, they can do that as well. The table comparison is from when the original issue was posted without taking into account early stopping which seems to be where the performance improvements really come from. That original comparison was intended to see whether the issue was isolated to AutoML/Model Builder. As mentioned in the potential solutions though, disabling early stopping within the API greatly improves performance when using the API. While it's good that using the Image Classification API can achieve comparable results when following the methodologies described in the academic paper, it might be good to think about how similar performance can be achieved with AutoML and dependent tooling. For documentation purposes, an "easy" solution would be to find another dataset or use case. However, that would not be beneficial in the sense that we'd be working around the limitations rather than seeing how improvements can be made overall.

@JakeRadMSFT
Copy link
Contributor

JakeRadMSFT commented Dec 3, 2019

@codemzs okay we can discuss in standup or after standup. I'll send a note to standup chat.

I'm find doing that but it's actually less efficient. I may or may not be the developer working on the solution. It's nice to have all the content here for the next developer to work on this.

@natke
Copy link

natke commented Dec 3, 2019

We should also validate the performance with early stopping using a validation dataset.

If this gives good enough performance, then perhaps the solution would be Jake's third option, with early stopping remaining enabled.

@codemzs
Copy link
Member

codemzs commented Dec 3, 2019

I have already done that. Will explain at scrum today.

@codemzs
Copy link
Member

codemzs commented Dec 13, 2019

Hi Folks,

I have added a functionality to Image classification API that auto creates the validation set by taking 10%(modifiable) of the images from test set if no validation set is provided and early stopping is enabled and also shuffles the images properly.

Below are logs from which you can see with early stopping the training stopped at 33 epochs that took ~9 minutes on GPU and achieved 97.08% accuracy. Out of 27000 images 24300 images were used as train set and remaining as test set. This change should be in master branch end of this week after which model builder just needs to update the nuget version of ML .NET dependencies.

CC: @harshithapv , @CESARDELATORRE , @JakeRadMSFT , @briacht , @natke , @luisquintanilla, @ashbhandare , @justinormont

Thanks,
Zeeshan Siddiqui

Phase: Training, Dataset used: Validation, Batch Processed Count: 243, Epoch: 31, Accuracy: 0.9502052
Phase: Training, Dataset used: Train, Batch Processed Count: 2187, Learning Rate: 0.003715743 Epoch: 32, Accuracy: 0.9697338, Cross-Entropy: 0.09318368
Phase: Training, Dataset used: Validation, Batch Processed Count: 243, Epoch: 32, Accuracy: 0.9497937
Phase: Training, Dataset used: Train, Batch Processed Count: 2187, Learning Rate: 0.003715743 Epoch: 33, Accuracy: 0.9701452, Cross-Entropy: 0.09265892
Phase: Training, Dataset used: Validation, Batch Processed Count: 243, Epoch: 33, Accuracy: 0.9497937
Saver not created because there are no variables in the graph to restore
2019-12-13 00:31:32.405873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0001:00:00.0
2019-12-13 00:31:32.412895: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-12-13 00:31:32.420014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-13 00:31:32.423919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-13 00:31:32.429937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-12-13 00:31:32.432693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-12-13 00:31:32.437395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7466 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0001:00:00.0, compute capability: 5.2)
Restoring parameters from C:\Users\mladmin\repo\codemzs\machinelearning\bin\AnyCPU.Debug\Microsoft.ML.Samples.GPU\workspace\custom_retrained_model_based_on_resnet_v2_50_299.meta
Froze 2 variables.
Converted 2 variables to const ops.
2019-12-13 00:31:38.397836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0001:00:00.0
2019-12-13 00:31:38.405122: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-12-13 00:31:38.413486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-12-13 00:31:38.417136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-13 00:31:38.428009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-12-13 00:31:38.432485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-12-13 00:31:38.441809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7466 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0001:00:00.0, compute capability: 5.2)
Finished training in 519412
Evaluating Model
Finished evaluation in 584099

Evaluation Metrics
Log Loss: 0.0901965446885085 | MacroAccuracy: 0.9708

@justinormont
Copy link

@codemzs: Quite nice. What was your final MicroAccuracy?

@JakeRadMSFT JakeRadMSFT self-assigned this Jan 6, 2020
@JakeRadMSFT JakeRadMSFT added P1 Priority:0 Work that we can't release without labels Jan 7, 2020
@JakeRadMSFT JakeRadMSFT added this to the January 2020 milestone Jan 12, 2020
@JakeRadMSFT JakeRadMSFT removed their assignment Jan 13, 2020
@LittleLittleCloud
Copy link
Contributor

@luisquintanilla Could you help test the same dataset on mlnet 0.15.0-preview, to see if accuracy improved? thanks

@luisquintanilla
Copy link
Contributor Author

@LittleLittleCloud accuracy improved after using a version of Model Builder with ML.NET 1.5.0-preview

image

@codemzs
Copy link
Member

codemzs commented Jan 14, 2020

Thanks @luisquintanilla for finding this issue and thanks @LittleLittleCloud for integrating the latest nuget with the fix. It seems even the training time has improved, 12.43 minutes vs 18 minutes (DNNFeaturizer approach with just first sweep) and also higher accuracy. Lets ship it!

@codemzs codemzs self-assigned this Jan 14, 2020
@codemzs codemzs closed this as completed Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:0 Work that we can't release without
Projects
None yet
Development

No branches or pull requests

6 participants