Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Image Classification] Low Accuracy on EuroSAT Dataset #4504

luisquintanilla opened this issue Nov 27, 2019 · 6 comments


Copy link

@luisquintanilla luisquintanilla commented Nov 27, 2019

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): .NET Core 2.1
  • ML.NET Version: 1.4.0
  • Model Builder Version: 16.0.1911.1103


The EuroSAT paper, a geo-referenced aerial/satellite image dataset of 27,000 images categorized into 10 different classes is said to achieve 98.57% classification accuracy using CNNs. More specifically, using ResNet50, it achieves 96.37% accuracy using a 90/10 train/test split. Using ML.NET Image Classification API as well as Model Builder achieves 99%+ accuracy while training. However, when evaluating the model, both with and without cross validation, accuracy drops between 61-69% using only the CPU and 59% using the GPU. See performance comparisons in table below.

Method Number of Images Cross-Validation Training Accuracy Evaluation Accuracy
API (CPU) 20000 (18000 Train, 2000 Test) No 0.9946118 0.698
Model Builder (CPU) 27000 Yes 0.9954983 0.6168
Model Builder (GPU) 27000 Yes N/A 0.5949

Source code / logs

The source code is at the following repo:

Dataset download link

Output logs:


@luisquintanilla luisquintanilla changed the title [Image Classification] Low Accuracy on EuroSat Dataset [Image Classification] Low Accuracy on EuroSAT Dataset Nov 27, 2019

This comment has been minimized.

Copy link

@luisquintanilla luisquintanilla commented Nov 27, 2019

@CESARDELATORRE can you please add the GPU log whenever you get a chance. Thanks.


This comment has been minimized.

Copy link


Yeah, I tried that same image-set with Model Builder and trained with GPU.
It finished in a good/short time for such a volume of photos (764 secs --> 13 minutes), however the final accuracy is pretty low: 59.49%

I also attach the log info captured from Visual Studio output that you can download from here:!Ag33_uWyTcH5pO9VlFNTWQimJ2xBVQ?e=1yZQZ9

@codemzs codemzs closed this Nov 27, 2019
@codemzs codemzs reopened this Nov 27, 2019

This comment has been minimized.

Copy link

@codemzs codemzs commented Nov 27, 2019

@luisquintanilla Please provide repro code for train-test split since that is what is used in the paper.


This comment has been minimized.

Copy link

@luisquintanilla luisquintanilla commented Nov 27, 2019

@codemzs here's the code doing train/test split.

Below is the file I used to read the file paths and labels. To test, you can change the parent directory C:\Users\luquinta.REDMOND\Datasets\EuroSAT to wherever you've saved the labelled subdirectories. Also, the extension on the attached file is .txt. The extension in the code is .tsv so make sure to change accordingly when setting the value of TRAIN_DATA_FILEPATH at the top of Program.cs



This comment has been minimized.

Copy link

@justinormont justinormont commented Nov 27, 2019

Minor: Cross-Validation won't be used in ModelBuilder as the dataset has 27,000 rows. Standard train-validate mode will be used.

We should look into why the Image Classifier API isn't hitting the expected numbers. Perhaps we can compare the training script from the paper to our implementation. As a comparison point, the DNNImageFeaturizer style gets a bit over 94% accuracy taking 18min on CPU for its first model (then continues sweeping). Hence can replicate similar (but lower) scores as the paper implying the dataset is not at fault.

Here are the splits I was using:

  • EuroSAT.TRAIN.tsv -- AutoML will split to Train-Validate splits; the default pseudo-random splits are fine for this dataset; running scores come from the validation split)
  • EuroSAT.TEST.tsv -- Final scores come from the test dataset when run from the generated code

This comment has been minimized.

Copy link

@codemzs codemzs commented Nov 30, 2019

Hi @luisquintanilla ,

As I suspected your comparison was not apples to apples with regards to the EuroSAT paper.

  1. You were using early stopping (and they were not), please use default 200 epochs and turn off early stopping .. doing this alone with will get your accuracy between 93-94%. Early stopping works great when you supply a validation set but you were not even doing that so it falls back to using trainset as validation set! While this is not ideal but it seems to work in practice for some datasets we have tested on but definitely not all.
  2. You were not correctly splitting your dataset into 90:10 train:test split (please see the attached
    Program.txt that contains your code amended with code to split the dataset correctly.
  3. You were not shuffling the dataset prior to training as the paper does, shuffle transform does not shuffle at the level of an individual data point but it shuffles in blocks.

If you do the above then you will get an accuracy of 97.1833333333333% (roughly one percentage point higher than the best accuracy reported in EuroSAT paper using ResNet50), please see the attached logs - EuroSAT_90_10_split_200_epochs_shuffle.txt

I'm closing this issue because there is no issue with Image Classification API but you were getting low accuracy because you were not using the API correctly to make fair comparisons with the paper.


@codemzs codemzs closed this Nov 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
4 participants
You can’t perform that action at this time.