Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError when retraining s2099 #38

Open
qiminchen opened this issue Jan 8, 2021 · 19 comments
Open

AssertionError when retraining s2099 #38

qiminchen opened this issue Jan 8, 2021 · 19 comments

Comments

@qiminchen
Copy link
Collaborator

qiminchen commented Jan 8, 2021

@beijbom Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images
2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels
2021-01-08 11:44:47,479 Entering: loading of reference data
2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds.
Traceback (most recent call last):
  File "scripts/regression/retrain_source.py", line 106, in <module>
    fire.Fire()
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/regression/retrain_source.py", line 69, in train
    do_training(source_root, train_labels, val_labels, n_epochs, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training
    train_labels, val_labels, n_epochs, [], feature_loc, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__
    clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train
    refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data
    x_, y_ = load_image_data(labels, imkey, classes, feature_loc)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data
    assert rc_labels_set.issubset(rc_features_set)
AssertionError

While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine using both features from the server or re-extracted features from EfficientNetb0.

To reproduce this AssertionError:

  1. clone the up-to-date pyspacer repo
  2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
  3. run below to cache the features from bucket and retrain LR
    python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

Please let me know if you can reproduce the error.

@kriegman
Copy link

kriegman commented Jan 8, 2021 via email

@qiminchen
Copy link
Collaborator Author

@kriegman good question, I actually don't know the logic behind it,

does coralnet train and create your MLP classifiers? Or does it still use the older LR code classifier? Could that be the problem?
Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features?

So for a source that already had a classifier, a new classifier will be trained when more images are added and if the accuracy of the newly trained classifier is higher than the old one, the old one will be replaced. But I'm not sure if VGG16 will be used or EfficientNet. I guess this depends on the front end setting.

Then here is a question, when more images added to a source that already had a classifier, will it

  1. use VGG16 to extract new features and retrain on the whole feature set or
  2. use EfficientNetb0
    (1). retrain only on newly extracted features since the old features are extracted using VGG16 and new features are from EfficientNetb0, they have different dimensions?
    (2). retrain the classifier on the whole features set in this case need to re-extract all the images?

@beijbom
Copy link
Collaborator

beijbom commented Jan 9, 2021 via email

@beijbom
Copy link
Collaborator

beijbom commented Jan 9, 2021 via email

@qiminchen
Copy link
Collaborator Author

I suspect this line is the culprit:
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78
Can you remove that and try again?

@beijbom you're right, but instead of removing the line, I changed it to (ann['row'], ann['col'], ann['label']) for ann in anns, so basically remove the -1 from both row and col and guess what, it passed the assertion and I got the normal accuracy which is around 75% as the author claimed.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 96639.19it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 18:00:13,024 Trainset: 3020, valset: 200 images
2021-01-08 18:00:13,024 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 18:00:13,032 Trainset: 60, valset: 48, common: 48 labels
2021-01-08 18:00:13,032 Entering: loading of reference data
2021-01-08 18:00:16,864 Exiting: loading of reference data after 3.831654 seconds.
2021-01-08 18:00:16,864 Entering: training using LR
2021-01-08 18:02:04,396 Epoch 0, acc: 0.7422
2021-01-08 18:03:47,539 Epoch 1, acc: 0.7532
2021-01-08 18:05:32,405 Epoch 2, acc: 0.7562
2021-01-08 18:07:15,441 Epoch 3, acc: 0.7582
2021-01-08 18:08:56,827 Epoch 4, acc: 0.761
2021-01-08 18:10:38,644 Epoch 5, acc: 0.7618
2021-01-08 18:12:20,371 Epoch 6, acc: 0.7624
2021-01-08 18:14:01,516 Epoch 7, acc: 0.7622
2021-01-08 18:15:42,928 Epoch 8, acc: 0.7626
2021-01-08 18:17:24,107 Epoch 9, acc: 0.763
2021-01-08 18:17:24,107 Exiting: training using LR after 1027.243072 seconds.
2021-01-08 18:17:24,107 Entering: calibration
2021-01-08 18:17:24,466 Exiting: calibration after 0.358726 seconds.
Re-trained BonaireCoralReefMonitoring_2020 (2099). Old acc: 45.9, new acc: 77.2

Oscar, can you remind me of -1 in both row and col here? to be consistent with the 0-index?

@StephenChan
Copy link
Member

In case it helps, I took a pass through all the sources where a new classifier was trained since the rollout:

  • 39 (this was Oscar testing the new extractor)
  • 526, 1984, 2099, 2240, 2243, 2248 (only new classifiers present, <70% accuracy)
  • 2132, 2193, 2204, 2205, 2229, 2251, 2252 (only new classifiers present, >70% accuracy)
  • 1395, 1646, 1716, 1721, 1846, 2090, 2118, 2145, 2151, 2215, 2221, 2247 (old and new classifiers present, improved accuracy)
  • 1813, 2245 (old and new classifiers present, new did not improve accuracy enough to be accepted)

I got these source IDs with the following in manage.py shell:

import datetime
from django.utils import timezone
from vision_backend.models import Classifier
Classifier.objects.all().filter(create_date__gt=datetime.datetime(2020, 12, 31, 5, 0, tzinfo=timezone.utc)).values_list('source', flat=True).distinct()

@qiminchen
Copy link
Collaborator Author

thanks @StephenChan , so s2099 is weird as the author said the old classifier had 75% accuracy but now it presents as a new source

@beijbom
Copy link
Collaborator

beijbom commented Jan 9, 2021 via email

@StephenChan
Copy link
Member

Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff?

I didn't take stats on EXIF orientations across all sources. Only the ones that were annotated in certain months in 2020. That source list was published in this blog post. The only relevant source from there is source 1646, but that only had 2 images with non-default EXIF orientations, so that seems unlikely to make a big difference.

@beijbom
Copy link
Collaborator

beijbom commented Jan 9, 2021 via email

@StephenChan
Copy link
Member

StephenChan commented Jan 9, 2021

Hmm, for 2), if they were doing a lot of annotation work in this source recently, maybe they realized they needed to add a label or two - which would involve a labelset change, which would involve a classifier reset. We can ask if they had to do that.

@beijbom
Copy link
Collaborator

beijbom commented Jan 9, 2021 via email

@qiminchen
Copy link
Collaborator Author

For 1), you should be able to get the expected performance on your end as well

  1. Clone this repo
  2. Change (ann['row']-1, ann['col']-1, ann['label']) for ann in anns to (ann['row'], ann['col'], ann['label']) for ann in anns
    https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78
  3. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
  4. run below to cache the features from bucket and retrain LR
    python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

@StephenChan
Copy link
Member

I went ahead and inspected a DB backup from just before the rollout. Source 2099 did have 7 classifiers, highest having 77% accuracy. The labelset had 68 labels, and now it has 69 labels. So they did change the labelset, and that must be why the classifiers got cleared. That seems to solve mystery number 2 then.

Let me know if you want any info from this DB backup which might help with figuring out the accuracy drop.

@beijbom
Copy link
Collaborator

beijbom commented Jan 11, 2021

Regarding the first mystery, I tracked down some data including the ides for the classifier (from the UI) and the batch id (by querying the server)

jobs = BatchJob.objects.filter()
>>> for job in jobs:
...  if '17524' in job.job_token:
...   print(job)

classifier id: 17524
batch job id: 1787

I'm uploaded the payloads for the training and results link here: [link]. The job_msg is parsed by

https://github.com/beijbom/pyspacer/blob/master/spacer/mailman.py#L17

and defines the train job.

@qimin Chen : can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g:
"model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

I think it'd be nice to understand what happened. At the same time, I'm tempted to ask the user to switch to efficientNet. I'm pretty sure that'd wipe out the issue and he should be switching anyways.

@beijbom
Copy link
Collaborator

beijbom commented Jan 11, 2021

@StephenChan @kriegman : Are you ok if we ask the user to switch to EfficientNet? We have already backed up all the (likely faulty) features data, so we don't lose reproducibility. But this way the user is unblocked and it's a double win since his backend will work even better than before.

@StephenChan
Copy link
Member

That sounds reasonable to me.

@qiminchen
Copy link
Collaborator Author

can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g:
"model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

@beijbom the aws_access_key_id and aws_secret_access_key you generated for me a while ago doesn’t have permission to access coralnet-production bucket, can you regenerate one? btw what is "key": "media/classifiers/17524.model" here used for if I run a training locally?

@beijbom
Copy link
Collaborator

beijbom commented Jan 13, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants