AssertionError when retraining s2099 #38

qiminchen · 2021-01-08T20:06:54Z

@beijbom Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images
2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels
2021-01-08 11:44:47,479 Entering: loading of reference data
2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds.
Traceback (most recent call last):
  File "scripts/regression/retrain_source.py", line 106, in <module>
    fire.Fire()
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/regression/retrain_source.py", line 69, in train
    do_training(source_root, train_labels, val_labels, n_epochs, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training
    train_labels, val_labels, n_epochs, [], feature_loc, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__
    clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train
    refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data
    x_, y_ = load_image_data(labels, imkey, classes, feature_loc)
  File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data
    assert rc_labels_set.issubset(rc_features_set)
AssertionError

While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine using both features from the server or re-extracted features from EfficientNetb0.

To reproduce this AssertionError:

clone the up-to-date pyspacer repo
change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
run below to cache the features from bucket and retrain LR
python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

Please let me know if you can reproduce the error.

The text was updated successfully, but these errors were encountered:

kriegman · 2021-01-08T23:22:08Z

Qimin, If a user has a source that's using the older VGG features, does coralnet train and create your MLP classifiers? Or does it still use the older LR code classifier? Could that be the problem? Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features? David

…

On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen ***@***.***> wrote: @beijbom <https://urldefense.com/v3/__https://github.com/beijbom__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2kTfdGBD$> Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop. (pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR Downloading 11016 metadata and image/feature files... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s] Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images... Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099... 2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images 2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch 2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels 2021-01-08 11:44:47,479 Entering: loading of reference data 2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds. Traceback (most recent call last): File "scripts/regression/retrain_source.py", line 106, in <module> fire.Fire() File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire target=component.__name__) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/regression/retrain_source.py", line 69, in train do_training(source_root, train_labels, val_labels, n_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training train_labels, val_labels, n_epochs, [], feature_loc, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__ clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train refx, refy = load_batch_data(labels, ref_set, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data x_, y_ = load_image_data(labels, imkey, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data assert rc_labels_set.issubset(rc_features_set) AssertionError While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine. To reproduce this AssertionError: 1. clone the up-to-date pyspacer repo 2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24 <https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py*L24__;Iw!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2sZkdfEf$> 3. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR Please let me know if you can reproduce the error. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/38__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2pNLDX4C$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AMPEKKDSYC2DQJZFG3SY5QW3ANCNFSM4V2453EA__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2g7_Yf_W$> .

qiminchen · 2021-01-09T00:01:30Z

@kriegman good question, I actually don't know the logic behind it,

does coralnet train and create your MLP classifiers? Or does it still use the older LR code classifier? Could that be the problem?
Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features?

So for a source that already had a classifier, a new classifier will be trained when more images are added and if the accuracy of the newly trained classifier is higher than the old one, the old one will be replaced. But I'm not sure if VGG16 will be used or EfficientNet. I guess this depends on the front end setting.

Then here is a question, when more images added to a source that already had a classifier, will it

use VGG16 to extract new features and retrain on the whole feature set or
use EfficientNetb0
(1). retrain only on newly extracted features since the old features are extracted using VGG16 and new features are from EfficientNetb0, they have different dimensions?
(2). retrain the classifier on the whole features set in this case need to re-extract all the images?

beijbom · 2021-01-09T01:17:32Z

Qimin: I suspect this line is the culprit: https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78 Can you remove that and try again?

…

On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen ***@***.***> wrote: @beijbom <https://github.com/beijbom> Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop. (pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR Downloading 11016 metadata and image/feature files... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s] Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images... Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099... 2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images 2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch 2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels 2021-01-08 11:44:47,479 Entering: loading of reference data 2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds. Traceback (most recent call last): File "scripts/regression/retrain_source.py", line 106, in <module> fire.Fire() File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire target=component.__name__) File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/regression/retrain_source.py", line 69, in train do_training(source_root, train_labels, val_labels, n_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training train_labels, val_labels, n_epochs, [], feature_loc, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__ clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train refx, refy = load_batch_data(labels, ref_set, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data x_, y_ = load_image_data(labels, imkey, classes, feature_loc) File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data assert rc_labels_set.issubset(rc_features_set) AssertionError While it should NOT be the pyspacer issue as I also tried retraining some sources from spacer-trainingdata/beta_export bucket and they all worked fine. To reproduce this AssertionError: 1. clone the up-to-date pyspacer repo 2. change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24 3. run below to cache the features from bucket and retrain LR python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR Please let me know if you can reproduce the error. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAITTF5QJUG4PLMYVXT2UI3SY5QW3ANCNFSM4V2453EA> .

beijbom · 2021-01-09T01:17:42Z

On Fri, Jan 8, 2021 at 3:22 PM kriegman ***@***.***> wrote: Qimin, If a user has a source that's using the older VGG features, does coralnet train and create your MLP classifiers? Or does it still use

the older LR code classifier? Could that be the problem?

LR https://github.com/beijbom/coralnet/blob/master/project/config/settings/vision_backend.py#L38

…

Could the old classifiers be lost because the retraining treated it as a "new classifier with new features" as if the source was toggled to new features? David On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen ***@***.***> wrote: > @beijbom > < https://urldefense.com/v3/__https://github.com/beijbom__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2kTfdGBD$ > > Hi Oscar, when I tried to retrain the LR/MLP classifier using the features > from the server (the one you just exported to > s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the > AssertionError. For re-extract features using EfficientNetb0, I'm still > working on it as it will take 15hrs using my laptop. > > (pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR > Downloading 11016 metadata and image/feature files... > 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s] > Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images... > Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099... > 2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images > 2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch > 2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels > 2021-01-08 11:44:47,479 Entering: loading of reference data > 2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds. > Traceback (most recent call last): > File "scripts/regression/retrain_source.py", line 106, in <module> > fire.Fire() > File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire > component_trace = _Fire(component, args, parsed_flag_args, context, name) > File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire > target=component.__name__) > File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace > component = fn(*varargs, **kwargs) > File "scripts/regression/retrain_source.py", line 69, in train > do_training(source_root, train_labels, val_labels, n_epochs, clf_type) > File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training > train_labels, val_labels, n_epochs, [], feature_loc, clf_type) > File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__ > clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type) > File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train > refx, refy = load_batch_data(labels, ref_set, classes, feature_loc) > File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data > x_, y_ = load_image_data(labels, imkey, classes, feature_loc) > File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data > assert rc_labels_set.issubset(rc_features_set) > AssertionError > > While it should NOT be the pyspacer issue as I also tried retraining some > sources from spacer-trainingdata/beta_export bucket and they all worked > fine. > > To reproduce this AssertionError: > > 1. clone the up-to-date pyspacer repo > 2. change the spacer-trainingdata to spacer-test as the s2099 was > exported to this bucket. > https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24 > < https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py*L24__;Iw!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2sZkdfEf$ > > 3. run below to cache the features from bucket and retrain LR > python scripts/regression/retrain_source.py train 2099 /path/to/local > 10 coranet_1_release_debug_export1 LR > > Please let me know if you can reproduce the error. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > < https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/38__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2pNLDX4C$ >, > or unsubscribe > < https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AMPEKKDSYC2DQJZFG3SY5QW3ANCNFSM4V2453EA__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2g7_Yf_W$ > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAITTF7SZGAYIGT3HKUERCTSY6HS5ANCNFSM4V2453EA> .

qiminchen · 2021-01-09T02:35:50Z

I suspect this line is the culprit:
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78
Can you remove that and try again?

@beijbom you're right, but instead of removing the line, I changed it to (ann['row'], ann['col'], ann['label']) for ann in anns, so basically remove the -1 from both row and col and guess what, it passed the assertion and I got the normal accuracy which is around 75% as the author claimed.

(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 96639.19it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 18:00:13,024 Trainset: 3020, valset: 200 images
2021-01-08 18:00:13,024 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 18:00:13,032 Trainset: 60, valset: 48, common: 48 labels
2021-01-08 18:00:13,032 Entering: loading of reference data
2021-01-08 18:00:16,864 Exiting: loading of reference data after 3.831654 seconds.
2021-01-08 18:00:16,864 Entering: training using LR
2021-01-08 18:02:04,396 Epoch 0, acc: 0.7422
2021-01-08 18:03:47,539 Epoch 1, acc: 0.7532
2021-01-08 18:05:32,405 Epoch 2, acc: 0.7562
2021-01-08 18:07:15,441 Epoch 3, acc: 0.7582
2021-01-08 18:08:56,827 Epoch 4, acc: 0.761
2021-01-08 18:10:38,644 Epoch 5, acc: 0.7618
2021-01-08 18:12:20,371 Epoch 6, acc: 0.7624
2021-01-08 18:14:01,516 Epoch 7, acc: 0.7622
2021-01-08 18:15:42,928 Epoch 8, acc: 0.7626
2021-01-08 18:17:24,107 Epoch 9, acc: 0.763
2021-01-08 18:17:24,107 Exiting: training using LR after 1027.243072 seconds.
2021-01-08 18:17:24,107 Entering: calibration
2021-01-08 18:17:24,466 Exiting: calibration after 0.358726 seconds.
Re-trained BonaireCoralReefMonitoring_2020 (2099). Old acc: 45.9, new acc: 77.2

Oscar, can you remind me of -1 in both row and col here? to be consistent with the 0-index?

StephenChan · 2021-01-09T02:36:25Z

In case it helps, I took a pass through all the sources where a new classifier was trained since the rollout:

39 (this was Oscar testing the new extractor)
526, 1984, 2099, 2240, 2243, 2248 (only new classifiers present, <70% accuracy)
2132, 2193, 2204, 2205, 2229, 2251, 2252 (only new classifiers present, >70% accuracy)
1395, 1646, 1716, 1721, 1846, 2090, 2118, 2145, 2151, 2215, 2221, 2247 (old and new classifiers present, improved accuracy)
1813, 2245 (old and new classifiers present, new did not improve accuracy enough to be accepted)

I got these source IDs with the following in manage.py shell:

import datetime
from django.utils import timezone
from vision_backend.models import Classifier
Classifier.objects.all().filter(create_date__gt=datetime.datetime(2020, 12, 31, 5, 0, tzinfo=timezone.utc)).values_list('source', flat=True).distinct()

qiminchen · 2021-01-09T02:42:42Z

thanks @StephenChan , so s2099 is weird as the author said the old classifier had 75% accuracy but now it presents as a new source

beijbom · 2021-01-09T03:50:43Z

Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff?

…

On Fri, Jan 8, 2021 at 6:42 PM Qimin Chen ***@***.***> wrote: thanks @StephenChan <https://github.com/StephenChan> , so s2099 is weird as the author said the old classifier had 75% accuracy but now it presents as a new source — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAITTF64C52JPYVOCBZXMB3SY67C5ANCNFSM4V2453EA> .

StephenChan · 2021-01-09T05:27:57Z

Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff?

I didn't take stats on EXIF orientations across all sources. Only the ones that were annotated in certain months in 2020. That source list was published in this blog post. The only relevant source from there is source 1646, but that only had 2 images with non-default EXIF orientations, so that seems unlikely to make a big difference.

beijbom · 2021-01-09T06:50:09Z

Gotit. Yeah, this eludes me. Two things in particular. 1) how can Qiming get the expected performance when he retrains on the same features as is used in production? 2) where did the previous classifiers go?

…

On Fri, Jan 8, 2021 at 9:28 PM StephenChan ***@***.***> wrote: Is there any correlation between the ones with low accuracy and the ones where we fixed EXIF stuff? I didn't take stats on EXIF orientations across all sources. Only the ones that were annotated in certain months in 2020. That source list was published in this blog post <https://coralnet.ucsd.edu/blog/annotation-tool-bug-fixes-follow-up-checking-potentially-affected-images/>. The only relevant source from there is source 1646, but that only had 2 images with non-default EXIF orientations, so that seems unlikely to make a big difference. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAITTF6OATQQGLMRKZ5AI7DSY7SOTANCNFSM4V2453EA> .

StephenChan · 2021-01-09T10:38:58Z

Hmm, for 2), if they were doing a lot of annotation work in this source recently, maybe they realized they needed to add a label or two - which would involve a labelset change, which would involve a classifier reset. We can ask if they had to do that.

beijbom · 2021-01-09T14:50:25Z

Hmm. Yeah. It’s worth asking them.

…

On Sat, Jan 9, 2021 at 02:39 StephenChan ***@***.***> wrote: Hmm, for 2), if they were doing a lot of annotation work in this source recently, maybe they realized they needed to add a label or two - which would involve a labelset change, which would involve a classifier reset. We can ask if they had to do that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAITTF6LNCT7NGIXTK2IPQTSZAW47ANCNFSM4V2453EA> .

qiminchen · 2021-01-09T18:46:16Z

For 1), you should be able to get the expected performance on your end as well

Clone this repo
Change (ann['row']-1, ann['col']-1, ann['label']) for ann in anns to (ann['row'], ann['col'], ann['label']) for ann in anns
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78
change the spacer-trainingdata to spacer-test as the s2099 was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
run below to cache the features from bucket and retrain LR
python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR

StephenChan · 2021-01-10T10:58:51Z

I went ahead and inspected a DB backup from just before the rollout. Source 2099 did have 7 classifiers, highest having 77% accuracy. The labelset had 68 labels, and now it has 69 labels. So they did change the labelset, and that must be why the classifiers got cleared. That seems to solve mystery number 2 then.

Let me know if you want any info from this DB backup which might help with figuring out the accuracy drop.

beijbom · 2021-01-11T16:37:29Z

Regarding the first mystery, I tracked down some data including the ides for the classifier (from the UI) and the batch id (by querying the server)

jobs = BatchJob.objects.filter()
>>> for job in jobs:
...  if '17524' in job.job_token:
...   print(job)

classifier id: 17524
batch job id: 1787

I'm uploaded the payloads for the training and results link here: [link]. The job_msg is parsed by

https://github.com/beijbom/pyspacer/blob/master/spacer/mailman.py#L17

and defines the train job.

@qimin Chen : can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g:
"model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

I think it'd be nice to understand what happened. At the same time, I'm tempted to ask the user to switch to efficientNet. I'm pretty sure that'd wipe out the issue and he should be switching anyways.

beijbom · 2021-01-11T16:43:43Z

@StephenChan @kriegman : Are you ok if we ask the user to switch to EfficientNet? We have already backed up all the (likely faulty) features data, so we don't lose reproducibility. But this way the user is unblocked and it's a double win since his backend will work even better than before.

StephenChan · 2021-01-11T21:08:45Z

That sounds reasonable to me.

qiminchen · 2021-01-12T22:59:26Z

can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts.

(You are going to have to change the bucket_names and keys to the test bucket. E.g:
"model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"})

@beijbom the aws_access_key_id and aws_secret_access_key you generated for me a while ago doesn’t have permission to access coralnet-production bucket, can you regenerate one? btw what is "key": "media/classifiers/17524.model" here used for if I run a training locally?

beijbom · 2021-01-13T03:15:55Z

@qimin Chen <qic003@ucsd.edu> : yeah, you have to replace the bucket_name to point to the test bucket and the key field to correspond to the exported data.

…

On Tue, Jan 12, 2021 at 2:59 PM Qimin Chen ***@***.***> wrote: can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts. (You are going to have to change the bucket_names and keys to the test bucket. E.g: "model_loc": {"storage_type": "s3", "key": "media/classifiers/17524.model", "bucket_name": "coralnet-production"}) @beijbom <https://github.com/beijbom> the aws_access_key_id and aws_secret_access_key you generated for me a while ago doesn’t have permission to access coralnet-production bucket, can you regenerate one? btw what is "key": "media/classifiers/17524.model" here used for if I run a training locally? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAITTFZOJ23H7A2LYFNDM63SZTH5ZANCNFSM4V2453EA> .

qiminchen added bug help wanted labels Jan 8, 2021

StephenChan mentioned this issue Aug 10, 2021

Training fails due to point mismatches and cannot retry coralnet/coralnet#408

Closed

StephenChan mentioned this issue Dec 15, 2022

Async task management improvements coralnet/coralnet#459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError when retraining s2099 #38

AssertionError when retraining s2099 #38

qiminchen commented Jan 8, 2021 •

edited

Loading

kriegman commented Jan 8, 2021 via email

qiminchen commented Jan 9, 2021

beijbom commented Jan 9, 2021 via email

beijbom commented Jan 9, 2021 via email

qiminchen commented Jan 9, 2021

StephenChan commented Jan 9, 2021

qiminchen commented Jan 9, 2021

beijbom commented Jan 9, 2021 via email

StephenChan commented Jan 9, 2021

beijbom commented Jan 9, 2021 via email

StephenChan commented Jan 9, 2021 •

edited

Loading

beijbom commented Jan 9, 2021 via email

qiminchen commented Jan 9, 2021

StephenChan commented Jan 10, 2021

beijbom commented Jan 11, 2021 •

edited

Loading

beijbom commented Jan 11, 2021

StephenChan commented Jan 11, 2021

qiminchen commented Jan 12, 2021

beijbom commented Jan 13, 2021 via email

AssertionError when retraining s2099 #38

AssertionError when retraining s2099 #38

Comments

qiminchen commented Jan 8, 2021 • edited Loading

kriegman commented Jan 8, 2021 via email

qiminchen commented Jan 9, 2021

beijbom commented Jan 9, 2021 via email

beijbom commented Jan 9, 2021 via email

qiminchen commented Jan 9, 2021

StephenChan commented Jan 9, 2021

qiminchen commented Jan 9, 2021

beijbom commented Jan 9, 2021 via email

StephenChan commented Jan 9, 2021

beijbom commented Jan 9, 2021 via email

StephenChan commented Jan 9, 2021 • edited Loading

beijbom commented Jan 9, 2021 via email

qiminchen commented Jan 9, 2021

StephenChan commented Jan 10, 2021

beijbom commented Jan 11, 2021 • edited Loading

beijbom commented Jan 11, 2021

StephenChan commented Jan 11, 2021

qiminchen commented Jan 12, 2021

beijbom commented Jan 13, 2021 via email

qiminchen commented Jan 8, 2021 •

edited

Loading

StephenChan commented Jan 9, 2021 •

edited

Loading

beijbom commented Jan 11, 2021 •

edited

Loading