-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError when retraining s2099 #38
Comments
Qimin,
If a user has a source that's using the older VGG features, does coralnet
train and create your MLP classifiers? Or does it still use
the older LR code classifier? Could that be the problem?
Could the old classifiers be lost because the retraining treated it as a
"new classifier with new features" as if the source was toggled to new
features?
David
…On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen ***@***.***> wrote:
@beijbom
<https://urldefense.com/v3/__https://github.com/beijbom__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2kTfdGBD$>
Hi Oscar, when I tried to retrain the LR/MLP classifier using the features
from the server (the one you just exported to
s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the
AssertionError. For re-extract features using EfficientNetb0, I'm still
working on it as it will take 15hrs using my laptop.
(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images
2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels
2021-01-08 11:44:47,479 Entering: loading of reference data
2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds.
Traceback (most recent call last):
File "scripts/regression/retrain_source.py", line 106, in <module>
fire.Fire()
File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "scripts/regression/retrain_source.py", line 69, in train
do_training(source_root, train_labels, val_labels, n_epochs, clf_type)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training
train_labels, val_labels, n_epochs, [], feature_loc, clf_type)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__
clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train
refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data
x_, y_ = load_image_data(labels, imkey, classes, feature_loc)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data
assert rc_labels_set.issubset(rc_features_set)
AssertionError
While it should NOT be the pyspacer issue as I also tried retraining some
sources from spacer-trainingdata/beta_export bucket and they all worked
fine.
To reproduce this AssertionError:
1. clone the up-to-date pyspacer repo
2. change the spacer-trainingdata to spacer-test as the s2099 was
exported to this bucket.
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
<https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py*L24__;Iw!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2sZkdfEf$>
3. run below to cache the features from bucket and retrain LR
python scripts/regression/retrain_source.py train 2099 /path/to/local
10 coranet_1_release_debug_export1 LR
Please let me know if you can reproduce the error.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/38__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2pNLDX4C$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AMPEKKDSYC2DQJZFG3SY5QW3ANCNFSM4V2453EA__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2g7_Yf_W$>
.
|
@kriegman good question, I actually don't know the logic behind it,
So for a source that already had a classifier, a new classifier will be trained when more images are added and if the accuracy of the newly trained classifier is higher than the old one, the old one will be replaced. But I'm not sure if VGG16 will be used or EfficientNet. I guess this depends on the front end setting. Then here is a question, when more images added to a source that already had a classifier, will it
|
Qimin:
I suspect this line is the culprit:
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L78
Can you remove that and try again?
…On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen ***@***.***> wrote:
@beijbom <https://github.com/beijbom> Hi Oscar, when I tried to retrain
the LR/MLP classifier using the features from the server (the one you just
exported to s3://spacer-test/coranet_1_release_debug_export1/s2099/), it
raised the AssertionError. For re-extract features using EfficientNetb0,
I'm still working on it as it will take 15hrs using my laptop.
(pyspacer) Min:pyspacermaster qiminchen$ python scripts/regression/retrain_source.py train 2099 /Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1 LR
Downloading 11016 metadata and image/feature files...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11016/11016 [00:00<00:00, 94177.94it/s]
Assembling data in /Users/qiminchen/Downloads/pyspacer-test/s2099/images...
Training classifier for source /Users/qiminchen/Downloads/pyspacer-test/s2099...
2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images
2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16 mini-batches per epoch
2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels
2021-01-08 11:44:47,479 Entering: loading of reference data
2021-01-08 11:44:47,615 Exiting: loading of reference data after 0.136114 seconds.
Traceback (most recent call last):
File "scripts/regression/retrain_source.py", line 106, in <module>
fire.Fire()
File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "scripts/regression/retrain_source.py", line 69, in train
do_training(source_root, train_labels, val_labels, n_epochs, clf_type)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py", line 94, in do_training
train_labels, val_labels, n_epochs, [], feature_loc, clf_type)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py", line 50, in __call__
clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 62, in train
refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 181, in load_batch_data
x_, y_ = load_image_data(labels, imkey, classes, feature_loc)
File "/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py", line 145, in load_image_data
assert rc_labels_set.issubset(rc_features_set)
AssertionError
While it should NOT be the pyspacer issue as I also tried retraining some
sources from spacer-trainingdata/beta_export bucket and they all worked
fine.
To reproduce this AssertionError:
1. clone the up-to-date pyspacer repo
2. change the spacer-trainingdata to spacer-test as the s2099 was
exported to this bucket.
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
3. run below to cache the features from bucket and retrain LR
python scripts/regression/retrain_source.py train 2099 /path/to/local
10 coranet_1_release_debug_export1 LR
Please let me know if you can reproduce the error.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF5QJUG4PLMYVXT2UI3SY5QW3ANCNFSM4V2453EA>
.
|
On Fri, Jan 8, 2021 at 3:22 PM kriegman ***@***.***> wrote:
Qimin,
If a user has a source that's using the older VGG features, does coralnet
train and create your MLP classifiers? Or does it still use
the older LR code classifier? Could that be the problem?
…
Could the old classifiers be lost because the retraining treated it as a
"new classifier with new features" as if the source was toggled to new
features?
David
On Fri, Jan 8, 2021 at 12:07 PM Qimin Chen ***@***.***>
wrote:
> @beijbom
> <
https://urldefense.com/v3/__https://github.com/beijbom__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2kTfdGBD$
>
> Hi Oscar, when I tried to retrain the LR/MLP classifier using the
features
> from the server (the one you just exported to
> s3://spacer-test/coranet_1_release_debug_export1/s2099/), it raised the
> AssertionError. For re-extract features using EfficientNetb0, I'm still
> working on it as it will take 15hrs using my laptop.
>
> (pyspacer) Min:pyspacermaster qiminchen$ python
scripts/regression/retrain_source.py train 2099
/Users/qiminchen/Downloads/pyspacer-test 10 coranet_1_release_debug_export1
LR
> Downloading 11016 metadata and image/feature files...
>
100%|████████████████████████████████████████████████████████████████████████████████████████████████████|
11016/11016 [00:00<00:00, 94177.94it/s]
> Assembling data in
/Users/qiminchen/Downloads/pyspacer-test/s2099/images...
> Training classifier for source
/Users/qiminchen/Downloads/pyspacer-test/s2099...
> 2021-01-08 11:44:47,468 Trainset: 3020, valset: 200 images
> 2021-01-08 11:44:47,469 Using 200 images per mini-batch and 16
mini-batches per epoch
> 2021-01-08 11:44:47,479 Trainset: 60, valset: 50, common: 50 labels
> 2021-01-08 11:44:47,479 Entering: loading of reference data
> 2021-01-08 11:44:47,615 Exiting: loading of reference data after
0.136114 seconds.
> Traceback (most recent call last):
> File "scripts/regression/retrain_source.py", line 106, in <module>
> fire.Fire()
> File
"/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py",
line 138, in Fire
> component_trace = _Fire(component, args, parsed_flag_args, context, name)
> File
"/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py",
line 471, in _Fire
> target=component.__name__)
> File
"/Users/qiminchen/opt/anaconda3/envs/pyspacer/lib/python3.7/site-packages/fire/core.py",
line 675, in _CallAndUpdateTrace
> component = fn(*varargs, **kwargs)
> File "scripts/regression/retrain_source.py", line 69, in train
> do_training(source_root, train_labels, val_labels, n_epochs, clf_type)
> File
"/Users/qiminchen/PycharmProjects/pyspacermaster/scripts/regression/utils.py",
line 94, in do_training
> train_labels, val_labels, n_epochs, [], feature_loc, clf_type)
> File
"/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_classifier.py",
line 50, in __call__
> clf, ref_accs = train(train_labels, feature_loc, nbr_epochs, clf_type)
> File
"/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py",
line 62, in train
> refx, refy = load_batch_data(labels, ref_set, classes, feature_loc)
> File
"/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py",
line 181, in load_batch_data
> x_, y_ = load_image_data(labels, imkey, classes, feature_loc)
> File
"/Users/qiminchen/PycharmProjects/pyspacermaster/spacer/train_utils.py",
line 145, in load_image_data
> assert rc_labels_set.issubset(rc_features_set)
> AssertionError
>
> While it should NOT be the pyspacer issue as I also tried retraining some
> sources from spacer-trainingdata/beta_export bucket and they all worked
> fine.
>
> To reproduce this AssertionError:
>
> 1. clone the up-to-date pyspacer repo
> 2. change the spacer-trainingdata to spacer-test as the s2099 was
> exported to this bucket.
>
https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24
> <
https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py*L24__;Iw!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2sZkdfEf$
>
> 3. run below to cache the features from bucket and retrain LR
> python scripts/regression/retrain_source.py train 2099 /path/to/local
> 10 coranet_1_release_debug_export1 LR
>
> Please let me know if you can reproduce the error.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <
https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/38__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2pNLDX4C$
>,
> or unsubscribe
> <
https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5AMPEKKDSYC2DQJZFG3SY5QW3ANCNFSM4V2453EA__;!!Mih3wA!TQfSf62qque4dhLOT6MbqQihiwpfDUMEarfWT_VX8IyS99JkNgqrRWCH2g7_Yf_W$
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF7SZGAYIGT3HKUERCTSY6HS5ANCNFSM4V2453EA>
.
|
@beijbom you're right, but instead of removing the line, I changed it to
Oscar, can you remind me of |
In case it helps, I took a pass through all the sources where a new classifier was trained since the rollout:
I got these source IDs with the following in manage.py shell: import datetime
from django.utils import timezone
from vision_backend.models import Classifier
Classifier.objects.all().filter(create_date__gt=datetime.datetime(2020, 12, 31, 5, 0, tzinfo=timezone.utc)).values_list('source', flat=True).distinct() |
thanks @StephenChan , so |
Is there any correlation between the ones with low accuracy and the ones
where we fixed EXIF stuff?
…On Fri, Jan 8, 2021 at 6:42 PM Qimin Chen ***@***.***> wrote:
thanks @StephenChan <https://github.com/StephenChan> , so s2099 is weird
as the author said the old classifier had 75% accuracy but now it presents
as a new source
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF64C52JPYVOCBZXMB3SY67C5ANCNFSM4V2453EA>
.
|
I didn't take stats on EXIF orientations across all sources. Only the ones that were annotated in certain months in 2020. That source list was published in this blog post. The only relevant source from there is source 1646, but that only had 2 images with non-default EXIF orientations, so that seems unlikely to make a big difference. |
Gotit. Yeah, this eludes me. Two things in particular.
1) how can Qiming get the expected performance when he retrains on the same
features as is used in production?
2) where did the previous classifiers go?
…On Fri, Jan 8, 2021 at 9:28 PM StephenChan ***@***.***> wrote:
Is there any correlation between the ones with low accuracy and the ones
where we fixed EXIF stuff?
I didn't take stats on EXIF orientations across all sources. Only the ones
that were annotated in certain months in 2020. That source list was
published in this blog post
<https://coralnet.ucsd.edu/blog/annotation-tool-bug-fixes-follow-up-checking-potentially-affected-images/>.
The only relevant source from there is source 1646, but that only had 2
images with non-default EXIF orientations, so that seems unlikely to make a
big difference.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF6OATQQGLMRKZ5AI7DSY7SOTANCNFSM4V2453EA>
.
|
Hmm, for 2), if they were doing a lot of annotation work in this source recently, maybe they realized they needed to add a label or two - which would involve a labelset change, which would involve a classifier reset. We can ask if they had to do that. |
Hmm. Yeah. It’s worth asking them.
…On Sat, Jan 9, 2021 at 02:39 StephenChan ***@***.***> wrote:
Hmm, for 2), if they were doing a lot of annotation work in this source
recently, maybe they realized they needed to add a label or two - which
would involve a labelset change, which would involve a classifier reset. We
can ask if they had to do that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF6LNCT7NGIXTK2IPQTSZAW47ANCNFSM4V2453EA>
.
|
For 1), you should be able to get the expected performance on your end as well
|
I went ahead and inspected a DB backup from just before the rollout. Source 2099 did have 7 classifiers, highest having 77% accuracy. The labelset had 68 labels, and now it has 69 labels. So they did change the labelset, and that must be why the classifiers got cleared. That seems to solve mystery number 2 then. Let me know if you want any info from this DB backup which might help with figuring out the accuracy drop. |
Regarding the first mystery, I tracked down some data including the ides for the classifier (from the UI) and the batch id (by querying the server)
classifier id: 17524 I'm uploaded the payloads for the training and results link here: [link]. The job_msg is parsed by https://github.com/beijbom/pyspacer/blob/master/spacer/mailman.py#L17 and defines the train job. @qimin Chen : can you dig in and 1) run a training locally based on exactly this job definition and see if you can replicate the low performance 2) if you can, compare this job definition with what you created when running the scripts. (You are going to have to change the bucket_names and keys to the test bucket. E.g: I think it'd be nice to understand what happened. At the same time, I'm tempted to ask the user to switch to efficientNet. I'm pretty sure that'd wipe out the issue and he should be switching anyways. |
@StephenChan @kriegman : Are you ok if we ask the user to switch to EfficientNet? We have already backed up all the (likely faulty) features data, so we don't lose reproducibility. But this way the user is unblocked and it's a double win since his backend will work even better than before. |
That sounds reasonable to me. |
@beijbom the aws_access_key_id and aws_secret_access_key you generated for me a while ago doesn’t have permission to access |
@qimin Chen <qic003@ucsd.edu> :
yeah, you have to replace the bucket_name to point to the test bucket and
the key field to correspond to the exported data.
…On Tue, Jan 12, 2021 at 2:59 PM Qimin Chen ***@***.***> wrote:
can you dig in and 1) run a training locally based on exactly this job
definition and see if you can replicate the low performance 2) if you can,
compare this job definition with what you created when running the scripts.
(You are going to have to change the bucket_names and keys to the test
bucket. E.g:
"model_loc": {"storage_type": "s3", "key":
"media/classifiers/17524.model", "bucket_name": "coralnet-production"})
@beijbom <https://github.com/beijbom> the aws_access_key_id and
aws_secret_access_key you generated for me a while ago doesn’t have
permission to access coralnet-production bucket, can you regenerate one?
btw what is "key": "media/classifiers/17524.model" here used for if I run
a training locally?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTFZOJ23H7A2LYFNDM63SZTH5ZANCNFSM4V2453EA>
.
|
@beijbom Hi Oscar, when I tried to retrain the LR/MLP classifier using the features from the server (the one you just exported to
s3://spacer-test/coranet_1_release_debug_export1/s2099/
), it raised the AssertionError. For re-extract features using EfficientNetb0, I'm still working on it as it will take 15hrs using my laptop.While it should NOT be the pyspacer issue as I also tried retraining some sources from
spacer-trainingdata/beta_export
bucket and they all worked fine using both features from the server or re-extracted features from EfficientNetb0.To reproduce this AssertionError:
spacer-trainingdata
tospacer-test
as thes2099
was exported to this bucket. https://github.com/beijbom/pyspacer/blob/8d9af6713657ca6791f14d51efec1b1fdc38894b/scripts/regression/utils.py#L24python scripts/regression/retrain_source.py train 2099 /path/to/local 10 coranet_1_release_debug_export1 LR
Please let me know if you can reproduce the error.
The text was updated successfully, but these errors were encountered: