Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in rule pangolearn after 04/09/2022 update #84

Closed
egenomics opened this issue Apr 19, 2022 · 6 comments
Closed

Error in rule pangolearn after 04/09/2022 update #84

egenomics opened this issue Apr 19, 2022 · 6 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@egenomics
Copy link

egenomics commented Apr 19, 2022

Hi,
We have been using pangolin (through conda) for a while now. With the last pangolearn update our pipeline broke. We are using
pangolin: 3.1.20
pangolearn: 2022-04-09
constellations: v0.1.7
scorpio: 0.3.16
pango-designation used by pangoLEARN/Usher: v1.3
pango-designation aliases: 1.6

We get the following error:

All dependencies satisfied.
The query file is:/datos/MiSeq/MICRO/COVID/analysis/2022_04_19_R2247/consensus/consensus.R2247.fna
** Running sequence QC **
Number of sequences detected: 48
Total passing QC: 44

Data files found:
Trained model: /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangoLEARN/data/decisionTree_v1.joblib
Header file: /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangoLEARN/data/decisionTreeHeaders_v1.joblib
Designated hash: /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangoLEARN/data/lineages.hash.csv

Job stats:
job count min threads max threads


add_failed_seqs 1 1 1
align_to_reference 1 1 1
all 1 1 1
generate_report 1 1 1
get_constellations 1 1 1
hash_sequence_assign 1 1 1
pangolearn 1 1 1
scorpio 1 1 1
total 8 1 1

loading model 04/19/2022, 14:24:50
/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 1.0.1 when using version 0.23.1. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
processing block of 44 sequences 04/19/2022, 14:24:51
[Tue Apr 19 14:24:52 2022]
Error in rule pangolearn:
jobid: 0
output: /tmp/tmpz2w2ggj4/lineage_report.pass_qc.csv

RuleException:
AttributeError in line 112 of /root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk:
'DecisionTreeClassifier' object has no attribute 'n_features_'
File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangolin/scripts/pangolearn.smk", line 112, in __rule_pangolearn
File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/pangolin/pangolearn/pangolearn.py", line 170, in assign_lineage
File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 922, in predict_proba
File "/root/miniconda3/envs/pangolin_test/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 395, in _validate_X_predict
File "/root/miniconda3/envs/pangolin_test/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Exiting because a job execution failed. Look above for error message
Exiting because a job execution failed. Look above for error message

@corneliusroemer
Copy link

This looks like a duplicate of: cov-lineages/pangolin#427

I've encountered this error before, too. Try to reinstall your environment.

This warning gives a hint about a possible reason: you may not be using the sklearn version expected. Setting up a fresh environment with pangolin should fix this.

Let me know if reinstalling doesn't solve the problem and then share information on the exact packages and their versions installed in your environment.

@corneliusroemer corneliusroemer added duplicate This issue or pull request already exists bug Something isn't working labels Apr 19, 2022
@wm75
Copy link

wm75 commented May 9, 2022

@aineniamh @corneliusroemer I think this issue deserves reopening.

This error comes from the fact that apparently the most recent pangoLEARN models have been built using a more recent version of scikit-learn.
The bioconda recipe for pangolin 3.1.20 has its scikit-learn dependency pinned to 0.23.1 (https://github.com/bioconda/bioconda-recipes/blob/a574d43146db09006d462746aa1d8716c77404b4/recipes/pangolin/meta.yaml#L25) and due to internal changes in scikit-learn models dumped with versions > 1.0 will not load with that older version (compare https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations).
Conversely when you're trying to load a model that got dumped with a pre-1.0 version of scikit-learn with a version > 1.0 you will see a warning like this one:

UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.24.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations

and though I have no idea whether the model would really be compromised that doesn't sound encouraging.

Since dumped scikit-learn models are generally not guaranteed to be reloadable with different versions, I think the bioconda approach of pinning a given pangolin release to a specific version of scikit-learn is the right thing to do, but it requires that:

  • you take a public note of the scikit-learn version used when releasing a new pangolin version
  • you stick to this scikit-learn version for building models over the lifetime of at least this pangolin release (less frequent changes are of course always simplifying matters for everyone)

For 3.1.20 I'm not sure what should be done now. Fact is that models since 2022-04-09 won't work with fresh conda installs of pangolin 3.1.20, but there's no simple fix I can see. The question is whether you'd want to switch back to building future models with scikit-learn 0.24 agaiiin as you did previously?

More importantly, however, the same logic holds for pangolin v4 and its pangoLEARN part of pangolin-data, too. Again, it would be good to have the scikit-learn version clearly stated, and most importantly not changing unnecessarily.

@wm75
Copy link

wm75 commented May 9, 2022

@egenomics a solution to fix your issue (without updating to pangolin 4) is to:

  • edit conda-meta/pangolin-3.1.20-pyhdfd78af_0.json (or whatever build you have) to change the scikit-learn dependency line from 0.23.1 to >=0.23.1, then
  • run conda update -c conda-forge scikit-learn in your env

This will enable you to run recent models of pangoLEARN with your pangolin. However, you'll see the UserWarning above when trying to run with older models.

@aineniamh
Copy link
Member

I'd like to just give a warning that when we released pangolin 4.0, I intended to maintain pangoLEARN for a couple of months before phasing it out. This was just to give a buffer zone of time for people to update to pangolin 4.0. It's been about 5 weeks, so bear in mind that this repository won't be maintained much longer!

I think this is a good point about scikit-learn versions though, as this is relevant to the random forest model too (you don't see the warnings in 4.0 but the same thing exists that people's local version of scikit-learn may be different to what we've trained on). We can specify a particular version of scikit-learn if this might be an issue, but I've never noticed the version of scikit-learn effecting the inference from the model.

@corneliusroemer
Copy link

Thanks @wm75 for investigating and giving such a detailed description of what's behind the error here and in cov-lineages/pangolin#427

The happy path is to use up to date pangolin models with up to date pangoLEARN models.

If for reproducibility one needs to use an old pangoLEARN model, one should use the corresponding pangolin version that was around at the time the model was trained.

@aineniamh Do I understand you correctly that pangoLEARN as a whole will be phased out?

@aineniamh
Copy link
Member

Yeah as it's no longer needed in pangolin 4.0, I'll archive the repo at some point in the not too distant future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants