Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in calculate_dirichlet_abundances.py. - ValueError: could not convert string to float #20

Open
BenjaminGuinet opened this issue Feb 2, 2024 · 1 comment

Comments

@BenjaminGuinet
Copy link

BenjaminGuinet commented Feb 2, 2024

Hello while running haystac --mode abundances on my data I got this following snakemake error :

Error in rule calculate_dirichlet_abundances:
 jobid: 0
 output: /crex/proj/snic2022-6-144/nobackup/user/Meta_project/Virus/Haystac_analysis/Scripts/Herpesvirus_sequences/L010_Herpesvirus_db_output/probabilities/L010/L010_posterior_abundance.tsv
 log: /crex/proj/snic2022-6-144/nobackup/user/Meta_project/Virus/Haystac_analysis/Scripts/Herpesvirus_sequences/L010_Herpesvirus_db_output/probabilities/L010/L010_posterior_abundance.log (check log file(s) for error message)
 conda-env: /home/user/haystac/cache/conda/3983e9d32a119d531dde75cf1fd7d0a0

RuleException:
CalledProcessError in line 169 of /home/user/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/metagenomics.smk:
Command 'source /sw/apps/conda/latest/rackham_stage/bin/activate '/home/user/haystac/cache/conda/3983e9d32a119d531dde75cf1fd7d0a0'; set -euo pipefail;  python /crex/proj/snic2022-6-144/nobackup/user/Meta_project/Virus/Haystac_analysis/Scripts/Herpesvirus_sequences/.snakemake/scripts/tmpjizpoz5a.calculate_dirichlet_abundances.py' returned non-zero exit status 1.
File "/home/user/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2352, in run_wrapper
File "/home/user/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/metagenomics.smk", line 169, in __rule_calculate_dirichlet_abundances
File "/home/user/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/user/.conda/envs/haystac/lib/python3.6/concurrent/futures/thread.py", line 56, in run
File "/home/user/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/user/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2364, in run_wrapper
Trying to restart job 0.`

And when looking at the log file I see :

144/nobackup/user/Meta_project/Virus/Haystac_analysis/Scripts/Herpesvirus_sequences/L010_Herpesvirus_db_output/probabilities/L010/L010_posterior_abundance.log
Traceback (most recent call last):
File "/crex/proj/snic2022-6-144/nobackup/user/Meta_project/Virus/Haystac_analysis/Scripts/Herpesvirus_sequences/.snakemake/scripts/tmp5j1s7h8q.calculate_dirichlet_abu
ndances.py", line 145, in <module>
 calculate_dirichlet_abundances(
File "/crex/proj/snic2022-6-144/nobackup/user/Meta_project/Virus/Haystac_analysis/Scripts/Herpesvirus_sequences/.snakemake/scripts/tmp5j1s7h8q.calculate_dirichlet_abu
ndances.py", line 84, in calculate_dirichlet_abundances
 a = ts_tv_matrix.groupby("Taxon").sum().squeeze().astype(float)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/generic.py", line 6324, in astype
 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 451, in astype
 return self.apply(
        ^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 352, in apply
 applied = getattr(b, f)(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 511, in astype
 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", line 242, in astype_array_safe
 new_values = astype_array(values, dtype, copy=copy)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", line 187, in astype_array
 values = _astype_nansafe(values, dtype, copy=copy)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", line 138, in _astype_nansafe
 return arr.astype(dtype, copy=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'ERR5024913.26762701ERR5024914.36341165ERR5024913.39738624ERR5024914.9765425ERR5024914.48021130ERR5024913.12985897ERR5024913.17020599

So I looked into the haystac code and

On line 78 in the code of calculate_dirichlet_abundances.py I found this :

if len(ts_tv_matrix.Taxon.unique()) > 1:
    a = ts_tv_matrix.groupby("Taxon").sum().squeeze().astype(float)
else:
    a = ts_tv_matrix.groupby("Taxon").sum().iloc[:, 0].astype(float)
a.loc["Grey_Matter"] = grey_matter.sum()

It looks like that it is this part that gives this error : ValueError: could not convert string to float: 'ERR5024913.26762701ERR5024914.36341165ERR5024913.39738624ERR5024914.9765425ERR5024914.48021130ERR5024913.12985897ERR5024913.17020599

Because the ts_tv_matrix at this step looks like that :

                                 Taxon              Read_ID  Dirichlet_Assignment
0     Columbid_alphaherpesvirus_1_strain_HLJ  ERR5024913.47906133                   0.0
1     Columbid_alphaherpesvirus_1_strain_HLJ  ERR5024913.16245378                   0.0
2     Columbid_alphaherpesvirus_1_strain_HLJ   ERR5024913.5227225                   0.0
3     Columbid_alphaherpesvirus_1_strain_HLJ   ERR5024913.2444664                   0.0
4     Columbid_alphaherpesvirus_1_strain_HLJ  ERR5024913.13611153                   0.0
...                                      ...                  ...                   ...
1942         Human_herpesvirus_2_strain_HG52  ERR5024914.75275187                   0.0
1943         Human_herpesvirus_2_strain_HG52  ERR5024913.36437085                   0.0
1944         Human_herpesvirus_2_strain_HG52  ERR5024914.41555325                   0.0
1945         Human_herpesvirus_2_strain_HG52  ERR5024913.40749291                   0.0
1946         Human_herpesvirus_2_strain_HG52  ERR5024913.66974822                   0.0

So it looks like my tsv files does look like has it should our did I missed something ?
Thanks for your help.

Here is the full haystac code I used : haystac analyse --mode abundances --database Herpesvirus_db --sample L010 --output L010_Herpesvirus_db_output --cores 15

@antonisdim
Copy link
Owner

Hello Benjamin,

I hope you are doing great, and apologies for not getting back to you earlier !

You are right in terms of which line is causing that ValueError to be raised, and your ts_tv_matrix also looks fine.

I have been trying to reproduce your error locally, using the the example snippet of your ts_tv_matrix . I have managed to reproduce your error only when I use a version of pandas higher than the one that is specified in the yaml file used to create the /home/user/haystac/cache/conda/3983e9d32a119d531dde75cf1fd7d0a0 conda env. In that yaml file the pandas version is specified to be 1.0.3, and it appears you might be using a version that is >= 1.4.1. Which makes sense as your python version within that env also changes to python3.11. Unfortunately I am not sure how this happened as whenever I am creating that env locally, it gets built correctly.

I think if you downgrade your pandas, or build that env with the correct dependency versions (as pinned in the envs/scipy.yaml), your analysis should work.

That being said, thank you for raising this issue ! It's a good reason to update the versions of some dependencies, which I will get round to doing soon.

Please let me know if this helps !

Best,
Antony

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants