Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genes from proteome/species not descendant of Nx.tsv are present in Nx.tsv. #602

Closed
matrs opened this issue Aug 5, 2021 · 4 comments
Closed

Comments

@matrs
Copy link

matrs commented Aug 5, 2021

Hello,
I'm trying to define single-copy orthogroups from the Nx.tsv files. i'm getting results that I consider confusing, so I wrote a couple of lines to check if a specific Nx.tsv has only genes pertaining to its descendants species, which I'm expecting. Let's say I take the N11.tsv, I see the descendants species of this node in the species tree and I see two:

['MGYG-HGUT-04532',
 'DGYMR06203__metabat2_low_PE']

Then, I loop over all the Nx.tsv files and I check the column MGYG-HGUT-04532 every time. I'm expecting to get genes only in the N11.tsv file and its ancestors:

[Tree node 'N7' (0x7f514471e49),
 Tree node 'N3' (0x7f5147961be),
 Tree node 'N1' (0x7f51478373a),
 Tree node 'N0' (0x7f514471e46)]
nodes = [f'N{n}.tsv' for n in range(194)]
for n in nodes:
    n_df = pd.read_csv(root.joinpath(n), sep='\t', na_filter=False)
    print(n, n_df.loc[:, 'MGYG-HGUT-04532'].unique(), sep='\n')

Which produces:

N0.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N1.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N2.tsv
['']
N3.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N4.tsv
['']
N5.tsv
['']
N6.tsv
['']
N7.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_00320'
 'GFNMCGMP_00321' 'GFNMCGMP_00381, GFNMCGMP_00380']
N8.tsv
['']
N9.tsv
['']
N10.tsv
['']
N11.tsv
['' 'GFNMCGMP_00750, GFNMCGMP_00293' 'GFNMCGMP_00570'
 'GFNMCGMP_01197, GFNMCGMP_00667' 'GFNMCGMP_00341'
...]
N12.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N13.tsv
['']
N14.tsv
['']
... empty lists
['']
N20.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N29.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
...
followed by  empty lists

So N12, N20 and N29.tsv show genes for MGYG-HGUT-04532, although none of these nodes are descendants/ancestors of N11. I tried with other species and nodes, but It's always the same. Maybe I'm misunderstanding how this works and I'd appreciate any help. I'm attaching the tree file and a couple of Nx.tsv.

I'm running orthofinder 2.5.2

Jose Luis

SpeciesTree_rooted_node_labels.txt

Ns.zip

@davidemms
Copy link
Owner

Hi Jose Luis

That's very strange, these *.tsv files don't seem to correspond at all to the SpeciesTree_rooted_node_labels.txt file. E.g.

N12.tsv contains genes from 4 species: MGYG-HGUT-04532, bin3c.184.contigs, X355_Hoffmanns_Two_toed_Sloth__metabat2_high_PE.021.contigs & GCF_001683795.1_ASM168379v1_genomic, but these species are distributed quite widely across the attached species tree.

And the same for N29.tsv.

Could you describe the steps taken in OrthoFinder to produce these? Was it just a single run from the start, what commands did you use?

All the best
David

@matrs
Copy link
Author

matrs commented Aug 6, 2021

Hello David, thank you very much for your prompt answer.
the previous files come from a run which uses previous orthofinder runs (I tested a few options). To help find what the problem is, I'm attaching another related run which has this exact same problem but uses the "original run" directly. So the original run here is Jul21, which doesn't appear to have this problem. That run was:

orthofinder -f faas -t 28 -a 8

Then, using those results I ran:

orthofinder -b Results_Jul21 -f  extra_faas -M msa -y -t 28 -a 8

Which created files that have the problem (I'm attaching them with the log, Jul29). This last run added 3 genomes and removed one, genome 36 in the log file. (the files attached in the original post come from this run but specifying a tree, -ft -s)

For example, when looking at the N3 node in this jul29 tree:

tree.search_nodes(name='N3')[0].get_leaf_names()
[ ]: ['MGYG-HGUT-04532', 'DGYMR06203__metabat2_low_PE.047.contigs']
tree.search_nodes(name='N3')[0].get_ancestors()
[ ]: [Tree node 'N1' (0x7f07b1822cd), Tree node 'N0' (0x7f07b25387f)]

Then looking to the Ns files and MGYG-HGUT-04532, I get N4, N7 and N11 too:

N4.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N5.tsv
['']
N6.tsv
['']
N7.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N8.tsv
['']
N9.tsv
['']
N10.tsv
['']
N11.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']

I'm attaching a few files, but in this drive folder are some of the results for both runs
https://drive.google.com/drive/folders/1CELoUvE1w87FmFNXzos1__GFHpNDN_f1?usp=sharing

I hope this helps and let me know If any other file/information is needed.

Log_jul29.txt
SpeciesTree_rooted_node_labels_jul29.txt
Log_jul21.txt

@davidemms
Copy link
Owner

Hi Jose Luis

This should now be fixed, you can regenerate the correct results just by running with the 'from trees' option on the final results directory which had the added species: "-ft Results_Jul29/". Thanks again for reporting this.

All the best
David

@matrs
Copy link
Author

matrs commented Aug 6, 2021

Hi David,
I tried the last code and It seems to work as expected. Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants