Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Additional output file with consensus gene name per orthogroup #362

Open
MrTomRod opened this issue Mar 24, 2020 · 36 comments

Comments

@MrTomRod
Copy link

MrTomRod commented Mar 24, 2020

Hi!

I wrote a quick-and-dirty script (orthogroup_to_gene_name.py) that takes Orthogroups.tsv and turns it into Orthogroup_BestNames.tsv. This file looks like this:

Orthogroup Best Gene Name Gene Name Occurrences
OG0000000 ABC transporter ATP-binding protein {'Fluoroquinolones export ATP-binding protein/MT2762': 2, 'Sulfate/thiosulfate import ATP-binding protein CysA': 2, ...}
OG0000001 IS30 family transposase, partial {'IS30 family transposase': 144, 'IS30 family transposase, partial': 518, 'Integrase core domain protein': 4, 'hypothetical protein': 6}
OG0000002 IS5/IS1182 family transposase, partial {'IS5 family transposase': 106, 'IS5/IS1182 family transposase, partial': 326, 'IS5 family transposase, partial': 81, ...}

I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!

Best, MrTomRod

@davidemms
Copy link
Owner

Hi

This looks nice, could you talk me through this a little more? From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name. Do you know how general purpose it is? What sources of fasta files does it give meaningful gene names for?

I think it's a nice idea. In designing OrthoFinder everything has to be general purpose so I've avoided making assumptions about the description lines in FASTA files--I only work with the sequence data, which I know will be in the form I expect. So a consideration for me would be what proportion of users would it work for and how much support I'd be called on to give if it didn't work for someone's dataset. I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there.

I think another step I'd probably take if I included it in the tools directory would be to remove the dependencies on biopython, fire and pandas. I love using those libraries myself but for an OrthoFinder userbase I I'd generally try to reduce the dependencies as much as possible.

Let me know your thoughts on this.

Thanks!
David

@MrTomRod
Copy link
Author

MrTomRod commented May 11, 2020

From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name.

Exactly! Then, it checks how frequent each name is. It's very fast and little can go wrong, since the fasta format is quite simple.

Do you know how general purpose it is?

No, but it was one of the first things I did with your output. It gives one a quick-and-dirty description of what the OGs role might be.

What sources of fasta files does it give meaningful gene names for?

I'm working with prokaryotes, and both Prokka as well as NCBI-annotated files (PGAP pipeline) give compatible output. These are the two most common ones, I think.

I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there.

I think a list of links to tools is the best option because it makes clear that these scripts aren't your responsibility and you don't have to do be involved if the scripts are updated.

I'd generally try to reduce the dependencies as much as possible.

If you want me to, I'd be happy to remove the dependencies.

Btw: If you looked at my repo, you will have noticed that I also made a script that mimicks roary_plots.py. roary_plots.py was also not made by the roary-devs but a guy called Marco Galardini.

@davidemms
Copy link
Owner

Hi

I had a play with the script using Ensembl proteomes but I couldn't get it to work. If it could support those then I think that could increase its usefulness further. But, as an added complication, with these I advise users to run a script which extracts only the longest transcript variants for input into OrthoFinder: https://davidemms.github.io/orthofinder_tutorials/running-an-example-orthofinder-analysis.html
As a result the OrthoFinder results use only the gene name to refer to each gene, e.g. >ENSDARG00000098423.2
instead of
>ENSDARP00000136500.1 pep chromosome:GRCz11:2:31885581:31886019:-1 gene:ENSDARG00000098423.2 transcript:ENSDART00000170804.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:trgv2 description:T cell receptor gamma variable 2 [Source:ZFIN;Acc:ZDB-GENE-051115-14]
Obviously your script could be provided with these original fasta files with the full description to get the gene names but it would need to know how to take either "ENSDARG00000098423.2" or "ENSDARP00000136500.1" and find the full description in the fasta file.

The other source I use regularly myself is Phytozome, I don't know how that ranks in terms of overall popularity.

I'll try and find some time in the coming weeks to find out about and put together a list of tools people have developed for OthoFinder and will then put it up on the webpage.

All the best
David

@MrTomRod
Copy link
Author

MrTomRod commented May 15, 2020

I don't fully understand...
The input fastas only contain the gene name (ENSDARG00000098423.2)?
Could you provide me with such a file?

How would you design the solution? Add an ensembl-mode which translates the gene name to a description using their API? (Or automatically detect ensembl-ids since they follow a simple pattern.)

@davidemms
Copy link
Owner

Putting aside the primary_transcipts.py script for a minute. If I download a few proteomes from Ensembl and run OrthoFinder on them and then the orthogroup_to_gene_name.py script on the results the lines I get look like this:

Orthogroup      Best Gene Name  Gene Name Occurrences
...
...
OG0024176       pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040971.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]    {'pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040971.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]': 1, 'pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040959.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]': 1}
OG0024177       pep primary_assembly:Xenopus_tropicalis_v9.1:8:15991645:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049678.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]      {'pep primary_assembly:Xenopus_tropicalis_v9.1:8:15991645:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049678.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]': 1, 'pep primary_assembly:Xenopus_tropicalis_v9.1:8:16024975:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049687.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]': 1}

so it's not pulling out just the gene names for Ensembl file, I think it's giving the full line? I think getting it to work with these files is the main challenge.

@davidemms
Copy link
Owner

The primary transcripts script is only a minor complication. If the user ran the primary_transcripts.py script their input to OrthoFinder would only contain the gene names (ENSDARG00000098423.2) but they would still have a folder of the original fasta files*. And so this folder would be the directory they provide to your script rather than the directory of processed fasta files provided to OrthoFinder as input. So then your script would just have to behave more or less as it currently does and extract the info from the fasta files by looking up the identifier ENSDARG00000098423.2.

*E.g Ensembl fasta file: ftp://ftp.ensembl.org/pub/release-100/fasta/danio_rerio/pep/Danio_rerio.GRCz11.pep.all.fa.gz

@ViriatoII
Copy link

ViriatoII commented Jun 17, 2020

This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled..

@MrTomRod
Copy link
Author

MrTomRod commented Jun 18, 2020

This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled..

Glad someone else finds it useful. If your fasta files contain descriptions, it should work. I used it recently on the output of the most recent OrthoFinder without an issue. If you have a problem, create an issue on my repo...

Also, I wrote a tiny how-to in README.md, in case you missed it.

I'll take some time today to run @davidemms tutorial and get my script working with that output.

@davidemms
Copy link
Owner

It'd be great to see it working on the tutorial data, Ensembl is widely used and I'd love to get to try it on some of my analyses!

@MrTomRod
Copy link
Author

MrTomRod commented Jun 18, 2020

I had an accident and thought I'd spend some of my free time working on non-stressful projects like this, but I just learned that this (screen time) may slow my recovery or even incur long-time consequences.

So I won't be working on it until for the immediate future, sorry!

@davidemms Btw, I've been incorporating OrthoFinder into a PhD sub-project, a tool I named OpenGenomeBrowser. If you're interested, test out an early prototype here! This is where OrthoFinder comes into play: Trees view.

@ViriatoII
Copy link

ViriatoII commented Jun 18, 2020

Safe recovery @MrTomRod ! Maybe I can adapt the scripts myself (also the roary plots, I meant). I'f I do I'll write here!

@jolespin
Copy link

@MrTomRod I'm getting an error:
KeyError: "['OG' 'Gene Tree Parent Clade'] not found in axis"

Does this ring any bells?

@MrTomRod
Copy link
Author

MrTomRod commented Aug 18, 2020

It does! I forgot to document the recent changes I made to the script. Sorry. It should work now if you follow the manual:

https://github.com/MrTomRod/orthofinder-tools/blob/master/README.md

@jolespin
Copy link

jolespin commented Aug 18, 2020

Awesome, I'm going to check it out now.

Am I using the wrong file?

In [2]: df = pd.read_csv("orthofinder_output/Orthogroups/Orthogroups.tsv", sep="\t")
df
In [3]: df.columns
Out[3]:
Index(['Orthogroup', 'GCA_002571385.1', 'GCA_003704095.1', 'GCF_000222465.1',
       'GCF_002042975.1', 'GCF_002571385.1', 'GCF_003704095.1',
       'GCF_004143615.1', 'Mussismilia_hispida'],
      dtype='object')

Also, where do the annotations come from as I'm not seeing them in the dataframe?

Thanks!

@MrTomRod
Copy link
Author

MrTomRod commented Aug 18, 2020

Yes, you have to use the new N0 file.

OrthogroupToGeneName(
    n0_tsv='orthofinder_output/Phylogenetic_Hierarchical_Orthogroups/N0.tsv',
    index_column='HOG'
).run(
    fasta_dir=PATH_TO_ORTHOFINDER_FASTAS,
    file_endings='faa',
    out='path/to/output.tsv'
)

The annotations come from the fasta files. (PATH_TO_ORTHOFINDER_FASTAS)

@jolespin
Copy link

Ok cool, I just got it to run! So basically, we provide annotations as the description? I have a de-novo genome so I'll need to add my own annotations for some of these.

btw, thank you for creating this script!

@MrTomRod
Copy link
Author

Ok cool, I just got it to run! So basically, we provide annotations as the description?

for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.

I have a de-novo genome so I'll need to add my own annotations for some of these.

keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.

btw, thank you for creating this script!

no worries, i made it quick and dirty for myself and decided to share it.

@sunnycqcn
Copy link

Hello MrTomRod,

Ok cool, I just got it to run! So basically, we provide annotations as the description?

for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.

I have a de-novo genome so I'll need to add my own annotations for some of these.

keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.

btw, thank you for creating this script!

no worries, i made it quick and dirty for myself and decided to share it.

I used you scripts for ploting. I can get the results with the following warning.
/home/AAFC-AAC/fuf/miniconda3/envs/orthofinder/lib/python3.8/site-packages/fire/core.py:672: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89) have mixed types.Specify dtype option on import or set low_memory=False.
component = fn(*varargs, **kwargs)
But the results look good.
Thanks,
Fuyou

@sunnycqcn
Copy link

Hi!

I wrote a quick-and-dirty script (orthogroup_to_gene_name.py) that takes Orthogroups.tsv and turns it into Orthogroup_BestNames.tsv. This file looks like this:

Orthogroup Best Gene Name Gene Name Occurrences
OG0000000 ABC transporter ATP-binding protein {'Fluoroquinolones export ATP-binding protein/MT2762': 2, 'Sulfate/thiosulfate import ATP-binding protein CysA': 2, ...}
OG0000001 IS30 family transposase, partial {'IS30 family transposase': 144, 'IS30 family transposase, partial': 518, 'Integrase core domain protein': 4, 'hypothetical protein': 6}
OG0000002 IS5/IS1182 family transposase, partial {'IS5 family transposase': 106, 'IS5/IS1182 family transposase, partial': 326, 'IS5 family transposase, partial': 81, ...}
I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!

Best, MrTomRod

Hello MrTomRod,
I got the results from OrthoFinder, I did not find the file directory about Phylogenetic_Hierarchical_Orthogroups/N0.tsv.
I tried a few times. I still did not get this. I am a little confused.
Could you tell me how I can get this?
Thanks,
Fuyou

@MrTomRod
Copy link
Author

MrTomRod commented Aug 26, 2020

You can't find the folder Phylogenetic_Hierarchical_Orthogroups?

Are you using the latest version of Orthofinder? Because older versions don't create it.

Here are the instructions for my script.

@sunnycqcn
Copy link

You can't find the folder Phylogenetic_Hierarchical_Orthogroups?

Are you using the latest version of Orthofinder? Because older versions don't create it.

Here are the instructions for my script.

I got this question. Because I did not update my orthofinder. I just want to tell you about this.
Thanks,
Fuyou

@sunnycqcn
Copy link

You can't find the folder Phylogenetic_Hierarchical_Orthogroups?

Are you using the latest version of Orthofinder? Because older versions don't create it.

Here are the instructions for my script.

Hello MrTomRod,
I tried this script. But I get the result is as following. I can not find the functional category.
HOG Best Gene Name Gene Name Occurrences
N0.HOG0000000 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1})
N0.HOG0000001 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1})
N0.HOG0000002 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1})
N0.HOG0000003 LREF_000140 Counter({'LREF_000140': 1, 'L022_001102': 1})
N0.HOG0000004 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1})
N0.HOG0000005 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1})
N0.HOG0000006 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1})
N0.HOG0000007 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1})
N0.HOG0000008 L022_006069 Counter({'L022_006069': 1, 'LREF_005195': 1})
N0.HOG0000009 L022_000673 Counter({'L022_000673': 1, 'LREF_001506': 1})
N0.HOG0000010 L022_000673 Counter({'L022_000673': 1, 'LREF_001506': 1})
N0.HOG0000011 L022_000673 Counter({'L022_000673': 1, 'LREF_001506': 1})
N0.HOG0000012 L022_001622 Counter({'L022_001622': 1, 'LREF_000642': 1})
N0.HOG0000013 L022_001622 Counter({'L022_001622': 1, 'LREF_000642': 1})
N0.HOG0000014 L022_001622 Counter({'L022_001622': 1, 'LREF_000642': 1})
N0.HOG0000015 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1})
N0.HOG0000016 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1})
N0.HOG0000017 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1})
N0.HOG0000018 L022_005455 Counter({'L022_005455': 1, 'LREF_003145': 1})
If I want to get the functional catergory, what can I do? I read the new tutorials, I did not find any new. My all protein sequences are de novo annotation. So I should download some sequences from database with functional catergory?
I am much appreciated for your helping.
Best,
Fuyou

@MrTomRod
Copy link
Author

Please show me the first 5 lines of one of your fasta files.

@sunnycqcn
Copy link

Please show me the first 5 lines of one of your fasta files.

ps/N0.tsv
N0.HOG0000001 OG0000000 n9 L022_001102-T1 LREF_000140-T3
N0.HOG0000002 OG0000000 n6 L022_001102-T3 LREF_000140-T6, LREF_000140-T1
N0.HOG0000003 OG0000000 n8 L022_001102-T5 LREF_000140-T2
N0.HOG0000004 OG0000001 n1 L022_006069-T1 LREF_005195-T4
N0.HOG0000005 OG0000001 n9 L022_006069-T3 LREF_005195-T1
N0.HOG0000006 OG0000001 n8 L022_006069-T5 LREF_005195-T3
Thanks,
Fuyou

@MrTomRod
Copy link
Author

no, the first 5 lines of one of your FASTA files.

@sunnycqcn
Copy link

L022_000001-T1 L022_000001
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTSA
AVWNIAGTYIKAKSWLPGMELESDTIQYVKRISSIPIPEIIFSWVDVEWDRTFLVLKALKGRTLDQVWESLSANRRMEIA
DKVAQFCRTLALSTSEMLMTANGNAALEPFLTSLPPASEPSWKPRNLGPFSSCQLRSYLSKPTVLDDIDLFHFYHADLGP
SNVIVTDDGSIVGIIDWESAAFYPKFWLGTKPLVSVGFTVRGAEKRAWAVLLASSLERVGFSPDLDKYEAWKKAIGK
L022_000002-T1 L022_000002
MDARPVLKVFSVLRCLDSLRSSRPPLQFCPSALSPTMPSTFTVLSAGLVAASSGAFAQKVPETAQVIDQKLFNVLDVVPP
PSVFNGSTLFTWPGVTAESLTAKPFHIYDEEFYDVIGSDPTLTLIASSESDPIFHEAVTWHPPRDEVFFVQNAGAPAAGT
GLNKSSIVQKISLAEAEAVRTGKLDAVTVTVVNTSNPQVINPNGGTNYKGQIIFAGEGQGADITSALYLMNPEPPYNTTT
LLNNYFGRQFNSLNDLVVHPKNGDVYFTDTLYGFLQDFRPPPGLRNQVYRLNVETGAVTVVADDFTLPNGITINPEGDRA
YVTDTGIALGYYGRNLSSPASVYSFDVNEDGTWENRKTFAYTPAFVPDGVHTDSEGRVYTGCGDGVHVFNTSGKLIGKIY
TGMSAANFQFAGKGRMIITGQTKLFYVTLAAEGAPLT
L022_000003-T1 L022_000003
MMYRVKAVYDYSSPHEDDLDFKAGQIISVTEEEGDDWYVGEYIDDTGTKRDGLFPRNFVEKYEPEPPPRPNRASRYKTTE
QPAVQAAPPIPDTPQPQEPDQQPEPEPEPEPEPEPAPEPAPEPEPVRHEEPEPPRAQPPPLEVPAAAKTVPLPMSPMSPA
SASTRRAPELPAQAPLQQESAAKPAPASKKPPPPIAAKSNAFRDRIAAFNAPAAAPIQPFKPAGAPTNFIKKPFVAPPPS
RNAYVPPVREAPPVQTYRREEDPEIAERQAQDNDAAEKAGLVAHDAPAQAATEEESQPKVSLKERIALLQKQQQEQAIRA
AAMHKEKPRRPPVKKRTESIDAHPEDSEDTGLERVASGASKERISMDQARPPRVSHDIKSPDEHQRHREILSDANDADQS
AAGDTEDAGGESTSVEGDEERNNVQQTQPSLPIRAAAAPTQEPDVGDEQDVEPEEGEEEEEEDEMDAETRRKLELRERMA
KMSGGMGMPGMFGGIPMGGLPPPKKKKPILESKRVDGDEEPSLPQQRVAMLPMPGMPTVKSPEQEDRQVLQVEKEDETAP
PITSAHRPDEVPDVEDVQPHLTDRTPTGEQPPPVPMEKRPLAPPVPSATRPAPPAPPQVLSPGPGSESGDEMTDAPSALS
PNTPSAPFASPTKRNSYFNSDNMRQAPSIPLQAARPSSPAVSRPPPPPPPTQAPPSRHGEEPPRRLDRNEGETDYEGDYD
TDIASGATHKDALKSHAREASMDGSTTEEPSPARSPTAPHGLPPFPPHLPRAVPPPPPHQPPSRSSIDVPRGAPPLPPVR
PTTNDVQLDEDDDEYDPYRYAGPPVASPPPMPRAVPPPPQSQPPQPHNKPPPPMPPSMPPAVHRQAPESDEDDLYSAPQP
RKSHDRPRPPPPPQAAPHQPAPPPAHQHAPPPPPRQFAPPLPPQERPAPPPPPDRSTMAAPGGEPQPRMPIGRKSLDANR
ALGSRPSMDQSRPGLGADFIARDIDLGMSSHWWTQPKMLPPSLQGRKDILVDMQESTSGNTVEKLVKVLHLDYSQTTVSA
RFDLSNVSDVELEQRNDPPPPTQRPDQLEAAYEQYGRAIAKAVELKQNTVVGAGTPWSLVDELLKPLADALPPVSTRAFG
ALVYANLANASTQSFDEIRPGDIITFRNARFQGKTGPMHAKYAVDVGKPDHVGVVVEWDGSKKKVRVWEQGREHKKVKQE
SFRMGDLRSGEVRVWRVVGRSWVGW
L022_000004-T1 L022_000004
MPPKSILKNSTTTVSTTDIPPGRKPANPRHLDVAIHHANILEDRKRVEAAVLDAIIMLMDFPLTPSANAKRPSPSDAQLF
IDKLNIFQPADYDALIEERNLTDKCGYALCSNPKQKARSTARKQFIDTDHGVEIVDRKVLEVWCSSDCARRALYVKVQLK
EEPAWLRQGGFADKIELMVDNTEEHDNALPLPLKKEEDQSTNAQDDDVSAAWAAQEEAMAELALERGEQPGQFTKANPGL
VKEDIMERATSNAPPKPPTLVAQDMDNTTMAIEGHVPRIDRKRNDDDEDEEDAQDWDKHLPG
L022_000005-T1 L022_000005
MSLSRQSLSPTANLLRNSRLFSLPNPLPKPPVGESFRSGITKASDTATLPYPTQQAIATTKSSLARGDWGLKRPLPQRSY
LVQVSDPVLRVNQLDTIEHVTDFDSAADHVRTRQKWEAMGVPMMKGMTQMRDKDLSGTPPAGAFEIRDDTTSYDTELGLD
ESGLYLKALKDNVKTQADATKKAFDTLWMNKRQAQEAEWDAQGASGDGRVEKIKAFEQEQAQARKQLNKQLTRLRKSDFT
PFTPPPVDAAVHNTKRWKHDGPWLPGMSADDFLAYLNKEITKRKPEFHRYLVEFVKNEIYTTRRLAASQAGEAPPMDMAE
AEAWHARQEKQWRNITPDQIDAGIKSLRKETANNPLGSKLVTKLILPFLRLPAIKFKYTSYAEDASTRAIDQYQFDQETA
PLSTHPSAGLGYLRTRSYITNHPILGPQAQPTPVTARVIQPRTAGTSKQVYARLGVGGFVANDQYRSTDVSSGNRAGVNN
ARDVETIDIDTPGGKKLTVQPLFGSVTNDGRIHIKVKRSQGPEVQVAKGALDDRPPVREFARVERIERFASGKESGVKEL
DEFSGRAKMLSGFLEGLGEGQGKGE

Thanks
Fuyou

@MrTomRod
Copy link
Author

MrTomRod commented Aug 27, 2020

The reason is that your fastas are not functionally annotated.

My script only works if your fastas look something like this:

>L022000001-T1 L022_000001 Description Gene 1
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
>L022000001-T1 L022_000001 Description Gene 2
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS

@sunnycqcn
Copy link

sunnycqcn commented Aug 27, 2020 via email

@harish0201
Copy link

Any such luck for Proteomes downloaded from uniprot?

@MrTomRod
Copy link
Author

MrTomRod commented Jan 13, 2021

Any such luck for Proteomes downloaded from uniprot?

This is an example header:

>tr|A0A151G869|A0A151G869_LACPN Bacteriocin immunity protein OS=Lactiplantibacillus plantarum OX=1590 GN=AYO51_00560 PE=4 SV=1

If you change these headers by removing everything after OS=, it should work fine.

Example:

header.rsplit(' OS=', maxsplit=1)[0]

Result:

'>tr|A0A151G869|A0A151G869_LACPN Bacteriocin immunity protein'

@gracielad
Copy link

Hello MrTomRod,
I tried the script orthofider_plot.py (based on roary plot), but I I'm getting an error. I think it´s relate to library fire

I run this commnad python3 orthofinder_plots.py TREE ~/Species_Tree/SpeciesTree_rooted.txt ORTHOGROUPS_TSV ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv OUT output
Could you help me how I can get this?

Follow the error.
Traceback (most recent call last):
File "orthofinder_plots.py", line 206, in
fire.Fire(OrthofinderPlots.create_plots)
File "/home/user/.local/lib/python3.6/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/user/.local/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire
target=component.name)
File "/home/user/.local/lib/python3.6/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "orthofinder_plots.py", line 62, in create_plots
assert os.path.isdir(out)
AssertionError

Thanks!
All the best

Graci

@MrTomRod
Copy link
Author

MrTomRod commented Mar 27, 2021

Hey Garci

Sorry, my script is not great as I did not invest too much time into it. It's a quick-and-dirty rewrite of Marco Galardinis script.

The error you're seeing is not related to the Fire library. You have to read Python stack traces from the bottom up. Maybe this is useful to you: Understanding the Python Traceback.

The relevant line is assert os.path.isdir(out). The problem is that the variable out does not point to an existing folder. Usage of my script is explained here.

python3 orthofinder_plots.py --tree ~/Species_Tree/SpeciesTree_rooted.txt --orthogroups_tsv ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv --out output

Maybe all you have to do is mkdir output?

@marius44
Copy link

No errors! But I just wanted to say THANK YOU!

Just to know, how does it calculate the core, soft-core, shell, and cloud genomes?

@MrTomRod
Copy link
Author

Glad you find it useful.

I just copied the logic from Marco Galardini / roary_plots:

    CORE, SOFT, SHELL = (n_strains * f for f in [.99, .95, .15])

    core = ((og_count >= CORE) & (og_count <= n_strains)).sum()
    softcore = ((og_count >= SOFT) & (og_count < CORE)).sum()
    shell = ((og_count >= SHELL) & (og_count < SOFT)).sum()
    cloud = (og_count < SHELL).sum()

Hope that's clear enough.

@marius44
Copy link

I think it's pretty clear.

Last thing, Is there a proper way to cite your script? I found this (https://guides.libraries.uc.edu/citing/code) but I would need your data.

@MrTomRod
Copy link
Author

MrTomRod commented Aug 15, 2022

Thanks, that's very nice.

So:

Roder, T (2022) MrTomRod/orthofinder-tools computer program (Version 0.0.2). https://github.com/MrTomRod/orthofinder-tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants