Feature request: Additional output file with consensus gene name per orthogroup #362

MrTomRod · 2020-03-24T08:59:56Z

Hi!

I wrote a quick-and-dirty script (orthogroup_to_gene_name.py) that takes Orthogroups.tsv and turns it into Orthogroup_BestNames.tsv. This file looks like this:

Orthogroup	Best Gene Name	Gene Name Occurrences
OG0000000	ABC transporter ATP-binding protein	{'Fluoroquinolones export ATP-binding protein/MT2762': 2, 'Sulfate/thiosulfate import ATP-binding protein CysA': 2, ...}
OG0000001	IS30 family transposase, partial	{'IS30 family transposase': 144, 'IS30 family transposase, partial': 518, 'Integrase core domain protein': 4, 'hypothetical protein': 6}
OG0000002	IS5/IS1182 family transposase, partial	{'IS5 family transposase': 106, 'IS5/IS1182 family transposase, partial': 326, 'IS5 family transposase, partial': 81, ...}

I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!

Best, MrTomRod

The text was updated successfully, but these errors were encountered:

davidemms · 2020-05-11T15:31:19Z

Hi

This looks nice, could you talk me through this a little more? From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name. Do you know how general purpose it is? What sources of fasta files does it give meaningful gene names for?

I think it's a nice idea. In designing OrthoFinder everything has to be general purpose so I've avoided making assumptions about the description lines in FASTA files--I only work with the sequence data, which I know will be in the form I expect. So a consideration for me would be what proportion of users would it work for and how much support I'd be called on to give if it didn't work for someone's dataset. I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there.

I think another step I'd probably take if I included it in the tools directory would be to remove the dependencies on biopython, fire and pandas. I love using those libraries myself but for an OrthoFinder userbase I I'd generally try to reduce the dependencies as much as possible.

Let me know your thoughts on this.

Thanks!
David

MrTomRod · 2020-05-11T16:13:44Z

From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name.

Exactly! Then, it checks how frequent each name is. It's very fast and little can go wrong, since the fasta format is quite simple.

Do you know how general purpose it is?

No, but it was one of the first things I did with your output. It gives one a quick-and-dirty description of what the OGs role might be.

What sources of fasta files does it give meaningful gene names for?

I'm working with prokaryotes, and both Prokka as well as NCBI-annotated files (PGAP pipeline) give compatible output. These are the two most common ones, I think.

I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there.

I think a list of links to tools is the best option because it makes clear that these scripts aren't your responsibility and you don't have to do be involved if the scripts are updated.

I'd generally try to reduce the dependencies as much as possible.

If you want me to, I'd be happy to remove the dependencies.

Btw: If you looked at my repo, you will have noticed that I also made a script that mimicks roary_plots.py. roary_plots.py was also not made by the roary-devs but a guy called Marco Galardini.

davidemms · 2020-05-15T08:25:55Z

Hi

I had a play with the script using Ensembl proteomes but I couldn't get it to work. If it could support those then I think that could increase its usefulness further. But, as an added complication, with these I advise users to run a script which extracts only the longest transcript variants for input into OrthoFinder: https://davidemms.github.io/orthofinder_tutorials/running-an-example-orthofinder-analysis.html
As a result the OrthoFinder results use only the gene name to refer to each gene, e.g. >ENSDARG00000098423.2
instead of
>ENSDARP00000136500.1 pep chromosome:GRCz11:2:31885581:31886019:-1 gene:ENSDARG00000098423.2 transcript:ENSDART00000170804.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:trgv2 description:T cell receptor gamma variable 2 [Source:ZFIN;Acc:ZDB-GENE-051115-14]
Obviously your script could be provided with these original fasta files with the full description to get the gene names but it would need to know how to take either "ENSDARG00000098423.2" or "ENSDARP00000136500.1" and find the full description in the fasta file.

The other source I use regularly myself is Phytozome, I don't know how that ranks in terms of overall popularity.

I'll try and find some time in the coming weeks to find out about and put together a list of tools people have developed for OthoFinder and will then put it up on the webpage.

All the best
David

MrTomRod · 2020-05-15T08:42:13Z

I don't fully understand...
The input fastas only contain the gene name (ENSDARG00000098423.2)?
Could you provide me with such a file?

How would you design the solution? Add an ensembl-mode which translates the gene name to a description using their API? (Or automatically detect ensembl-ids since they follow a simple pattern.)

davidemms · 2020-05-15T10:47:13Z

Putting aside the primary_transcipts.py script for a minute. If I download a few proteomes from Ensembl and run OrthoFinder on them and then the orthogroup_to_gene_name.py script on the results the lines I get look like this:

Orthogroup      Best Gene Name  Gene Name Occurrences
...
...
OG0024176       pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040971.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]    {'pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040971.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]': 1, 'pep primary_assembly:Xenopus_tropicalis_v9.1:4:46687547:46760370:1 gene:ENSXETG00000021271.1 transcript:ENSXETT00000040959.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:uncharacterized loc100145527 [Source:NCBI gene;Acc:100145527]': 1}
OG0024177       pep primary_assembly:Xenopus_tropicalis_v9.1:8:15991645:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049678.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]      {'pep primary_assembly:Xenopus_tropicalis_v9.1:8:15991645:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049678.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]': 1, 'pep primary_assembly:Xenopus_tropicalis_v9.1:8:16024975:16059640:1 gene:ENSXETG00000021409.1 transcript:ENSXETT00000049687.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:sf3b5 description:splicing factor 3b subunit 5 [Source:Xenbase;Acc:XB-GENE-5845857]': 1}

so it's not pulling out just the gene names for Ensembl file, I think it's giving the full line? I think getting it to work with these files is the main challenge.

davidemms · 2020-05-15T10:48:06Z

The primary transcripts script is only a minor complication. If the user ran the primary_transcripts.py script their input to OrthoFinder would only contain the gene names (ENSDARG00000098423.2) but they would still have a folder of the original fasta files*. And so this folder would be the directory they provide to your script rather than the directory of processed fasta files provided to OrthoFinder as input. So then your script would just have to behave more or less as it currently does and extract the info from the fasta files by looking up the identifier ENSDARG00000098423.2.

*E.g Ensembl fasta file: ftp://ftp.ensembl.org/pub/release-100/fasta/danio_rerio/pep/Danio_rerio.GRCz11.pep.all.fa.gz

ViriatoII · 2020-06-17T18:31:17Z

This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled..

MrTomRod · 2020-06-18T09:32:36Z

This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled..

Glad someone else finds it useful. If your fasta files contain descriptions, it should work. I used it recently on the output of the most recent OrthoFinder without an issue. If you have a problem, create an issue on my repo...

Also, I wrote a tiny how-to in README.md, in case you missed it.

I'll take some time today to run @davidemms tutorial and get my script working with that output.

davidemms · 2020-06-18T12:07:55Z

It'd be great to see it working on the tutorial data, Ensembl is widely used and I'd love to get to try it on some of my analyses!

MrTomRod · 2020-06-18T12:16:45Z

I had an accident and thought I'd spend some of my free time working on non-stressful projects like this, but I just learned that this (screen time) may slow my recovery or even incur long-time consequences.

So I won't be working on it until for the immediate future, sorry!

@davidemms Btw, I've been incorporating OrthoFinder into a PhD sub-project, a tool I named OpenGenomeBrowser. If you're interested, test out an early prototype here! This is where OrthoFinder comes into play: Trees view.

ViriatoII · 2020-06-18T12:23:37Z

Safe recovery @MrTomRod ! Maybe I can adapt the scripts myself (also the roary plots, I meant). I'f I do I'll write here!

jolespin · 2020-08-18T16:56:32Z

@MrTomRod I'm getting an error:
KeyError: "['OG' 'Gene Tree Parent Clade'] not found in axis"

Does this ring any bells?

MrTomRod · 2020-08-18T17:47:59Z

It does! I forgot to document the recent changes I made to the script. Sorry. It should work now if you follow the manual:

https://github.com/MrTomRod/orthofinder-tools/blob/master/README.md

jolespin · 2020-08-18T20:42:13Z

Awesome, I'm going to check it out now.

Am I using the wrong file?

In [2]: df = pd.read_csv("orthofinder_output/Orthogroups/Orthogroups.tsv", sep="\t")
df
In [3]: df.columns
Out[3]:
Index(['Orthogroup', 'GCA_002571385.1', 'GCA_003704095.1', 'GCF_000222465.1',
       'GCF_002042975.1', 'GCF_002571385.1', 'GCF_003704095.1',
       'GCF_004143615.1', 'Mussismilia_hispida'],
      dtype='object')

Also, where do the annotations come from as I'm not seeing them in the dataframe?

Thanks!

MrTomRod · 2020-08-18T20:59:04Z

Yes, you have to use the new N0 file.

OrthogroupToGeneName(
    n0_tsv='orthofinder_output/Phylogenetic_Hierarchical_Orthogroups/N0.tsv',
    index_column='HOG'
).run(
    fasta_dir=PATH_TO_ORTHOFINDER_FASTAS,
    file_endings='faa',
    out='path/to/output.tsv'
)

The annotations come from the fasta files. (PATH_TO_ORTHOFINDER_FASTAS)

jolespin · 2020-08-18T21:26:06Z

Ok cool, I just got it to run! So basically, we provide annotations as the description? I have a de-novo genome so I'll need to add my own annotations for some of these.

btw, thank you for creating this script!

MrTomRod · 2020-08-19T06:31:45Z

Ok cool, I just got it to run! So basically, we provide annotations as the description?

for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.

I have a de-novo genome so I'll need to add my own annotations for some of these.

keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.

btw, thank you for creating this script!

no worries, i made it quick and dirty for myself and decided to share it.

sunnycqcn · 2020-08-26T19:17:32Z

Hello MrTomRod,

Ok cool, I just got it to run! So basically, we provide annotations as the description?

for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.

I have a de-novo genome so I'll need to add my own annotations for some of these.

keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.

btw, thank you for creating this script!

no worries, i made it quick and dirty for myself and decided to share it.

I used you scripts for ploting. I can get the results with the following warning.
/home/AAFC-AAC/fuf/miniconda3/envs/orthofinder/lib/python3.8/site-packages/fire/core.py:672: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89) have mixed types.Specify dtype option on import or set low_memory=False.
component = fn(*varargs, **kwargs)
But the results look good.
Thanks,
Fuyou

sunnycqcn · 2020-08-26T19:29:29Z

Hi!

I wrote a quick-and-dirty script (orthogroup_to_gene_name.py) that takes Orthogroups.tsv and turns it into Orthogroup_BestNames.tsv. This file looks like this:

Orthogroup Best Gene Name Gene Name Occurrences
OG0000000 ABC transporter ATP-binding protein {'Fluoroquinolones export ATP-binding protein/MT2762': 2, 'Sulfate/thiosulfate import ATP-binding protein CysA': 2, ...}
OG0000001 IS30 family transposase, partial {'IS30 family transposase': 144, 'IS30 family transposase, partial': 518, 'Integrase core domain protein': 4, 'hypothetical protein': 6}
OG0000002 IS5/IS1182 family transposase, partial {'IS5 family transposase': 106, 'IS5/IS1182 family transposase, partial': 326, 'IS5 family transposase, partial': 81, ...}
I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!

Best, MrTomRod

Hello MrTomRod,
I got the results from OrthoFinder, I did not find the file directory about Phylogenetic_Hierarchical_Orthogroups/N0.tsv.
I tried a few times. I still did not get this. I am a little confused.
Could you tell me how I can get this?
Thanks,
Fuyou

MrTomRod · 2020-08-26T19:54:36Z

You can't find the folder Phylogenetic_Hierarchical_Orthogroups?

Are you using the latest version of Orthofinder? Because older versions don't create it.

Here are the instructions for my script.

sunnycqcn · 2020-08-26T19:56:46Z

You can't find the folder Phylogenetic_Hierarchical_Orthogroups?

Are you using the latest version of Orthofinder? Because older versions don't create it.

Here are the instructions for my script.

I got this question. Because I did not update my orthofinder. I just want to tell you about this.
Thanks,
Fuyou

sunnycqcn · 2020-08-26T22:29:37Z

You can't find the folder Phylogenetic_Hierarchical_Orthogroups?

Are you using the latest version of Orthofinder? Because older versions don't create it.

Here are the instructions for my script.

Hello MrTomRod,
I tried this script. HOG Best Gene Name N0.HOG0000000 LREF_000140 N0.HOG0000001 LREF_000140 N0.HOG0000002 LREF_000140 N0.HOG0000003 LREF_000140 N0.HOG0000004 L022_006069 N0.HOG0000005 L022_006069 N0.HOG0000006 L022_006069 N0.HOG0000007 L022_006069 N0.HOG0000008 L022_006069 N0.HOG0000009 L022_000673 N0.HOG0000010 L022_000673 N0.HOG0000011 L022_000673 N0.HOG0000012 L022_001622 N0.HOG0000013 L022_001622 N0.HOG0000014 L022_001622 N0.HOG0000015 L022_005455 N0.HOG0000016 L022_005455 N0.HOG0000017 L022_005455 N0.HOG0000018 L022_005455 If I want to get the I am much appreciated Best,
Fuyou But I get the result is as following. I can not find the functional category.
Gene Name Occurrences
Counter({'LREF_000140': 1, 'L022_001102': 1})
Counter({'LREF_000140': 1, 'L022_001102': 1})
Counter({'LREF_000140': 1, 'L022_001102': 1})
Counter({'LREF_000140': 1, 'L022_001102': 1})
Counter({'L022_006069': 1, 'LREF_005195': 1})
Counter({'L022_006069': 1, 'LREF_005195': 1})
Counter({'L022_006069': 1, 'LREF_005195': 1})
Counter({'L022_006069': 1, 'LREF_005195': 1})
Counter({'L022_006069': 1, 'LREF_005195': 1})
Counter({'L022_000673': 1, 'LREF_001506': 1})
Counter({'L022_000673': 1, 'LREF_001506': 1})
Counter({'L022_000673': 1, 'LREF_001506': 1})
Counter({'L022_001622': 1, 'LREF_000642': 1})
Counter({'L022_001622': 1, 'LREF_000642': 1})
Counter({'L022_001622': 1, 'LREF_000642': 1})
Counter({'L022_005455': 1, 'LREF_003145': 1})
Counter({'L022_005455': 1, 'LREF_003145': 1})
Counter({'L022_005455': 1, 'LREF_003145': 1})
Counter({'L022_005455': 1, 'LREF_003145': 1})
functional catergory, what can I do? I read the new tutorials, I did not find any new. My all protein sequences are de novo annotation. So I should download some sequences from database with functional catergory?
for your helping.

MrTomRod · 2020-08-27T10:28:44Z

Please show me the first 5 lines of one of your fasta files.

sunnycqcn · 2020-08-27T13:54:04Z

Please show me the first 5 lines of one of your fasta files.

ps/N0.tsv
N0.HOG0000001 OG0000000 n9 L022_001102-T1 LREF_000140-T3
N0.HOG0000002 OG0000000 n6 L022_001102-T3 LREF_000140-T6, LREF_000140-T1
N0.HOG0000003 OG0000000 n8 L022_001102-T5 LREF_000140-T2
N0.HOG0000004 OG0000001 n1 L022_006069-T1 LREF_005195-T4
N0.HOG0000005 OG0000001 n9 L022_006069-T3 LREF_005195-T1
N0.HOG0000006 OG0000001 n8 L022_006069-T5 LREF_005195-T3
Thanks,
Fuyou

MrTomRod · 2020-08-27T13:57:58Z

no, the first 5 lines of one of your FASTA files.

sunnycqcn · 2020-08-27T14:04:25Z

L022_000001-T1 L022_000001
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTSA
AVWNIAGTYIKAKSWLPGMELESDTIQYVKRISSIPIPEIIFSWVDVEWDRTFLVLKALKGRTLDQVWESLSANRRMEIA
DKVAQFCRTLALSTSEMLMTANGNAALEPFLTSLPPASEPSWKPRNLGPFSSCQLRSYLSKPTVLDDIDLFHFYHADLGP
SNVIVTDDGSIVGIIDWESAAFYPKFWLGTKPLVSVGFTVRGAEKRAWAVLLASSLERVGFSPDLDKYEAWKKAIGK
L022_000002-T1 L022_000002
MDARPVLKVFSVLRCLDSLRSSRPPLQFCPSALSPTMPSTFTVLSAGLVAASSGAFAQKVPETAQVIDQKLFNVLDVVPP
PSVFNGSTLFTWPGVTAESLTAKPFHIYDEEFYDVIGSDPTLTLIASSESDPIFHEAVTWHPPRDEVFFVQNAGAPAAGT
GLNKSSIVQKISLAEAEAVRTGKLDAVTVTVVNTSNPQVINPNGGTNYKGQIIFAGEGQGADITSALYLMNPEPPYNTTT
LLNNYFGRQFNSLNDLVVHPKNGDVYFTDTLYGFLQDFRPPPGLRNQVYRLNVETGAVTVVADDFTLPNGITINPEGDRA
YVTDTGIALGYYGRNLSSPASVYSFDVNEDGTWENRKTFAYTPAFVPDGVHTDSEGRVYTGCGDGVHVFNTSGKLIGKIY
TGMSAANFQFAGKGRMIITGQTKLFYVTLAAEGAPLT
L022_000003-T1 L022_000003
MMYRVKAVYDYSSPHEDDLDFKAGQIISVTEEEGDDWYVGEYIDDTGTKRDGLFPRNFVEKYEPEPPPRPNRASRYKTTE
QPAVQAAPPIPDTPQPQEPDQQPEPEPEPEPEPEPAPEPAPEPEPVRHEEPEPPRAQPPPLEVPAAAKTVPLPMSPMSPA
SASTRRAPELPAQAPLQQESAAKPAPASKKPPPPIAAKSNAFRDRIAAFNAPAAAPIQPFKPAGAPTNFIKKPFVAPPPS
RNAYVPPVREAPPVQTYRREEDPEIAERQAQDNDAAEKAGLVAHDAPAQAATEEESQPKVSLKERIALLQKQQQEQAIRA
AAMHKEKPRRPPVKKRTESIDAHPEDSEDTGLERVASGASKERISMDQARPPRVSHDIKSPDEHQRHREILSDANDADQS
AAGDTEDAGGESTSVEGDEERNNVQQTQPSLPIRAAAAPTQEPDVGDEQDVEPEEGEEEEEEDEMDAETRRKLELRERMA
KMSGGMGMPGMFGGIPMGGLPPPKKKKPILESKRVDGDEEPSLPQQRVAMLPMPGMPTVKSPEQEDRQVLQVEKEDETAP
PITSAHRPDEVPDVEDVQPHLTDRTPTGEQPPPVPMEKRPLAPPVPSATRPAPPAPPQVLSPGPGSESGDEMTDAPSALS
PNTPSAPFASPTKRNSYFNSDNMRQAPSIPLQAARPSSPAVSRPPPPPPPTQAPPSRHGEEPPRRLDRNEGETDYEGDYD
TDIASGATHKDALKSHAREASMDGSTTEEPSPARSPTAPHGLPPFPPHLPRAVPPPPPHQPPSRSSIDVPRGAPPLPPVR
PTTNDVQLDEDDDEYDPYRYAGPPVASPPPMPRAVPPPPQSQPPQPHNKPPPPMPPSMPPAVHRQAPESDEDDLYSAPQP
RKSHDRPRPPPPPQAAPHQPAPPPAHQHAPPPPPRQFAPPLPPQERPAPPPPPDRSTMAAPGGEPQPRMPIGRKSLDANR
ALGSRPSMDQSRPGLGADFIARDIDLGMSSHWWTQPKMLPPSLQGRKDILVDMQESTSGNTVEKLVKVLHLDYSQTTVSA
RFDLSNVSDVELEQRNDPPPPTQRPDQLEAAYEQYGRAIAKAVELKQNTVVGAGTPWSLVDELLKPLADALPPVSTRAFG
ALVYANLANASTQSFDEIRPGDIITFRNARFQGKTGPMHAKYAVDVGKPDHVGVVVEWDGSKKKVRVWEQGREHKKVKQE
SFRMGDLRSGEVRVWRVVGRSWVGW
L022_000004-T1 L022_000004
MPPKSILKNSTTTVSTTDIPPGRKPANPRHLDVAIHHANILEDRKRVEAAVLDAIIMLMDFPLTPSANAKRPSPSDAQLF
IDKLNIFQPADYDALIEERNLTDKCGYALCSNPKQKARSTARKQFIDTDHGVEIVDRKVLEVWCSSDCARRALYVKVQLK
EEPAWLRQGGFADKIELMVDNTEEHDNALPLPLKKEEDQSTNAQDDDVSAAWAAQEEAMAELALERGEQPGQFTKANPGL
VKEDIMERATSNAPPKPPTLVAQDMDNTTMAIEGHVPRIDRKRNDDDEDEEDAQDWDKHLPG
L022_000005-T1 L022_000005
MSLSRQSLSPTANLLRNSRLFSLPNPLPKPPVGESFRSGITKASDTATLPYPTQQAIATTKSSLARGDWGLKRPLPQRSY
LVQVSDPVLRVNQLDTIEHVTDFDSAADHVRTRQKWEAMGVPMMKGMTQMRDKDLSGTPPAGAFEIRDDTTSYDTELGLD
ESGLYLKALKDNVKTQADATKKAFDTLWMNKRQAQEAEWDAQGASGDGRVEKIKAFEQEQAQARKQLNKQLTRLRKSDFT
PFTPPPVDAAVHNTKRWKHDGPWLPGMSADDFLAYLNKEITKRKPEFHRYLVEFVKNEIYTTRRLAASQAGEAPPMDMAE
AEAWHARQEKQWRNITPDQIDAGIKSLRKETANNPLGSKLVTKLILPFLRLPAIKFKYTSYAEDASTRAIDQYQFDQETA
PLSTHPSAGLGYLRTRSYITNHPILGPQAQPTPVTARVIQPRTAGTSKQVYARLGVGGFVANDQYRSTDVSSGNRAGVNN
ARDVETIDIDTPGGKKLTVQPLFGSVTNDGRIHIKVKRSQGPEVQVAKGALDDRPPVREFARVERIERFASGKESGVKEL
DEFSGRAKMLSGFLEGLGEGQGKGE

Thanks
Fuyou

MrTomRod · 2020-08-27T14:14:29Z

The reason is that your fastas are not functionally annotated.

My script only works if your fastas look something like this:

>L022000001-T1 L022_000001 Description Gene 1
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
>L022000001-T1 L022_000001 Description Gene 2
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS

sunnycqcn · 2020-08-27T15:40:05Z

I got it. Thanks, Fuyou

…

On Thu, Aug 27, 2020 at 8:14 AM Thomas Roder ***@***.***> wrote: The reason is that your fastas are not functionally annotated. My script only works if your fastas look something like this: >L022000001-T1 L022_000001 Description Gene 1 MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS >L022000001-T1 L022_000001 Description Gene 2 MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#362 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF3JCKCHUEJSBVI6NFZ4DNTSCZS5LANCNFSM4LSPLHOA> .

-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA

harish0201 · 2021-01-12T16:08:27Z

Any such luck for Proteomes downloaded from uniprot?

MrTomRod · 2021-01-13T09:46:55Z

Any such luck for Proteomes downloaded from uniprot?

This is an example header:

>tr|A0A151G869|A0A151G869_LACPN Bacteriocin immunity protein OS=Lactiplantibacillus plantarum OX=1590 GN=AYO51_00560 PE=4 SV=1

If you change these headers by removing everything after OS=, it should work fine.

Example:

header.rsplit(' OS=', maxsplit=1)[0]

Result:

'>tr|A0A151G869|A0A151G869_LACPN Bacteriocin immunity protein'

gracielad · 2021-03-27T02:46:59Z

Hello MrTomRod,
I tried the script orthofider_plot.py (based on roary plot), but I I'm getting an error. I think it´s relate to library fire

I run this commnad python3 orthofinder_plots.py TREE ~/Species_Tree/SpeciesTree_rooted.txt ORTHOGROUPS_TSV ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv OUT output
Could you help me how I can get this?

Follow the error.
Traceback (most recent call last):
File "orthofinder_plots.py", line 206, in
fire.Fire(OrthofinderPlots.create_plots)
File "/home/user/.local/lib/python3.6/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/user/.local/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire
target=component.name)
File "/home/user/.local/lib/python3.6/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "orthofinder_plots.py", line 62, in create_plots
assert os.path.isdir(out)
AssertionError

Thanks!
All the best

Graci

MrTomRod · 2021-03-27T10:39:22Z

Hey Garci

Sorry, my script is not great as I did not invest too much time into it. It's a quick-and-dirty rewrite of Marco Galardinis script.

The error you're seeing is not related to the Fire library. You have to read Python stack traces from the bottom up. Maybe this is useful to you: Understanding the Python Traceback.

The relevant line is assert os.path.isdir(out). The problem is that the variable out does not point to an existing folder. Usage of my script is explained here.

python3 orthofinder_plots.py --tree ~/Species_Tree/SpeciesTree_rooted.txt --orthogroups_tsv ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv --out output

Maybe all you have to do is mkdir output?

marius44 · 2022-08-15T10:37:06Z

No errors! But I just wanted to say THANK YOU!

Just to know, how does it calculate the core, soft-core, shell, and cloud genomes?

MrTomRod · 2022-08-15T10:48:33Z

Glad you find it useful.

I just copied the logic from Marco Galardini / roary_plots:

    CORE, SOFT, SHELL = (n_strains * f for f in [.99, .95, .15])

    core = ((og_count >= CORE) & (og_count <= n_strains)).sum()
    softcore = ((og_count >= SOFT) & (og_count < CORE)).sum()
    shell = ((og_count >= SHELL) & (og_count < SOFT)).sum()
    cloud = (og_count < SHELL).sum()

Hope that's clear enough.

marius44 · 2022-08-15T10:55:36Z

I think it's pretty clear.

Last thing, Is there a proper way to cite your script? I found this (https://guides.libraries.uc.edu/citing/code) but I would need your data.

MrTomRod · 2022-08-15T11:08:34Z

Thanks, that's very nice.

Author name: Thomas Roder
Date: 15. Aug 2022
Title of program: MrTomRod/orthofinder-tools
Code version: 0.0.2
Type: computer program
Web address: https://github.com/MrTomRod/orthofinder-tools

So:

Roder, T (2022) MrTomRod/orthofinder-tools computer program (Version 0.0.2). https://github.com/MrTomRod/orthofinder-tools

ViriatoII mentioned this issue Jul 27, 2020

Extra script: annotate Orthogroups (HOG) from fasta descriptions #434

Open

davidemms mentioned this issue Aug 18, 2020

[Question] What is the recommended way of annotating orthologs? #440

Open

jolespin mentioned this issue Jun 17, 2021

[Feature Request] Specify output directory #578

Open

Feature request: Additional output file with consensus gene name per orthogroup #362

Feature request: Additional output file with consensus gene name per orthogroup #362

Comments

MrTomRod commented Mar 24, 2020 • edited

davidemms commented May 11, 2020

MrTomRod commented May 11, 2020 • edited

davidemms commented May 15, 2020

MrTomRod commented May 15, 2020 • edited

davidemms commented May 15, 2020

davidemms commented May 15, 2020

ViriatoII commented Jun 17, 2020 • edited

MrTomRod commented Jun 18, 2020 • edited

davidemms commented Jun 18, 2020

MrTomRod commented Jun 18, 2020 • edited

ViriatoII commented Jun 18, 2020 • edited

jolespin commented Aug 18, 2020

MrTomRod commented Aug 18, 2020 • edited

jolespin commented Aug 18, 2020 • edited

MrTomRod commented Aug 18, 2020 • edited

jolespin commented Aug 18, 2020

MrTomRod commented Aug 19, 2020

sunnycqcn commented Aug 26, 2020

sunnycqcn commented Aug 26, 2020

MrTomRod commented Aug 26, 2020 • edited

sunnycqcn commented Aug 26, 2020

sunnycqcn commented Aug 26, 2020

MrTomRod commented Aug 27, 2020

sunnycqcn commented Aug 27, 2020

MrTomRod commented Aug 27, 2020

sunnycqcn commented Aug 27, 2020

MrTomRod commented Aug 27, 2020 • edited

sunnycqcn commented Aug 27, 2020 via email

harish0201 commented Jan 12, 2021

MrTomRod commented Jan 13, 2021 • edited

gracielad commented Mar 27, 2021

MrTomRod commented Mar 27, 2021 • edited

marius44 commented Aug 15, 2022

MrTomRod commented Aug 15, 2022

marius44 commented Aug 15, 2022

MrTomRod commented Aug 15, 2022 • edited

MrTomRod commented Mar 24, 2020 •

edited

MrTomRod commented May 11, 2020 •

edited

MrTomRod commented May 15, 2020 •

edited

ViriatoII commented Jun 17, 2020 •

edited

MrTomRod commented Jun 18, 2020 •

edited

MrTomRod commented Jun 18, 2020 •

edited

ViriatoII commented Jun 18, 2020 •

edited

MrTomRod commented Aug 18, 2020 •

edited

jolespin commented Aug 18, 2020 •

edited

MrTomRod commented Aug 18, 2020 •

edited

MrTomRod commented Aug 26, 2020 •

edited

MrTomRod commented Aug 27, 2020 •

edited

MrTomRod commented Jan 13, 2021 •

edited

MrTomRod commented Mar 27, 2021 •

edited

MrTomRod commented Aug 15, 2022 •

edited