-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Additional output file with consensus gene name per orthogroup #362
Comments
Hi This looks nice, could you talk me through this a little more? From what I can see from the script it looks like it takes the 'description' attribute for each gene by reading the fasta file using Bio.SeqIO and uses anything in there as a possible name. Do you know how general purpose it is? What sources of fasta files does it give meaningful gene names for? I think it's a nice idea. In designing OrthoFinder everything has to be general purpose so I've avoided making assumptions about the description lines in FASTA files--I only work with the sequence data, which I know will be in the form I expect. So a consideration for me would be what proportion of users would it work for and how much support I'd be called on to give if it didn't work for someone's dataset. I think if it is general purpose enough then options would include including it in the tools directory for OrthoFinder or I could pull together a page on the orthofinder website listing external tools people have developed for OrthoFinder and list it there. I think another step I'd probably take if I included it in the tools directory would be to remove the dependencies on biopython, fire and pandas. I love using those libraries myself but for an OrthoFinder userbase I I'd generally try to reduce the dependencies as much as possible. Let me know your thoughts on this. Thanks! |
Exactly! Then, it checks how frequent each name is. It's very fast and little can go wrong, since the fasta format is quite simple.
No, but it was one of the first things I did with your output. It gives one a quick-and-dirty description of what the OGs role might be.
I'm working with prokaryotes, and both Prokka as well as NCBI-annotated files (PGAP pipeline) give compatible output. These are the two most common ones, I think.
I think a list of links to tools is the best option because it makes clear that these scripts aren't your responsibility and you don't have to do be involved if the scripts are updated.
If you want me to, I'd be happy to remove the dependencies. Btw: If you looked at my repo, you will have noticed that I also made a script that mimicks |
Hi I had a play with the script using Ensembl proteomes but I couldn't get it to work. If it could support those then I think that could increase its usefulness further. But, as an added complication, with these I advise users to run a script which extracts only the longest transcript variants for input into OrthoFinder: https://davidemms.github.io/orthofinder_tutorials/running-an-example-orthofinder-analysis.html The other source I use regularly myself is Phytozome, I don't know how that ranks in terms of overall popularity. I'll try and find some time in the coming weeks to find out about and put together a list of tools people have developed for OthoFinder and will then put it up on the webpage. All the best |
I don't fully understand... How would you design the solution? Add an ensembl-mode which translates the gene name to a description using their API? (Or automatically detect ensembl-ids since they follow a simple pattern.) |
Putting aside the primary_transcipts.py script for a minute. If I download a few proteomes from Ensembl and run OrthoFinder on them and then the orthogroup_to_gene_name.py script on the results the lines I get look like this:
so it's not pulling out just the gene names for Ensembl file, I think it's giving the full line? I think getting it to work with these files is the main challenge. |
The primary transcripts script is only a minor complication. If the user ran the primary_transcripts.py script their input to OrthoFinder would only contain the gene names ( *E.g Ensembl fasta file: ftp://ftp.ensembl.org/pub/release-100/fasta/danio_rerio/pep/Danio_rerio.GRCz11.pep.all.fa.gz |
This is awsome! The outputs of Orthofinder have been slightly updated now, do you figure you could update your scripts to accommodate that? I tried a bit but struggled.. |
Glad someone else finds it useful. If your fasta files contain descriptions, it should work. I used it recently on the output of the most recent OrthoFinder without an issue. If you have a problem, create an issue on my repo... Also, I wrote a tiny how-to in README.md, in case you missed it. I'll take some time today to run @davidemms tutorial and get my script working with that output. |
It'd be great to see it working on the tutorial data, Ensembl is widely used and I'd love to get to try it on some of my analyses! |
I had an accident and thought I'd spend some of my free time working on non-stressful projects like this, but I just learned that this (screen time) may slow my recovery or even incur long-time consequences. So I won't be working on it until for the immediate future, sorry! @davidemms Btw, I've been incorporating OrthoFinder into a PhD sub-project, a tool I named OpenGenomeBrowser. If you're interested, test out an early prototype here! This is where OrthoFinder comes into play: Trees view. |
Safe recovery @MrTomRod ! Maybe I can adapt the scripts myself (also the roary plots, I meant). I'f I do I'll write here! |
@MrTomRod I'm getting an error: Does this ring any bells? |
It does! I forgot to document the recent changes I made to the script. Sorry. It should work now if you follow the manual: https://github.com/MrTomRod/orthofinder-tools/blob/master/README.md |
Awesome, I'm going to check it out now. Am I using the wrong file?
Also, where do the annotations come from as I'm not seeing them in the dataframe? Thanks! |
Yes, you have to use the new N0 file.
The annotations come from the fasta files. ( |
Ok cool, I just got it to run! So basically, we provide annotations as the description? I have a de-novo genome so I'll need to add my own annotations for some of these. btw, thank you for creating this script! |
for each gene that was assigned to an orthgroup, the script takes the annotation from the protein fasta. then, it selects the most common annotation as the orthogroups annotation.
keep in mind that with something like GO-terms/EC-numbers/KEGG-annotations, the logic of my script would not suffice. there, the logic should be: assign a GO-term to the orthgroup if at least 50% (or whatever) of the genes in the orthogroup have this GO-term.
no worries, i made it quick and dirty for myself and decided to share it. |
Hello MrTomRod,
I used you scripts for ploting. I can get the results with the following warning. |
|
You can't find the folder Are you using the latest version of Orthofinder? Because older versions don't create it. Here are the instructions for my script. |
I got this question. Because I did not update my orthofinder. I just want to tell you about this. |
Hello MrTomRod, |
Please show me the first 5 lines of one of your fasta files. |
ps/N0.tsv |
no, the first 5 lines of one of your FASTA files. |
Thanks |
The reason is that your fastas are not functionally annotated. My script only works if your fastas look something like this:
|
I got it.
Thanks,
Fuyou
…On Thu, Aug 27, 2020 at 8:14 AM Thomas Roder ***@***.***> wrote:
The reason is that your fastas are not functionally annotated.
My script only works if your fastas look something like this:
>L022000001-T1 L022_000001 Description Gene 1
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
>L022000001-T1 L022_000001 Description Gene 2
MFCPMDPPVEDSVVRTGENRWSLGSLYVCELVYDVPNDAMASWEADGNTYCIRKSSKDEQPSTVLGDSGSNRIHHAGTS
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#362 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF3JCKCHUEJSBVI6NFZ4DNTSCZS5LANCNFSM4LSPLHOA>
.
--
Fuyou Fu, Ph.D.
Department of Botany and Plant Pathology
Purdue University
USA
|
Any such luck for Proteomes downloaded from uniprot? |
This is an example header:
If you change these headers by removing everything after Example:
Result:
|
Hello MrTomRod, I run this commnad python3 orthofinder_plots.py TREE ~/Species_Tree/SpeciesTree_rooted.txt ORTHOGROUPS_TSV ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv OUT output Follow the error. Thanks! Graci |
Hey Garci Sorry, my script is not great as I did not invest too much time into it. It's a quick-and-dirty rewrite of Marco Galardinis script. The error you're seeing is not related to the Fire library. You have to read Python stack traces from the bottom up. Maybe this is useful to you: Understanding the Python Traceback. The relevant line is python3 orthofinder_plots.py --tree ~/Species_Tree/SpeciesTree_rooted.txt --orthogroups_tsv ~/OrthoFinder/Results_Mar25/Orthogroups/Orthogroups.tsv --out output Maybe all you have to do is |
No errors! But I just wanted to say THANK YOU! Just to know, how does it calculate the core, soft-core, shell, and cloud genomes? |
Glad you find it useful. I just copied the logic from Marco Galardini / roary_plots: CORE, SOFT, SHELL = (n_strains * f for f in [.99, .95, .15])
core = ((og_count >= CORE) & (og_count <= n_strains)).sum()
softcore = ((og_count >= SOFT) & (og_count < CORE)).sum()
shell = ((og_count >= SHELL) & (og_count < SOFT)).sum()
cloud = (og_count < SHELL).sum() Hope that's clear enough. |
I think it's pretty clear. Last thing, Is there a proper way to cite your script? I found this (https://guides.libraries.uc.edu/citing/code) but I would need your data. |
Thanks, that's very nice.
So: Roder, T (2022) MrTomRod/orthofinder-tools computer program (Version 0.0.2). https://github.com/MrTomRod/orthofinder-tools |
Hi!
I wrote a quick-and-dirty script (
orthogroup_to_gene_name.py
) that takesOrthogroups.tsv
and turns it intoOrthogroup_BestNames.tsv
. This file looks like this:I find it quite useful and since it takes no time to generate, I wanted to suggest you add it to your tool!
Best, MrTomRod
The text was updated successfully, but these errors were encountered: