Annotating Orthogroups: Blasting against set of known genes #451

ViriatoII · 2020-09-04T11:24:21Z

Hey,

Many people are asking about orthogroup annotation (#373 , #411 , #440 ) so I'm sharing my scripts.

I have finished a script that picks up the resulting N0 of Orthofinder and blasts a gene of each orthogroup against a reference set of well annotated genes (this is a tblastn, so it blasts against DNA genes, not proteins). It outputs the orthogroups and the best hit names, as well as identity, length and coordinates of alignment.

The script is here: (the download will come as .txt, please change that to .py)
annotate_ogroups_vs_ref.py

You run it like this:
annotate_ogroups_vs_ref.py N0.tsv ortho_dir/ ref_genes.fasta

ortho_dir contains fastas for all species in N0. It's the input dir of the Orthofinder run.
ref_genes.fasta is a set of well annotated genes. For example all the arabidopsis genes

Also, do this first: makeblastdb -dbtype nucl -in reference_genes.fasta

We have a script that extrapolates the annotations of the genes in each orthogroup to that orthogroup annotation, in case you have many well studied proteins in your Orthofinder run

Happy annotating,
Ricardo

The text was updated successfully, but these errors were encountered:

davidemms · 2020-09-15T14:10:32Z

Hi Ricardo

Thanks, that's really great! I will put together a page on the tutorials site soon so will mention it there. It'd be great to hear from people here who are finding Ricardo's tool useful too.

All the best
David

guo-cheng · 2022-10-21T23:55:26Z

Hey,

Many people are asking about orthogroup annotation (#373 , #411 , #440 ) so I'm sharing my scripts.

I have finished a script that picks up the resulting N0 of Orthofinder and blasts a gene of each orthogroup against a reference set of well annotated genes (this is a tblastn, so it blasts against DNA genes, not proteins). It outputs the orthogroups and the best hit names, as well as identity, length and coordinates of alignment.

The script is here: (the download will come as .txt, please change that to .py) annotate_ogroups_vs_ref.py

You run it like this: annotate_ogroups_vs_ref.py N0.tsv ortho_dir/ ref_genes.fasta

ortho_dir contains fastas for all species in N0. It's the input dir of the Orthofinder run.

ref_genes.fasta is a set of well annotated genes. For example all the arabidopsis genes

Also, do this first: makeblastdb -dbtype nucl -in reference_genes.fasta

We have a script that extrapolates the annotations of the genes in each orthogroup to that orthogroup annotation, in case you have many well studied proteins in your Orthofinder run

Happy annotating, Ricardo

Hi Ricardo,

Thank you for your scripts. When i used it, i find the running speed is too low. Do you have any idea to speed up the annotation steps?

All the best,
Guo-Cheng

ViriatoII · 2022-10-25T16:25:07Z

Dear @guo-cheng;

I wrote this several years ago and wasn't such a good python programmer back then. I don't have time to rewrite this, but I think it is possible to make it better, yes. I'll say how:

This for loop is unnecessary:

for seq_record in SeqIO.parse(infasta, "fasta"):
				if seq_record.id == gene:

I would rather replace it by having a look_up dictionary prepared before any loop. Probably also have the SeqIO parser outside:

parser = SeqIO.parse(infasta, "fasta") 
lookup_dic = {seq.id : i for i,seq in enumerate(parser)}                      # dict should look like this: {"gene1": 0, "gene2":1 .......}

for row in range(0, len(df)):                 
   (......)
  # and directly extract the correct SeqIO object 
   seq_record =    parser [lookup_dic [gene]]
  (.....)

Hope this helps! :)

Kind regards,
Ricardo

guo-cheng · 2022-10-26T00:20:46Z

font{ line-height: 1.6; } ul,ol{ padding-left: 20px; list-style-position: inside; } OK, I get it!Thanks for your immediately response.Best regardsGuo-Cheng guochengli666 ***@***.***

…

---- Replied Message ---- From Ricardo ***@***.***> Date 10/26/2022 00:25 To ***@***.***> Cc ***@***.***> , ***@***.***> Subject Re: [davidemms/OrthoFinder] Annotating Orthogroups: Blasting against set of known genes (#451) Dear @guo-cheng; I wrote this several years ago and wasn't such a good python programmer back then. I don't have time to rewrite this, but I think it is possible to make it better, yes. I'll say how: This for loop is unnecessary: if seq_record.id == gene: I would rather replace it by having a look_up dictionary prepared before any loop. Probably also have the SeqIO parser outside: `parser = SeqIO.parse(infasta, "fasta") lookup = {seq.id : i for i,seq in enumerate(parser)} # dict should look like this: {"gene1": 0, "gene2":1 .......} for row in range(0, len(df)): ......... and directly extract the correct SeqIO object seq_record = parser [lookup_dic[gene]] ... ` Hope this helps! :) Kind regards, Ricardo —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotating Orthogroups: Blasting against set of known genes #451

Annotating Orthogroups: Blasting against set of known genes #451

ViriatoII commented Sep 4, 2020 •

edited

davidemms commented Sep 15, 2020

guo-cheng commented Oct 21, 2022

ViriatoII commented Oct 25, 2022 •

edited

guo-cheng commented Oct 26, 2022 via email

Annotating Orthogroups: Blasting against set of known genes #451

Annotating Orthogroups: Blasting against set of known genes #451

Comments

ViriatoII commented Sep 4, 2020 • edited

davidemms commented Sep 15, 2020

guo-cheng commented Oct 21, 2022

ViriatoII commented Oct 25, 2022 • edited

guo-cheng commented Oct 26, 2022 via email

ViriatoII commented Sep 4, 2020 •

edited

ViriatoII commented Oct 25, 2022 •

edited