## Preface

The purpose of this script is to match de novo IDs from Ren et al 2019 to Medicago truncatula Refseq IDs (close, well annotated relative). We will use the provided FASTA file from Ren et al 2019 and the RefSeq as the subject. Because the provided FASTA file consists of nucleotides and our RefSeq subject files consists of proteins, we must use blastx. From there we will map the BLAST hits to Orthogroups.

The working directory is the jobs folder.

In [1]:
import pandas as pd
import Bio.SeqIO as SeqIO
from Bio import SearchIO
import os as os

## FASTA Input

Here we are loading in the Sesbania Cannabina files from Ren et al 2019.

In [2]:
#read in fasta files
ren = SeqIO.to_dict(SeqIO.parse("../raw_data/ren_2019/GSE99532_unigene.fasta", "fasta"))

#extract ids
ren_ids = list(ren.keys())

#extract sequences. doing this iterative search will keep both the ids and sequences in the same order for easier reference
ren_seq = []

for i in ren_ids:
    ren_seq.append(ren[i].seq)

## BLAST Alignment

BLAST Sesbania sequences against RefSeq versions. Save data to disk because it can be computationally intensive. We will use the M. truncatula database created in Garcia et al 2017 analyses.

In [3]:
"""
No need to re-run this because it takes forever and I did it already.
"""
#construct directory to store blast alignments
os.mkdir("../processed_data/Q-ren2019_S-medicagoREFSEQ")

#write out sesbania sequences to file to make life easier
os.mkdir("../raw_data/ren2019_denovo_seq")

for record in SeqIO.parse("../raw_data/ren_2019/GSE99532_unigene.fasta", "fasta"):
    SeqIO.write(record, "../raw_data/ren2019_denovo_seq/" + record.id + ".FNA", "fasta")

seq_files = os.listdir("../raw_data/ren2019_denovo_seq")

In [4]:
print(len(seq_files))
print(len(ren_ids))
print(len(set(ren_ids)))
print(seq_files[0])

290972
290972
290972
Cluster-45083.171926.FNA


In [5]:
"""
I already ran this. No need to run it again.
"""
#run blast alignments and save to disk
db = "../blast_db/medicago_truncatula"
for i in seq_files:
    cmd = "blastx -query ../raw_data/ren2019_denovo_seq/" + i + " -db " + db + " -outfmt 7 -out ../processed_data/Q-ren2019_S-medicagoREFSEQ/" + i + ".txt"
    os.system(cmd)

## Extracting BLAST Data

In [6]:
#iterate through blast files and save most relevant information
ren_gene = []
refseq_gene = []
ident_pcts = []
aln_spans = []
evalues = []

file_prefix = "../processed_data/Q-ren2019_S-medicagoREFSEQ/"
files = os.listdir(file_prefix)

for i in files:
    try:
        blast_in = SearchIO.read(file_prefix + i, "blast-tab", comments = True)
        ren_gene.append(blast_in[0][0].query_id)
        refseq_gene.append(blast_in[0][0].hit_id)
        ident_pcts.append(blast_in[0][0].ident_pct)
        aln_spans.append(blast_in[0][0].aln_span)
        evalues.append(blast_in[0][0].evalue)
    except:
        pass

In [7]:
#convert blast info lists into python dataframe
blast_df = pd.DataFrame(list(zip(ren_gene, refseq_gene, ident_pcts, aln_spans, evalues)),
                       columns = ["sesbania_gene", "refseq_gene", "percent_identity", "alignment_length", "evalue"])
blast_df.to_csv("../processed_data/20210713_sesbania_blast_results.tsv", sep = "\t")

In [8]:
blast_df = pd.read_csv("../processed_data/20210713_sesbania_blast_results.tsv", sep = "\t", index_col = 0)

#I'm choosing a relatively stringent requirement of e-value < 0.0001 as an accepted cut-off
blast_df["evalue_acceptance"] = blast_df.evalue < 0.0001

## Orthogroup Matching

In [9]:
#read in orthogroup data
ortho_meta = pd.read_csv("../20200324_genome_analyses/metadata/Orthogroups.csv", sep = "\t")

In [10]:
#create dataframe matching brachypodium proteins with orthogroup
ortho_filt = ortho_meta[["Unnamed: 0", "AM_refseq_medicago_truncatula"]]
ortho_filt = ortho_filt.dropna()

#match proteins to orthogroups
orthogroup = []
protein = []

for i in range(0,len(ortho_filt.index)):
    holder = ortho_filt.iloc[i,1]
    for j in holder.split(", "):
        orthogroup.append(ortho_filt.iloc[i,0])
        protein.append(j)

refseq_ortho = pd.DataFrame(list(zip(orthogroup, protein)),
                          columns = ["orthogroup", "refseq_gene"])

In [17]:
blast_df2 = pd.merge(blast_df, refseq_ortho, on = "refseq_gene")
blast_df2.loc[(blast_df2.evalue_acceptance == False), "orthogroup"] = "not_in_medicago"

#This dataframe has multiple entries for each protein because ENA marks all the different kinds of accession in different rows for the same gene
blast_df2.to_csv("../processed_data/20210713_sesbania_cannabina_blast_with_orthogroup.tsv", sep = "\t")