Author: Dan Shea  
Date: 2019.10.31  
#### Annotation of the gtf file using `blastn` and `tblastx` results

I examined the expression of 24 (Tissue type / Condition) data sets to derive transcriptomes. (Table for Sample types is shown below.)  
To do this, QC'ed reads are first aligned to a _de novo_ assembled reference of Hitomebore rice cultivar using `hisat2`.  
The resulting `.bam` files are then passed to `stringtie` to generate `.gtf` files for each aligned sample.  
After that `TACO` is used to merge the `.gtf` files into a single annotation of the transcriptome.  
A fasta formatted file containing the CDS of each transcript is then generated from this merged gtf using `gffread` and the reference fasta file.  
This was then compared to a nucleotide blast database using both `blastn` and `tblastx`, keeping only the best subject hit found in the blast db.  

This workbook will load the blast results (both `blastn` and `tblastx`) and the merged gtf so that we may append the blast results to the `Group` field.  
We will note the best hit for each blast and some key attributes of the blast result.

| Samples |
| ------ |
| 24h_2012 |
| 24h_DDW |
| 24h_TH68 |
| 48h_2012 |
| 48h_DDW |
| 48h_TH68 |
| AM |
| C_leaf |
| EM |
| F_leaf |
| Flagleaf |
| H_leaf |
| N_embryo |
| N_leaf |
| N_leafsheath |
| N_root |
| RT |
| S_YoungPanicle |
| YL |
| YP13 |
| YP36 |
| YP610 |
| YP_Booting |
| YoungPanicle |

In [1]:
# The files we will be using for annotation
gtf_file = 'assembly.gtf'
blastn_file = 'run_blastn.out'
tblastx_file = 'run_tblastx.out'

In [2]:
# We will load them as pandas dataframes since the output is tab-delimited
import pandas as pd

In [3]:
gtf_cols = ['ref', 'source', 'method', 'start', 'stop', 'score', 'strand', 'phase', 'group']
gtf_df = pd.read_csv(gtf_file, sep='\t', header=None, names=gtf_cols)

In [4]:
gtf_df

Unnamed: 0,ref,source,method,start,stop,score,strand,phase,group
0,Chr0_RaGOO,taco,transcript,18692,19325,1000,+,.,"tss_id ""TSS2""; locus_id ""L6""; abs_frac ""0.7589..."
1,Chr0_RaGOO,taco,transcript,18692,44423,318,+,.,"tss_id ""TSS2""; locus_id ""L6""; abs_frac ""0.2410..."
2,Chr0_RaGOO,taco,exon,18692,19325,1000,+,.,"tss_id ""TSS2""; locus_id ""L6""; transcript_id ""T..."
3,Chr0_RaGOO,taco,exon,18692,18958,318,+,.,"tss_id ""TSS2""; locus_id ""L6""; transcript_id ""T..."
4,Chr0_RaGOO,taco,transcript,35924,43783,1000,+,.,"tss_id ""TSS3""; locus_id ""L6""; abs_frac ""1.0000..."
...,...,...,...,...,...,...,...,...,...
444251,chr12_RaGOO,taco,exon,27645467,27645634,1000,+,.,"tss_id ""TSS46480""; locus_id ""L30590""; transcri..."
444252,chr12_RaGOO,taco,exon,27645467,27645634,744,+,.,"tss_id ""TSS46481""; locus_id ""L30590""; transcri..."
444253,chr12_RaGOO,taco,exon,27645467,27645634,129,+,.,"tss_id ""TSS46481""; locus_id ""L30590""; transcri..."
444254,chr12_RaGOO,taco,exon,27645467,27645634,91,+,.,"tss_id ""TSS46481""; locus_id ""L30590""; transcri..."


In [5]:
blastn_cols = ['qseqid', 'sseqid', 'pident', 'qlen', 'slen', 'length', 'sstrand', 'qcovs', 'evalue', 'bitscore']
blastn_df = pd.read_csv(blastn_file, sep='\t', header=None, names=blastn_cols)

In [6]:
blastn_df

Unnamed: 0,qseqid,sseqid,pident,qlen,slen,length,sstrand,qcovs,evalue,bitscore
0,TU2,LOC_Os09g01000.1,100.000,634,297,56,minus,9,1.810000e-21,104.0
1,TU3,LOC_Os09g01000.1,100.000,757,297,56,minus,7,2.180000e-21,104.0
2,TU592,LOC_Os01g57968.1,100.000,632,366,185,plus,29,3.510000e-93,342.0
3,TU589,LOC_Os01g57968.1,100.000,829,366,185,plus,22,4.650000e-93,342.0
4,TU594,LOC_Os01g57968.1,100.000,680,366,185,plus,27,3.790000e-93,342.0
...,...,...,...,...,...,...,...,...,...,...
95486,TU76010,LOC_Os12g44040.3,99.755,841,408,408,plus,49,0.000000e+00,749.0
95487,TU76011,LOC_Os12g44040.3,99.755,984,408,408,plus,41,0.000000e+00,749.0
95488,TU76007,LOC_Os12g44040.3,99.755,1460,408,408,plus,28,0.000000e+00,749.0
95489,TU76012,LOC_Os12g44040.3,99.755,1588,408,408,plus,26,0.000000e+00,749.0


In [7]:
tblastx_cols = ['qseqid', 'sseqid', 'pident', 'qlen', 'slen', 'length', 'frames', 'qcovs', 'evalue', 'bitscore']
tblastx_df = pd.read_csv(tblastx_file, sep='\t', header=None, names=tblastx_cols)

In [8]:
tblastx_df

Unnamed: 0,qseqid,sseqid,pident,qlen,slen,length,frames,qcovs,evalue,bitscore
0,TU2,LOC_Os09g01000.1,100.00,634,297,18,3/-1,9,3.810000e-06,51.0
1,TU2,LOC_Os09g01000.1,100.00,634,297,18,1/-2,9,7.200000e-06,50.1
2,TU2,LOC_Os09g01000.1,100.00,634,297,18,-1/2,9,3.540000e-05,47.8
3,TU2,LOC_Os09g01000.1,100.00,634,297,18,-3/1,9,9.190000e-05,46.4
4,TU2,LOC_Os09g01000.1,100.00,634,297,18,2/-3,9,4.470000e-04,44.1
...,...,...,...,...,...,...,...,...,...,...
1348047,TU76009,LOC_Os11g26350.1,98.75,2118,615,80,-1/-2,26,2.570000e-76,186.0
1348048,TU76009,LOC_Os11g26350.1,100.00,2118,615,24,-3/-3,26,2.570000e-76,59.3
1348049,TU76009,LOC_Os11g26350.1,100.00,2118,615,25,-3/-1,26,2.570000e-76,56.5
1348050,TU76009,LOC_Os11g26350.1,100.00,2118,615,15,-3/-1,26,2.570000e-76,36.8


In [9]:
# For each row in the gtf where method == 'transcript' we will add annotations
gtf_transcripts = gtf_df.loc[gtf_df['method'] == 'transcript', :]

# For blastn_df and tblastx_df we first groupby() the 'qseqid' and then return the idxmax() of the 'bitscore'
blastn_best = blastn_df.loc[blastn_df.groupby('qseqid')['bitscore'].idxmax()]
tblastx_best = tblastx_df.loc[tblastx_df.groupby('qseqid')['bitscore'].idxmax()]

for row in gtf_transcripts.itertuples():
    # parse out the transcript_id from the 'group' column
    transcript_id = row.group.split(';')[5].strip().split(' ')[1].split('"')[1]
    
    # find the best hits for that id in the blastn results
    try:
        tmp_bestn = blastn_best.loc[blastn_best['qseqid'] == transcript_id]
        sseqidn = tmp_bestn.loc[:, 'sseqid'].iloc[0]
        bitscoren = tmp_bestn.loc[:, 'bitscore'].iloc[0]
    except IndexError as e:
        sseqidn = 'none'
        bitscoren = 'none'
    
    # find the best hits for that id in the tblastx results
    try:
        tmp_bestx = tblastx_best.loc[tblastx_best['qseqid'] == transcript_id]
        sseqidx = tmp_bestx.loc[:, 'sseqid'].iloc[0]
        bitscorex = tmp_bestx.loc[:, 'bitscore'].iloc[0]
    except IndexError as e:
        sseqidx = 'none'
        bitscorex = 'none'
    
    # append these values to the existing 'group' field for the given transcript_id
    tmp_group = gtf_transcripts.at[row.Index, 'group']
    gtf_df.at[row.Index, 'group'] = tmp_group + ' blastn_sseqid "{}"; blastn_bitscore "{}"; tblastx_sseqid "{}"; tblastx_bitscore "{}";'.format(sseqidn, bitscoren, sseqidx, bitscorex)


In [10]:
gtf_df.to_csv('Hitomebore_annotated.gtf', sep='\t', header=None, index=None, quoting=None)