Dan Shea  
2017.05.04  

lncRNA analysis of RNA-seq data for _B. rapa_ subsp. _pekinensis_ var. S11  
We have a `tmap` file generated by `stringtie`, `gffcompare`, and the differential expression analysis from `cuffcompare`. Additionally, we have the `gene_exp.diff` file that tells us the loci of the transfrags. We want to divide the transfrags into different types based on their mapping to gene loci of the _B. rapa_ var. Chiifu-401 reference sequence and additionally create fasta files of the genomic DNA sequences of the transcripts for use in single-strand `blastx` analysis against the Swissprot database. This should allow us to confirm both putative novel genes and putative lncRNAs such as lincRNA (long-intergenic), cis-NAT (opposite strand of gene locus), and intronic ncRNA (within the intron of a gene locus.

This workbook looks at the intergenic transfrags that did not return a hit against Swissprot. We must update the excel spreadsheet output by the previous analysis, creating a putative genes worksheet for genes that did return a blastx hit and remove those genes from the putative intergenic worksheet.

In [1]:
# Let's import some of the packages we will require for this analysis
import pandas as pd
from Bio import SeqIO

In [2]:
# Define the input files we want to load in as pandas data frames
excel_file         = 'putative_lncrnas.xlsx'
blastx_nohits_file = 'putative_intergenic_lncRNAs.swissprot.blastx.nohits.out'
blastx_hits_file   = 'putative_unannotated_genes.tsv'

In [3]:
# Load the files as pandas data frames
excel         = pd.read_excel(excel_file, sheetname='Intergenic')
blastx_nohits = pd.read_table(blastx_nohits_file)
blastx_hits   = pd.read_table(blastx_hits_file)

In [4]:
excel

Unnamed: 0,test_id,gene_id,gene,locus,sample_1,sample_2,status,value_1,value_2,log2(fold_change),...,qry_gene_id,qry_id,FMI,FPKM,FPKM_conf_lo,FPKM_conf_hi,cov,len,major_iso_id,ref_match_len
1,MSTRG.10000,MSTRG.10000,-,A04:15821022-15821330,s11_NV,s11_4V,OK,0.991745,2.126390,1.100370,...,MSTRG.10000,MSTRG.10000.1,100,0,0,0,0,308,MSTRG.10000.1,-
2,MSTRG.10003,MSTRG.10003,-,A04:15842844-15843588,s11_NV,s11_4V,OK,12.067700,4.587120,-1.395490,...,MSTRG.10003,MSTRG.10003.1,100,0,0,0,0,664,MSTRG.10003.1,-
4,MSTRG.10014,MSTRG.10014,-,A04:15953371-15954875,s11_NV,s11_4V,OK,3.186410,5.795740,0.863061,...,MSTRG.10014,MSTRG.10014.1,100,0,0,0,0,1209,MSTRG.10014.1,-
6,MSTRG.10022,MSTRG.10022,-,A04:16016405-16022757,s11_NV,s11_4V,OK,7.532030,6.144130,-0.293829,...,MSTRG.10022,MSTRG.10022.1,100,0,0,0,0,6330,MSTRG.10022.1,-
8,MSTRG.10040,MSTRG.10040,-,A04:16160696-16163427,s11_NV,s11_4V,OK,5.498820,9.417510,0.776223,...,MSTRG.10040,MSTRG.10040.1,100,0,0,0,0,1883,MSTRG.10040.1,-
10,MSTRG.10043,MSTRG.10043,-,A04:16181290-16181929,s11_NV,s11_4V,OK,2.565920,1.148040,-1.160300,...,MSTRG.10043,MSTRG.10043.1,100,0,0,0,0,617,MSTRG.10043.1,-
13,MSTRG.10078,MSTRG.10078,-,A04:16423670-16426549,s11_NV,s11_4V,OK,4.492290,4.700060,0.065229,...,MSTRG.10078,MSTRG.10078.1,100,0,0,0,0,2651,MSTRG.10078.1,-
15,MSTRG.10087,MSTRG.10087,-,A04:16518730-16524454,s11_NV,s11_4V,OK,6.724960,15.236800,1.179960,...,MSTRG.10087,MSTRG.10087.1,100,0,0,0,0,1201,MSTRG.10087.1,-
16,MSTRG.10116,MSTRG.10116,-,A04:16779966-16780751,s11_NV,s11_4V,OK,28.500900,41.860900,0.554594,...,MSTRG.10116,MSTRG.10116.1,100,0,0,0,0,785,MSTRG.10116.1,-
19,MSTRG.10136,MSTRG.10136,-,A04:16960916-16966294,s11_NV,s11_4V,OK,242.735000,162.915000,-0.575268,...,MSTRG.10136,MSTRG.10136.2,100,0,0,0,0,4882,MSTRG.10136.2,-


`excel` is the previously created intergenic excel worksheet. We will update this, removing entries that appear in `blastx_hits`.

In [5]:
blastx_nohits

Unnamed: 0,query_id
0,MSTRG.10003
1,MSTRG.10014
2,MSTRG.10040
3,MSTRG.10043
4,MSTRG.10116
5,MSTRG.10164
6,MSTRG.10169
7,MSTRG.10256
8,MSTRG.10262
9,MSTRG.10271


`blastx_nohits` contains all entries in the `excel` data frame that did not return a `blastx` hit from the Swissprot database.

In [6]:
blastx_hits

Unnamed: 0,query_id,subject_id,%_identity,alignment_length,mismatches,gap_opens,query_start,query_end,subject_start,subject_end,evalue,bit_score
0,MSTRG.10000,sp|Q8W413|INV4_ARATH,79.49,39,8,0,118,2,25,63,5.000000e-16,77.0
1,MSTRG.10022,sp|P14381|YTX2_XENLA,23.89,900,595,28,4091,1539,3,861,6.000000e-43,177.0
2,MSTRG.10078,sp|O81908|PPR2_ARATH,70.03,684,180,9,2718,736,24,705,0.000000e+00,919.0
3,MSTRG.10087,sp|Q96247|AUX1_ARATH,63.91,169,14,3,1581,2084,1,123,2.000000e-100,196.0
4,MSTRG.10087,sp|Q96247|AUX1_ARATH,100.00,62,0,0,2353,2538,159,220,2.000000e-100,138.0
5,MSTRG.10087,sp|Q96247|AUX1_ARATH,87.80,41,5,0,2144,2266,121,161,2.000000e-100,79.0
6,MSTRG.10087,sp|Q96247|AUX1_ARATH,89.90,99,10,0,3827,4123,319,417,1.000000e-78,180.0
7,MSTRG.10087,sp|Q96247|AUX1_ARATH,94.20,69,4,0,3541,3747,254,322,1.000000e-78,138.0
8,MSTRG.10087,sp|Q96247|AUX1_ARATH,91.14,79,5,1,5210,5446,409,485,1.000000e-33,142.0
9,MSTRG.10136,sp|Q99315|YG31B_YEAST,34.54,941,555,15,3340,641,522,1442,1.000000e-143,490.0


`blastx_hits` contains the results of running `blastx` against Swissprot for the intergenic transscripts. Only entries that return a hit are listed in this data frame.

In [7]:
# Remove the duplicate blast hits, keeping only the best hit
blastx_hits = blastx_hits.drop_duplicates('query_id')

In [8]:
# First we will merge excel with blastx_hits to create the putative unannotated genes worksheet
blastx_hits_worksheet = pd.merge(excel, blastx_hits, left_on='test_id', right_on='query_id')

In [9]:
blastx_hits_worksheet

Unnamed: 0,test_id,gene_id,gene,locus,sample_1,sample_2,status,value_1,value_2,log2(fold_change),...,%_identity,alignment_length,mismatches,gap_opens,query_start,query_end,subject_start,subject_end,evalue,bit_score
0,MSTRG.10000,MSTRG.10000,-,A04:15821022-15821330,s11_NV,s11_4V,OK,0.991745,2.126390,1.100370,...,79.49,39,8,0,118,2,25,63,5.000000e-16,77.0
1,MSTRG.10022,MSTRG.10022,-,A04:16016405-16022757,s11_NV,s11_4V,OK,7.532030,6.144130,-0.293829,...,23.89,900,595,28,4091,1539,3,861,6.000000e-43,177.0
2,MSTRG.10078,MSTRG.10078,-,A04:16423670-16426549,s11_NV,s11_4V,OK,4.492290,4.700060,0.065229,...,70.03,684,180,9,2718,736,24,705,0.000000e+00,919.0
3,MSTRG.10087,MSTRG.10087,-,A04:16518730-16524454,s11_NV,s11_4V,OK,6.724960,15.236800,1.179960,...,63.91,169,14,3,1581,2084,1,123,2.000000e-100,196.0
4,MSTRG.10136,MSTRG.10136,-,A04:16960916-16966294,s11_NV,s11_4V,OK,242.735000,162.915000,-0.575268,...,34.54,941,555,15,3340,641,522,1442,1.000000e-143,490.0
5,MSTRG.10137,MSTRG.10137,-,A04:16960916-16966294,s11_NV,s11_4V,OK,4.740240,3.838070,-0.304578,...,34.54,941,555,15,3340,641,522,1442,1.000000e-143,490.0
6,MSTRG.10197,MSTRG.10197,-,A04:17428893-17436774,s11_NV,s11_4V,OK,20.328400,16.673200,-0.285960,...,33.11,296,128,7,4888,4004,893,1119,2.000000e-69,138.0
7,MSTRG.10279,MSTRG.10279,-,A04:18050412-18052111,s11_NV,s11_4V,OK,4.590450,3.295220,-0.478261,...,52.04,269,126,3,1693,890,1060,1326,2.000000e-85,293.0
8,MSTRG.10365,MSTRG.10365,-,A04:18741523-18742449,s11_NV,s11_4V,OK,29.237000,37.350500,0.353332,...,97.44,39,1,0,618,734,46,84,3.000000e-17,79.3
9,MSTRG.10696,MSTRG.10696,-,A05:1720238-1721107,s11_NV,s11_4V,OK,87.962000,91.137500,0.051164,...,96.77,62,2,0,512,327,2,63,4.000000e-33,122.0


In [10]:
# Now we can remove entries from excel that received a hit during blastx of the Swissprot db.
intergenic = pd.merge(excel, blastx_nohits, how='inner', left_on='qry_gene_id', right_on='query_id')

In [11]:
intergenic

Unnamed: 0,test_id,gene_id,gene,locus,sample_1,sample_2,status,value_1,value_2,log2(fold_change),...,qry_id,FMI,FPKM,FPKM_conf_lo,FPKM_conf_hi,cov,len,major_iso_id,ref_match_len,query_id
0,MSTRG.10003,MSTRG.10003,-,A04:15842844-15843588,s11_NV,s11_4V,OK,12.067700,4.587120,-1.395490,...,MSTRG.10003.1,100,0,0,0,0,664,MSTRG.10003.1,-,MSTRG.10003
1,MSTRG.10014,MSTRG.10014,-,A04:15953371-15954875,s11_NV,s11_4V,OK,3.186410,5.795740,0.863061,...,MSTRG.10014.1,100,0,0,0,0,1209,MSTRG.10014.1,-,MSTRG.10014
2,MSTRG.10040,MSTRG.10040,-,A04:16160696-16163427,s11_NV,s11_4V,OK,5.498820,9.417510,0.776223,...,MSTRG.10040.1,100,0,0,0,0,1883,MSTRG.10040.1,-,MSTRG.10040
3,MSTRG.10043,MSTRG.10043,-,A04:16181290-16181929,s11_NV,s11_4V,OK,2.565920,1.148040,-1.160300,...,MSTRG.10043.1,100,0,0,0,0,617,MSTRG.10043.1,-,MSTRG.10043
4,MSTRG.10116,MSTRG.10116,-,A04:16779966-16780751,s11_NV,s11_4V,OK,28.500900,41.860900,0.554594,...,MSTRG.10116.1,100,0,0,0,0,785,MSTRG.10116.1,-,MSTRG.10116
5,MSTRG.10164,MSTRG.10164,-,A04:17165278-17165738,s11_NV,s11_4V,OK,5.495660,21.009700,1.934690,...,MSTRG.10164.1,100,0,0,0,0,373,MSTRG.10164.1,-,MSTRG.10164
6,MSTRG.10169,MSTRG.10169,-,A04:17247886-17248534,s11_NV,s11_4V,NOTEST,0.147567,0.137982,-0.096892,...,MSTRG.10169.1,100,0,0,0,0,648,MSTRG.10169.1,-,MSTRG.10169
7,MSTRG.10256,MSTRG.10256,-,A04:17866766-17867003,s11_NV,s11_4V,OK,1.180830,6.317920,2.419640,...,MSTRG.10256.1,100,0,0,0,0,237,MSTRG.10256.1,-,MSTRG.10256
8,MSTRG.10262,MSTRG.10262,-,A04:17946528-17947028,s11_NV,s11_4V,OK,12034.700000,12828.600000,0.092162,...,MSTRG.10262.1,100,0,0,0,0,500,MSTRG.10262.1,-,MSTRG.10262
9,MSTRG.10271,MSTRG.10271,-,A04:18016205-18016928,s11_NV,s11_4V,OK,0.896980,0.813720,-0.140544,...,MSTRG.10271.1,100,0,0,0,0,723,MSTRG.10271.1,-,MSTRG.10271


In [12]:
# Finally, we now output the table data for each class_code we created a fasta file for as an excel file
writer = pd.ExcelWriter('putative_lincrnas.xlsx')
intergenic.to_excel(writer, 'Intergenic')
blastx_hits_worksheet.to_excel(writer, 'Unannotated genes')
writer.save()