Purpose: Extract gene lengths from gff file for B73 V5 so I can convert count data from Brandon Webster into TPM.<br>
Author: Anna Pardo<br>
Date initiated: May 11, 2023

In [1]:
# import modules
import pandas as pd

In [3]:
# read in gff file
gff = pd.read_csv("../data/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff",sep="\t",header=None,comment="#")
gff.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,chr1,assembly,chromosome,1,308452471,.,.,.,ID=1;Name=chromosome:Zm-B73-REFERENCE-NAM-5.0:...
1,chr1,NAM,gene,34617,40204,.,+,.,ID=Zm00001eb000010;biotype=protein_coding;logi...
2,chr1,NAM,mRNA,34617,40204,.,+,.,ID=Zm00001eb000010_T001;Parent=Zm00001eb000010...
3,chr1,NAM,five_prime_UTR,34617,34721,.,+,.,Parent=Zm00001eb000010_T001
4,chr1,NAM,exon,34617,35318,.,+,.,Parent=Zm00001eb000010_T001;Name=Zm00001eb0000...


In [4]:
# subset to type gene only
genes = gff[gff[2]=="gene"]
genes.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
1,chr1,NAM,gene,34617,40204,.,+,.,ID=Zm00001eb000010;biotype=protein_coding;logi...
23,chr1,NAM,gene,41214,46762,.,-,.,ID=Zm00001eb000020;biotype=protein_coding;logi...
106,chr1,NAM,gene,108554,114382,.,-,.,ID=Zm00001eb000050;biotype=protein_coding;logi...
123,chr1,NAM,gene,188559,189581,.,-,.,ID=Zm00001eb000060;biotype=protein_coding;logi...
131,chr1,NAM,gene,190192,198832,.,-,.,ID=Zm00001eb000070;biotype=protein_coding;logi...


In [8]:
# extract gene ID as separate column
ids = []
for i in range(len(genes.index)):
    ids.append(genes.iloc[i,8].strip().split(";")[0].split("=")[1])

In [9]:
# add ID as column
genes[9]=ids
genes.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  genes[9]=ids


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
1,chr1,NAM,gene,34617,40204,.,+,.,ID=Zm00001eb000010;biotype=protein_coding;logi...,Zm00001eb000010
23,chr1,NAM,gene,41214,46762,.,-,.,ID=Zm00001eb000020;biotype=protein_coding;logi...,Zm00001eb000020
106,chr1,NAM,gene,108554,114382,.,-,.,ID=Zm00001eb000050;biotype=protein_coding;logi...,Zm00001eb000050
123,chr1,NAM,gene,188559,189581,.,-,.,ID=Zm00001eb000060;biotype=protein_coding;logi...,Zm00001eb000060
131,chr1,NAM,gene,190192,198832,.,-,.,ID=Zm00001eb000070;biotype=protein_coding;logi...,Zm00001eb000070


In [11]:
# calculate length
genes[10] = genes[4]-genes[3]
genes.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  genes[10] = genes[4]-genes[3]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
1,chr1,NAM,gene,34617,40204,.,+,.,ID=Zm00001eb000010;biotype=protein_coding;logi...,Zm00001eb000010,5587
23,chr1,NAM,gene,41214,46762,.,-,.,ID=Zm00001eb000020;biotype=protein_coding;logi...,Zm00001eb000020,5548
106,chr1,NAM,gene,108554,114382,.,-,.,ID=Zm00001eb000050;biotype=protein_coding;logi...,Zm00001eb000050,5828
123,chr1,NAM,gene,188559,189581,.,-,.,ID=Zm00001eb000060;biotype=protein_coding;logi...,Zm00001eb000060,1022
131,chr1,NAM,gene,190192,198832,.,-,.,ID=Zm00001eb000070;biotype=protein_coding;logi...,Zm00001eb000070,8640


In [12]:
# subset dataframe & rename columns
df = genes[[9,10]]
df.rename(columns={9:"GeneID",10:"Length"},inplace=True)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={9:"GeneID",10:"Length"},inplace=True)


Unnamed: 0,GeneID,Length
1,Zm00001eb000010,5587
23,Zm00001eb000020,5548
106,Zm00001eb000050,5828
123,Zm00001eb000060,1022
131,Zm00001eb000070,8640


In [13]:
# save dataframe
df.to_csv("../data/gene_lengths_B73V5.txt",sep="\t",header=True,index=False)