Global transcriptomic analysis of Cyanothece 51142 reveals robust diurnal oscillation of central metabolic processes

Jana Stöckel, Eric A. Welsh, Michelle Liberton, Rangesh Kunnvakkam, Rajeev Aurora, and Himadri B. Pakrasi

https://www.pnas.org/content/105/16/6156

In [1]:
import numpy as np
import pandas as pd
raw_df = pd.read_excel("4 - JSData/4-SD3.xls",sheet_name=0,header=1,usecols=[0,2],skipfooter=4)

In [2]:
raw_df.head()

Unnamed: 0,ORF**,ANNOTATION††
0,cce_1117,"probable proteinase inhibitor I4, serpin"
1,cce_3464,hypothetical protein
2,cce_1462,unknown protein
3,cce_2038,"carbamoyl-phosphate synthase, large subunit; carB"
4,cce_4757,bifunctional cobalbumin biosynthesis enzyme; c...


In [3]:
raw_df.iloc[-1]

 ORF**                      cce_1931
ANNOTATION††    hypothetical protein
Name: 198, dtype: object

In [4]:
raw_df.columns

Index([' ORF**', 'ANNOTATION††'], dtype='object')

In [5]:
def func1(row):
    if row['ANNOTATION††'].find(';')!=-1:
        return row[' ORF**'] + " " + row['ANNOTATION††'][row['ANNOTATION††'].find(';')+1:] 
    else:
        return row[' ORF**']
raw_df['SOURCE'] = raw_df.apply(func1,axis=1)

In [6]:
df = raw_df.dropna()

In [7]:
len(df)

199

In [10]:
df.to_excel('4 - JSData/myFile4.xlsx',columns=['SOURCE'])

In [11]:
raw_df2 = pd.read_excel("4 - JSData/4-SD3.xls",sheet_name=1,header=1,usecols=[0,2],skipfooter=4)
raw_df2['SOURCE'] = raw_df2.apply(func1,axis=1)
df2 = raw_df2.dropna()
df2.to_excel('4 - JSData/myFile4-2.xlsx',columns=['SOURCE'])

In [12]:
raw_df3 = pd.read_excel("4 - JSData/4-SD3.xls",sheet_name=2,header=1,usecols=[0,2],skipfooter=4)
raw_df3['SOURCE'] = raw_df3.apply(func1,axis=1)
df3 = raw_df3.dropna()
df3.to_excel('4 - JSData/myFile4-3.xlsx',columns=['SOURCE'])

In [13]:
raw_df4 = pd.read_excel("4 - JSData/4-SD3.xls",sheet_name=3,header=1,usecols=[0,2],skipfooter=4)
raw_df4['SOURCE'] = raw_df4.apply(func1,axis=1)
df4 = raw_df4.dropna()
df4.to_excel('4 - JSData/myFile4-4.xlsx',columns=['SOURCE'])

# Download the file that has light dark expression information and metabolic pathway information

Use only 3 columns because the rest are not important at this moment

In [134]:
df = pd.read_excel("4 - JSData/SD4.xls",usecols=[0,2],header=2,skipfooter=6).dropna(how='all')

In [135]:
df.columns = ['ORF','Function']

Must drop 2 indices which do not have any information

In [140]:
df.iloc[94:96]

Unnamed: 0,ORF,Function
100,LIGHT,PATHWAY / ANNOTATION
101,ORF**,PHOTOSYNTHESIS


In [145]:
df.loc[101]

ORF                  ORF**
Function    PHOTOSYNTHESIS
Name: 101, dtype: object

In [146]:
df.drop([100,101],inplace=True)

In [150]:
df.reset_index(drop=True,inplace=True)

In [151]:
df

Unnamed: 0,ORF,Function
0,cce_0666,glucose-6-phosphate isomerase; pgi1
1,cce_5178,glucose-6-phosphate isomerase; pgi2
2,cce_0669,6-phosphofructokinase; pfkA1
3,cce_3253,6-phosphofructokinase; pfkA2
4,cce_4758,"fructose 1,6-bisphosphatase I"
...,...,...
224,cce_4627,transketolase; tktA
225,cce_0798,ribulose-phosphate 3-epimerase; rpe
226,cce_0103,ribose 5-phosphate isomerase; rpiA
227,cce_4304,triosephosphate isomerase; tpi


Firstly, include the dark/light expression. Till index 93 is dark phase expression genes.

In [152]:
df['Expression'] = 'D'

In [153]:
df.iloc[93]

ORF                                  cce_0568
Function      nitrogen fixation protein; nifW
Expression                                  D
Name: 93, dtype: object

In [155]:
df.loc[94:,'Expression'] = 'L'

Now we can define the pathway

In [156]:
df['pathway'] = 'unknown'

Till index 25 is Glycolysis

In [159]:
df.loc[:25,'pathway'] = 'Glycolysis'

In [160]:
df.iloc[25]

ORF                          cce_0606
Function      phosphoglucomutase; pgm
Expression                          D
pathway                    Glycolysis
Name: 25, dtype: object

Now find the indices which has null values of ORF, those should give the pathway names. Use a technique calle dmask

In [161]:
mask = df.ORF.isnull()
df[mask]

Unnamed: 0,ORF,Function,Expression,pathway
26,,Glycogen degradation,D,unknown
34,,TCA cycle,D,unknown
49,,Oxidative Pentose Phosphate Cycle,D,unknown
59,,Amino acid biosynthesis,D,unknown
77,,Nitrogen fixation,D,unknown
94,,Photosystem II,L,unknown
120,,Photosystem I,L,unknown
133,,Cytochrome b6/f complex,L,unknown
140,,Photosynthetic electron transport,L,unknown
149,,Phycobilisome,L,unknown


As we can see, all of them corresponds to some pathway. Now, from the start index to the next, we will define the pathways.

In [164]:
index = list(df[mask].index)

In [165]:
index

[26, 34, 49, 59, 77, 94, 120, 133, 140, 149, 164, 180, 198, 213, 223]

In [167]:
df.iloc[index[0]]['Function']

'Glycogen degradation'

In [168]:
previdx = index[0]
for i in index[1:]:
    currentidx = i
    df.loc[previdx:currentidx,'pathway'] = df.iloc[previdx]['Function']
    previdx = currentidx
df.loc[previdx:,'pathway'] = df.iloc[previdx]['Function']

In [178]:
df.iloc[224]

ORF                                       cce_4627
Function                       transketolase; tktA
Expression                                       L
pathway       Reductive Pentose Phorsphate Pathway
Name: 224, dtype: object

In [179]:
df

Unnamed: 0,ORF,Function,Expression,pathway
0,cce_0666,glucose-6-phosphate isomerase; pgi1,D,Glycolysis
1,cce_5178,glucose-6-phosphate isomerase; pgi2,D,Glycolysis
2,cce_0669,6-phosphofructokinase; pfkA1,D,Glycolysis
3,cce_3253,6-phosphofructokinase; pfkA2,D,Glycolysis
4,cce_4758,"fructose 1,6-bisphosphatase I",D,Glycolysis
...,...,...,...,...
224,cce_4627,transketolase; tktA,L,Reductive Pentose Phorsphate Pathway
225,cce_0798,ribulose-phosphate 3-epimerase; rpe,L,Reductive Pentose Phorsphate Pathway
226,cce_0103,ribose 5-phosphate isomerase; rpiA,L,Reductive Pentose Phorsphate Pathway
227,cce_4304,triosephosphate isomerase; tpi,L,Reductive Pentose Phorsphate Pathway


export the dataframe to excel

In [187]:
df.dropna(inplace=True)

In [189]:
df.reset_index(drop=True,inplace=True)

In [192]:
df.to_excel("4 - JSData/EPinfo.xlsx")