## Example data from the wild!! 
Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

“Starving” the yeast of these nutrients lets us find genes that:

Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.
Sounds pretty cool, right? So let’s get started!

You can check out the paper here: https://www.molbiolcell.org/doi/full/10.1091/mbc.e07-08-0779

### 1. Start by loading in the data as a pandas dataframe 
- data file = bcmb_bootcamp2020/day3/data/Brauer2008_DataSet1_clean.tds
- Note this is a tab separated file, you will need to specify the delimeter as "\t" in your load command

In [45]:
#DJH code - every string I have written starts with my initials 
import pandas as pd
origin=pd.read_csv('../data/Brauer2008_DataSet1_clean.tds', sep='\t')

In [38]:
#data_origin=pd.read_csv('/Users/timp/bcmb_bootcamp/day3/data/Brauer2008_DataSet1_clean.tds', sep='\t')

In [39]:
#x=pd.read_csv('../data/Brauer2008_DataSet1_clean.tds', sep='\t')

Each of those columns like G0.05, N0.3 and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with six limiting nutrients and six growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.

In [46]:
#DJH code
origin

Unnamed: 0,GID,YORF,NAME,GWEIGHT,G0.05,G0.1,G0.15,G0.2,G0.25,G0.3,...,L0.15,L0.2,L0.25,L0.3,U0.05,U0.1,U0.15,U0.2,U0.25,U0.3
0,GENE1331X,A_06_P5820,SFB2 -- ER to Golgi transport -- molecul...,1,-0.24,-0.13,-0.21,-0.15,-0.05,-0.05,...,0.13,0.20,0.17,0.11,-0.06,-0.26,-0.05,-0.28,-0.19,0.09
1,GENE4924X,A_06_P5866,-- biological process unknown -- mol...,1,0.28,0.13,-0.40,-0.48,-0.11,0.17,...,0.02,0.04,0.03,0.01,-1.02,-0.91,-0.59,-0.61,-0.17,0.18
2,GENE4690X,A_06_P1834,QRI7 -- proteolysis and peptidolysis -- ...,1,-0.02,-0.27,-0.27,-0.02,0.24,0.25,...,-0.07,-0.05,-0.13,-0.04,-0.91,-0.94,-0.42,-0.36,-0.49,-0.47
3,GENE1177X,A_06_P4928,CFT2 -- mRNA polyadenylylation* -- RNA b...,1,-0.33,-0.41,-0.24,-0.03,-0.03,0.00,...,-0.05,0.02,0.00,0.08,-0.53,-0.51,-0.26,0.05,-0.14,-0.01
4,GENE511X,A_06_P5620,SSO2 -- vesicle fusion* -- t-SNARE activ...,1,0.05,0.02,0.40,0.34,-0.13,-0.14,...,0.00,-0.11,0.04,0.01,-0.45,-0.09,-0.13,0.02,-0.09,-0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5532,GENE2833X,A_06_P6094,KRE1 -- cell wall organization and bioge...,1,0.41,-0.28,0.30,0.50,-0.05,-0.08,...,0.38,0.23,0.21,0.15,0.32,0.62,0.54,0.01,0.56,0.28
5533,GENE271X,A_06_P3243,MTL1 -- cell wall organization and bioge...,1,0.50,-0.12,0.25,0.24,0.13,0.02,...,0.25,-0.02,-0.06,-0.10,,0.50,0.29,-0.14,0.47,0.27
5534,GENE1691X,A_06_P4196,KRE9 -- cell wall organization and bioge...,1,0.15,0.09,0.21,0.46,0.19,-0.02,...,0.37,0.21,0.16,-0.01,-0.68,0.63,0.41,0.09,0.48,0.43
5535,GENE1755X,A_06_P4680,UTH1 -- mitochondrion organization and b...,1,0.63,0.38,0.05,0.12,0.13,-0.01,...,-0.07,0.02,0.24,0.18,-0.89,0.19,0.03,0.04,0.13,0.19


Now that you have loaded in and looked at the data list 2 reasons why this dataset does NOT follow the rules of tidy data (hint review section 2.3 of Hadly Wickam's Tidy data paper http://vita.had.co.nz/papers/tidy-data.pdf) 

ANSWER:
1. 
2. 

### 2. Make a new dataframe called df_clean that follows the tidy data rules, have it print
- (hint "NAME" column consists of gene name, biological functions, molecular functions, systematic names, and gene number. Split into 5 separate columns with unique names. This might be helpful https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.Series.str.split.html)

In [47]:
#DJH code
origin_drop=origin.drop(columns=['GID','YORF','GWEIGHT'])
df_clean=origin_drop.melt(id_vars='NAME')
df_clean


Unnamed: 0,NAME,variable,value
0,SFB2 -- ER to Golgi transport -- molecul...,G0.05,-0.24
1,-- biological process unknown -- mol...,G0.05,0.28
2,QRI7 -- proteolysis and peptidolysis -- ...,G0.05,-0.02
3,CFT2 -- mRNA polyadenylylation* -- RNA b...,G0.05,-0.33
4,SSO2 -- vesicle fusion* -- t-SNARE activ...,G0.05,0.05
...,...,...,...
199327,KRE1 -- cell wall organization and bioge...,U0.3,0.28
199328,MTL1 -- cell wall organization and bioge...,U0.3,0.27
199329,KRE9 -- cell wall organization and bioge...,U0.3,0.43
199330,UTH1 -- mitochondrion organization and b...,U0.3,0.19


In [48]:
#DJH code
df_clean['NAME'].str.split(' -- ', expand=True)

Unnamed: 0,0,1,2,3,4
0,SFB2,ER to Golgi transport,molecular function unknown,YNL049C,1082129
1,,biological process unknown,molecular function unknown,YNL095C,1086222
2,QRI7,proteolysis and peptidolysis,metalloendopeptidase activity,YDL104C,1085955
3,CFT2,mRNA polyadenylylation*,RNA binding,YLR115W,1081958
4,SSO2,vesicle fusion*,t-SNARE activity,YMR183C,1081214
...,...,...,...,...,...
199327,KRE1,cell wall organization and biogenesis,structural constituent of cell wall,YNL322C,1083836
199328,MTL1,cell wall organization and biogenesis,molecular function unknown,YGR023W,1080930
199329,KRE9,cell wall organization and biogenesis*,molecular function unknown,YJL174W,1082539
199330,UTH1,mitochondrion organization and biogenesis*,molecular function unknown,YKR042W,1082610


In [49]:
#DJH code
#Name column consists of gene name, biological functions, molecular functions, systematic names, and gene number. 
# Split into 5 separate columns with unique names.
#df_clean['NAME'].str.split(' -- ', expand=True)
df_clean[['gene_name','bio_fun','mol_fun','sys_name','gene_no']]=df_clean['NAME'].str.split(' -- ', expand=True)
df_clean

Unnamed: 0,NAME,variable,value,gene_name,bio_fun,mol_fun,sys_name,gene_no
0,SFB2 -- ER to Golgi transport -- molecul...,G0.05,-0.24,SFB2,ER to Golgi transport,molecular function unknown,YNL049C,1082129
1,-- biological process unknown -- mol...,G0.05,0.28,,biological process unknown,molecular function unknown,YNL095C,1086222
2,QRI7 -- proteolysis and peptidolysis -- ...,G0.05,-0.02,QRI7,proteolysis and peptidolysis,metalloendopeptidase activity,YDL104C,1085955
3,CFT2 -- mRNA polyadenylylation* -- RNA b...,G0.05,-0.33,CFT2,mRNA polyadenylylation*,RNA binding,YLR115W,1081958
4,SSO2 -- vesicle fusion* -- t-SNARE activ...,G0.05,0.05,SSO2,vesicle fusion*,t-SNARE activity,YMR183C,1081214
...,...,...,...,...,...,...,...,...
199327,KRE1 -- cell wall organization and bioge...,U0.3,0.28,KRE1,cell wall organization and biogenesis,structural constituent of cell wall,YNL322C,1083836
199328,MTL1 -- cell wall organization and bioge...,U0.3,0.27,MTL1,cell wall organization and biogenesis,molecular function unknown,YGR023W,1080930
199329,KRE9 -- cell wall organization and bioge...,U0.3,0.43,KRE9,cell wall organization and biogenesis*,molecular function unknown,YJL174W,1082539
199330,UTH1 -- mitochondrion organization and b...,U0.3,0.19,UTH1,mitochondrion organization and biogenesis*,molecular function unknown,YKR042W,1082610


In [11]:
#df_clean[['genename', 'biofunction', 'molfunction', 'sysname', 'genenum']]=df_clean['NAME'].str.split(' -- ', expand=True)

In [12]:
#df_clean

### 3. Subsetting!

Next let's dig into the data more. Using pandas again subset the dataframe to just keep the genes that have the string "cell cycle" as their biological process. (see note about "NAME" column above in step 2)  

In [50]:
#This command will return a df that contains only those values for which 'cell cycle' is the sole process
cell_cyle_only=df_clean[df_clean['bio_fun']=='cell cycle']
cell_cyle_only

Unnamed: 0,NAME,variable,value,gene_name,bio_fun,mol_fun,sys_name,gene_no
706,PCL2 -- cell cycle -- cyclin-dependent p...,G0.05,-1.19,PCL2,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL127W,1082011
1132,PCL9 -- cell cycle -- cyclin-dependent p...,G0.05,-1.15,PCL9,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL179W,1085191
2740,CLG1 -- cell cycle -- cyclin-dependent p...,G0.05,0.17,CLG1,cell cycle,cyclin-dependent protein kinase regulator acti...,YGL215W,1083039
2771,SCM4 -- cell cycle -- molecular function...,G0.05,0.17,SCM4,cell cycle,molecular function unknown,YGR049W,1085248
2948,PCL5 -- cell cycle -- cyclin-dependent p...,G0.05,2.00,PCL5,cell cycle,cyclin-dependent protein kinase regulator acti...,YHR071W,1083682
...,...,...,...,...,...,...,...,...
194927,PCL9 -- cell cycle -- cyclin-dependent p...,U0.3,0.13,PCL9,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL179W,1085191
196535,CLG1 -- cell cycle -- cyclin-dependent p...,U0.3,0.03,CLG1,cell cycle,cyclin-dependent protein kinase regulator acti...,YGL215W,1083039
196566,SCM4 -- cell cycle -- molecular function...,U0.3,0.48,SCM4,cell cycle,molecular function unknown,YGR049W,1085248
196743,PCL5 -- cell cycle -- cyclin-dependent p...,U0.3,-1.00,PCL5,cell cycle,cyclin-dependent protein kinase regulator acti...,YHR071W,1083682


In [51]:
#This command will return a df that contains all mentions of the term 'cell cycle' in the bio_fun column
cell_cycle=df_clean[df_clean['bio_fun'].str.contains('cell cycle')]
cell_cycle

Unnamed: 0,NAME,variable,value,gene_name,bio_fun,mol_fun,sys_name,gene_no
108,ZPR1 -- regulation of progression throug...,G0.05,0.00,ZPR1,regulation of progression through cell cycle,protein binding,YGR211W,1082106
118,PTC2 -- G1/S transition of mitotic cell ...,G0.05,-1.04,PTC2,G1/S transition of mitotic cell cycle*,protein phosphatase type 2C activity,YER089C,1084958
434,PIN4 -- G2/M transition of mitotic cell ...,G0.05,-0.71,PIN4,G2/M transition of mitotic cell cycle*,molecular function unknown,YBL051C,1084089
445,HSL7 -- regulation of progression throug...,G0.05,-0.56,HSL7,regulation of progression through cell cycle*,protein-arginine N-methyltransferase activity,YBR133C,1082505
466,SIS2 -- G1/S transition of mitotic cell ...,G0.05,0.16,SIS2,G1/S transition of mitotic cell cycle*,phosphopantothenoylcysteine decarboxylase acti...,YKR072C,1081300
...,...,...,...,...,...,...,...,...
199116,VHS3 -- G1/S transition of mitotic cell ...,U0.3,0.10,VHS3,G1/S transition of mitotic cell cycle*,phosphopantothenoylcysteine decarboxylase acti...,YOR054C,1084698
199120,SIC1 -- G1/S transition of mitotic cell ...,U0.3,0.02,SIC1,G1/S transition of mitotic cell cycle*,protein binding*,YLR079W,1080719
199172,CDC4 -- G1/S transition of mitotic cell ...,U0.3,0.28,CDC4,G1/S transition of mitotic cell cycle*,protein binding*,YFL009W,1080982
199191,PTK2 -- G1/S transition of mitotic cell ...,U0.3,0.15,PTK2,G1/S transition of mitotic cell cycle*,protein kinase activity,YJR059W,1082178


In [15]:
#Question for the TA: I didn't run this command, is this just to confirm the categories present in the df?
#if I was downloading this data set from the internet, would I run this to determine values on which to sort?
#df_clean['biofunction'].unique()

array(['ER to Golgi transport', 'biological process unknown',
       'proteolysis and peptidolysis', 'mRNA polyadenylylation*',
       'vesicle fusion*', 'riboflavin biosynthesis',
       'vacuolar acidification', 'deadenylylation-independent decapping',
       'protein retention in Golgi*',
       'negative regulation of exit from mitosis*',
       'cytokinesis, completion of separation',
       'cell wall organization and biogenesis*',
       'cytochrome c oxidase complex assembly*',
       'pantothenate biosynthesis*',
       'transcription from mitochondrial promoter',
       'invasive growth (sensu Saccharomyces)*',
       'positive regulation of transcription from RNA polymerase II promoter',
       'cation homeostasis', 'nuclear mRNA splicing, via spliceosome*',
       'regulation of glycogen biosynthesis*', 'RNA splicing*',
       'mitochondrial genome maintenance*', 'sulfate transport',
       'telomerase-independent telomere maintenance*', '',
       'response to unfolded pro

In [18]:
#cellcycle=df_clean[df_clean['biofunction']=='cell cycle']

In [52]:
#cycle2=df_clean[df_clean['biofunction'].str.contains('cell cycle')]
#cycle2

In [22]:
cycle2['biofunction'].unique()

array(['regulation of progression through cell cycle',
       'G1/S transition of mitotic cell cycle*',
       'G2/M transition of mitotic cell cycle*',
       'regulation of progression through cell cycle*',
       'G1/S-specific transcription in mitotic cell cycle', 'cell cycle',
       'G1/S transition of mitotic cell cycle',
       'regulation of progression through mitotic cell cycle',
       'G1-specific transcription in mitotic cell cycle', 'cell cycle*',
       'G2/M-specific transcription in mitotic cell cycle',
       'cell cycle arrest in response to pheromone',
       'G1/S-specific transcription in mitotic cell cycle*'], dtype=object)

In [53]:
cell_cyle_only

Unnamed: 0,NAME,variable,value,gene_name,bio_fun,mol_fun,sys_name,gene_no
706,PCL2 -- cell cycle -- cyclin-dependent p...,G0.05,-1.19,PCL2,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL127W,1082011
1132,PCL9 -- cell cycle -- cyclin-dependent p...,G0.05,-1.15,PCL9,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL179W,1085191
2740,CLG1 -- cell cycle -- cyclin-dependent p...,G0.05,0.17,CLG1,cell cycle,cyclin-dependent protein kinase regulator acti...,YGL215W,1083039
2771,SCM4 -- cell cycle -- molecular function...,G0.05,0.17,SCM4,cell cycle,molecular function unknown,YGR049W,1085248
2948,PCL5 -- cell cycle -- cyclin-dependent p...,G0.05,2.00,PCL5,cell cycle,cyclin-dependent protein kinase regulator acti...,YHR071W,1083682
...,...,...,...,...,...,...,...,...
194927,PCL9 -- cell cycle -- cyclin-dependent p...,U0.3,0.13,PCL9,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL179W,1085191
196535,CLG1 -- cell cycle -- cyclin-dependent p...,U0.3,0.03,CLG1,cell cycle,cyclin-dependent protein kinase regulator acti...,YGL215W,1083039
196566,SCM4 -- cell cycle -- molecular function...,U0.3,0.48,SCM4,cell cycle,molecular function unknown,YGR049W,1085248
196743,PCL5 -- cell cycle -- cyclin-dependent p...,U0.3,-1.00,PCL5,cell cycle,cyclin-dependent protein kinase regulator acti...,YHR071W,1083682


Next, subset the dataframe again so it only contains cell cycle genes from the glucose treatments "G"

Hint: Consider looking into str.contains https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

In [54]:
#DJH
#define a new variable as a dataframe filtered on contains the value 'G' in the variable column 
cell_cycle_gluc=cell_cyle_only[cell_cyle_only['variable'].str.contains('G')]
cell_cycle_gluc

Unnamed: 0,NAME,variable,value,gene_name,bio_fun,mol_fun,sys_name,gene_no
706,PCL2 -- cell cycle -- cyclin-dependent p...,G0.05,-1.19,PCL2,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL127W,1082011
1132,PCL9 -- cell cycle -- cyclin-dependent p...,G0.05,-1.15,PCL9,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL179W,1085191
2740,CLG1 -- cell cycle -- cyclin-dependent p...,G0.05,0.17,CLG1,cell cycle,cyclin-dependent protein kinase regulator acti...,YGL215W,1083039
2771,SCM4 -- cell cycle -- molecular function...,G0.05,0.17,SCM4,cell cycle,molecular function unknown,YGR049W,1085248
2948,PCL5 -- cell cycle -- cyclin-dependent p...,G0.05,2.0,PCL5,cell cycle,cyclin-dependent protein kinase regulator acti...,YHR071W,1083682
4557,PCL1 -- cell cycle -- cyclin-dependent p...,G0.05,-0.64,PCL1,cell cycle,cyclin-dependent protein kinase regulator acti...,YNL289W,1082226
6243,PCL2 -- cell cycle -- cyclin-dependent p...,G0.1,-0.52,PCL2,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL127W,1082011
6669,PCL9 -- cell cycle -- cyclin-dependent p...,G0.1,-0.55,PCL9,cell cycle,cyclin-dependent protein kinase regulator acti...,YDL179W,1085191
8277,CLG1 -- cell cycle -- cyclin-dependent p...,G0.1,0.15,CLG1,cell cycle,cyclin-dependent protein kinase regulator acti...,YGL215W,1083039
8308,SCM4 -- cell cycle -- molecular function...,G0.1,-0.19,SCM4,cell cycle,molecular function unknown,YGR049W,1085248


In [55]:
#cellcycleg=cellcycle[cellcycle['variable'].str.contains('G')]
#cellcycleg

Write the subsetted file out to a csv - and open it in excel or google sheets to examine it and see if you did it right.  Screenshot the result.

In [57]:
cell_cycle_gluc.to_csv('test_tidy_data.csv')

In [24]:
#cellcycleg.to_csv('test.csv')

YAY you're all finished and are now super extra awesome at tidying data!! 