## This notebook will walk through a common pipeline for calculating a developmental index from an expression matrix containing normalized expression data (i.e. RPKM, FPKM, or TPM). The input dataframe should be formatted such that every column corresponds to an individual sample/cell, and every row corresponds to a gene.
---

Step (1) - Import the necessary packages and data. Clean data as necessary.

---

In [1]:
import pandas as pd
import numpy as np

import developmental_index as dvp

df = pd.read_csv('C:\\Users\\Ben\\Dropbox\\bilbo_lab_spr2020\\microglia-seq_website\\microglia-seq\\mdi_w_rpkm\\GSE99622_hanamsagar2017_tpm_unmelted_v2.csv', header = [0, 1, 2, 3, 4])
genes = df.iloc[:, df.columns.get_level_values(4) == 'gene'].values.flatten()
df.set_index(genes, inplace = True)
df.drop(df.columns[0], axis = 1, inplace = True)

---

Step (2) - Scale the expression values to between 0 and 1, so that all genes contribute to the index equally

---

In [2]:
df = dvp.scale_expression(df)

--- 

Step (3) - Drop any rows (genes) that do not have detectable expression in any of the samples

---

In [3]:
df = dvp.drop_unexpressed_genes(df)

---

Step (4) - Extract the columns (samples) corresponds to all the 'young' and 'old' samples, so that they can be compared against one another to determine if there is a significant difference in expression from young to old

---

In [4]:
## these lines are different based on how you have indexed the samples in your dataframe. You may need to adjust accordingly
young = df['E18']
old = df['P60']
old = old.iloc[:, old.columns.get_level_values(1)=='SAL']

---

Step (4.5) - Use the 'identify significant genes' function to identify the genes that are regulated by development. Here you need to specify young and old columns, which we defined in the cell - Step 4 - above

---

In [5]:
df = dvp.identify_significant_genes(df, young, old)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


---

Step (5) - Remove any rows (genes) that were not regulated by development, i.e. insignificant

---

In [6]:
df = dvp.remove_insignificant_rows(df)

---

OPTIONAL STEP - Create a sub-dataframe containing just useful regulated gene information (direction and valence)


---

In [7]:
regulated_genes = dvp.extract_regulated_genes(df)

In [8]:
regulated_genes

Unnamed: 0_level_0,direction,valence
gene,Unnamed: 1_level_1,Unnamed: 2_level_1
Prlr,UP,9.459871
Upk1b,UP,8.016785
Mlph,UP,7.518620
Olfr558,UP,7.269765
Cox8b,UP,7.225406
...,...,...
St8sia2,DOWN,-10.758872
H19,DOWN,-10.802779
Add2,DOWN,-11.215961
Neurod6,DOWN,-11.590081


---

Step (6) - Use the index to generate index values for each individual sample (column) in your dataset. Here you will need to also provide the index location for sample columns (usually df.columns[1:-4]).

---

In [9]:
index = dvp.generate_index(df, sample_cols = df.columns[1:-4])

---

Step (7) - Scale all index values to between 0 and 1, with 1 being most mature, and 0 least mature

---

In [10]:
index = dvp.scale_index(index)

---

OPTIONAL STEP - Cleaning up output columns, checking that everything worked properly, and exporting as .csv

---

In [11]:
index.drop('level_4', axis = 1, inplace = True)
index

Unnamed: 0,level_0,level_1,level_2,level_3,0
0,E18,Female,SAL,F_E18 1,0.074341
1,E18,Male,SAL,M_E18 1,0.039307
2,E18,Female,SAL,F_E18 3,0.069697
3,E18,Male,SAL,M_E18 4,0.0
4,P14,Female,SAL,F_P14 1,0.362481
5,P14,Female,SAL,F_P14 2,0.356923
6,P14,Female,SAL,F_P14 3,0.290847
7,P14,Female,SAL,F_P14 4,0.382989
8,P14,Female,SAL,F_P14 5,0.248252
9,P14,Female,SAL,F_P14 6,0.336164


In [12]:
index.to_csv('calculated_index.csv')
regulated_genes.to_csv('regulated_genes.csv')