## Purpose: 
### Define genes that are differentially expressed between platinum-drug resistant and sensitive cells. These will be used as gene lists in the NMF of CCLE gene expression data.
<hr style="border: none; border-bottom: 3px solid #88BBEE;">

Since thousands of genes have been implicated in platinum-resistance, two experiments may not yield the same set of differentially expressed genes. Here we create several different gene lists in order to try and capture the heterogeneity of platinum resistance.

### **Creating a 2 different gene lists:**  
**(1)** Bulk RNA seq [Olivier]: **1211 genes (pval unadjusted)**    
**(2)** The union of [Marchion et al](https://www.ncbi.nlm.nih.gov/pubmed/21849418) analysis and Olivier's bulk RNA Seq: **3443 genes**   

**Brief summary of the analysis done by the Marchion et al group:**     
Probe sets with expression ranges >2-fold only used. For each cell line, Pearson correlation coefficients were calculated for expression data and cisplatin EC50. 
* **|R|** > 0.85 for Marchion data. Could do p value here too.  
* **p (unadjusted)** < 0.01 for Olivier's data.      

In [1]:
import pandas as pd
import numpy as np
import math
from scipy import stats
from statsmodels.sandbox.stats.multicomp import multipletests

In [2]:
# Reading in marchion et al data
march_dat = pd.read_csv('../data/Marchionetal2.csv', index_col=0)
march_dat.reset_index(inplace=True)
march_dat.sort_values('Gene_Symbol')
march_dat.head()

Unnamed: 0,Cell_Line,Correlation_Coefficient,Gene_Symbol
0,A2008,-0.97187,ABHD11
1,A2008,-0.969153,LOC729070
2,A2008,-0.94759,OGG1
3,A2008,-0.946277,NFATC2IP
4,A2008,-0.940457,SUSD4


In [3]:
# Reading in Olivier's bulk RNA data
#NOTE Excel has done a nasty thing here and renamed genesymbols that resemble dates in their date format. 
#This is how I received the Marchion data and I am going to use ours in this way as well so that they
#are comparable and I can convert both at the end. 
oliv_dat = pd.read_csv('../data/res.df.csv', index_col=0)
oliv_dat.reset_index(inplace=True)
oliv_dat.sort_values('symbol')
oliv_dat.head()

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,gene_id,symbol
0,62.361911,4.594909,0.365186,12.582373,2.64e-36,5.58e-32,ENSG00000243232,PCDHAC2
1,1040.392032,2.977025,0.371799,8.007074,1.17e-15,1.24e-11,ENSG00000147100,SLC16A2
2,756.580703,1.989505,0.257404,7.729119,1.08e-14,7.62e-11,ENSG00000171444,MCC
3,21.451441,-2.763946,0.386104,-7.158546,8.15e-13,4.31e-09,ENSG00000251220,RFPL4AP3
4,64.231034,-2.585837,0.373825,-6.917246,4.61e-12,1.71e-08,ENSG00000273079,GRIN2B


In [4]:
#Cleaning both datasets for multi-gene and non-existent entries:
march_dat_clean = march_dat[(march_dat.Gene_Symbol.str.contains('///'))== False]
march_dat_clean = march_dat_clean[(march_dat_clean.Gene_Symbol.str.contains('NaN'))== False]
oliv_dat_clean = oliv_dat[((oliv_dat.symbol.str.contains('NaN'))== False)]
oliv_dat_clean = oliv_dat_clean[((oliv_dat_clean.symbol.str.contains('///'))== False)]
oliv_dat_clean = oliv_dat_clean[np.isfinite(oliv_dat_clean['padj'])]
march_dat_clean.shape

(3319, 3)

In [5]:
oliv_dat_clean.shape

(20295, 8)

In [6]:
#Setting maximum p values and minimum R correlation coefficients.
max_p_value = 0.01
min_R_value = 0.85

In [7]:
#Filtering both datasets for statistical limits set in cell above
#***CURRENTLY THESE LINES DON'T DO ANYTHING, MARCHION DATA IS ALREADY FILTERED
pos_cor = march_dat_clean['Correlation_Coefficient'] < min_R_value
neg_cor = march_dat_clean['Correlation_Coefficient'] > -min_R_value
march_dat_filter = march_dat_clean[pos_cor | neg_cor]
march_dat_filter = march_dat_filter.drop_duplicates(subset='Gene_Symbol', keep="last")

sig_p = oliv_dat_clean['pvalue'] < max_p_value
sig_padj = oliv_dat_clean['padj'] < max_p_value

oliv_dat_filter_nadj = oliv_dat_clean[sig_p] 
oliv_dat_filter = oliv_dat_clean[sig_padj]

In [8]:
o_dat = oliv_dat_filter_nadj['symbol'].unique()
m_dat = march_dat_filter['Gene_Symbol'].unique()
union = oliv_dat_filter_nadj['symbol'].append(march_dat_filter['Gene_Symbol']).unique()
len(union)

3443

In [9]:
#print('Total genes in Marchion et al. analysis with expression ranges >=2-fold (maximum/minimum): {}').format(march_dat_filter.shape[0])

print('Total genes in Marchion et al. analysis which have a |R| > 0.85 : {}'.format(m_dat.shape[0]))
print('\n')
print('Total genes in Olivier\'s RNA Seq analysis: {}'.format(oliv_dat_clean.shape[0]))
print('Total genes in Olivier\'s RNA Seq analysis which have p_value < 0.01 : {}'.format(o_dat.shape[0]))   
print('Total genes in Olivier\'s RNA Seq analysis which have p_value < 0.01 / (# of genes) : {}'.format(oliv_dat_filter.shape[0]))   

Total genes in Marchion et al. analysis which have a |R| > 0.85 : 2375


Total genes in Olivier's RNA Seq analysis: 20295
Total genes in Olivier's RNA Seq analysis which have p_value < 0.01 : 1211
Total genes in Olivier's RNA Seq analysis which have p_value < 0.01 / (# of genes) : 191


# (1) CREATING JUST A BULK RNA SEQ GENE LIST [OLIVIER'S]
### Carboplatin

In [13]:
# Just Olivier's bulk RNA Seq
pd.Series(oliv_dat_filter_nadj['symbol']).to_csv('../data/bulkRNASeq_genelist.csv')

# (2) CREATING A BULK RNA SEQ + Marchion et al GENE LIST 
### Cisplatin and Carboplatin

In [11]:
pd.Series(union).to_csv('../data/bulkRNASeq_and_Marchion_genelist.csv')