# Calculations of the Effect Size (ES) for each microarray study 

###  Using Hedges' g value, an adjusted Cohen's d  value

$$  {Enrichment} = \bar{X_2}-\bar{X_1}$$

Let Group 1 be 6h Sleeping Cerebral Cortex Expression values and Group 2 be 6hSD Cerebral Cortex Expression values 

(S mean - SD mean) **(Logged values, so minus gives ratio)** 

$$  {Pooled\ Standard\  Deviation} = \sqrt\frac{(n_1-1)S_1^2 +(n_2-1)S_2^2}{(n_2 +n_2) -2}  $$  

$$  {Cohen's\ d\ value} = \frac{Enrichment}{Pooled\ Standard\ Deviation} $$

$$  {Correction\ Factor (J\ Factor)} = 1- \frac{3}{4df-1} $$

$$  {Hedges'\ g\ value} = Cohen's\ d\ \text{x}\ J\ $$

$$  {Variance\ in\ d (V_d)} = \frac{n_1- +n_2}{n_1 n_2} + \frac{d^2}{2(n_1 +n_2)}  $$

$$  {Variance\ in\ g (V_g)} = J^2\  \text{x}\ V_d  $$

$$  {Standard\ Error\ in\ g (SE_g)} = \sqrt{V_g}  $$

## Setup working environment and import data

In [1]:
import pandas as pd # Dataframes and file IO/
import numpy as np # numerical calculations
%cd /Users/Ella1/Desktop/data sets MoEx


/Users/Ella1/Desktop/data sets MoEx


In [2]:
prefix = 'MoEx_CerCx_'   # define a prefix to add to column names (making indexing easier later)

In [3]:
# import the data file to a data frame 'df'
df=pd.read_table('DATASET-GSE33491(exon).txt', delimiter='\t',  index_col=0) #,nrows=500)  
df.shape

(28399, 30)

In [4]:
# remove probes that are know to cross-hybridise to more than one target
df =df[~df.index.str.contains('_x_|_s_')]    #   important reverse selector ~ 
df.shape

(28399, 30)

## Look at column names and then setup filters for grouping columns into Oligodendrocyte samples of S and SD groups
It is important that we pick up only one of the two types of tissue investivagtes in this assession.

In [5]:
df.columns

Index(['Definition', 'Symbol', 'Transcript_cluster_ids',
       'Constitutive_exons_used', 'Constitutive_IDs_used',
       'Putative microRNA binding sites', 'Select Cellular Compartments',
       'Select Protein Classes', 'Chromosome', 'Strand',
       'Genomic Gene Corrdinates', 'GO-Biological Process',
       'GO-Molecular Function', 'GO-Cellular Component', 'WikiPathways',
       'GSM828577_CerCx_S.CEL', 'GSM828578_CerCx_S.CEL',
       'GSM828579_CerCx_S.CEL', 'avg-CerCx_S', 'log_fold-CerCx_S_vs_CerCx_SD',
       'fold-CerCx_S_vs_CerCx_SD', 'rawp-CerCx_S_vs_CerCx_SD',
       'adjp-CerCx_S_vs_CerCx_SD', 'GSM828583_CerCx_SD.CEL',
       'GSM828584_CerCx_SD.CEL', 'GSM828585_CerCx_SD.CEL', 'avg-CerCx_SD',
       'ANOVA-rawp', 'ANOVA-adjp', 'largest fold'],
      dtype='object')

In [6]:
# define regular expressions for sleep (S) and sleep dep (SD) filters 
s_filt ='CerCx_S.CEL'
sd_filt ='CerCx_SD.CEL'

In [7]:
df_s=df.filter(regex= s_filt)
df_s.head()

Unnamed: 0_level_0,GSM828577_CerCx_S.CEL,GSM828578_CerCx_S.CEL,GSM828579_CerCx_S.CEL
Ensembl_gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSMUSG00000028180,9.167394,9.199392,9.020027
ENSMUSG00000028182,4.865563,4.969232,4.631375
ENSMUSG00000028185,4.683664,4.772426,5.099074
ENSMUSG00000028184,10.504887,10.413447,10.544117
ENSMUSG00000028187,8.810018,8.791182,9.118087


In [8]:
df_sd=df.filter(regex= sd_filt)
df_sd.head()

Unnamed: 0_level_0,GSM828583_CerCx_SD.CEL,GSM828584_CerCx_SD.CEL,GSM828585_CerCx_SD.CEL
Ensembl_gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSMUSG00000028180,9.15771,9.044992,9.061149
ENSMUSG00000028182,4.99021,4.74879,4.93129
ENSMUSG00000028185,5.00226,4.816444,5.287148
ENSMUSG00000028184,10.323355,10.448718,10.420303
ENSMUSG00000028187,8.932962,9.024588,8.915061


## Calculations 

In [9]:
# Enrichment

df[prefix+'Enrich'] = df.filter(regex=sd_filt).mean(axis=1) - df.filter(regex=s_filt).mean(axis=1)

In [10]:
df[prefix+'Enrich'].head()

Ensembl_gene
ENSMUSG00000028180   -0.040987
ENSMUSG00000028182    0.068040
ENSMUSG00000028185    0.183563
ENSMUSG00000028184   -0.090025
ENSMUSG00000028187    0.051108
Name: MoEx_CerCx_Enrich, dtype: float64

In [11]:
# Calculating Pooled StDev
Scount = df.filter(regex=s_filt).count(axis=1)
SDcount = df.filter(regex=sd_filt).count(axis=1)

StdevS = (Scount-1) * df.filter(regex=s_filt).var(axis=1)
StdevSD = (SDcount-1) * df.filter(regex=sd_filt).var(axis=1)

df[prefix+'poolStDev'] = np.sqrt((StdevS+StdevSD)/(Scount+ SDcount-2))

In [12]:
# Calculating Cohen's d
df[prefix+'Cohens_d'] = df[prefix+'Enrich'] / df[prefix+'poolStDev']

In [13]:
#df[prefix+'poolStDev'].head()
df[prefix+'Cohens_d'] .head()

Ensembl_gene
ENSMUSG00000028180   -0.511002
ENSMUSG00000028182    0.449621
ENSMUSG00000028185    0.804721
ENSMUSG00000028184   -1.355934
ENSMUSG00000028187    0.375042
Name: MoEx_CerCx_Cohens_d, dtype: float64

In [14]:
# Calculating J value (Correction factor)

df[prefix+'J'] = 1-(3/(4*(Scount+SDcount-1)))                              


In [15]:
# Calculating Hedge's g

df[prefix+'Hedges_g'] = df[prefix+'Cohens_d'] * df[prefix+'J']

In [16]:
#df[prefix+'J'].head()
df[prefix+'Hedges_g'] .head()

Ensembl_gene
ENSMUSG00000028180   -0.434352
ENSMUSG00000028182    0.382178
ENSMUSG00000028185    0.684013
ENSMUSG00000028184   -1.152544
ENSMUSG00000028187    0.318786
Name: MoEx_CerCx_Hedges_g, dtype: float64

In [17]:
# Calculating Var_d
Scount = df.filter(regex=s_filt).count(axis=1)
SDcount = df.filter(regex=sd_filt).count(axis=1)

Ftop1 = Scount + SDcount
Ftop2 = Scount * SDcount
Fbottom1 = np.square(df[prefix+'Cohens_d']) 
Fbottom2 =  2*(Scount + SDcount)


df[prefix+'Var_d'] = (Ftop1/Ftop2) + (Fbottom1 /Fbottom2)

In [18]:
#check output
df[prefix+'Var_d'].head()

Ensembl_gene
ENSMUSG00000028180    0.688427
ENSMUSG00000028182    0.683513
ENSMUSG00000028185    0.720631
ENSMUSG00000028184    0.819880
ENSMUSG00000028187    0.678388
Name: MoEx_CerCx_Var_d, dtype: float64

In [19]:
df[prefix+'Var_g'] = df[prefix+'Var_d'] * np.square(df[prefix+'J'])

In [20]:
# Calculating SEg
df[prefix+'SEg'] = np.sqrt(df[prefix+'Var_g'])

In [21]:
df.sort_values(by= 'MoEx_CerCx_Hedges_g', ascending=False, inplace=True)
df

Unnamed: 0_level_0,Definition,Symbol,Transcript_cluster_ids,Constitutive_exons_used,Constitutive_IDs_used,Putative microRNA binding sites,Select Cellular Compartments,Select Protein Classes,Chromosome,Strand,...,ANOVA-adjp,largest fold,MoEx_CerCx_Enrich,MoEx_CerCx_poolStDev,MoEx_CerCx_Cohens_d,MoEx_CerCx_J,MoEx_CerCx_Hedges_g,MoEx_CerCx_Var_d,MoEx_CerCx_Var_g,MoEx_CerCx_SEg
Ensembl_gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000021203,"OTU domain, ubiquitin aldehyde binding 2 [Sour...",Otub2,6797538,ENSMUSE00000652580|ENSMUSE00000983636|ENSMUSE0...,5461555|4918835|5117223|5413249|4631348|518789...,"mmu-miR-103(miRanda), mmu-miR-107(miRanda), mm...",,protein_coding,chr12,+,...,0.066333,0.585614,0.585614,0.017953,32.618793,0.85,27.725974,89.332137,64.542469,8.033833
ENSMUSG00000031431,"TSC22 domain family, member 3 [Source:MGI Symb...",Tsc22d3,7019818,ENSMUSE00001216661|ENSMUSE00001240294,4553875,"mmu-miR-101a(RNAhybrid|miRanda), mmu-miR-101b(...",nucleus,protein_coding,chrX,-,...,0.066333,0.807870,0.807870,0.032995,24.484422,0.85,20.811759,50.623910,36.575775,6.047791
ENSMUSG00000034075,"zinc finger, DHHC domain containing 5 [Source:...",Zdhhc5,6888311,ENSMUSE00000235993|ENSMUSE00000236001|ENSMUSE0...,5405119|5189863|5147626|5329296|4528134|475815...,"mmu-miR-103(RNAhybrid|miRanda), mmu-miR-107(RN...",transmembrane,protein_coding,chr2,-,...,0.066333,0.158009,0.158009,0.008662,18.240873,0.85,15.504742,28.394122,20.514753,4.529321
ENSMUSG00000028341,"nuclear receptor subfamily 4, group A, member ...",Nr4a3,6913315,ENSMUSE00000178311|ENSMUSE00000727365|ENSMUSE0...,4325277|4548963|4394427|4816328|5474941|4427967,"mmu-let-7b(miRanda), mmu-let-7e(miRanda), mmu-...",,transcription regulator|receptor|protein_coding,chr4,+,...,0.066333,0.524709,0.524709,0.030304,17.314945,0.85,14.717704,25.650611,18.532567,4.304947
ENSMUSG00000022602,activity regulated cytoskeletal-associated pro...,Arc,6836691,ENSMUSE00000254617|ENSMUSE00000682358|ENSMUSE0...,4648212|5281674|4425285|5518611|5569733,"mmu-miR-106a(miRanda), mmu-miR-1192(TargetScan...",,protein_coding,chr15,-,...,0.066333,1.438506,1.438506,0.083503,17.226956,0.85,14.642912,25.397333,18.349573,4.283640
ENSMUSG00000034640,TCDD-inducible poly(ADP-ribose) polymerase [So...,Tiparp,6898076,ENSMUSE00000407721|ENSMUSE00000750918|ENSMUSE0...,4941316|4680057|4621347,"mmu-let-7c(RNAhybrid|miRanda), mmu-let-7d(RNAh...",,protein_coding,chr3,+,...,0.066333,0.761433,0.761433,0.045235,16.832893,0.85,14.307959,24.278859,17.541475,4.188254
ENSMUSG00000020387,PHD finger protein 15 [Source:MGI Symbol;Acc:M...,Phf15,6788141,ENSMUSE00000104069|ENSMUSE00000104079|ENSMUSE0...,5079503|4590303|4925300|5341221|5555943|536349...,"mmu-miR-103(pictar), mmu-miR-107(pictar), mmu-...",,protein_coding,chr11,-,...,0.070356,0.405747,0.405747,0.025318,16.025815,0.85,13.621943,22.068896,15.944777,3.993091
ENSMUSG00000028899,"TAF12 RNA polymerase II, TATA box binding prot...",Taf12,6917489,ENSMUSE00000333148|ENSMUSE00000668041|ENSMUSE0...,4954437|5156858|5097233|4619960|5389502,"mmu-let-7d*(RNAhybrid|miRanda), mmu-let-7e(RNA...",,transcription regulator|protein_coding,chr4,+,...,0.070356,0.349181,0.349181,0.022125,15.782263,0.85,13.414923,21.423318,15.478347,3.934253
ENSMUSG00000007617,homer homolog 1 (Drosophila) [Source:MGI Symbo...,Homer1,6808997,ENSMUSE00000373594|ENSMUSE00000611898|ENSMUSE0...,5379806|5265609|4981176|5454625|4906552|455219...,"mmu-let-7a(RNAhybrid|miRanda), mmu-let-7b(RNAh...",,protein_coding,chr13,+,...,0.070356,0.645561,0.645561,0.041415,15.587597,0.85,13.249457,20.914432,15.110677,3.887245
ENSMUSG00000048701,coiled-coil domain containing 6 [Source:MGI Sy...,Ccdc6,6768614,ENSMUSE00000343572|ENSMUSE00000732984|ENSMUSE0...,4685231|4487811|4876131|5264548,"mmu-miR-122(TargetScan), mmu-miR-122a(TargetSc...",,protein_coding,chr10,+,...,0.070356,0.238954,0.238954,0.015350,15.566614,0.85,13.231622,20.859956,15.071318,3.882180


In [22]:
df.columns

Index(['Definition', 'Symbol', 'Transcript_cluster_ids',
       'Constitutive_exons_used', 'Constitutive_IDs_used',
       'Putative microRNA binding sites', 'Select Cellular Compartments',
       'Select Protein Classes', 'Chromosome', 'Strand',
       'Genomic Gene Corrdinates', 'GO-Biological Process',
       'GO-Molecular Function', 'GO-Cellular Component', 'WikiPathways',
       'GSM828577_CerCx_S.CEL', 'GSM828578_CerCx_S.CEL',
       'GSM828579_CerCx_S.CEL', 'avg-CerCx_S', 'log_fold-CerCx_S_vs_CerCx_SD',
       'fold-CerCx_S_vs_CerCx_SD', 'rawp-CerCx_S_vs_CerCx_SD',
       'adjp-CerCx_S_vs_CerCx_SD', 'GSM828583_CerCx_SD.CEL',
       'GSM828584_CerCx_SD.CEL', 'GSM828585_CerCx_SD.CEL', 'avg-CerCx_SD',
       'ANOVA-rawp', 'ANOVA-adjp', 'largest fold', 'MoEx_CerCx_Enrich',
       'MoEx_CerCx_poolStDev', 'MoEx_CerCx_Cohens_d', 'MoEx_CerCx_J',
       'MoEx_CerCx_Hedges_g', 'MoEx_CerCx_Var_d', 'MoEx_CerCx_Var_g',
       'MoEx_CerCx_SEg'],
      dtype='object')

### Import key file from BioMart and index probesets to MGI gene symbols

In [23]:
dfX=pd.read_table('../FHS project/Sleep notebook Copy/BioMart_Ensmbl_index/mart_export72_MGIsymbol.txt',index_col=[0])
 
#dfX.pop('Affy mouse430 2 probeset') /# remove 430V2 probeset info (not needed for 430AV2 indexing)
dfX.head(5)

Unnamed: 0_level_0,Description,MGI symbol
Ensembl Gene ID,Unnamed: 1_level_1,Unnamed: 2_level_1
ENSMUSG00000039221,ribosomal protein L22 like 1 [Source:MGI Symbo...,Rpl22l1
ENSMUSG00000095611,predicted gene 10597 [Source:MGI Symbol;Acc:MG...,Gm10597
ENSMUSG00000061731,exostoses (multiple) 1 [Source:MGI Symbol;Acc:...,Ext1
ENSMUSG00000018599,"Smith-Magenis syndrome chromosome region, cand...",Smcr7
ENSMUSG00000094722,predicted gene 7792 [Source:MGI Symbol;Acc:MGI...,Gm7792


In [24]:
df_Join = df.join(dfX, how='left', sort=True)
df_FINAL1 = df_Join.groupby('MGI symbol').mean()
df_FINAL1[df_FINAL1.index.duplicated()==True]   # checking that no duplicate entries exist in the dataframe

Unnamed: 0_level_0,GSM828577_CerCx_S.CEL,GSM828578_CerCx_S.CEL,GSM828579_CerCx_S.CEL,avg-CerCx_S,log_fold-CerCx_S_vs_CerCx_SD,fold-CerCx_S_vs_CerCx_SD,rawp-CerCx_S_vs_CerCx_SD,adjp-CerCx_S_vs_CerCx_SD,GSM828583_CerCx_SD.CEL,GSM828584_CerCx_SD.CEL,...,ANOVA-adjp,largest fold,MoEx_CerCx_Enrich,MoEx_CerCx_poolStDev,MoEx_CerCx_Cohens_d,MoEx_CerCx_J,MoEx_CerCx_Hedges_g,MoEx_CerCx_Var_d,MoEx_CerCx_Var_g,MoEx_CerCx_SEg
MGI symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### Columns from the list above can then easily be picked to produce files for use later. Examples below given:
 #### df3 = average S and SD expression for the platform and the log-fold changes
 #### df4 = Hedges g  values and associated variance for Meta-analysis (after indexing)

In [25]:
# df3 = df_FINAL1.loc[:,[u'avg-SD', u'avg-S', u'log_fold-S_vs_SD']]
# df3.columns =[prefix+'avg-SD', prefix+'avg-S', prefix+'log_fold-S_vs_SD']
# df3.to_csv('input_files/MoEx_CerCx_SymbolforIndexHedges.csv')

In [26]:
df4 = df_FINAL1.loc[:,[u'MoEx_CerCx_Enrich',u'MoEx_CerCx_Hedges_g', u'MoEx_CerCx_Var_g', u'MoEx_CerCx_SEg']]
df4.to_csv('../FHS project/Sleep notebook Copy/IPython_notebooks/input_files/MoEx_CerCx_SymbolforIndexHedges.csv')

In [27]:
df4.head(10)  # check final ouput

Unnamed: 0_level_0,MoEx_CerCx_Enrich,MoEx_CerCx_Hedges_g,MoEx_CerCx_Var_g,MoEx_CerCx_SEg
MGI symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0610005C13Rik,0.228652,1.254299,0.612772,0.782798
0610007P14Rik,-0.038639,-0.752437,0.528847,0.727219
0610008F07Rik,0.3593,0.837987,0.540185,0.734973
0610009B14Rik,0.062553,0.344874,0.491578,0.701126
0610009B22Rik,-0.145335,-0.583642,0.510053,0.71418
0610009D07Rik,-0.045855,-0.398695,0.494913,0.703501
0610009E02Rik,-0.104917,-0.62792,0.514524,0.717303
0610009L18Rik,-0.013778,-0.071032,0.482087,0.694325
0610009O20Rik,-0.119663,-1.075786,0.57811,0.760335
0610010F05Rik,-0.207753,-3.698719,1.62171,1.273464
