# Calculations of the Effect Size (ES) for each microarray study 

###  Using Hedges' g value, an adjusted Cohen's d  value

$$  {Enrichment} = \bar{X_2}-\bar{X_1}$$

Let Group 1 be Sleeping Forebrain Non-Oligodendrocyte Expression values and Group 2 be 4-8hSD Forebrain Non-Oligodendrocyte Expression values 

(S mean - SD mean) **(Logged values, so minus gives ratio)** 

$$  {Pooled\ Standard\  Deviation} = \sqrt\frac{(n_1-1)S_1^2 +(n_2-1)S_2^2}{(n_2 +n_2) -2}  $$  

$$  {Cohen's\ d\ value} = \frac{Enrichment}{Pooled\ Standard\ Deviation} $$

$$  {Correction\ Factor (J\ Factor)} = 1- \frac{3}{4df-1} $$

$$  {Hedges'\ g\ value} = Cohen's\ d\ \text{x}\ J\ $$

$$  {Variance\ in\ d (V_d)} = \frac{n_1- +n_2}{n_1 n_2} + \frac{d^2}{2(n_1 +n_2)}  $$

$$  {Variance\ in\ g (V_g)} = J^2\  \text{x}\ V_d  $$

$$  {Standard\ Error\ in\ g (SE_g)} = \sqrt{V_g}  $$

## Setup working environment and import data

In [1]:
import pandas as pd # Dataframes and file IO
import numpy as np # numerical calculations
%cd /Users/Ella1/Desktop/data sets 430AV2


/Users/Ella1/Desktop/data sets 430AV2


In [2]:
prefix = '430AV2_ForeNonOlig_'   # define a prefix to add to column names (making indexing easier later)

In [3]:
# import the data file to a data frame 'df'
df=pd.read_table('DATASET-GSE48369.txt', delimiter='\t',  index_col=0) #,nrows=500)  
df.shape

(45101, 51)

In [4]:
# remove probes that are know to cross-hybridise to more than one target
df =df[~df.index.str.contains('_x_|_s_')]    #   important reverse selector ~ 
df.shape

(40569, 51)

## Look at column names and then setup filters for grouping columns into Non-Oligodendrocyte samples of S and SD groups
It is important that we pick up only one of the two types of tissue investivagtes in this assession.

In [5]:
df.columns

Index(['Symbol', 'Definition', 'Ensembl_id', 'Entrez_id', 'Unigene_id',
       'GO-Process', 'GO-Function', 'GO-Component', 'Pathway_info',
       'Putative microRNA binding sites', 'Select Cellular Compartments',
       'Select Protein Classes', 'GSM1176618_ForeOlig_S.CEL',
       'GSM1176619_ForeOlig_S.CEL', 'GSM1176620_ForeOlig_S.CEL',
       'GSM1176621_ForeOlig_S.CEL', 'GSM1176622_ForeOlig_S.CEL',
       'GSM1176623_ForeOlig_S.CEL', 'avg-ForeOlig_S',
       'log_fold-ForeOlig_S_vs_ForeOlig_SD', 'fold-ForeOlig_S_vs_ForeOlig_SD',
       'rawp-ForeOlig_S_vs_ForeOlig_SD', 'adjp-ForeOlig_S_vs_ForeOlig_SD',
       'GSM1176630_ForeOlig_SD.CEL', 'GSM1176631_ForeOlig_SD.CEL',
       'GSM1176632_ForeOlig_SD.CEL', 'GSM1176633_ForeOlig_SD.CEL',
       'GSM1176634_ForeOlig_SD.CEL', 'GSM1176635_ForeOlig_SD.CEL',
       'avg-ForeOlig_SD', 'GSM1176636_ForeNonOlig_S.CEL',
       'GSM1176637_ForeNonOlig_S.CEL', 'GSM1176638_ForeNonOlig_S.CEL',
       'GSM1176639_ForeNonOlig_S.CEL', 'GSM1176640_ForeN

In [6]:
# define regular expressions for sleep (S) and sleep dep (SD) filters 
s_filt ='ForeNonOlig_S.CEL'
sd_filt ='ForeNonOlig_SD.CEL'

In [7]:
df_s=df.filter(regex= s_filt)
df_s.head()

Unnamed: 0_level_0,GSM1176636_ForeNonOlig_S.CEL,GSM1176637_ForeNonOlig_S.CEL,GSM1176638_ForeNonOlig_S.CEL,GSM1176639_ForeNonOlig_S.CEL,GSM1176640_ForeNonOlig_S.CEL,GSM1176641_ForeNonOlig_S.CEL
Probesets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1427138_at,7.38233,7.62909,6.60472,6.93801,6.71016,7.13424
1425600_a_at,11.25625,11.43172,11.37437,11.339,11.4498,11.33535
1457168_at,5.08315,5.06547,5.09521,4.67219,5.19908,5.24162
1450135_at,10.20196,10.11574,10.02259,9.90802,9.89074,10.13718
1424014_at,10.2958,10.1829,10.57897,9.97481,10.30083,9.53992


In [8]:
df_sd=df.filter(regex= sd_filt)
df_sd.head()

Unnamed: 0_level_0,GSM1176648_ForeNonOlig_SD.CEL,GSM1176649_ForeNonOlig_SD.CEL,GSM1176650_ForeNonOlig_SD.CEL,GSM1176651_ForeNonOlig_SD.CEL,GSM1176652_ForeNonOlig_SD.CEL,GSM1176653_ForeNonOlig_SD.CEL
Probesets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1427138_at,7.75234,7.39151,7.3995,7.57308,7.46008,7.14445
1425600_a_at,11.31269,11.24978,11.20397,11.35161,11.1929,11.18712
1457168_at,5.17386,4.98357,5.12343,5.10408,4.6996,4.98886
1450135_at,9.62592,10.15827,9.80858,9.78696,9.92484,10.11384
1424014_at,10.32778,10.75235,10.47619,10.60029,10.5254,10.46914


## Calculations 

In [9]:
# Enrichment

df[prefix+'Enrich'] = df.filter(regex=sd_filt).mean(axis=1) - df.filter(regex=s_filt).mean(axis=1)

In [10]:
df[prefix+'Enrich'].head()

Probesets
1427138_at      0.387068
1425600_a_at   -0.114737
1457168_at     -0.047220
1450135_at     -0.142970
1424014_at      0.379653
Name: 430AV2_ForeNonOlig_Enrich, dtype: float64

In [11]:
# Calculating Pooled StDev
Scount = df.filter(regex=s_filt).count(axis=1)
SDcount = df.filter(regex=sd_filt).count(axis=1)

StdevS = (Scount-1) * df.filter(regex=s_filt).var(axis=1)
StdevSD = (SDcount-1) * df.filter(regex=sd_filt).var(axis=1)

df[prefix+'poolStDev'] = np.sqrt((StdevS+StdevSD)/(Scount+ SDcount-2))

In [12]:
# Calculating Cohen's d
df[prefix+'Cohens_d'] = df[prefix+'Enrich'] / df[prefix+'poolStDev']

In [13]:
#df[prefix+'poolStDev'].head()
df[prefix+'Cohens_d'] .head()

Probesets
1427138_at      1.234247
1425600_a_at   -1.644184
1457168_at     -0.252200
1450135_at     -0.838912
1424014_at      1.401680
Name: 430AV2_ForeNonOlig_Cohens_d, dtype: float64

In [14]:
# Calculating J value (Correction factor)

df[prefix+'J'] = 1-(3/(4*(Scount+SDcount-1)))                              


In [15]:
# Calculating Hedge's g

df[prefix+'Hedges_g'] = df[prefix+'Cohens_d'] * df[prefix+'J']

In [16]:
#df[prefix+'J'].head()
df[prefix+'Hedges_g'] .head()

Probesets
1427138_at      1.150094
1425600_a_at   -1.532081
1457168_at     -0.235004
1450135_at     -0.781714
1424014_at      1.306111
Name: 430AV2_ForeNonOlig_Hedges_g, dtype: float64

In [17]:
# Calculating Var_d
Scount = df.filter(regex=s_filt).count(axis=1)
SDcount = df.filter(regex=sd_filt).count(axis=1)

Ftop1 = Scount + SDcount
Ftop2 = Scount * SDcount
Fbottom1 = np.square(df[prefix+'Cohens_d']) 
Fbottom2 =  2*(Scount + SDcount)


df[prefix+'Var_d'] = (Ftop1/Ftop2) + (Fbottom1 /Fbottom2)

In [18]:
#check output
df[prefix+'Var_d'].head()

Probesets
1427138_at      0.396807
1425600_a_at    0.445973
1457168_at      0.335984
1450135_at      0.362657
1424014_at      0.415196
Name: 430AV2_ForeNonOlig_Var_d, dtype: float64

In [19]:
df[prefix+'Var_g'] = df[prefix+'Var_d'] * np.square(df[prefix+'J'])

In [20]:
# Calculating SEg
df[prefix+'SEg'] = np.sqrt(df[prefix+'Var_g'])

In [21]:
df.sort_values(by= '430AV2_ForeNonOlig_Hedges_g', ascending=False, inplace=True)
df

Unnamed: 0_level_0,Symbol,Definition,Ensembl_id,Entrez_id,Unigene_id,GO-Process,GO-Function,GO-Component,Pathway_info,Putative microRNA binding sites,...,ANOVA-adjp,largest fold,430AV2_ForeNonOlig_Enrich,430AV2_ForeNonOlig_poolStDev,430AV2_ForeNonOlig_Cohens_d,430AV2_ForeNonOlig_J,430AV2_ForeNonOlig_Hedges_g,430AV2_ForeNonOlig_Var_d,430AV2_ForeNonOlig_Var_g,430AV2_ForeNonOlig_SEg
Probesets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1436387_at,C330006P03Rik,homer homolog 1 (Drosophila) [Source:MGI Symbo...,ENSMUSG00000007617,320588,,,,,,"mmu-let-7a(RNAhybrid|miRanda), mmu-let-7b(RNAh...",...,2.330827e-14,4.371350,2.162957,0.209812,10.309030,0.931818,9.606142,4.761504,4.134344,2.033309
1427682_a_at,Egr2,early growth response 2,ENSMUSG00000037868,13654,,brain segmentation // myelination // facial ne...,nucleic acid binding // transcription factor a...,intracellular // nucleus,Adipogenesis genes:WP447(WikiPathways) // Whit...,"mmu-let-7g(RNAhybrid|miRanda), mmu-miR-100(RNA...",...,4.685071e-10,4.327847,3.853687,0.484893,7.947507,0.931818,7.405631,2.965119,2.574569,1.604546
1418687_at,Arc,activity regulated cytoskeletal-associated pro...,ENSMUSG00000022602,11838,,anterior/posterior pattern formation // multic...,actin binding,postsynaptic membrane // endosome // acrosome ...,Serotonin and anxiety-related events:WP2140(Wi...,"mmu-miR-106a(miRanda), mmu-miR-1192(TargetScan...",...,1.130078e-10,2.440912,2.440912,0.309803,7.878904,0.931818,7.341706,2.919880,2.535289,1.592259
1416505_at,Nr4a1,"nuclear receptor subfamily 4, group A, member 1",ENSMUSG00000023034,15370,,"regulation of transcription, DNA-dependent // ...",transcription factor activity // protein homod...,nucleus,GenMAPP-Nuclear_Receptors // Spinal Cord Injur...,"mmu-let-7a(miRanda), mmu-let-7b(miRanda), mmu-...",...,1.110060e-10,2.449920,1.924565,0.271010,7.101457,0.931818,6.617267,2.434612,2.113938,1.453939
1436329_at,Egr3,early growth response 3,,13655,Mm.103737,transcription // peripheral nervous system dev...,nucleic acid binding // DNA binding // metal i...,intracellular // nucleus,,,...,1.196981e-12,3.137160,1.096007,0.158334,6.922131,0.931818,6.450167,2.329829,2.022956,1.422307
1421768_a_at,Homer1,homer homolog 1 (Drosophila),ENSMUSG00000007617,26556,,metabotropic glutamate receptor signaling path...,protein binding,postsynaptic membrane // synapse // cytoplasm ...,,"mmu-let-7a(RNAhybrid|miRanda), mmu-let-7b(RNAh...",...,1.622863e-12,2.669662,0.965737,0.142169,6.792858,0.931818,6.329709,2.255955,1.958812,1.399576
1421679_a_at,Cdkn1a,cyclin-dependent kinase inhibitor 1A (P21),ENSMUSG00000023067,12575,,cellular response to extracellular stimulus //...,cyclin binding // zinc ion binding // kinase a...,cyclin-dependent protein kinase holoenzyme com...,GenMAPP-Cell_Cycle_KEGG // PluriNetWork:WP1763...,"mmu-miR-105(TargetScan), mmu-miR-106a(TargetSc...",...,2.099051e-07,4.322870,1.862638,0.277925,6.701939,0.931818,6.244989,2.204833,1.914424,1.383627
1425671_at,Homer1,homer homolog 1 (Drosophila),ENSMUSG00000007617,26556,,metabotropic glutamate receptor signaling path...,protein binding,postsynaptic membrane // synapse // cytoplasm ...,,"mmu-let-7a(RNAhybrid|miRanda), mmu-let-7b(RNAh...",...,7.923588e-13,4.099978,2.139545,0.344246,6.215157,0.931818,5.791396,1.942841,1.686940,1.298822
1423100_at,Fos,FBJ osteosarcoma oncogene,ENSMUSG00000021250,14281,,cellular response to extracellular stimulus //...,transcription factor activity // DNA binding /...,transcription factor complex // nucleus,Insulin Signaling:WP65(WikiPathways) // IL-6 s...,"mmu-let-7b(RNAhybrid|miRanda), mmu-let-7e(RNAh...",...,1.385606e-09,3.951955,2.623582,0.434777,6.034318,0.931818,5.622888,1.850542,1.606798,1.267595
1437247_at,Fosl2 /// LOC634417,fos-like antigen 2 [Source:MGI Symbol;Acc:MGI:...,ENSMUSG00000029135,14284|634417,,"regulation of transcription, DNA-dependent // ...",transcription factor activity // DNA binding /...,nucleus,,"mmu-miR-101a(miRanda), mmu-miR-101b(miRanda), ...",...,2.570664e-12,3.438748,1.280650,0.213971,5.985150,0.931818,5.577072,1.825918,1.585417,1.259134


In [22]:
df.columns

Index(['Symbol', 'Definition', 'Ensembl_id', 'Entrez_id', 'Unigene_id',
       'GO-Process', 'GO-Function', 'GO-Component', 'Pathway_info',
       'Putative microRNA binding sites', 'Select Cellular Compartments',
       'Select Protein Classes', 'GSM1176618_ForeOlig_S.CEL',
       'GSM1176619_ForeOlig_S.CEL', 'GSM1176620_ForeOlig_S.CEL',
       'GSM1176621_ForeOlig_S.CEL', 'GSM1176622_ForeOlig_S.CEL',
       'GSM1176623_ForeOlig_S.CEL', 'avg-ForeOlig_S',
       'log_fold-ForeOlig_S_vs_ForeOlig_SD', 'fold-ForeOlig_S_vs_ForeOlig_SD',
       'rawp-ForeOlig_S_vs_ForeOlig_SD', 'adjp-ForeOlig_S_vs_ForeOlig_SD',
       'GSM1176630_ForeOlig_SD.CEL', 'GSM1176631_ForeOlig_SD.CEL',
       'GSM1176632_ForeOlig_SD.CEL', 'GSM1176633_ForeOlig_SD.CEL',
       'GSM1176634_ForeOlig_SD.CEL', 'GSM1176635_ForeOlig_SD.CEL',
       'avg-ForeOlig_SD', 'GSM1176636_ForeNonOlig_S.CEL',
       'GSM1176637_ForeNonOlig_S.CEL', 'GSM1176638_ForeNonOlig_S.CEL',
       'GSM1176639_ForeNonOlig_S.CEL', 'GSM1176640_ForeN

### Import key file from BioMart and index probesets to MGI gene symbols

In [23]:
dfX=pd.read_table('../FHS project/Sleep notebook Copy/BioMart_Ensmbl_index/mart_export72_430v2430Av2.txt',index_col=[3])
 
dfX.pop('Affy mouse430 2 probeset') # remove 430V2 probeset info (not needed for 430AV2 indexing)
dfX.head(5)

Unnamed: 0_level_0,Ensembl Gene ID,Description,MGI symbol
Affy mouse430a 2 probeset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1417126_a_at,ENSMUSG00000039221,ribosomal protein L22 like 1 [Source:MGI Symbo...,Rpl22l1
,ENSMUSG00000095611,predicted gene 10597 [Source:MGI Symbol;Acc:MG...,Gm10597
1417730_at,ENSMUSG00000061731,exostoses (multiple) 1 [Source:MGI Symbol;Acc:...,Ext1
1417730_at,ENSMUSG00000061731,exostoses (multiple) 1 [Source:MGI Symbol;Acc:...,Ext1
,ENSMUSG00000061731,exostoses (multiple) 1 [Source:MGI Symbol;Acc:...,Ext1


In [24]:
df_Join = df.join(dfX, how='left', sort=True)
df_FINAL1 = df_Join.groupby('MGI symbol').mean()
df_FINAL1[df_FINAL1.index.duplicated()==True]   # checking that no duplicate entries exist in the dataframe

Unnamed: 0_level_0,GSM1176618_ForeOlig_S.CEL,GSM1176619_ForeOlig_S.CEL,GSM1176620_ForeOlig_S.CEL,GSM1176621_ForeOlig_S.CEL,GSM1176622_ForeOlig_S.CEL,GSM1176623_ForeOlig_S.CEL,avg-ForeOlig_S,log_fold-ForeOlig_S_vs_ForeOlig_SD,fold-ForeOlig_S_vs_ForeOlig_SD,rawp-ForeOlig_S_vs_ForeOlig_SD,...,ANOVA-adjp,largest fold,430AV2_ForeNonOlig_Enrich,430AV2_ForeNonOlig_poolStDev,430AV2_ForeNonOlig_Cohens_d,430AV2_ForeNonOlig_J,430AV2_ForeNonOlig_Hedges_g,430AV2_ForeNonOlig_Var_d,430AV2_ForeNonOlig_Var_g,430AV2_ForeNonOlig_SEg
MGI symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### Columns from the list above can then easily be picked to produce files for use later. Examples below given:
 #### df3 = average S and SD expression for the platform and the log-fold changes
 #### df4 = Hedges g  values and associated variance for Meta-analysis (after indexing)

In [25]:
# df3 = df_FINAL1.loc[:,[u'avg-SD', u'avg-S', u'log_fold-S_vs_SD']]
# df3.columns =[prefix+'avg-SD', prefix+'avg-S', prefix+'log_fold-S_vs_SD']
# df3.to_csv('input_files/430AV2_SymbolExpression_forIndex.csv')

In [26]:
df4 = df_FINAL1.loc[:,[u'430AV2_ForeNonOlig_Enrich',u'430AV2_ForeNonOlig_Hedges_g', u'430AV2_ForeNonOlig_Var_g', u'430AV2_ForeNonOlig_SEg']]
df4.to_csv('../FHS project/Sleep notebook Copy/IPython_notebooks/input_files/430AV2_ForeNonOlig_SymbolforIndexHedges.csv')

In [27]:
df4.head(10)  # check final ouput

Unnamed: 0_level_0,430AV2_ForeNonOlig_Enrich,430AV2_ForeNonOlig_Hedges_g,430AV2_ForeNonOlig_Var_g,430AV2_ForeNonOlig_SEg
MGI symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0610005C13Rik,-0.094337,-0.410458,0.296448,0.544471
0610008F07Rik,-0.022675,-0.106744,0.289903,0.538427
0610009B22Rik,0.086633,0.617748,0.305329,0.552566
0610009D07Rik,-0.105899,-0.603806,0.305756,0.552901
0610009O20Rik,-0.31942,-0.801172,0.316173,0.562293
0610010K14Rik,-0.085128,-0.153827,0.290414,0.538901
0610012G03Rik,0.351547,0.80836,0.32287,0.567607
0610031J06Rik,-0.045908,-0.209692,0.29126,0.539686
0610037L13Rik,-0.173502,-0.585347,0.303705,0.551094
0610040J01Rik,0.347025,0.632919,0.306119,0.553281
