# Perform GSEA using GSEAPY  

Following the potocol defined here: https://gseapy.readthedocs.io/en/latest/gseapy_tutorial.html#use-gsea-command-or-gsea


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # mac
import pandas as pd
import gseapy as gp
import numpy as np
import matplotlib.pyplot as plt

In [2]:
gp.__version__

'0.9.9'

## 1. Prepare prerank file

* eg: https://github.com/zqfang/GSEApy/blob/master/tests/data/temp.rnk



In [3]:
gene_exp_alpha = pd.read_csv("../dat/figdata/fig2_prom_ttest_res.csv",index_col=1)
gene_exp_alpha.head()

Unnamed: 0_level_0,gene_tr.idx,tr.idx,odds,padj.Bonferroni,celltype
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SAMD11,SAMD11_7,7,0.540894,1.0,alpha
SAMD11,SAMD11_13,13,0.69316,1.0,alpha
SAMD11,SAMD11_14,14,0.479521,1.0,alpha
NOC2L,NOC2L_19,19,0.872666,1.0,alpha
KLHL17,KLHL17_23,23,0.827881,1.0,alpha


In [59]:
df = gene_exp_alpha.loc[gene_exp_alpha["celltype"]=="beta"].sort_values(by='odds', ascending=False)["odds"]
df=np.log2(df)
df.head()

Gene
PI4KA    inf
NELL1    inf
PRKCH    inf
KLKB1    inf
UXS1     inf
Name: odds, dtype: float64

In [119]:
rnk.shape

(21825, 2)

In [114]:
df = gene_exp_alpha.loc[gene_exp_alpha["celltype"]=="beta"].sort_values(by='odds', ascending=False)["odds"]
df=np.log2(df)

# replace inf to max exclude inf
#df=df.replace(np.inf,df[~df.isin([np.inf])].max(0))
#df=df.replace(-np.inf,df[~df.isin([-np.inf])].min(0))
df.to_csv('../dat/figdata/res.genes.b.rnk',sep='\t')
rnk = pd.read_table("../dat/figdata/res.genes.b.rnk", header=None)

for i in  rnk.index[(rnk[1]==np.inf).tolist()].tolist():
    rnk.iloc[i,1]= df[~df.isin([np.inf])].max(0)*(1+np.random.uniform()/100)

for i in  rnk.index[(rnk[1]==-np.inf).tolist()].tolist():
    rnk.iloc[i,1]= df[~df.isin([-np.inf])].min(0)*(1+np.random.uniform()/100)

#rnk.set_index(0)
rnk.head()

Unnamed: 0,0,1
0,PI4KA,5.001727
1,NELL1,5.039825
2,PRKCH,5.046754
3,KLKB1,5.020901
4,UXS1,5.017432


## 2. define gene sets

In our case, we will use the beta gene sets from the three literature. And save to [gmt](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) format.

It was done in another [notebook](./compare_with_previous_glists.ipynb)

## 3. Run gsea

```python
gseapy.prerank(rnk='gsea_data.rnk', gene_sets='ene_sets.gmt', outdir='gseapy_out', min_size=15,
               max_size=1000, permutation_n=1000, weighted_score_type=1, ascending=False,
               figsize=(6.5,6), format='png')
```

In [120]:
# run gsea
# enrichr libraries are supported by gsea module. Just provide the name

gs_res = gp.prerank(rnk=rnk, # or data='./P53_resampling_data.txt'
                 gene_sets='../dat/glists/gsea.gmt', # enrichr library names or gmt file
                 #set permutation_type to phenotype if samples >=15
                 permutation_num=1000, # reduce number to speed up test
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 #weighted_score_type=1,
                 #ascending=False,
                 min_size=10,
                 max_size=600,
                 processes=4)
                 #format='png')
gs_res.res2d

2019-02-19 11:33:59,495 Input gene rankings contains duplicated IDs, Only use the duplicated ID with highest value!


Unnamed: 0_level_0,es,nes,pval,fdr,geneset_size,matched_size,genes,ledge_genes
Term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Beta sub4_xin,-0.340284,-1.74007,0.0,0.010599,390,385,PSMF1;WDR45B;HSPA9;ANXA2;ATP6V0D1;KRT8;ATP6V1H...,ARID5B;EIF2S2;ZFAND2A;ARPP19;XBP1;TMEM258;U2AF...
ST8SIA1+_dorrell,0.562926,1.639242,0.002439,0.047165,46,46,C1orf127;CORO2B;TMEM130;MYOM1;TP53I11;SULF2;SH...,C1orf127;CORO2B;TMEM130;MYOM1;TP53I11;SULF2;SH...
CD9-_dorrell,0.507622,1.515943,0.016018,0.061458,57,56,SPARCL1;CCDC81;ASB9;KCNMB2;PTPRD;SEMA6A;ITM2A;...,SPARCL1;CCDC81;ASB9;KCNMB2;PTPRD;SEMA6A;ITM2A;...
Beta sub2_xin,0.569174,1.526186,0.031888,0.084147,28,27,PCP4;PEMT;AP3B1;ID1;STMN2;RBP1;TFF3;GNAS;IAPP;...,PCP4;PEMT;AP3B1;ID1;STMN2;RBP1;TFF3;GNAS;IAPP;...
ST8SIA1-_dorrell,0.442361,1.376958,0.05618,0.118771,65,65,GPD1L;SLC39A11;TCEA3;RYK;AFAP1;FRMD4B;G6PC2;RR...,GPD1L;SLC39A11;TCEA3;RYK;AFAP1;FRMD4B;G6PC2;RR...
Beta sub3_xin,0.643224,1.385023,0.10574,0.139888,13,10,INS;ASB9;DLK1;CHGA;LAMP1;IGFBP7;CPE;CKB;TIMP1;...,INS;ASB9;DLK1;CHGA;LAMP1;IGFBP7
mature_bader,0.42909,1.268595,0.160287,0.195272,72,50,KIF6;NEB;TNS1;SYNGAP1;CX3CR1;NOSTRIN;CAV1;NPFF...,KIF6;NEB;TNS1;SYNGAP1;CX3CR1;NOSTRIN;CAV1;NPFF...
CD9+_dorrell,0.345443,1.003738,0.466667,0.612996,45,42,TSPAN33;KCNA5;COL6A2;GNAL;SEL1L3;AKAP13;TACSTD...,TSPAN33;KCNA5;COL6A2;GNAL;SEL1L3;AKAP13;TACSTD...
Beta sub1_xin,0.359062,0.839491,0.676554,0.843348,13,13,PRSS23;FXYD2;SCGB2A1;FFAR4;RBP4;PPP1R1A;SCGN;A...,PRSS23;FXYD2;SCGB2A1;FFAR4;RBP4
immature_bader,0.18559,0.637977,0.998,0.948309,579,565,NELL1;TMEM63A;PARVA;PGM1;ATP2A3;APBB3;C1orf127...,NELL1;TMEM63A;PARVA;PGM1;ATP2A3;APBB3;C1orf127...


In [123]:
gs_res.res2d.to_csv('../dat/figdata/GSEA_beta_res.csv')

### show the results 
The **gsea** module will generate heatmap for genes in each gene sets in the backgroud.
But if you need to do it yourself, use the code below

In [122]:
from gseapy.plot import gseaplot, heatmap
terms = gs_res.res2d.index
for i in range(len(terms)):
    gseaplot(gs_res.ranking, term=terms[i], **gs_res.results[terms[i]],ofname=terms[i]+'_b.pdf')
