# Perform GSEA using GSEAPY  

Following the potocol defined here: https://gseapy.readthedocs.io/en/latest/gseapy_tutorial.html#use-gsea-command-or-gsea


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # mac
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt

In [2]:
gp.__version__

'0.9.9'

## 1. Prepare prerank file

* eg: https://github.com/zqfang/GSEApy/blob/master/tests/data/temp.rnk



In [27]:
gene_exp_alpha = pd.read_csv("./dat/1901/res.genes_level.a.csv",index_col=0)
gene_exp_alpha.head()

Unnamed: 0,pval,odds,type1_frac,type2_frac,FDR,padj.Bonferroni,FDR.BY
SAMD11,2e-06,0.53273,0.03985,0.072289,1.3e-05,0.030243,0.000137
NOC2L,0.193454,0.872666,0.040319,0.045934,0.245128,1.0,1.0
KLHL17,0.249872,0.827881,0.013127,0.015813,0.299613,1.0,1.0
PLEKHN1,0.000105,0.43971,0.013127,0.029367,0.000419,1.0,0.004283
C1orf170,0.291157,1.227541,0.01383,0.011295,0.33806,1.0,1.0


In [46]:
df = gene_exp_alpha.sort_values(by='odds', ascending=False)["odds"]

# replace inf to max exclude inf
df.replace(np.inf,df[~df.isin([np.inf])].max(0)).to_csv('./dat/1901/res.genes.a.rnk',sep='\t')


## 2. define gene sets

In our case, we will use the beta gene sets from the three literature. And save to [gmt](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) format.

It was done in another [notebook](./compare_with_previous_glists.ipynb)

## 3. Run gsea

```python
gseapy.prerank(rnk='gsea_data.rnk', gene_sets='ene_sets.gmt', outdir='gseapy_out', min_size=15,
               max_size=1000, permutation_n=1000, weighted_score_type=1, ascending=False,
               figsize=(6.5,6), format='png')
```

In [42]:
# run gsea
# enrichr libraries are supported by gsea module. Just provide the name

gs_res = gp.prerank(rnk='./dat/1901/res.genes.a.rnk', # or data='./P53_resampling_data.txt'
                 gene_sets='./dat/glists/gsea.gmt', # enrichr library names or gmt file
                 #set permutation_type to phenotype if samples >=15
                 permutation_num=1000, # reduce number to speed up test
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 weighted_score_type=1,
                 ascending=False,
                 min_size=0,
                 max_size=4000,
                 processes=4,
                 format='png')

In [43]:
#access the dataframe results throught res2d attribute
gs_res.res2d

Unnamed: 0_level_0,es,nes,pval,fdr,geneset_size,matched_size,genes,ledge_genes
Term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1_alpha,0.793763,3.526346,0.0,0.0,240,240,RAC2;RNF17;GPR37L1;PGAM2;GALNT15;HTR1F;MS4A8;O...,RAC2;RNF17;GPR37L1;PGAM2;GALNT15;HTR1F;MS4A8;O...
B1_beta,0.672698,2.728089,0.0,0.0,75,75,MS4A8;PDE6C;IGFBP1;FREM3;G6PC2;GRIK1;SLC25A34;...,MS4A8;PDE6C;IGFBP1;FREM3;G6PC2;GRIK1;SLC25A34;...
CD9-_dorrell,0.375997,1.468218,0.157042,0.252549,57,55,IGSF1;CCDC81;LRRTM4;ADAMTS5;NAV3;SEMA6A;NPY;GR...,IGSF1;CCDC81;LRRTM4;ADAMTS5;NAV3;SEMA6A;NPY;GR...
ST8SIA1+_dorrell,0.363038,1.38938,0.155419,0.2532741,46,45,TFF3;MYOM1;TP53I11;KIRREL3;ATP8B1;TMEM130;ST8S...,TFF3;MYOM1;TP53I11;KIRREL3;ATP8B1;TMEM130;ST8S...
Beta sub3_xin,0.451421,1.238652,0.217005,0.3504255,13,9,CHGA;ASB9;IGFBP7;CPE;INS;TIMP1;LAMP1;CHGB;CKB,CHGA;ASB9;IGFBP7;CPE;INS;TIMP1
Beta sub1_xin,0.525961,1.590612,0.070358,0.4666667,13,13,RBP4;SCGB2A1;PPP1R1A;FFAR4;SCGN;PRSS23;FXYD2;T...,RBP4;SCGB2A1;PPP1R1A;FFAR4;SCGN
mature_bader,0.27047,1.04651,0.354675,0.5584577,72,51,TNS1;NPFFR2;SVEP1;NUP210L;RHOH;ITGA9;GPR83;APO...,TNS1;NPFFR2;SVEP1;NUP210L;RHOH;ITGA9;GPR83;APO...
CD9+_dorrell,0.269371,1.065961,0.331617,0.5820385,45,41,KCNA5;RBP4;PPP1R1A;AK5;ZFP36;PRDM1;NPC1L1;MIA2...,KCNA5;RBP4;PPP1R1A;AK5;ZFP36;PRDM1;NPC1L1;MIA2...
Beta sub2_xin,0.27877,0.991437,0.414737,0.6252608,28,28,TFF3;NPY;IAPP;RBP1;IGFBP5;PAM;GPX3;PEMT;ID1;SE...,TFF3;NPY;IAPP;RBP1;IGFBP5;PAM;GPX3;PEMT;ID1;SE...
ST8SIA1-_dorrell,0.217337,0.874103,0.551308,0.6332223,65,64,G6PC2;SLC5A1;SCGB2A1;NEUROD1;RFX6;SLC27A2;LIN7...,G6PC2;SLC5A1;SCGB2A1;NEUROD1;RFX6;SLC27A2;LIN7...


### show the results 
The **gsea** module will generate heatmap for genes in each gene sets in the backgroud.
But if you need to do it yourself, use the code below

In [44]:
from gseapy.plot import gseaplot, heatmap
terms = gs_res.res2d.index
for i in range(len(terms)):
    gseaplot(gs_res.ranking, term=terms[i], **gs_res.results[terms[i]],ofname=terms[i]+'.png')
