# Perform GSEA using GSEAPY  

Following the potocol defined here: https://gseapy.readthedocs.io/en/latest/gseapy_tutorial.html#use-gsea-command-or-gsea


In [7]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # mac
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt
import numpy as np

In [2]:
gp.__version__

'0.9.9'

## 1. Prepare prerank file

* eg: https://github.com/zqfang/GSEApy/blob/master/tests/data/temp.rnk



In [5]:
gene_exp_beta = pd.read_csv("./dat/1901/res.genes_level.b.csv",index_col=0)
gene_exp_beta.head()

Unnamed: 0,pval,odds,type1_frac,type2_frac,FDR,padj.Bonferroni,FDR.BY
SAMD11,5.791039e-09,1.416723,0.228985,0.173295,3.364968e-07,8.8e-05,3e-06
NOC2L,5.391691e-06,1.447693,0.110243,0.078835,7.608109e-05,0.082396,0.000777
KLHL17,0.01986402,1.626148,0.014929,0.009233,0.04680986,1.0,0.478007
PLEKHN1,0.07146685,0.765752,0.015847,0.020597,0.1243914,1.0,1.0
C1orf170,0.4620219,1.035402,0.016537,0.01598,0.4789458,1.0,1.0


In [8]:
df = gene_exp_beta.sort_values(by='odds', ascending=False)["odds"]

# replace inf to max exclude inf
df.replace(np.inf,df[~df.isin([np.inf])].max(0)).to_csv('./dat/1901/res.genes.b.rnk',sep='\t')


## 2. define gene sets

In our case, we will use the beta gene sets from the three literature. And save to [gmt](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) format.

It was done in another [notebook](./compare_with_previous_glists.ipynb)

## 3. Run gsea

```python
gseapy.prerank(rnk='gsea_data.rnk', gene_sets='ene_sets.gmt', outdir='gseapy_out', min_size=15,
               max_size=1000, permutation_n=1000, weighted_score_type=1, ascending=False,
               figsize=(6.5,6), format='png')
```

In [9]:
# run gsea
# enrichr libraries are supported by gsea module. Just provide the name

gs_res = gp.prerank(rnk='./dat/1901/res.genes.b.rnk', # or data='./P53_resampling_data.txt'
                 gene_sets='./dat/glists/gsea.gmt', # enrichr library names or gmt file
                 #set permutation_type to phenotype if samples >=15
                 permutation_num=500, # reduce number to speed up test
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 weighted_score_type=1,
                 ascending=False,
                 min_size=0,
                 max_size=4000,
                 processes=4,
                 format='png')

In [10]:
#access the dataframe results throught res2d attribute
gs_res.res2d

Unnamed: 0_level_0,es,nes,pval,fdr,geneset_size,matched_size,genes,ledge_genes
Term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1_alpha,0.592078,3.740554,0.0,0.0,240,233,EDNRA;SRPX;PKP1;PDE6C;GRIK1;MS4A8;PGAM2;LRRC3B...,EDNRA;SRPX;PKP1;PDE6C;GRIK1;MS4A8;PGAM2;LRRC3B...
B1_beta,0.827921,4.49515,0.0,0.0,75,75,BTBD17;INS-IGF2;SMAD6;PDE6C;GRIK1;MS4A8;CXCL5;...,BTBD17;INS-IGF2;SMAD6;PDE6C;GRIK1;MS4A8;CXCL5;...
mature_bader,0.296981,1.50201,0.070213,0.07849674,72,50,NEB;SYNGAP1;CX3CR1;NPFFR2;SYT1;SVEP1;SRGAP3;BA...,NEB;SYNGAP1;CX3CR1;NPFFR2;SYT1;SVEP1;SRGAP3;BA...
ST8SIA1+_dorrell,0.308675,1.515026,0.084034,0.1244633,46,46,MYOM1;TP53I11;TMEM130;ATP8B1;SMOC1;RBM43;KCNJ8...,MYOM1;TP53I11;TMEM130;ATP8B1;SMOC1;RBM43;KCNJ8...
CD9-_dorrell,0.352208,1.859225,0.035052,0.1432035,57,56,SPARCL1;CCDC81;ITM2A;COL24A1;ADAMTS5;STAB1;IGS...,SPARCL1;CCDC81;ITM2A;COL24A1;ADAMTS5;STAB1;IGS...
Beta sub2_xin,0.435515,1.850973,0.024283,0.1463858,28,27,ID1;RBP1;GPX3;RGS16;TFF3;NPY;IAPP;GNAS;FOS;PEM...,ID1;RBP1;GPX3;RGS16;TFF3;NPY;IAPP;GNAS;FOS;PEM...
CD9+_dorrell,0.250344,1.250502,0.205945,0.170935,45,42,KCNA5;COL6A2;TACSTD2;SYNM;MIA2;RBP4;PRDM1;CD74...,KCNA5;COL6A2;TACSTD2;SYNM;MIA2;RBP4;PRDM1;CD74...
Beta sub1_xin,0.378206,1.279366,0.215584,0.2089711,13,13,SCGB2A1;FFAR4;RBP4;PRSS23;FXYD2;PPP1R1A;SCGN;A...,SCGB2A1;FFAR4;RBP4;PRSS23;FXYD2;PPP1R1A;SCGN
Beta sub3_xin,0.514642,1.626509,0.044619,0.2429156,13,10,INS;DLK1;CHGA;ASB9;LAMP1;IGFBP7;CPE;CKB;TIMP1;...,INS;DLK1;CHGA;ASB9
ST8SIA1-_dorrell,0.167693,0.902521,0.568507,0.6088801,65,65,G6PC2;SCGB2A1;FRMD4B;SLC27A2;NEUROD1;TCEA3;MAR...,G6PC2;SCGB2A1;FRMD4B;SLC27A2;NEUROD1;TCEA3;MAR...


### show the results 
The **gsea** module will generate heatmap for genes in each gene sets in the backgroud.
But if you need to do it yourself, use the code below

In [11]:
from gseapy.plot import gseaplot, heatmap
terms = gs_res.res2d.index
for i in range(len(terms)):
    gseaplot(gs_res.ranking, term=terms[i], **gs_res.results[terms[i]],ofname=terms[i]+'_b.png')
