# Perform GSEA using GSEAPY  

Following the potocol defined here: https://gseapy.readthedocs.io/en/latest/gseapy_tutorial.html#use-gsea-command-or-gsea


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # mac
import pandas as pd
import gseapy as gp
import matplotlib.pyplot as plt
import numpy as np

In [2]:
gp.__version__

'0.9.9'

## 1. Prepare expression file
In our case, the expression file is the percentage of cells inside each pseudostate **bin**

In [3]:
gene_exp_beta = pd.read_csv("./dat/1901/res.genes_level.b.csv",index_col=0)
gene_exp_beta.head()

Unnamed: 0,pval,odds,type1_frac,type2_frac,FDR,padj.Bonferroni,FDR.BY
SAMD11,5.791039e-09,1.416723,0.228985,0.173295,3.364968e-07,8.8e-05,3e-06
NOC2L,5.391691e-06,1.447693,0.110243,0.078835,7.608109e-05,0.082396,0.000777
KLHL17,0.01986402,1.626148,0.014929,0.009233,0.04680986,1.0,0.478007
PLEKHN1,0.07146685,0.765752,0.015847,0.020597,0.1243914,1.0,1.0
C1orf170,0.4620219,1.035402,0.016537,0.01598,0.4789458,1.0,1.0


In [36]:
np.log2((gene_exp_beta[["type1_frac","type2_frac"]]+.00001).head())

Unnamed: 0,type1_frac,type2_frac
SAMD11,-2.126613,-2.528611
NOC2L,-3.181104,-3.664833
KLHL17,-6.064792,-6.75743
PLEKHN1,-5.978691,-5.60075
C1orf170,-5.917329,-5.966676


In [21]:
(gene_exp_beta[["type1_frac","type2_frac"]]).head()

Unnamed: 0,type1_frac,type2_frac
SAMD11,0.228985,0.173295
NOC2L,0.110243,0.078835
KLHL17,0.014929,0.009233
PLEKHN1,0.015847,0.020597
C1orf170,0.016537,0.01598


## 2. Phenotype file (.cls)
- The first line specify the total samples and phenotype numbers. Leave number 1 alway be 1.
- The second line specify the phenotype class(name).
- The third line specify column attributes in setp 1.

In [12]:
with open('./dat/1901/pheno_beta.cls',"w") as f:
    f.write("2 2 1\n")
    f.write("#B1 B2\n")
    f.write("B1 B2"+"\n")

In [13]:
phenoA, phenoB, class_vector =  gp.parser.gsea_cls_parser("./dat/1901/pheno_beta.cls")
#class_vector used to indicate group attributes for each sample
print(class_vector)

['B1', 'B2']


In [14]:
print("positively correlated: ", phenoA)

positively correlated:  B1


In [15]:
print("negtively correlated: ", phenoB)


negtively correlated:  B2


## 3. define gene sets

In our case, we will use the beta gene sets from the three literature. And save to [gmt](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) format.

It was done in another [notebook](./compare_with_previous_glists.ipynb)

## 4. Run gsea
The result is look like this:

![The result interpetation](https://software.broadinstitute.org/gsea/doc/ug_images/anl-enrichment-geneset-plot-annotated.gif)

see also https://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

- Normalized Enrichment Score (NES)=$\frac{\textrm{actual ES}}{\textrm{mean ES against all permutations}}$

In [42]:
# run gsea
# enrichr libraries are supported by gsea module. Just provide the name

gs_res = gp.gsea(data=np.log2(gene_exp_beta[["type1_frac","type2_frac"]]+.0000001), # or data='./P53_resampling_data.txt'
                 gene_sets='./dat/glists/gsea.gmt', # enrichr library names or gmt file
                 cls= class_vector, # cls=class_vector
                 #set permutation_type to phenotype if samples >=15
                 permutation_type='gene_set',
                 permutation_num=1000, # reduce number to speed up test
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 method='diff_of_classes',
                 processes=4,
                 min_size=0,
                 seed=12345,
                 max_size=4000,
                 format='png')

In [43]:
#access the dataframe results throught res2d attribute
gs_res.res2d

Unnamed: 0_level_0,es,nes,pval,fdr,geneset_size,matched_size,genes,ledge_genes
Term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1_alpha,0.724383,3.002627,0.0,0.0,240,233,EDNRA;SRPX;PKP1;PDE6C;GRIK1;MS4A8;PGAM2;LRRC3B...,EDNRA;SRPX;PKP1;PDE6C;GRIK1;MS4A8;PGAM2;LRRC3B...
A2_alpha,-0.515989,-2.820014,0.0,0.0,3526,3522,WIPF3;TMEM130;SLIT2;TDRD5;FHOD3;ASS1;CACNA2D3;...,ZNF251;POLR3B;GET4;C17orf64;DENND5B;PIAS3;CENP...
B1_beta,0.852346,3.041127,0.0,0.0,75,75,BTBD17;INS-IGF2;SMAD6;PDE6C;GRIK1;CXCL5;MS4A8;...,BTBD17;INS-IGF2;SMAD6;PDE6C;GRIK1;CXCL5;MS4A8;...
B2_beta,-0.685345,-3.266767,0.0,0.0,420,420,SRI;UCP2;PROX1;EVI5L;ANXA11;CRY2;LSR;FBXL8;RCO...,NFYA;OSBPL10;KCNG3;PPP2R1A;NARS2;OSBPL6;ABCD3;...
Beta sub4_xin,-0.42948,-2.011811,0.0,0.004398,390,385,MSS51;MMP7;UQCRFS1;DNAJB9;ID2;BRIX1;SAP18;DNAJ...,ARID5B;ISCU;DNAJB11;ELF3;PNO1;RAB7A;ZFAND5;GHI...
immature_bader,-0.32803,-1.601343,0.0,0.029028,579,565,NELL1;G6PC2;S100A1;AP1M2;SYTL4;TMEM63A;GCK;SPO...,MPV17L2;GLOD4;C19orf70;SLC25A4;PRDX5;RAB3A;B9D...
Beta sub3_xin,0.648198,1.498635,0.072848,0.134466,13,10,INS;DLK1;CHGA;ASB9;LAMP1;IGFBP7;CPE;CKB;TIMP1;...,INS;DLK1;CHGA
CD9-_dorrell,0.375743,1.290275,0.141328,0.220644,57,56,SPARCL1;CCDC81;ITM2A;COL24A1;STAB1;ADAMTS5;IGS...,SPARCL1;CCDC81;ITM2A;COL24A1;STAB1;ADAMTS5;IGS...
mature_bader,0.350899,1.154928,0.226013,0.224146,72,50,NEB;SYNGAP1;CX3CR1;NPFFR2;SYT1;SVEP1;BASP1;KCN...,NEB;SYNGAP1;CX3CR1;NPFFR2;SYT1;SVEP1;BASP1;KCN...
ST8SIA1+_dorrell,0.365006,1.182005,0.199546,0.232716,46,46,MYOM1;TP53I11;TMEM130;ATP8B1;SMOC1;KCNJ8;PON3;...,MYOM1;TP53I11;TMEM130;ATP8B1;SMOC1;KCNJ8;PON3;...


### show the results 
The **gsea** module will generate heatmap for genes in each gene sets in the backgroud.
But if you need to do it yourself, use the code below

In [46]:
from gseapy.plot import gseaplot, heatmap
terms = gs_res.res2d.index
#for i in range(len(terms)):
#    gseaplot(gs_res.ranking, term=terms[i], **gs_res.results[terms[i]])
for i in range(len(terms)):
    gseaplot(gs_res.ranking, term=terms[i], **gs_res.results[terms[i]],ofname=terms[i]+'_beta.png')