Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
precompiled_data
testing_data
LSEA.py
README.md

README.md

LSEA

LSEA (Locus Set Enrichment Analysis) is a tool for performing gene set enrichment analysis on independent loci, taking into account LD (Linkage Disequilibrium).

Getting Started

LSEA could be applied for gene set enrichment analysis for data obtained from GWAS-summary statistics files in tsv-format. It is based on simple hypergeometric test, however it transforms genes and gene sets into independant loci and sets of independant loci to eliminate multiple signals from genes in LD to enhance analysis precision.
Tool includes precompiled universe of independant loci based on data, obtained from UK Biobank (https://www.ukbiobank.ac.uk/). Data for all heritable phenotypes (based on partitioned heritability p-value < 0.05) were processed with PLINK to get indepedant loci for each phenotype. After that all files were combined into universe with mearging intervals overlaping more than 60%.

Prerequisites

Installing

To install this tool clone this repository to your PC.

~$ git clone https://github.com/LSEA

Running and using tool

Firstly, you need to prepare tsv-file from GWAS summary statistics with the following structure:

CHR COORDINATE RSID REF ALT PVAL
9 136058188 rs12216896 C T 2.89651e-11

To launch this tool you will also need to specify path to PLINK and SnpEff directories.

Example usage

~$ python3 LSEA.py -af <input tsv-file> -sn <path to SNPeff> -pld <path to plink> -bf <bfile for plink> -p

This command will apply LSEA algorithm to the input file and will generate tsv-file with the following structure:

gene_set p-value q-value enrich_description
BIOCARTA_INTRINSIC_PATHWAY 2.0446237642438122e-14 2.2517441515617103e-10 (17776, 11, 36, 6, 'F11;FGB;FGA;F5;FGG;KLKB1')
The first column contains the name of the set, the second and the third represent p-value and corrected q-value of hypergeometric test, the last coloumn includes information about total number of independant loci, number of loci in quiery, number of loci in gene set, number of loci common for quiery and gene set and, finally, the genes list.
Note that the genes list could be smaller then the number of common loci, because only indepedant loci are counted for analysis.
-p (--precompiled flag) points that precompiled universe of independant loci based on UK Biobank data is used.
Information about HLA-locus is excluded from analisys due to high ambiguity of LD-scores within the HLA-locus.

Tool options:

-af <input.tsv> Input file in tsv-format 
-vf <input.vcf> Annotated vcf-file if it is already prepared 
-pl <input.clumped> PlINK  result of clumping (.clumped file) if it is already prepared
--precompiled, -p Use precompiled loci
-sn <path to SnpEff directory> Path to SnpEff
-g <genome> Flag for specifying genome for SnpEff annotation
-pld <path to PlINK directory> Path to PlINK 
-bf <bfile> Bfile for PLINK

Creating your own universe:

If you don't want to use precompiled universe of independant loci you can use options for creating your own universe based on GWAS summary statictics files. Use -cu (--create_universe) option to create universe of independant loci from your data:

-af <input.tsv> -cu

For this function you have to prepare results of clumping for your GWAS data (obtain .clumped file). If you have multiple files (e.g. for different phenotypes) use -cld flag for specifying directory with clumped files:

-cld <directory with clumped files>

Author

License

This project is free and available for everyone.

You can’t perform that action at this time.