In [1]:
%matplotlib inline
import pycistarget
pycistarget.__version__

'0.1.dev83+gc0508a0'

<a class="anchor" id="top"></a>
# PycisTarget on pycisTopic results

* [1. Deriving cistromes using topics](#1)
    * [A. Loading your region sets](#2)
    * [B. cisTarget](#3)
* [2. Deriving cistromes using DARs](#6)
    * [A. Loading your region sets](#7)
    * [B. cisTarget](#8)

**pycisTarget** is a python module that allows to perform motif enrichment analysis and derive genome-wide cistromes implementing **cisTarget** (Herrmann et al., 2012; Imrichova et al., 2015). In addition, *de novo* cistromes can also be derived (via **Homer** (Heinz et al., 2010)) and pycisTarget also includes a novel approach to derive differentially enriched motifs and cistromes between one or more groups of regions, named **Differentially Enriched Motifs (DEM)**.

In this tutorial we will show how to obtain cistromes from topics and DARs using cisTarget. For more information on how to use DEM and Homer, take a look to the Chip-seq tutorial.

<a class="anchor" id="1"></a>
## 1. Deriving cistromes using topics with cisTarget

<a class="anchor" id="2"></a>
### A. Loading your region sets

**pycisTarget** uses as input a dictionary containing the region set name as label and regions (as pyranges) as values. We will start by loading the binarized topics (see pycisTopic - Single sample workfllow tutorial).

In [2]:
# Load region binarized topics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/binarized_topic_region.pkl', 'rb')
binarized_topic_region = pickle.load(infile)
infile.close()

In [3]:
import pyranges as pr
from pycistarget.utils import *
region_sets = {key: pr.PyRanges(region_names_to_coordinates(binarized_topic_region[key].index.tolist())) for key in binarized_topic_region.keys()}

### B. Create cisTarget database

To run **cisTarget** you will need to provide a **ranking database** (that is, a feather file with a dataframe with motifs as rows, genomic regions as columns and their ranked position [based on cis-regulatory module (CRM) score (Frith et al., 2003)] as values). We provide those databases for human (hg38, hg19), mouse (mm10, mm9) and fly (dm3, dm6) at https://resources.aertslab.org/cistarget/. 

In addition, **if you want to use other regions or genomes to build your databases**, we provide a step-by-step tutorial and scripts at https://github.com/aertslab/create_cisTarget_databases. Below you can find the basic steps to do so:

In [None]:
#### Get fasta sequences
module load BEDTools
bedtools getfasta -fi /staging/leuven/stg_00002/lcb/resources/human/hg38/hg38.fa -bed /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.bed > /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.fa
#### Activate environment
conda_initialize /staging/leuven/stg_00002/lcb/ghuls/software/miniconda3/
conda activate create_cistarget_databases
#### Set ${create_cistarget_databases_dir} to https://github.com/aertslab/create_cisTarget_databases
create_cistarget_databases_dir='/staging/leuven/stg_00002/lcb/ghuls/software/create_cisTarget_databases'
#### Score the motifs in 10 chunks; we will use the non-redundant db here
for current_part in {1..10} ; do
     python3.8 ${create_cistarget_databases_dir}/create_cistarget_motif_databases.py \
         -f /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.fa \
         -M /staging/leuven/stg_00002/lcb/cbravo/motif_clustering/RNA_harmony_snn_res_5_clusters/motif_collection_combined_motifs_stamp_and_singlets/singletons/ \
         -m /staging/leuven/stg_00002/lcb/cbravo/motif_clustering/RNA_harmony_snn_res_5_clusters/motif_collection_combined_motifs_stamp_and_singlets/motifs.txt \
         -p ${current_part} 10 \
         -o /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs/human_brain \
         -t 20
done
#### Merge scores
${create_cistarget_databases_dir}/combine_partial_regions_or_genes_vs_motifs_or_tracks_cistarget_dbs.py -i  /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs/human_brain -o /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs/
#### Remove chunks
rm /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs/human_brain*part*
#### Create rankings
${create_cistarget_databases_dir}/convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py -i /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs/human_brain.motifs_vs_regions.scores.feather -s 555

For running cisTarget there are some relevant parameters:
- **ctx_db**: Path to the cisTarget database to use, or a preloaded cisTargetDatabase object (using the same region sets to be analyzed)
- **region_sets**: The input sets of regions 
- **specie**: Specie to which region coordinates and database belong to. To annotate motifs to TFs using cisTarget annotations, possible values are 'mus_musculus', 'homo_sapiens' or 'drosophila_melanogaster'. If any other value, motifs will not be annotated to a TF unless providing a customized annotation.
- **fraction_overlap**: Minimum overlap fraction (in any direction) to map input regions to regions in the database. Default: 0.4.
- **auc_threshold**: Threshold to calculate the AUC. For human and mouse we recommend to set it to 0.005 (default), for fly to 0.01.
- **nes_threshold**: NES threshold to calculate the motif significant. Default: 3.0
- **rank_threshold**: Percentage of regions to use as maximum rank to take into account for the region enrichment recovery curve. By default, we use 5% of the total number of regions in the database.
- **annotation**: Annotation to use to form the cistromes. Here we will only use the direct annotation as example. Default: ['Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot']
- **n_cpu**: Number of cpus to use during calculations.

In [4]:
# Load cistarget functions
from pycistarget.motif_enrichment_cistarget import *

In [5]:
# Preload db, you can also just provide the path to the db. Preloading the database is useful if you want to test different parameters. 
# This will take some time depending on the size of your database
db = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs_v2/human_brain.regions_vs_motifs.rankings.feather'
ctx_db = cisTargetDatabase(db, region_sets)   
# Remove dbcorr motifs
ctx_db.db_rankings = ctx_db.db_rankings[~ctx_db.db_rankings.index.str.contains("dbcorr")]

2021-08-12 12:27:28,368 cisTarget    INFO     Reading cisTarget database


In [6]:
# Run
cistarget_dict = run_cistarget(ctx_db = ctx_db,
                               region_sets = region_sets,
                               specie = 'homo_sapiens',
                               annotation_version = 'v9nr_clust',
                               path_to_motif_annotations = '/staging/leuven/stg_00002/lcb/cbravo/motif_clustering/RNA_harmony_snn_res_5_clusters/motif_collection_combined_motifs_stamp_and_singlets/hs_annotation.tsv',
                               auc_threshold = 0.005,
                               nes_threshold = 3.0,
                               rank_threshold = 0.05,
                               annotation = ['Direct_annot', 'Orthology_annot'],
                               n_cpu = 1,
                               _temp_dir='/scratch/leuven/313/vsc31305/ray_spill')

2021-08-12 12:40:36,427 cisTarget    INFO     Running cisTarget for Topic1 which has 12394 regions
2021-08-12 12:40:45,803 cisTarget    INFO     Annotating motifs for Topic1
2021-08-12 12:40:47,593 cisTarget    INFO     Getting cistromes for Topic1
2021-08-12 12:40:47,807 cisTarget    INFO     Running cisTarget for Topic2 which has 8915 regions
2021-08-12 12:40:53,352 cisTarget    INFO     Annotating motifs for Topic2
2021-08-12 12:40:54,780 cisTarget    INFO     Getting cistromes for Topic2
2021-08-12 12:40:54,970 cisTarget    INFO     Running cisTarget for Topic3 which has 3593 regions
2021-08-12 12:40:59,393 cisTarget    INFO     Annotating motifs for Topic3
2021-08-12 12:41:00,740 cisTarget    INFO     Getting cistromes for Topic3
2021-08-12 12:41:00,926 cisTarget    INFO     Running cisTarget for Topic4 which has 6046 regions
2021-08-12 12:41:05,573 cisTarget    INFO     Annotating motifs for Topic4
2021-08-12 12:41:06,857 cisTarget    INFO     Getting cistromes for Topic4
2021-08

In [7]:
# Save
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/topics/topic_cistarget_dict.pkl', 'wb') as f:
  pickle.dump(cistarget_dict, f)

We can load the results for exploration. 

In [8]:
# Load
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/topics/topic_cistarget_dict.pkl', 'rb')
cistarget_dict = pickle.load(infile)
infile.close()

To visualize motif enrichment results, we can use the `cisTarget_results()` function:

In [57]:
cistarget_results(cistarget_dict, name='Topic21')

Unnamed: 0,Logo,Region_set,Direct_annot,Orthology_annot,NES,AUC,Rank_at_max,Motif_hits
cisbp__M0416,,Topic21,,EGR2,7.788373,0.006863,21780.0,1177
jaspar__MA0491.1,,Topic21,JUND,,7.322481,0.006596,21692.0,1247
cisbp__M4526,,Topic21,SMARCC1,,7.137253,0.006489,21786.0,1174
taipale_cyt_meth__JUN_NATGACTCATN_FL_meth,,Topic21,JUN,,6.721687,0.006251,21787.0,1211
cisbp__M6230,,Topic21,FOSL2,,6.64767,0.006208,21672.0,1193
hocomoco__FOSL2_MOUSE.H11MO.0.A,,Topic21,,FOSL2,6.621156,0.006193,21743.0,1197
taipale_tf_pairs__BACH1_ATGACTCAT_HT,,Topic21,BACH1,,6.60747,0.006185,21790.0,1224
metacluster_35.5,,Topic21,"MEF2A, SMARCC1, JUNB, FOSB, FOS, ATF3, SMARCA4, MEF2C, FOSL2, JUN, STAT3, GATA2, MYC, FOSL1, BACH2, JUND, NFE2, RCOR1, BACH1, JDP2","FOSL1, JUNB, JUND, FOSB, FOS, FOSL2, JUN, JDP2, BATF",6.499758,0.006124,21776.0,1279
jaspar__MA0303.1,,Topic21,,,6.415368,0.006075,21783.0,1209
flyfactorsurvey__kay_Jra_SANGER_5_FBgn0001291,,Topic21,,"JUN, JUND, JUNB",6.281511,0.005998,21764.0,1213


This table can also be easily exported to a html file:

In [58]:
for x in range(1, len(cistarget_dict)+1):
    out_file = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/topics/Topic'+ str(x) + '.html'
    cistarget_dict['Topic'+ str(x)].motif_enrichment.to_html(open(out_file, 'w'), escape=False, col_space=80)

In addition, we can also access cistromes directly. Cistromes with the '\_extended' annotation include regions that contain a motif annotated to the TF by orthology in this case.

In [59]:
cistarget_dict['Topic21'].cistromes['Region_set'].keys()

dict_keys(['ARID3A_(1009r)', 'HEY1_(89r)', 'FOS_(2206r)', 'MITF_(89r)', 'ARNT2_(89r)', 'EGR3_(2957r)', 'JUN_(2662r)', 'CLOCK_(89r)', 'MAFF_(1009r)', 'CREB3L1_(89r)', 'CREB3_(89r)', 'FOSL1_(2261r)', 'SREBF1_(372r)', 'NPAS2_(89r)', 'MTA3_(1009r)', 'RFXAP_(616r)', 'HEY2_(89r)', 'MEF2A_(1279r)', 'STAT5A_(1016r)', 'IRF4_(1009r)', 'ENO1_(89r)', 'TRIM28_(1941r)', 'ID1_(89r)', 'BCL11A_(1009r)', 'BATF_(1914r)', 'BHLHE40_(89r)', 'RFX3_(616r)', 'EGR4_(1882r)', 'ZBTB7B_(372r)', 'FOSL2_(2102r)', 'STAT3_(2023r)', 'NFE2L2_(1009r)', 'HES6_(89r)', 'OLIG1_(89r)', 'GATA2_(1279r)', 'MYC_(1358r)', 'ATF6B_(89r)', 'BACH2_(1661r)', 'KLF11_(991r)', 'BACH1_(2508r)', 'TFEB_(89r)', 'MNT_(89r)', 'NFIC_(1221r)', 'ATF3_(1661r)', 'MAX_(1123r)', 'SMARCB1_(1009r)', 'EGR1_(2291r)', 'TFE3_(89r)', 'TPPP_(372r)', 'RFX1_(877r)', 'MYCN_(89r)', 'ARNTL_(89r)', 'CEBPB_(1680r)', 'EP300_(1941r)', 'NFE2L1_(1009r)', 'MLXIPL_(89r)', 'RELA_(1009r)', 'GCM2_(699r)', 'NFE2L3_(1009r)', 'NFE2_(2023r)', 'MAF_(1009r)', 'SETDB1_(959r)', 'SMA

[[Back to top]](#top)

<a class="anchor" id="6"></a>
## 2. Deriving cistromes using DARs

<a class="anchor" id="7"></a>
### A. Loading your region sets

**pycisTarget** uses as input a dictionary containing the region set name as label and regions (as pyranges) as values. We will start by loading the binarized topics (see pycisTopic - Single sample workfllow tutorial).

In [68]:
# Load region DARs topics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/DARs.pkl', 'rb')
DARs_dict = pickle.load(infile)
infile.close()

In [69]:
import pyranges as pr
from pycistarget.utils import *
region_sets = {key: pr.PyRanges(region_names_to_coordinates(DARs_dict[key].index.tolist())) for key in DARs_dict.keys()}

<a class="anchor" id="8"></a>
### B. cisTarget

We run cisTarget with same settings as before, and using the previously created database.

In [70]:
# Load cistarget functions
from pycistarget.motif_enrichment_cistarget import *

In [71]:
# Preload db, you can also just provide the path to the db. Preloading the database is useful if you want to test different parameters. 
# This will take some time depending on the size of your database
db = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/dbs/human_brain.regions_vs_motifs.rankings.feather'
ctx_db = cisTargetDatabase(db, region_sets)   
# Remove dbcorr motifs
ctx_db.db_rankings = ctx_db.db_rankings[~ctx_db.db_rankings.index.str.contains("dbcorr")]   

2021-08-12 16:51:03,626 cisTarget    INFO     Reading cisTarget database


In [72]:
# Run
cistarget_dict = run_cistarget(ctx_db = ctx_db,
                               region_sets = region_sets,
                               specie = 'homo_sapiens',
                               annotation_version = 'v9nr_clust',
                               path_to_motif_annotations = '/staging/leuven/stg_00002/lcb/cbravo/motif_clustering/RNA_harmony_snn_res_5_clusters/motif_collection_combined_motifs_stamp_and_singlets/hs_annotation.tsv',
                               auc_threshold = 0.005,
                               nes_threshold = 3.0,
                               rank_threshold = 0.05,
                               annotation = ['Direct_annot', 'Orthology_annot'],
                               n_cpu = 1,
                               _temp_dir='/scratch/leuven/313/vsc31305/ray_spill')

2021-08-12 17:25:05,575 cisTarget    INFO     Running cisTarget for AST which has 32163 regions
2021-08-12 17:25:24,243 cisTarget    INFO     Annotating motifs for AST
2021-08-12 17:25:28,992 cisTarget    INFO     Getting cistromes for AST
2021-08-12 17:25:29,750 cisTarget    INFO     Running cisTarget for BG which has 32515 regions
2021-08-12 17:25:49,379 cisTarget    INFO     Annotating motifs for BG
2021-08-12 17:25:54,103 cisTarget    INFO     Getting cistromes for BG
2021-08-12 17:25:54,859 cisTarget    INFO     Running cisTarget for COP which has 23169 regions
2021-08-12 17:26:06,990 cisTarget    INFO     Annotating motifs for COP
2021-08-12 17:26:09,445 cisTarget    INFO     Getting cistromes for COP
2021-08-12 17:26:09,715 cisTarget    INFO     Running cisTarget for ENDO which has 15280 regions
2021-08-12 17:26:18,714 cisTarget    INFO     Annotating motifs for ENDO
2021-08-12 17:26:20,518 cisTarget    INFO     Getting cistromes for ENDO
2021-08-12 17:26:20,807 cisTarget    INF

In [73]:
# Save
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/DARs/DARs_cistarget_dict.pkl', 'wb') as f:
  pickle.dump(cistarget_dict, f)

We can load the results for exploration. 

In [74]:
# Load
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/DARs/DARs_cistarget_dict.pkl', 'rb')
cistarget_dict = pickle.load(infile)
infile.close()

To visualize motif enrichment results, we can use the `cisTarget_results()` function:

In [96]:
cistarget_results(cistarget_dict, name='COP')

Unnamed: 0,Logo,Region_set,Direct_annot,Orthology_annot,NES,AUC,Rank_at_max,Motif_hits
metacluster_46.3,,COP,"SOX10, SOX8",,18.620411,0.014769,21274.0,3051
hocomoco__SOX9_HUMAN.H11MO.0.B,,COP,SOX9,,16.817214,0.013573,21643.0,3073
metacluster_46.1,,COP,"SOX11, SOX3, SRY, SOX17, SOX13, SOX1, SOX9, SOX15, SOX18, SOX21, SOX2, SOX30, SOX6, SOX8, SOX10, SOX5, SOX7, SOX4, HBP1","SOX14, SOX21, SOX17",16.407962,0.013302,21771.0,3283
metacluster_2.93,,COP,"SOX9, SOX18, SOX4, SOX17",,14.233799,0.01186,21777.0,2999
transfac_pro__M08972,,COP,SOX17,,13.963742,0.01168,21789.0,2779
factorbook__SOX2,,COP,SOX2,,13.838165,0.011597,21708.0,2624
homer__CCATTGTTNY_Sox6,,COP,,SOX6,13.176417,0.011158,21781.0,2932
cisbp__M5208,,COP,,"SOX18, SOX7, SOX17",12.302875,0.010579,21673.0,2650
transfac_pro__M01308,,COP,SOX4,,12.171386,0.010492,21786.0,2669
cisbp__M1904,,COP,SOX9,,11.924563,0.010328,21790.0,2628


In [94]:
for key in cistarget_dict.keys():
    out_file = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/DARs/cistarget/'+ str(key) + '.html'
    cistarget_dict[key].motif_enrichment.to_html(open(out_file, 'w'), escape=False, col_space=80)