Skip to content
forked from ykai16/diHMM

Improved implementation of diHMM in C++

Notifications You must be signed in to change notification settings

gcyuan/diHMM-cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diHMM (C++/Python version)

Note:

This is the updated version of the diHMM model (Marco et al. 2017, Nat Comm). The original package was written in MATLAB and can be accessed at gcyuan/diHMM. In this updated version, we have increased computational efficiency in two major ways: 1. Implementing the code in C++ along with a Python wrapper to increase computational speed. 2. To use an ensemble approach to aggregate information from multiple models each trained on a different sample. The details of these changes are described in (Kai et al. 2020 bioRxiv)

Overview

diHMM stands for Hierarchical Hidden Markov Model. diHMM is a computational method for finding chromatin states at multiple scales. The model takes as input a multidimensional set of histone modifications for several cell types and classifies the genome into a preselected number of nucleosome-level and domain-level hidden states.

The diHMM model was originally developed by Eugenio Marco, with assistance from Wouter Meuleman, Jialiang Huang, Luca Pinello, Manolis Kellis and Guo-Cheng Yuan. The method was originally implemented in MATLAB. The code and sample data are available at (https://github.com/gcyuan/diHMM).

In this updated version, the computational efficiency is siginificantly improved by implementing in C++. Additional improvement is achieved by using an ensemble clustering approach. The development of this newer version was led by Stephanos Tsoucas and Yan Kai with assistance from Shengbao Suo, Xuan Cao, and Guo-Cheng Yuan.

References: Marco E*, Meuleman W*, Huang J*, Glass K, Pinello L, Wang J, Kellis M†, Yuan GC†. Multi-scale chromatin state annotation using a hierarchical hidden Markov model. Nature Commun. 2017 Apr 7;8:15011. (https://www.nature.com/articles/ncomms15011)).

Kai Y, Tsoucas S, Suo S, Yuan GC. Multi-scale annotations of chromatin states in 127 human cell-types. bioRxiv. (https://www.biorxiv.org/content/10.1101/2020.12.22.424078v1).

Annotation of 127 human cell types

We applied diHMM v1.0 to generate the multi-scale chromatin state annotations for the 127 human reference epigenomes in the Roadmap and ENCODE consortia. Detailed information of the information on the 127 epigenomes can be found at Roadmap Epigenomics Consortium and Kundaje et.al or this site.

A first impression of diHMM states

We generated the chromatin state maps at the nucleosome (200bp resolution) and domain (4kb resolution) level. Here we show a snapshot of diHMM states across 127 epigenomes below.

example

Accessing the multi-scale chromatin state maps in the 127 epigenomes

Our multi-scale annotations can be freely downloaded from the table below. After unzipping, those maps can be directly uploaded to genome browsers (e.g IGV) for visualization. Please note that our state annotations are based on hg19 reference genome.

To see full meta information about each reference epigenome, please visit here.

Epigenome ID (EID) Nucleosome Domain GROUP Standardized Epigenome name ANATOMY
E017 download download IMR90 IMR90 fetal lung fibroblasts Cell Line LUNG
E002 download download ESC ES-WA7 Cells ESC
E008 download download ESC H9 Cells ESC
E001 download download ESC ES-I3 Cells ESC
E015 download download ESC HUES6 Cells ESC
E014 download download ESC HUES48 Cells ESC
E016 download download ESC HUES64 Cells ESC
E003 download download ESC H1 Cells ESC
E024 download download ESC ES-UCSF4 Cells ESC
E020 download download iPSC iPS-20b Cells IPSC
E019 download download iPSC iPS-18 Cells IPSC
E018 download download iPSC iPS-15b Cells IPSC
E021 download download iPSC iPS DF 6.9 Cells IPSC
E022 download download iPSC iPS DF 19.11 Cells IPSC
E007 download download ES-deriv H1 Derived Neuronal Progenitor Cultured Cells ESC_DERIVED
E009 download download ES-deriv H9 Derived Neuronal Progenitor Cultured Cells ESC_DERIVED
E010 download download ES-deriv H9 Derived Neuron Cultured Cells ESC_DERIVED
E013 download download ES-deriv hESC Derived CD56+ Mesoderm Cultured Cells ESC_DERIVED
E012 download download ES-deriv hESC Derived CD56+ Ectoderm Cultured Cells ESC_DERIVED
E011 download download ES-deriv hESC Derived CD184+ Endoderm Cultured Cells ESC_DERIVED
E004 download download ES-deriv H1 BMP4 Derived Mesendoderm Cultured Cells ESC_DERIVED
E005 download download ES-deriv H1 BMP4 Derived Trophoblast Cultured Cells ESC_DERIVED
E006 download download ES-deriv H1 Derived Mesenchymal Stem Cells ESC_DERIVED
E062 download download Blood & T-cell Primary mononuclear cells from peripheral blood BLOOD
E034 download download Blood & T-cell Primary T cells from peripheral blood BLOOD
E045 download download Blood & T-cell Primary T cells effector/memory enriched from peripheral blood BLOOD
E033 download download Blood & T-cell Primary T cells from cord blood BLOOD
E044 download download Blood & T-cell Primary T regulatory cells from peripheral blood BLOOD
E043 download download Blood & T-cell Primary T helper cells from peripheral blood BLOOD
E039 download download Blood & T-cell Primary T helper naive cells from peripheral blood BLOOD
E041 download download Blood & T-cell Primary T helper cells PMA-I stimulated BLOOD
E042 download download Blood & T-cell Primary T helper 17 cells PMA-I stimulated BLOOD
E040 download download Blood & T-cell Primary T helper memory cells from peripheral blood 1 BLOOD
E037 download download Blood & T-cell Primary T helper memory cells from peripheral blood 2 BLOOD
E048 download download Blood & T-cell Primary T CD8+ memory cells from peripheral blood BLOOD
E038 download download Blood & T-cell Primary T helper naive cells from peripheral blood BLOOD
E047 download download Blood & T-cell Primary T CD8+ naive cells from peripheral blood BLOOD
E029 download download HSC & B-cell Primary monocytes from peripheral blood BLOOD
E031 download download HSC & B-cell Primary B cells from cord blood BLOOD
E035 download download HSC & B-cell Primary hematopoietic stem cells BLOOD
E051 download download HSC & B-cell Primary hematopoietic stem cells G-CSF-mobilized Male BLOOD
E050 download download HSC & B-cell Primary hematopoietic stem cells G-CSF-mobilized Female BLOOD
E036 download download HSC & B-cell Primary hematopoietic stem cells short term culture BLOOD
E032 download download HSC & B-cell Primary B cells from peripheral blood BLOOD
E046 download download HSC & B-cell Primary Natural Killer cells from peripheral blood BLOOD
E030 download download HSC & B-cell Primary neutrophils from peripheral blood BLOOD
E026 download download Mesench Bone Marrow Derived Cultured Mesenchymal Stem Cells STROMAL_CONNECTIVE
E049 download download Mesench Mesenchymal Stem Cell Derived Chondrocyte Cultured Cells STROMAL_CONNECTIVE
E025 download download Mesench Adipose Derived Mesenchymal Stem Cell Cultured Cells FAT
E023 download download Mesench Mesenchymal Stem Cell Derived Adipocyte Cultured Cells FAT
E052 download download Myosat Muscle Satellite Cultured Cells MUSCLE
E055 download download Epithelial Foreskin Fibroblast Primary Cells skin01 SKIN
E056 download download Epithelial Foreskin Fibroblast Primary Cells skin02 SKIN
E059 download download Epithelial Foreskin Melanocyte Primary Cells skin01 SKIN
E061 download download Epithelial Foreskin Melanocyte Primary Cells skin03 SKIN
E057 download download Epithelial Foreskin Keratinocyte Primary Cells skin02 SKIN
E058 download download Epithelial Foreskin Keratinocyte Primary Cells skin03 SKIN
E028 download download Epithelial Breast variant Human Mammary Epithelial Cells (vHMEC) BREAST
E027 download download Epithelial Breast Myoepithelial Primary Cells BREAST
E054 download download Neurosph Ganglion Eminence derived primary cultured neurospheres BRAIN
E053 download download Neurosph Cortex derived primary cultured neurospheres BRAIN
E112 download download Thymus Thymus THYMUS
E093 download download Thymus Fetal Thymus THYMUS
E071 download download Brain Brain Hippocampus Middle BRAIN
E074 download download Brain Brain Substantia Nigra BRAIN
E068 download download Brain Brain Anterior Caudate BRAIN
E069 download download Brain Brain Cingulate Gyrus BRAIN
E072 download download Brain Brain Inferior Temporal Lobe BRAIN
E067 download download Brain Brain Angular Gyrus BRAIN
E073 download download Brain Brain_Dorsolateral_Prefrontal_Cortex BRAIN
E070 download download Brain Brain Germinal Matrix BRAIN
E082 download download Brain Fetal Brain Female BRAIN
E081 download download Brain Fetal Brain Male BRAIN
E063 download download Adipose Adipose Nuclei FAT
E100 download download Muscle Psoas Muscle MUSCLE
E108 download download Muscle Skeletal Muscle Female MUSCLE
E107 download download Muscle Skeletal Muscle Male MUSCLE
E089 download download Muscle Fetal Muscle Trunk MUSCLE
E090 download download Muscle Fetal Muscle Leg MUSCLE_LEG
E083 download download Heart Fetal Heart HEART
E104 download download Heart Right Atrium HEART
E095 download download Heart Left Ventricle HEART
E105 download download Heart Right Ventricle HEART
E065 download download Heart Aorta VASCULAR
E078 download download Sm. Muscle Duodenum Smooth Muscle GI_DUODENUM
E076 download download Sm. Muscle Colon Smooth Muscle GI_COLON
E103 download download Sm. Muscle Rectal Smooth Muscle GI_RECTUM
E111 download download Sm. Muscle Stomach Smooth Muscle GI_STOMACH
E092 download download Digestive Fetal Stomach GI_STOMACH
E085 download download Digestive Fetal Intestine Small GI_INTESTINE
E084 download download Digestive Fetal Intestine Large GI_INTESTINE
E109 download download Digestive Small Intestine GI_INTESTINE
E106 download download Digestive Sigmoid Colon GI_COLON
E075 download download Digestive Colonic Mucosa GI_COLON
E101 download download Digestive Rectal Mucosa Donor 29 GI_RECTUM
E102 download download Digestive Rectal Mucosa Donor 31 GI_RECTUM
E110 download download Digestive Stomach Mucosa GI_STOMACH
E077 download download Digestive Duodenum Mucosa GI_DUODENUM
E079 download download Digestive Esophagus GI_ESOPHAGUS
E094 download download Digestive Gastric GI_STOMACH
E099 download download Other Placenta Amnion PLACENTA
E086 download download Other Fetal Kidney KIDNEY
E088 download download Other Fetal Lung LUNG
E097 download download Other Ovary OVARY
E087 download download Other Pancreatic Islets PANCREAS
E080 download download Other Fetal Adrenal Gland ADRENAL
E091 download download Other Placenta PLACENTA
E066 download download Other Liver LIVER
E098 download download Other Pancreas PANCREAS
E096 download download Other Lung LUNG
E113 download download Other Spleen SPLEEN
E114 download download ENCODE2012 A549 EtOH 0.02pct Lung Carcinoma Cell Line LUNG
E115 download download ENCODE2012 Dnd41 TCell Leukemia Cell Line BLOOD
E116 download download ENCODE2012 GM12878 Lymphoblastoid Cells BLOOD
E117 download download ENCODE2012 HeLa-S3 Cervical Carcinoma Cell Line CERVIX
E118 download download ENCODE2012 HepG2 Hepatocellular Carcinoma Cell Line LIVER
E119 download download ENCODE2012 HMEC Mammary Epithelial Primary Cells BREAST
E120 download download ENCODE2012 HSMM Skeletal Muscle Myoblasts Cells MUSCLE
E121 download download ENCODE2012 HSMM cell derived Skeletal Muscle Myotubes Cells MUSCLE
E122 download download ENCODE2012 HUVEC Umbilical Vein Endothelial Primary Cells VASCULAR
E123 download download ENCODE2012 K562 Leukemia Cells BLOOD
E124 download download ENCODE2012 Monocytes-CD14+ RO01746 Primary Cells BLOOD
E125 download download ENCODE2012 NH-A Astrocytes Primary Cells BRAIN
E126 download download ENCODE2012 NHDF-Ad Adult Dermal Fibroblast Primary Cells SKIN
E127 download download ENCODE2012 NHEK-Epidermal Keratinocyte Primary Cells SKIN
E128 download download ENCODE2012 NHLF Lung Fibroblast Primary Cells LUNG
E129 download download ENCODE2012 Osteoblast Primary Cells BONE

These maps can be freely downloaded from here.

Installation

  1. Create Conda Environment python version: 2.7
conda create -y -n dihmm python=2.7
conda activate dihmm
  1. Downlaod dihmm-cpp
git clone https://github.com/gcyuan/diHMM-cpp.git
  1. Install Go into the build dir and run
cd diHMM-cpp/build
cmake ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info-- Detecting C compiler ABI info - done
-- Check for working C compiler: /sc/arion/projects/YuanLab/gcproj/xuan/anaconda3/envs/dihmm/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /sc/arion/projects/YuanLab/gcproj/xuan/anaconda3/envs/dihmm/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Armadillo: /sc/arion/projects/YuanLab/gcproj/xuan/anaconda3/envs/dihmm/include (found version "11.2.0") 
-- Found PythonLibs: /sc/arion/projects/YuanLab/gcproj/xuan/anaconda3/envs/dihmm/lib/libpython2.7.so (found suitable version "2.7.18", minimum required is "2.7") 
-- Found Boost: /sc/arion/projects/YuanLab/gcproj/xuan/anaconda3/envs/dihmm/lib/cmake/Boost-1.72.0/BoostConfig.cmake (found version "1.72.0") found components: python numpy filesystem 
-- Looking for sgemm_
-- Looking for sgemm_ - found
-- Found BLAS: /sc/arion/projects/YuanLab/gcproj/xuan/anaconda3/envs/dihmm/lib/libopenblas.so  
-- Configuring done
-- Generating done
-- Build files have been written to: /sc/arion/projects/YuanLab/gcproj/xuan/dihmm-cpp/build
make
Consolidate compiler generated dependencies of target dihmm
[ 14%] Building CXX object CMakeFiles/dihmm.dir/Model.cpp.o
[ 28%] Building CXX object CMakeFiles/dihmm.dir/Emissions.cpp.o
[ 42%] Building CXX object CMakeFiles/dihmm.dir/Forward_Backward.cpp.o
[ 57%] Linking CXX shared library libdihmm.so
[ 71%] Built target dihmm
Consolidate compiler generated dependencies of target dihmm_ext
[ 85%] Building CXX object CMakeFiles/dihmm_ext.dir/dihmm_ext.cpp.o
[100%] Linking CXX shared library dihmm_ext.so
[100%] Built target dihmm_ext
  1. Install the dependency in your environment

bedtools, wigToBigWig, fetchChromSizes, bigWigToBedGraph required

conda install -c bioconda bedtools
wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64.v369/wigToBigWig
wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64.v369/fetchChromSizes
wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64.v369/bigWigToBedGraph

  1. Set path
export PYTHONPATH=${your_dihmm_dir}/diHMM-cpp/build

Then you can open a Python shell in the same dir and do

>>> import dihmm_ext

Training a model

Training a diHMM model can be done by using the script Train_diHMM.py, after making necessary changes to input data path and other parameters.

Including applying a diHMM model for chromatin state annotation. Annotation can be done with the script annotation.py using in Train_diHMM.py.

python dihmm-cpp/Train_diHMM.py -h
usage: Train_diHMM.py [-h] -i INPUT_DIR --clusters CLUSTERS --chroms CHROMS
                         -o OUT_DIR [--n_bin_states N_BIN_STATES]
                         [--n_domain_states N_DOMAIN_STATES]
                         [--domain_size DOMAIN_SIZE] [--tolerance TOLERANCE]
                         [--max_iter MAX_ITER] [--bin_res BIN_RES]

Train diHMM runner.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_DIR          The input binarized files dir. File name:
                        X1_chr1_binary.txt.
  --clusters CLUSTERS   Clusters/cell_types names used to train model.
                        Example: X1,X2 .
  --chroms CHROMS       chrs used to train model. Example: chr1,chr2 .
  -o OUT_DIR            Output dir.
  --n_bin_states N_BIN_STATES
                        Number of bin states. Default=2.
  --n_domain_states N_DOMAIN_STATES
                        Number of domain states. Default=4.
  --domain_size DOMAIN_SIZE
                        Number of domain size. Default=8.
  --tolerance TOLERANCE
                        Number of bin states. Default=1e-6.
  --max_iter MAX_ITER   Max iter number. Default=500.
  --bin_res BIN_RES     bin length used to generate binarized files.
                        Default=500.

Tutorial

Here is the tutorial for applying diHMM-cpp for H3K4me3 in hESC H1 Cells.

Visualization the results in Genome browser

example

About

Improved implementation of diHMM in C++

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Makefile 32.4%
  • Jupyter Notebook 30.1%
  • C++ 19.9%
  • C 8.3%
  • CMake 4.9%
  • Python 3.5%
  • Shell 0.9%