# Motif analysis

This tutorial describes how to visualize and analyze motifs learnt by DeepCpG, i.e. the filters of the first convolutional layer of the DNA module.

We fill first compute the activations (occurrence frequencies) of motifs in sequence windows, and then extract and align sequence fragments that maximally activate each motif. The resulting alignments will finally be visualized via `weblogo` and compared to known motifs via `Tomtom`.

## Requirements 

[Tomtom](http://web.mit.edu/meme_v4.9.0/doc/tomtom.html) is required for comparing motifs and is part of the [MEME-Suite](http://meme-suite.org/), which can be downloaded [here](http://meme-suite.org/doc/download.html).

[WebLogo3](http://weblogo.threeplusone.com/) is required for visualizing motifs and can be installed with `pip`:

```python
pip install weblogo
```

## Variables

We first initialize some variables that will be used throughout the tutorial.

In [11]:
data_dir="../data"
motifs_dir="./motifs"
mkdir -p $motifs_dir



## Dowloading the DeepCpG model 

We will analyze the filters of the DeepCpG DNA module that was trained on serum cells from *Smallwood et al. (2014)*, and which is described in the DeepCpG publication. This model can be downloaded with `dcpg_download.py`:

In [2]:
cmd="dcpg_download.py
  Smallwood2014_serum_dna
  --out_dir $motifs_dir/model
  "
eval $cmd

INFO (2017-01-31 21:24:32,210): Downloading model ...
INFO (2017-01-31 21:24:32,211): Model URL: http://www.ebi.ac.uk/~angermue/deepcpg/alias/1754b5bbc21a8257663acc52e657f69c
--2017-01-31 21:24:32--  http://www.ebi.ac.uk/~angermue/deepcpg/alias/1754b5bbc21a8257663acc52e657f69c
Resolving www.ebi.ac.uk (www.ebi.ac.uk)... 193.62.193.80
Connecting to www.ebi.ac.uk (www.ebi.ac.uk)|193.62.193.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124853155 (119M) [text/plain]
Saving to: ‘./motifs/model/model.zip’


2017-01-31 21:24:41 (13.2 MB/s) - ‘./motifs/model/model.zip’ saved [124853155/124853155]

Archive:  ./motifs/model/model.zip
  inflating: ./motifs/model/model.h5  
  inflating: ./motifs/model/model.json  
  inflating: ./motifs/model/model_weights.h5  
  inflating: ./motifs/model/model_weights_train.h5  
  inflating: ./motifs/model/model_weights_val.h5  
INFO (2017-01-31 21:24:43,610): Done!


## Creating DeepCpG data files

Next, we use `dcpg_data.py` to extract sequence windows for computing filter activations. `--nb_sample` specifies the number of sequence windows, which can be reduced to decrease compute costs:

In [3]:
dcpg_data="./data"
cmd="dcpg_data.py
    --cpg_profiles $data_dir/cpg/*.tsv
    --dna_files $data_dir/dna/mm10
    --out_dir ./data
    --dna_wlen 1001
    --nb_sample 25000
    "
eval $cmd

INFO (2017-01-31 21:24:48,992): Reading single-cell profiles ...
INFO (2017-01-31 21:24:49,772): 25000 samples
INFO (2017-01-31 21:24:49,774): --------------------------------------------------------------------------------
INFO (2017-01-31 21:24:49,774): Chromosome 1 ...
INFO (2017-01-31 21:24:49,855): 25000 / 25000 (100.0%) sites matched minimum coverage filter
INFO (2017-01-31 21:24:54,166): Chunk 	1 / 1
INFO (2017-01-31 21:24:54,209): Extracting DNA sequence windows ...
INFO (2017-01-31 21:24:59,138): Done!


## Computing filter activations 

Now we use `dcpg_filter_act.py` to compute the activation of filters in the sequence windows. The `--store_inputs` command will store the input sequence windows in the output file.

In [5]:
cmd="dcpg_filter_act.py
    ./data/*.h5
    --model_files ${motifs_dir}/model
    --out_file ${motifs_dir}/activations.h5
    --store_inputs
    "
eval $cmd

Using TensorFlow backend.
INFO (2017-01-31 21:25:09,298): Loading model ...
INFO (2017-01-31 21:25:09,298): Using model files ./motifs/model/model.json ./motifs/model/model_weights.h5
INFO (2017-01-31 21:25:10,070): Using activation layer "dna/activation_1"
INFO (2017-01-31 21:25:10,070): Using weight layer "dna/convolution1d_1"
INFO (2017-01-31 21:25:10,071): Reading data ...
INFO (2017-01-31 21:25:10,083): Computing activations
INFO (2017-01-31 21:25:10,099):   128/25000 (0.5%)
INFO (2017-01-31 21:26:46,539):  2688/25000 (10.8%)
INFO (2017-01-31 21:28:21,317):  5248/25000 (21.0%)
INFO (2017-01-31 21:29:56,056):  7808/25000 (31.2%)
INFO (2017-01-31 21:31:37,357): 10368/25000 (41.5%)
INFO (2017-01-31 21:33:13,737): 12928/25000 (51.7%)
INFO (2017-01-31 21:34:49,494): 15488/25000 (62.0%)
INFO (2017-01-31 21:36:29,782): 18048/25000 (72.2%)
INFO (2017-01-31 21:38:09,162): 20608/25000 (82.4%)
INFO (2017-01-31 21:39:50,073): 23168/25000 (92.7%)
INFO (2017-01-31 21:41:05,723)

## Visualizing and analyzing motifs

`dcpg_filter_motifs.py` allows to visualize and cluster motifs, compare motifs to known motifs, and compute motif statistics. We will compare motifs to motifs in the CIS-BP database, and plot motif heatmaps, motif activations, as well as the first two principal components of motif activations via the `--plot_heat`, `--plot_dens` and `--plot_pca` option, respectively:

In [6]:
cmd="dcpg_filter_motifs.py
    $motifs_dir/activations.h5
    --out_dir $motifs_dir
    --motif_db $data_dir/motif_databases/CIS-BP/Mus_musculus.meme
    --plot_heat
    --plot_dens
    --plot_pca
    --out_format png
    "
eval $cmd

INFO (2017-01-31 21:41:24,271): Reading data
Filters: 128
Filter len: 11
Samples: 25000
INFO (2017-01-31 21:41:25,563): Performing PCA on activations using 1000 samples
INFO (2017-01-31 21:41:34,631): Analyzing filters
INFO (2017-01-31 21:41:34,631): -----------------
INFO (2017-01-31 21:41:34,631): Filter 0
INFO (2017-01-31 21:41:44,331): Plotting filter densities
INFO (2017-01-31 21:41:46,521): Plotting filter heatmap
INFO (2017-01-31 21:41:46,779): Extracting activating kmers
INFO (2017-01-31 21:41:47,295): Plotting sequence logo
INFO (2017-01-31 21:41:51,035): Computing PWM
INFO (2017-01-31 21:41:51,041): Filter 1
INFO (2017-01-31 21:41:59,256): Plotting filter densities
INFO (2017-01-31 21:42:01,703): Plotting filter heatmap
INFO (2017-01-31 21:42:02,132): Extracting activating kmers
INFO (2017-01-31 21:42:02,527): Plotting sequence logo
INFO (2017-01-31 21:42:04,273): Computing PWM
INFO (2017-01-31 21:42:04,278): Filter 2
INFO (2017-01-31 21:42:12,724): Plotti

We can now have a look into the output motifs directory. `report_top.csv` contains for each DeepCpG motif the most similar motif in the CIS-BP database.

In [13]:
cat $motifs_dir/report_top.csv | head -n 25

idx	motif	act_mean	act_std	ic	nb_site	target id	p-value	e-value	q-value	overlap	query consensus	target consensus	orientation	protein	url
121	GCCCCCTAGTG	0.000	0.007	13.388	3164	M6125_1.02	0.000	0.000	0.000	11.000	GCCATCTGGTG	AGCGCCCCCTGGTGG	-	Ctcf	http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T049038_1.02
117	TATGGGCGGGG	0.001	0.012	10.746	13353	M0392_1.02	0.000	0.000	0.000	9.000	AGGGGGCGGGG	GGGGGCGGG	+	Sp1	http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T049025_1.02
30	CACCAGGTGGC	0.001	0.013	11.527	6690	M6125_1.02	0.000	0.000	0.000	11.000	CACCAGGTGGC	CCACCAGGGGGCGCT	+	Ctcf	http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T049038_1.02
79	CGCGACGTGGC	0.002	0.014	8.419	24699	M5593_1.02	0.000	0.000	0.001	11.000	GGGGGCGTGGC	GGGGGCGTGGC	+	Klf16	http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T049178_1.02
46	AGATATAATAA	0.020	0.056	10.281	24508	M1582_1.02	0.000	0.001	0.002	11.000	AAATATAATAA	ATATATAATAA	+	Hmg20b	http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T1445

Sequence logos are stored in the `logos/` directory. Let's have a look at the top-ranked motif 121--Ctcf:
<img src="./motifs/logos/121.png" width="500px"/>

The corresponding PWM weight matrix stored in `heat/` looks as follows:
<img src="./motifs/heat/121.png" width="600px"/>

Densities of motif activations are stored in `dens/`, e.g. for the density of the selected motif:
<img src="./motifs/dens/121.png" width="600px"/>

And lastly `plot_wmean.png` shows the two first principal components of weighted-mean motif activations in sequence windows. Motif with similar activaton pattern will be close to each other:
<img src="./motifs/pca_mean.png" width="1000px"/>