# DeepSEA Demo
This notebook demonstrates the use of convolutional neural networks for adressing genomic problems.
It also shows how Kipoi can be used to get trained models for problems instead of having to train
your own.

Our example method for this demonstration will be [DeepSEA](https://www.nature.com/articles/nmeth.3547), which predicts the effects of noncoding genomic variants on epigenetic features.

In [1]:
%%bash
# Preliminaries from https://towardsdatascience.com/conda-google-colab-75f7c867a522
# used to install conda for package management
MINICONDA_INSTALLER_SCRIPT=Miniconda3-4.5.4-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

PREFIX=/usr/local
installing: python-3.6.5-hc3d631a_2 ...
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
installing: libgcc-ng-7.2.0-hdf63c60_3 ...
installing: libstdcxx-ng-7.2.0-hdf63c60_3 ...
installing: libffi-3.2.1-hd88cf55_4 ...
installing: ncurses-6.1-hf484d3e_0 ...
installing: openssl-1.0.2o-h20670df_0 ...
installing: tk-8.6.7-hc745277_3 ...
installing: xz-5.2.4-h14c3975_4 ...
installing: yaml-0.1.7-had09818_2 ...
installing: zlib-1.2.11-ha838bed_2 ...
installing: libedit-3.1.20170329-h6b74fdf_2 ...
installing: readline-7.0-ha6073c6_4 ...
installing: sqlite-3.23.1-he433501_0 ...
installing: asn1crypto-0.24.0-py36_0 ...
installing: certifi-2018.4.16-py36_0 ...
installing: chardet-3.0.4-py36h0f667ec_1 ...
installing: idna-2.6-py36h82fb2a8_1 ...
installing: pycosat-0.6.3-py36h0a5515d_0 ...
installing: pycparser-2.18-py36hf9f622e_1 ...
installing: pysocks-1.6.8-py36_0 ...
installing: ruamel_yaml-0.15.37-py36h14c3975_2 ...
installing: six-1.11

--2020-10-19 17:49:33--  https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c84f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh [following]
--2020-10-19 17:49:34--  https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58468498 (56M) [application/x-sh]
Saving to: ‘Miniconda3-4.5.4-Linux-x86_64.sh’

     0K .......... .......... .......... .......... ..........  0% 52.7M 1s
    50K .......... .......... .......... .......... ..........  0%

In [2]:
%%bash
conda install --channel defaults conda python=3.6 --yes
conda update --channel defaults --all --yes

Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs: 
    - conda
    - python=3.6


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    requests-2.24.0            |             py_0          54 KB
    libstdcxx-ng-9.1.0         |       hdf63c60_0         4.0 MB
    cffi-1.14.3                |   py36he30daa8_0         224 KB
    xz-5.2.5                   |       h7b6447c_0         438 KB
    urllib3-1.25.10            |             py_0          93 KB
    wheel-0.35.1               |             py_0          36 KB
    python-3.6.12              |       hcff3b4d_2        34.0 MB
    openssl-1.1.1h             |       h7b6447c_0         3.8 MB
    pyopenssl-19.1.0           |             py_1          47 KB
    tk-8.6.10                  |       hbc83047_0         3.2 MB
    _libgcc_mutex-0.1          |             main   

requests-2.24.0      |   54 KB |            |   0% requests-2.24.0      |   54 KB | ########## | 100% 
libstdcxx-ng-9.1.0   |  4.0 MB |            |   0% libstdcxx-ng-9.1.0   |  4.0 MB | #######6   |  77% libstdcxx-ng-9.1.0   |  4.0 MB | #########3 |  93% libstdcxx-ng-9.1.0   |  4.0 MB | ########## | 100% 
cffi-1.14.3          |  224 KB |            |   0% cffi-1.14.3          |  224 KB | ########## | 100% 
xz-5.2.5             |  438 KB |            |   0% xz-5.2.5             |  438 KB | #########  |  90% xz-5.2.5             |  438 KB | ########## | 100% 
urllib3-1.25.10      |   93 KB |            |   0% urllib3-1.25.10      |   93 KB | ########## | 100% 
wheel-0.35.1         |   36 KB |            |   0% wheel-0.35.1         |   36 KB | ########## | 100% 
python-3.6.12        | 34.0 MB |            |   0% python-3.6.12        | 34.0 MB | ##5        |  25% python-3.6.12        | 34.0 MB | #####7     |  58% python-3.6.12        | 34.0 MB | #######5   |  75% pytho

In [3]:
import sys
sys.path.append("/usr/local/lib/python3.6/site-packages")

In [4]:
!conda install -c conda-forge -y  mamba
!mamba install -c bioconda -c conda-forge -y  pybedtools pyfaidx kipoi kipoiseq pyyaml


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
singledispatch           [] (00m:00s) Waiting...
related                  [] (00m:00s) Waiting...
tinydb                   [] (00m:00s) Waiting...
whichcraft               [] (00m:00s) Waiting...
click                    [] (00m:00s) Waiting...
cookiecutter             [] (00m:00s) Waiting...
importlib_metadata       [] (00m:00s) Waiting...
pybedtools               [] (00m:00s)     10 MB /     12 MB ( 25.97 MB/s)
gffutils                 [] (00m:00s) 
future                   [] (00m:00s) Decompressing...
kipoi-utils              [] (00m:00s) Waiting...
libcblas                 [] (00m:00s) Waiting...
libgfortran-ng           [] (00m:00s) Waiting...
liblapack                [] (00m:00s) Waiting...
markupsafe               [] (00m:00s) Waiting...
packaging                [] (00m:00s) Waiting...
poyo                     [] (00m:00s) Waiting...
h5py                     [] (00m:00s) Waiting...
pyparsing                [] (00m

In [5]:
# Import the necessary libraries
import kipoi
import torch

## Download the model and example data from Kipoi


In [6]:
# Download trained model from Kipoi
model = kipoi.get_model('DeepSEA/predict')

0.00B [00:00, ?B/s]

Downloading https://zenodo.org/record/1466993/files/deepsea_predict.pth?download=1 to /root/.kipoi/models/DeepSEA/predict/downloaded/model_files/weights/89e640bf6bdbe1ff165f484d9796efc7


100%|█████████▉| 211M/211M [00:17<00:00, 11.7MB/s]

In [7]:
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
it = dl.batch_iter(batch_size=10)
# predict for a batch
batch = next(it)


0.00B [00:00, ?B/s][A
8.19kB [00:00, 49.6kB/s]

0.00B [00:00, ?B/s][A

Downloading https://raw.githubusercontent.com/kipoi/kipoiseq/master/tests/data/intervals_51bp.tsv to /content/example/intervals_file
Downloading https://raw.githubusercontent.com/kipoi/kipoiseq/master/tests/data/hg38_chr22_32000000_32300000.fa to /content/example/fasta_file



303kB [00:00, 1.24MB/s]                            


## Data Tour
Always start by looking at your data

In [8]:
print(batch)

{'inputs': array([[[[1., 0., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 1., 1., 0.]],

        [[0., 1., 0., ..., 0., 0., 0.]],

        [[0., 0., 1., ..., 0., 0., 1.]]],


       [[[0., 0., 1., ..., 0., 1., 0.]],

        [[0., 0., 0., ..., 0., 0., 1.]],

        [[1., 1., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 1., 0., 0.]]],


       [[[0., 0., 0., ..., 0., 1., 0.]],

        [[0., 0., 1., ..., 0., 0., 1.]],

        [[1., 0., 0., ..., 1., 0., 0.]],

        [[0., 1., 0., ..., 0., 0., 0.]]],


       ...,


       [[[1., 1., 1., ..., 1., 0., 0.]],

        [[0., 0., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 0., 1., 1.]]],


       [[[0., 1., 1., ..., 1., 1., 1.]],

        [[0., 0., 0., ..., 0., 0., 0.]],

        [[0., 0., 0., ..., 0., 0., 0.]],

        [[1., 0., 0., ..., 0., 0., 0.]]],


       [[[1., 1., 1., ..., 1., 1., 0.]],

        [[0., 0., 0., ..., 0., 0., 1.]],

        [[0., 0., 0., ..., 0., 0., 0.]],


In [9]:
print(batch['inputs'].shape)

(10, 4, 1, 1000)


In [10]:
# Convert one-hot encoding to DNA
example_sequence_one_hot = batch['inputs'][0,:,:,:].squeeze()
print(example_sequence_one_hot.shape)

(4, 1000)


In [11]:
# Convert one-hot into integer labels
example_sequence_indices = example_sequence_one_hot.argmax(axis=0)
print(example_sequence_indices)

[0 2 3 1 3 2 1 0 1 0 2 0 0 1 0 3 3 3 3 2 0 2 0 3 3 3 2 3 3 3 3 1 3 1 0 2 3
 0 2 3 3 1 1 3 3 0 1 0 0 3 3 1 1 1 0 3 0 3 3 2 3 0 2 2 3 0 0 3 2 0 1 0 2 3
 3 3 3 0 0 3 2 1 1 0 1 3 2 0 0 0 3 0 0 0 0 0 1 0 3 0 0 0 2 3 2 0 1 3 2 1 1
 3 0 0 2 0 2 3 1 0 1 0 3 2 2 3 3 3 0 3 2 2 1 0 2 0 0 1 3 0 2 2 0 1 3 1 0 0
 2 1 1 1 0 2 3 3 1 3 3 1 3 2 0 0 0 3 1 1 0 0 0 3 1 1 0 2 2 1 1 0 1 3 3 3 1
 0 1 1 1 0 3 2 1 1 1 3 1 2 3 2 2 3 2 1 1 1 0 2 1 1 1 3 2 3 2 1 1 3 1 0 3 0
 0 1 1 1 2 2 2 3 2 1 3 2 0 2 2 3 2 2 0 2 1 0 2 1 3 1 1 1 0 3 1 1 0 2 2 3 1
 1 1 0 0 2 0 3 0 2 2 1 3 1 1 1 0 1 3 2 1 3 1 1 0 2 0 1 2 3 1 1 1 1 0 3 2 2
 0 3 3 2 1 1 1 1 1 0 2 1 0 0 0 2 2 0 1 1 3 1 1 3 2 1 1 0 2 1 0 2 1 3 1 1 2
 2 2 0 0 2 2 0 2 1 3 1 3 2 1 0 2 0 2 0 2 1 1 1 3 3 3 2 2 0 0 2 1 1 0 2 0 2
 1 0 2 0 0 0 2 2 2 0 2 1 0 2 2 1 0 1 1 3 2 1 3 2 2 0 1 0 0 1 3 2 2 2 1 1 3
 1 1 0 3 1 1 0 0 2 3 1 0 2 0 3 3 1 1 3 3 1 1 0 2 2 0 1 0 0 0 2 1 3 2 1 3 1
 3 1 1 3 3 1 0 1 1 1 0 2 0 0 1 0 1 1 0 1 3 2 3 1 0 1 1 3 1 1 3 1 0 1 0 2 0
 2 2 0 2 0 0 0 1 0 3 1 3 

In [12]:
# Kipoi models usually use ACGT, but that's not necessarily the case so be careful 
decoder = {0: 'A',
           1: 'C',
           2: 'G',
           3: 'T'
          }

# Decode the DNA sequence
decoded = []
for index in example_sequence_indices:
    decoded.append(decoder[index])
    
print(''.join(decoded))

AGTCTGCACAGAACATTTTGAGATTTGTTTTCTCAGTAGTTCCTTACAATTCCCATATTGTAGGTAATGACAGTTTTAATGCCACTGAAATAAAAACATAAAGTGACTGCCTAAGAGTCACATGGTTTATGGCAGAACTAGGACTCAAGCCCAGTTCTTCTGAAATCCAAATCCAGGCCACTTTCACCCATGCCCTCGTGGTGCCCAGCCCTGTGCCTCATAACCCGGGTGCTGAGGTGGAGCAGCTCCCATCCAGGTCCCAAGATAGGCTCCCACTGCTCCAGACGTCCCCATGGATTGCCCCCAGCAAAGGACCTCCTGCCAGCAGCTCCGGGAAGGAGCTCTGCAGAGAGCCCTTTGGAAGCCAGAGCAGAAAGGGAGCAGGCACCTGCTGGACAACTGGGCCTCCATCCAAGTCAGATTCCTTCCAGGACAAAGCTGCTCTCCTTCACCCAGAACACCACTGTCACCTCCTCACAGAGGAGAAACATCTTTGTTCTTCCATCTCAAAAGAGCTGGCTTTGCTGATATGACAGGCCCCAAAGAGCAAGTCAGCCTCATCAGCAGTTTTTCCTCCTCCCTCCTCCGCATTCTTCCTGGTGCGTCATCTTCCAAGGTGACACATACATTGTGGCTTTGGCAGGACTCCTGCCTGTTGGGACTCAGGAAGTTCACTTTGTCCTCCTAAGTCTCTATGTTGACACGCCCTTGCCTGTAAACACAAGAATTGAGAGGGGATATGATGATTCCAGAGATAGGAAATTGATCTCTAACCAAATTTCACATCTTAAGAAGGCCTGTGACTCTGGGACCACGGGTACCATGTTGAGAAGGGTTCCACCCAGTGGTCATGAGCACAGACCTTGTTCTCAGACCTGATTCCTCCAGGCAGGTTATTTGACATTTATGAACCTCAGTGTTCTCTGAAATGGGGATCATCCCCTGACTTCTGAGGGCAGTTAAATGAGATCAAGCATGTAAAGCTCTTAGCACCAAGCCT

### What should the outputs look like?

In [14]:
# This file comes from the resources folder of the DeepSEA standalone package
# that can be found here: http://deepsea.princeton.edu/media/code/deepsea.v0.94c.tar.gz
!wget https://raw.githubusercontent.com/ben-heil/dl_workshop/main/notebooks/predictor.names

--2020-10-19 17:52:28--  https://raw.githubusercontent.com/ben-heil/dl_workshop/main/notebooks/predictor.names
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17297 (17K) [text/plain]
Saving to: ‘predictor.names’


2020-10-19 17:52:28 (1.33 MB/s) - ‘predictor.names’ saved [17297/17297]



In [15]:
with open ('predictor.names') as feature_file:
    feature_names = [name.strip() for name in feature_file]

print(len(feature_names))
print(feature_names[:5])

919
['8988T|DNase|None', 'AoSMC|DNase|None', 'Chorion|DNase|None', 'CLL|DNase|None', 'Fibrobl|DNase|None']


## Making predictions with the data

In [16]:
pred = model.pipeline.predict(dl_kwargs, batch_size=10)


0it [00:00, ?it/s][A
1it [00:00,  1.61it/s]


In [17]:
pred

array([[0.08160985, 0.06867626, 0.10076762, ..., 0.09493407, 0.02133884,
        0.01201438],
       [0.06698142, 0.01062412, 0.02694611, ..., 0.15490845, 0.04822356,
        0.00770111],
       [0.04445538, 0.00539725, 0.01840791, ..., 0.14994638, 0.3529725 ,
        0.02272797],
       ...,
       [0.00081359, 0.00404314, 0.00176917, ..., 0.02167308, 0.11304276,
        0.02273136],
       [0.00079004, 0.00249485, 0.00252153, ..., 0.05105   , 0.04735951,
        0.01576868],
       [0.00076793, 0.00237438, 0.00250099, ..., 0.05096852, 0.04723714,
        0.01515259]], dtype=float32)

In [18]:
print(pred.shape)

(10, 919)


In [19]:
# Get the first prediction from the batch
single_pred = torch.Tensor(pred[0,:])
print(single_pred.shape)

torch.Size([919])


In [20]:
values, indices = torch.topk(single_pred, 5)

for value, index in zip(values, indices):
    feature = feature_names[index]
    print('Feature {} has a probability of {}'.format(feature, value))

Feature NHEK|H3K4me1|None has a probability of 0.6889464259147644
Feature NH-A|H3K4me1|None has a probability of 0.6628297567367554
Feature Osteoblasts|H3K4me1|None has a probability of 0.6460651159286499
Feature NH-A|H3K4me2|None has a probability of 0.541836142539978
Feature NHEK|H3K4me2|None has a probability of 0.49794310331344604


### Sequence 1 Info
H3K4 methylation indicates active transcription

NHEK = Normal epithelial keratinocytes  
NH-A = Astrocytes

In [21]:
# Get a single prediction from the batch
single_pred = torch.Tensor(pred[5,:])
print(single_pred.shape)

torch.Size([919])


In [22]:
values, indices = torch.topk(single_pred, 5)

for value, index in zip(values, indices):
    feature = feature_names[index]
    print('Feature {} has a probability of {}'.format(feature, value))

Feature U2OS|SETDB1|None has a probability of 0.36347267031669617
Feature Monocytes-CD14+RO01746|H3K9me3|None has a probability of 0.19080045819282532
Feature K562|KAP1|None has a probability of 0.14699679613113403
Feature H1-hESC|H3K9me3|None has a probability of 0.12245297431945801
Feature HEK293|KAP1|None has a probability of 0.105678990483284


### Sequence 2 info
SETDB1 is an H3K9 methyltransferase

H3K9 methylation indicates transcriptional repression  
KAP1 is a ubiquitous protein [likely involved in chromatin organization](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3143589/)

Cell types are largely cell lines/embronic