# Train Deepred-Mt

## Install Deepred-Mt

In [12]:
!pip install -U "deepredmt @ git+https://github.com/aedera/deepredmt.git" > /dev/null

  Running command git clone -q https://github.com/aedera/deepredmt.git /tmp/pip-install-otgctr2p/deepredmt


## Environment

In [13]:
import tensorflow as tf
import deepredmt

## Prepare training dataset

As training dataset, we will use a subset of the training data originally employed to train Deepred-Mt.

In [14]:
!wget https://raw.githubusercontent.com/aedera/deepredmt/main/data/training-data.tsv.gz

--2021-04-22 19:47:05--  https://raw.githubusercontent.com/aedera/deepredmt/main/data/training-data.tsv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3026591 (2.9M) [application/octet-stream]
Saving to: ‘training-data.tsv.gz’


2021-04-22 19:47:06 (20.1 MB/s) - ‘training-data.tsv.gz’ saved [3026591/3026591]



The following commands process the downloaded data to change the format, into the one expected by the learning method.

In [15]:
!gzip -d training-data.tsv.gz
!head -10 training-data.tsv | column -t

Allium_cepa!atp1!1002  atp1!1020  3  GCTTTCCCTGGGGATGTTTT  C  TATTTACATTCCCGTCTCTT  TTC  0.0011  0  0.00
Allium_cepa!atp1!1009  atp1!1027  1  CTGGGGATGTTTTCTATTTA  C  ATTCCCGTCTCTTAGAAAGA  CAT  0.0002  0  0.00
Allium_cepa!atp1!1013  atp1!1031  2  GGATGTTTTCTATTTACATT  C  CCGTCTCTTAGAAAGAGCCG  TCC  0.0002  0  0.00
Allium_cepa!atp1!1014  atp1!1032  3  GATGTTTTCTATTTACATTC  C  CGTCTCTTAGAAAGAGCCGC  TCC  0.0010  0  0.00
Allium_cepa!atp1!1015  atp1!1033  1  ATGTTTTCTATTTACATTCC  C  GTCTCTTAGAAAGAGCCGCT  CGT  0.0000  0  0.00
Allium_cepa!atp1!1018  atp1!1036  1  TTTTCTATTTACATTCCCGT  C  TCTTAGAAAGAGCCGCTAAA  CTC  0.0002  0  0.00
Allium_cepa!atp1!1020  atp1!1038  3  TTCTATTTACATTCCCGTCT  C  TTAGAAAGAGCCGCTAAACG  CTC  0.0018  0  0.00
Allium_cepa!atp1!1031  atp1!1049  2  TTCCCGTCTCTTAGAAAGAG  C  CGCTAAACGATCGGACCAGA  GCC  0.0000  0  0.00
Allium_cepa!atp1!1032  atp1!1050  3  TCCCGTCTCTTAGAAAGAGC  C  GCTAAACGATCGGACCAGAC  GCC  0.0000  0  0.00
Allium_cepa!atp1!1034  atp1!1052  2  CCGTCTCTTAGAAAGAGC

In [16]:
!cut -f4,5,6,8,9 training-data.tsv | \
  awk '{{printf "%s%s%s\t%s\t%s\n", $$1, $$2, $$3, $$5, $$4}}' > trainset.tmp

# nucleotide window, label (0/1), and editing extent
!head -10 trainset.tmp | column -t

GCTTTCCCTGGGGATGTTTTCTATTTACATTCCCGTCTCTT  0  0.0011
CTGGGGATGTTTTCTATTTACATTCCCGTCTCTTAGAAAGA  0  0.0002
GGATGTTTTCTATTTACATTCCCGTCTCTTAGAAAGAGCCG  0  0.0002
GATGTTTTCTATTTACATTCCCGTCTCTTAGAAAGAGCCGC  0  0.0010
ATGTTTTCTATTTACATTCCCGTCTCTTAGAAAGAGCCGCT  0  0.0000
TTTTCTATTTACATTCCCGTCTCTTAGAAAGAGCCGCTAAA  0  0.0002
TTCTATTTACATTCCCGTCTCTTAGAAAGAGCCGCTAAACG  0  0.0018
TTCCCGTCTCTTAGAAAGAGCCGCTAAACGATCGGACCAGA  0  0.0000
TCCCGTCTCTTAGAAAGAGCCGCTAAACGATCGGACCAGAC  0  0.0000
CCGTCTCTTAGAAAGAGCCGCTAAACGATCGGACCAGACAG  0  0.0000


Take a sample of 1000 datapoints

In [17]:
!shuf trainset.tmp | head -1000 > trainset.tsv 

## Training

In [18]:
deepredmt.fit("trainset.tsv", batch_size=32, epochs=10)

Model: "encoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_16 (Batc (None, 41, 4)             16        
_________________________________________________________________
conv1d_0_0 (Conv1D)          (None, 41, 16)            208       
_________________________________________________________________
batch_normalization_17 (Batc (None, 41, 16)            64        
_________________________________________________________________
activation_14 (Activation)   (None, 41, 16)            0         
_________________________________________________________________
conv1d_0_1 (Conv1D)          (None, 41, 16)            784       
_________________________________________________________________
batch_normalization_18 (Batc (None, 41, 16)            64        
_________________________________________________________________
activation_15 (Activation)   (None, 41, 16)            0   

True

The trained model is saved in a tensorflow file in a folder named models/deepredmt.

In [19]:
!ls -d ./models/deepredmt/*.tf

./models/deepredmt/210422-1947.tf


We can load this model and use it to make predictions.

In [20]:
path_to_model = !ls -d ./models/deepredmt/*.tf
model = tf.keras.models.load_model(path_to_model[0])

In [21]:
y_pred = deepredmt.predict('trainset.tsv', path_to_model[0])
y_pred[0:20] # see the predicted scores for the first 20 windows

array([[0.94932175],
       [0.987972  ],
       [0.43396324],
       [0.72903997],
       [0.5498872 ],
       [0.55503607],
       [0.6339734 ],
       [0.7841495 ],
       [0.06540111],
       [0.52731943],
       [0.5952312 ],
       [0.8876858 ],
       [0.24046293],
       [0.02224666],
       [0.5677028 ],
       [0.1244027 ],
       [0.98225266],
       [0.39220166],
       [0.08878222],
       [0.9660412 ]], dtype=float32)