<a href="https://colab.research.google.com/github/anvelezec/hans_on/blob/master/AMPs_data_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q tfds-nightly

In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Data base consolidation
There are plenty of antimicrobial peptides (AMPs) databases on the web, some useful links that centralize universities, research groups or laboratories metadata to download them are [[1]](http://crdd.osdd.net/raghava/satpdb/links.php), [[2]](http://www.uwm.edu.pl/biochemia/index.php/en/biopep/32-bioactive-peptide-databases):
 
Sometimes the discovered AMPs are able to be downloaded in a fasta, csv or txt format. In other occasions we should use web scraping to consolidate peptides and its properties.
 
During this notebook we are going to focus on the data pipeline creation, and we are using a toy database of 5  AMPs.

In [3]:
amps_toy = ["MPKTRRRPRRSQRKRPPTPWPYGRKKRRQRRR",
            "KLWKLWLKWLL",
            "GINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYE",
            "IKELLPHLSGIIDSVANAIK",
            "FLPLIGKVLSSIL"]

with open("AMPs_toy.txt", "w") as file:
  for amp in amps_toy:
    file.write(amp + "\n")

# 1. Step load dataset with tf.data.TextLineDataset
Since we would not load all our dataset in memory we use tf.data.TextLineDataset [[3]](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset#top_of_page) to read batches of AMPs and feed them into a model. This way we could scale our computing capacity at different dataset sizes.

In [4]:
tf_peptides = tf.data.TextLineDataset("/content/AMPs_toy.txt")

In [5]:
# Each elemet is has tf.Tensor structure
for peptide in tf_peptides:
  print(peptide)

tf.Tensor(b'MPKTRRRPRRSQRKRPPTPWPYGRKKRRQRRR', shape=(), dtype=string)
tf.Tensor(b'KLWKLWLKWLL', shape=(), dtype=string)
tf.Tensor(b'GINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYE', shape=(), dtype=string)
tf.Tensor(b'IKELLPHLSGIIDSVANAIK', shape=(), dtype=string)
tf.Tensor(b'FLPLIGKVLSSIL', shape=(), dtype=string)


# 2. Step: Create vocabulary and instantiate TokenTextEncoder as encoder

In [6]:
vocab_list = set()
for i in tf_peptides:
  i = i.numpy()
  j = list(i.decode("utf-8"))
  vocab_list.update(j)

print(vocab_list)

{'M', 'W', 'Y', 'I', 'K', 'H', 'G', 'E', 'R', 'T', 'D', 'P', 'Q', 'F', 'S', 'L', 'A', 'N', 'V'}


In [7]:
encoder = tfds.deprecated.text.TokenTextEncoder(vocab_list=vocab_list, decode_token_separator="")

# 3. Create a function to map a word/peptide into a list of letters or aminoacids

In [8]:
def encoder_fn(peptide):
  peptide = peptide.numpy()
  encoded_peptide = encoder.encode(" ".join(peptide.decode("utf-8")))
  return [encoded_peptide]

In [9]:
for i in tf_peptides.take(2):
  print(i)
  encoded_peptide = encoder_fn(i)
  print(encoded_peptide)

tf.Tensor(b'MPKTRRRPRRSQRKRPPTPWPYGRKKRRQRRR', shape=(), dtype=string)
[[1, 12, 5, 10, 9, 9, 9, 12, 9, 9, 15, 13, 9, 5, 9, 12, 12, 10, 12, 2, 12, 3, 7, 9, 5, 5, 9, 9, 13, 9, 9, 9]]
tf.Tensor(b'KLWKLWLKWLL', shape=(), dtype=string)
[[5, 16, 2, 5, 16, 2, 16, 5, 2, 16, 16]]


Create a tf.py_function wrapper So you can .map this function directly. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function.
You want to use Dataset.map to apply this function to each element of the dataset. Dataset.map runs in graph mode.

  * Graph tensors do not have a value.
  * In graph mode you can only use TensorFlow Ops and functions.

In [10]:
def encode_map_fn(peptide):
  # py_func doesn't set the shape of the returned tensors.
  encoded_peptide = tf.py_function(encoder_fn,
                                   inp=[peptide],
                                   Tout=(tf.int64))
  
  # `tf.data.Datasets` work best if all components have a shape set
  #  so set the shapes manually: 
  encoded_peptide.set_shape([None])
  return encoded_peptide

# 4. Map encode function through elements

In [11]:
tf_peptides_encoded = tf_peptides.map(encode_map_fn)

# 5. Create function to create windowed dataset
For more details and examples to undestand this window concept visit [[4]](https://colab.research.google.com/github/ageron/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb#scrollTo=hhijmvoGyVVF)

In [12]:
def window_map(x):
  """
  input: 
   x: tf.Tensor
   window_size: int (global)
  output: _VariantDataset 
  """  
  # cast tf.tensor to a tf.data, this way we can use the methods window and batch
  x = tf.data.Dataset.from_tensor_slices(x)
  
  # Create windows of size window_size, the result is a _VariantDataset, this is
  # why we need to extract its elements by using a batch with a buffer size 
  # equivalent a window_size 
  
  x = x.window(window_size, 1, drop_remainder=True)
  x = x.flat_map(lambda x: x.batch(window_size))
  return x

Applies the window_map function to the encoded dataset "tf_peptides_encoded" thhrough a flat_map. We are taking a window size equal to 4, this means our features and labels
length is going to be equal to 3


In [13]:
window_size = 10
b_flat = tf_peptides_encoded.flat_map(window_map).batch(5)

In [14]:
for i in b_flat.take(3):
  print(i)

tf.Tensor(
[[ 1 12  5 10  9  9  9 12  9  9]
 [12  5 10  9  9  9 12  9  9 15]
 [ 5 10  9  9  9 12  9  9 15 13]
 [10  9  9  9 12  9  9 15 13  9]
 [ 9  9  9 12  9  9 15 13  9  5]], shape=(5, 10), dtype=int64)
tf.Tensor(
[[ 9  9 12  9  9 15 13  9  5  9]
 [ 9 12  9  9 15 13  9  5  9 12]
 [12  9  9 15 13  9  5  9 12 12]
 [ 9  9 15 13  9  5  9 12 12 10]
 [ 9 15 13  9  5  9 12 12 10 12]], shape=(5, 10), dtype=int64)
tf.Tensor(
[[15 13  9  5  9 12 12 10 12  2]
 [13  9  5  9 12 12 10 12  2 12]
 [ 9  5  9 12 12 10 12  2 12  3]
 [ 5  9 12 12 10 12  2 12  3  7]
 [ 9 12 12 10 12  2 12  3  7  9]], shape=(5, 10), dtype=int64)


# 6. Create datasets with features and labels
At this moment we have out data pipeline all setup to train a amp generation model.

In [15]:
def label_feature(x):
  feature = x[:-1]
  label = x[1:]

  return feature, label

In [16]:
b_flat_ds = tf_peptides_encoded.flat_map(window_map).map(label_feature)

In [17]:
for i in b_flat_ds.take(3):
  print(i)

(<tf.Tensor: shape=(9,), dtype=int64, numpy=array([ 1, 12,  5, 10,  9,  9,  9, 12,  9])>, <tf.Tensor: shape=(9,), dtype=int64, numpy=array([12,  5, 10,  9,  9,  9, 12,  9,  9])>)
(<tf.Tensor: shape=(9,), dtype=int64, numpy=array([12,  5, 10,  9,  9,  9, 12,  9,  9])>, <tf.Tensor: shape=(9,), dtype=int64, numpy=array([ 5, 10,  9,  9,  9, 12,  9,  9, 15])>)
(<tf.Tensor: shape=(9,), dtype=int64, numpy=array([ 5, 10,  9,  9,  9, 12,  9,  9, 15])>, <tf.Tensor: shape=(9,), dtype=int64, numpy=array([10,  9,  9,  9, 12,  9,  9, 15, 13])>)


# Adding repeat and shuffle to our data pipeline
Now we have confidence in the pipelines quality we can integrate repeat, shuffle and batch functions to the pipeline.

In [18]:
feature_label_ds = tf_peptides_encoded.flat_map(window_map).map(label_feature).shuffle(3).repeat(10).batch(3)

In [19]:
for i in feature_label_ds.take(4):
  print(i)

(<tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[ 5, 10,  9,  9,  9, 12,  9,  9, 15],
       [ 1, 12,  5, 10,  9,  9,  9, 12,  9],
       [ 9,  9,  9, 12,  9,  9, 15, 13,  9]])>, <tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[10,  9,  9,  9, 12,  9,  9, 15, 13],
       [12,  5, 10,  9,  9,  9, 12,  9,  9],
       [ 9,  9, 12,  9,  9, 15, 13,  9,  5]])>)
(<tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[ 9,  9, 12,  9,  9, 15, 13,  9,  5],
       [12,  5, 10,  9,  9,  9, 12,  9,  9],
       [ 9, 12,  9,  9, 15, 13,  9,  5,  9]])>, <tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[ 9, 12,  9,  9, 15, 13,  9,  5,  9],
       [ 5, 10,  9,  9,  9, 12,  9,  9, 15],
       [12,  9,  9, 15, 13,  9,  5,  9, 12]])>)
(<tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[ 9,  9, 15, 13,  9,  5,  9, 12, 12],
       [ 9, 15, 13,  9,  5,  9, 12, 12, 10],
       [10,  9,  9,  9, 12,  9,  9, 15, 13]])>, <tf.Tensor: shape=(3, 9), dtype=int64, numpy=
array([[ 9, 15, 13,  9,  5,  9,

# References
[1] http://crdd.osdd.net/raghava/satpdb/links.php

[2] http://www.uwm.edu.pl/biochemia/index.php/en/biopep/32-bioactive-peptide-databases

[3] https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset#top_of_page

[4] https://colab.research.google.com/github/ageron/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb#scrollTo=hhijmvoGyVVF