# Using TensorFlow to train a neural network classifier on a dataset of enhancers in Drosophila

In [None]:
# Install the required packages
pip install genomic-benchmarks
pip install tensorflow-macos==2.12.0
pip install tensorflow-addons
pip install typing-extensions --upgrade  # fixing TF installation issue

## Data download

In [1]:
#We will download the Drosophila enhancers dataset from https://github.com/
#ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main included in the publication by 
#Grešová, Katarína, et al. "Genomic benchmarks: a collection of datasets for 
# genomic sequence classification." BMC Genomic Data 24.1 (2023): 25.

#Importing the required libraries
from pathlib import Path
import tensorflow as tf
import tensorflow_addons as tfa 
import numpy as np

from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer
from genomic_benchmarks.models.tf import get_basic_cnn_model_v0 as get_model

if not is_downloaded('drosophila_enhancers_stark'):
    download_dataset('drosophila_enhancers_stark')

  from tqdm.autonotebook import tqdm

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

 The versions of TensorFlow you are currently using is 2.12.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
Downloading...
From (original): https://drive.google.com/uc?id=1D8u3m09CNIv8e4-5rOu5wKuwcLl1eejs
From (redirected): htt

In [2]:
#We will take a look at the dataset
info('drosophila_enhancers_stark', 0)

Dataset `drosophila_enhancers_stark` has 2 classes: negative, positive.

The length of genomic intervals ranges from 236 to 3237, with average 2118.1238067688746 and median 2142.0.

Totally 6914 sequences have been found, 5184 for training and 1730 for testing.


Unnamed: 0,train,test
negative,2592,865
positive,2592,865


## Creation of a TensorFlow dataset

We will create a TF Dataset to train the model. Because the directory structure of the dataset is ready for training, we can just call ```tf.keras.preprocessing.text_dataset_from_directory```function as follows.

In [3]:
BATCH_SIZE = 64
SEQ_PATH = Path.home() / '.genomic_benchmarks' / 'drosophila_enhancers_stark'
#We will create two classes, corresponding to the positive and negative classes in the dataset
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]
NUM_CLASSES = len(CLASSES)
#Creation of the TF dataset
train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

Found 5184 files belonging to 2 classes.


In [4]:
if NUM_CLASSES > 2:
    train_dset = train_dset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))

## Text vectorization

We will use TF ```TextVectorization``` layer and splitting to characters to convert the strings to tensors.

In [5]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
VOCAB_SIZE = len(vectorize_layer.get_vocabulary())
vectorize_layer.get_vocabulary()

2025-01-20 12:13:12.463995: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


['', '[UNK]', 't', 'a', 'c', 'g']

In [6]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

## Model training

We will use a package with a simple convolutional neural network model

In [7]:
model = get_model(NUM_CLASSES, VOCAB_SIZE)

In [22]:
# Model compilation
model.compile(
    # Loss function to minimize
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    # Optimizer
    optimizer='adam',
    # List of metrics to monitor
    metrics=[tf.metrics.BinaryAccuracy(threshold=0.0), tfa.metrics.F1Score(name='f1',num_classes=1)])

In [23]:
#Training the model
EPOCHS = 10
#Fit will train the model by slicing the data into "batches", and repeatedly iterating 
# over the entire dataset for a given number of epochs.
history = model.fit(
    train_ds,
    epochs=EPOCHS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The F1 score is the average of precision and recall. A perfect model would have a score of 1. The binary accuracy calculates how often predictions match binary labels.

## Evaluation on the test set

Finally, we can do the same pre-processing for the test set and evaluate the F1 score of our model.

In [24]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

if NUM_CLASSES > 2:
    test_dset = test_dset.map(lambda x, y: (x, tf.one_hot(y, depth=NUM_CLASSES)))
test_ds =  test_dset.map(vectorize_text)

Found 1730 files belonging to 2 classes.


In [25]:
model.evaluate(test_ds)



[0.756486713886261, 0.5289017558097839, array([0.6666667], dtype=float32)]