# CleanLearning using Keras and HuggingFace Model for IMDB Reviews (Text Classification) formatted as a Tensorflow Dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/examples/blob/master/huggingface_keras_imdb/huggingface_keras_imdb.ipynb) 

This example demonstrates the use of a cleanlab-compatible Keras classifier (and pretrained bert models from HuggingFace) to find issues in the IMBD Reviews Dataset and train an improved classifier model using `cleanlab`'s CleanLearning. 

Please install the dependencies specified in this [requirements.txt](https://github.com/cleanlab/examples/blob/master/huggingface_keras_imdb/requirements.txt) file before running the notebook. 

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import AutoTokenizer, TFAutoModel
from transformers import logging
from sklearn.metrics import accuracy_score
import os

from cleanlab.models.keras import KerasWrapperModel
from cleanlab.classification import CleanLearning

logging.set_verbosity(40)
os.environ['TOKENIZERS_PARALLELISM']='false'

## Importing and pre-processing the IMDB Reviews Dataset

Here we load the IMDB reviews datasets and view a sample of our training and validation data:

In [2]:
train = tfds.load('imdb_reviews', split='train', shuffle_files=True)
val = tfds.load('imdb_reviews', split='test', shuffle_files=True)

In [None]:
# using only the first 1000 datapoints to reduce execution time

train_df = tfds.as_dataframe(train)[:1000].copy()
train_df['text'] = train_df['text'].apply(lambda x: x.decode('utf-8'))

val_df = tfds.as_dataframe(val)[:1000].copy()
val_df['text'] = val_df['text'].apply(lambda x: x.decode('utf-8'))

In [4]:
train_df.head()

Unnamed: 0,label,text
0,0,This was an absolutely terrible movie. Don't b...
1,0,"I have been known to fall asleep during films,..."
2,0,Mann photographs the Alberta Rocky Mountains i...
3,1,This is the kind of film for a snowy Sunday af...
4,1,"As others have mentioned, all the women that g..."


In [5]:
val_df.head()

Unnamed: 0,label,text
0,1,There are films that make careers. For George ...
1,1,"A blackly comic tale of a down-trodden priest,..."
2,0,"Scary Movie 1-4, Epic Movie, Date Movie, Meet ..."
3,0,Poor Shirley MacLaine tries hard to lend some ...
4,1,As a former Erasmus student I enjoyed this fil...


Then, we use a pretrained bert model to tokenize the text.

In [6]:
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

train_input = tokenizer(
    train_df["text"].to_list(),
    padding="max_length",
    truncation=True,
    max_length=50,
    return_tensors="tf",
)

val_input = tokenizer(
    val_df["text"].to_list(),
    padding="max_length",
    truncation=True,
    max_length=50,
    return_tensors="tf",
)

## Define Keras Model and inputs 

Here we specify the function used to build the Keras Model, which will be passed as an argument to the `KerasWrapperModel` class.

In [7]:
def build_model(model_name:str, max_len:int, n_classes:int):
    # define input ids, token type ids and attention mask as inputs to NN
    input_ids = tf.keras.layers.Input(
        shape=(max_len,), dtype='int32', name='input_ids')
    
    token_type_ids = tf.keras.layers.Input(
        shape=(max_len,), dtype='int32', name='token_type_ids')

    attention_mask = tf.keras.layers.Input(
        shape=(max_len,), dtype='int32', name='attention_mask')

    # get bert main layer and add it to the NN, passing in inputs
    bert_layer = TFAutoModel.from_pretrained(model_name)
    layer = bert_layer(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[1]
    output_layer = tf.keras.layers.Dense(n_classes, activation='sigmoid')(layer)

    # model instance
    model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=output_layer)
    model.summary()
    return model

Format the input into appropriate tensorflow datasets to pass into Keras classifier.

In [8]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_input), np.array(train_df['label']))).batch(64)
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_input), np.array(val_df['label']))).batch(64)

train_labels = np.array(train_df['label']) # to pass into cl.fit as y input

Define Keras model using `KerasWrapperModel` which is compatible with `CleanLearning`.

In [9]:
model = KerasWrapperModel(
    model=build_model,
    model_kwargs={
        "model_name": MODEL_NAME,
        "max_len": 50,
        "n_classes": 2,
    },
    compile_kwargs= {
      "optimizer":tf.keras.optimizers.Adam(2e-5),
      "loss":tf.keras.losses.SparseCategoricalCrossentropy(),
      "metrics":["accuracy"],
    },
)

early_stopping = tf.keras.callbacks.EarlyStopping(
        monitor='val_accuracy', mode='max', verbose=1, patience=3, restore_best_weights=True)

## Use CleanLearning to find label issues and train improved classifier

Lastly, we train the model using CleanLearning and view it's performance:

In [None]:
num_folds = 3  # increase this to 5 or 10 if you're willing to wait longer to more effectively find label issues 

cl = CleanLearning(clf=model, cv_n_folds=num_folds, verbose=True)

cl.fit(
    train_dataset,
    train_labels,
    clf_kwargs={
        "validation_data": val_dataset,
        "epochs": 10, # consider increasing this value to get better performance 
        "shuffle": True,
        "callbacks": [early_stopping],
        "verbose": True,
    },
)       

In [11]:
predictions = cl.predict(val_dataset)
print('Accuracy on val data: ', accuracy_score(val_df['label'], predictions))

Accuracy on val data:  0.761
