<h1>Text Classification with Tensorflow and HuggingFace transformers</h1>
<h3>The purpose of this notebook is to fine tune a pretrained model from hugging face to classify given text string as one of the labels. In order to fine tune the model, we need a sample data of text with proper labels. This notebook is gneric, so you can feed in any data as long as it has 2 columns i.e., text and label</h3>

<h2>Parameters for this notebook</h2>
<h3>
Model Name: bert-base-multilingual-uncased</br>
Model Source: Hugging Face </br>
Trainable Parameters: 1676M</br>
</h3>


In [None]:
%pip install \
    tensorflow==2.13.0 \
    transformers==4.34.0 \
    datasets==2.14.5 \
    evaluate==0.4.1 \
    scikit-learn==1.3.0 --quiet

In [None]:
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import create_optimizer
from transformers.keras_callbacks import KerasMetricCallback
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from datasets import Dataset
import tensorflow as tf
import matplotlib.pyplot as plt
import evaluate
import numpy as np
import pandas as pd
import json, random
from datetime import datetime


In [None]:
# set seeds
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

set_seed(42)

<h3>This notebook was executed on single T4 GPU. TF by default allocate all GPU memory while initizalizng the model. set_memory_growth is used here to allocate on demand. This is useful, when running multiple notebooks on same GPU</h3>

In [None]:
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.experimental.set_memory_growth(gpus[0], True)


<h3>Parameters for preparing data. I trained this model in multiple iterations, sometimes duplicating the data to maintain label balance.</h3>

In [None]:
labeled_data_path = 'sample-data.csv'
t_day = datetime.now().strftime('%Y-%m-%d-%H')

<h2>Load Data</h2>
</h3>Load the data and do some basic processing, split into test, train and validation sets and finally converted to datasets</h3>

In [None]:
# load data into dataframe
data = pd.read_csv(labeled_data_path)

# set datatype of text to string
data['text'] = data['text'].astype(str)

# if labels are strings, then use astype as str
data['label'] = data['label'].astype(int)

print(f'No. of rows in data: {data.shape[0]}')

# print first 10 rows
print(f'First few rows looks like this:\n{data[:10]}')

# unique label values and counts
print(f'Label counts: {data.label.value_counts()}')

# split data into train, valid and test sets
train_df, test_df = train_test_split(data, random_state=42, train_size=0.9, stratify=data.label.values)
train_df, valid_df = train_test_split(train_df, random_state=42, train_size=0.8, stratify=train_df.label.values)
print(f'train: {train_df.shape}; valid: {valid_df.shape}; test: {test_df.shape}')

# convert to datasets.Dataset
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)
test_dataset = Dataset.from_pandas(test_df)
print('Converted to datasets.Dataset')

# print first few rows of train_dataset
print('First few rows of train_dataset:')
print(train_dataset[0:2])


<H2>Model hyper parameters</h2>

In [None]:
# model parameters

model_name = 'bert-base-multilingual-uncased'  #drop in any other model from huggingface model hub
batch_size = 32
shuffle_buffer_size = 1000
label_length = len(data.label.unique())
num_epochs = 10


<h2>Tokenize data</h2>


In [None]:
# initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenize dataset
def tokenize(batch):
    return tokenizer(batch['text'], truncation=True)

# tokenize train and valid datasets
train_tokenized = train_dataset.map(tokenize, batched=True)
valid_tokenized = valid_dataset.map(tokenize, batched=True)
print('train and valid datasets tokenized')

# print first 2 rows of tokenized train dataset
print(train_tokenized[0:2])

<h2>Initialize Model</h2>
Load pretrained model from hugging face

In [None]:

# initialize model
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=label_length)

<h2>Convert tokenized dataset to TF datasets</h2>

In [None]:
# set format for tensorflow
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

# prepare tensorflow datasets
tf_train_set = model.prepare_tf_dataset(
    train_tokenized,
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    valid_tokenized,
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

<h2>Define optimizer and custom metric callback function</h2>

In [None]:
# define optimizer
batches_per_epoch = len(train_tokenized) / batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

# custom metrics
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# metric callback for training and validation accuracy
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

callbacks = [metric_callback]


<h2>Compile the model</h2>

In [None]:
model.compile(optimizer=optimizer, metrics=['accuracy'])
model.summary()

<h2>Train the model</h2>

In [None]:
# train the model
h = model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=num_epochs, callbacks=callbacks, verbose=1)

<h2>Plot accuracy and loss for training and valication set</h2>

In [None]:
# get accuracy and loss from history
acc = h.history['accuracy']
val_acc = h.history['val_accuracy']
loss = h.history['loss']
val_loss = h.history['val_loss']

# plot accuracy
import matplotlib.pyplot as plt
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title(f'Training and validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend()
plt.show()


# plot loss
plt.clf()
plt.plot(epochs, loss, 'r', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title(f'Training and validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend()
plt.show()

<h2>Predict the test set</h2>
Predict on test set and check the accuracy

In [None]:
# predict the test data
test_true = test_df.label.values
test_tokenized = test_dataset.map(tokenize, batched=True)
tf_test_set = model.prepare_tf_dataset(
    test_tokenized,
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)
test_pred = model.predict(tf_test_set)
test_pred = np.argmax(test_pred.logits, axis=1)
test_df['pred_label'] = test_pred

# get accuracy
accuracy = accuracy_score(test_true, test_pred)
print(f'Test Accuracy: {accuracy:.2%}')

# get classification report
report = classification_report(test_true, test_pred)
print(report)

# get confusion matrix
matrix = confusion_matrix(test_true, test_pred)
print(matrix)

<h2>Save model</h2>

In [None]:
# save the model with the tokenizer
model.save_pretrained('model/saved_model-%s'%(t_day))
tokenizer.save_pretrained('model/saved_model-%s-%s'%(t_day))

