# "Authorship Identification: Part-2 (DistilBERT Transformer)"
> "Using transfer-learning to fine-tune pretrained DistilBERT transformer for authorship identification. In a nutshell, DistilBERT is a small version of BERT which is "smaller, faster, cheaper, and lighter". It has 40% less parameters original BERT, runs 60% faster and preserve over 95% of BERT’s performances (measured on the GLUE language understanding benchmark)."

- toc: false
- sticky_rank: 2
- branch: master
- badges: true
- comments: true
- categories: [project, machine-learning, notebook, python]
- image: images/vignette/base.jpg
- hide: false
- search_exclude: false


# Abstract
**This is a follow-up post on the authorship identification project.**<br/>
I regard the past few years as the inception of the era of Transformers which started with the popular Research Paper "Attention is all you need" by "somebody" in 2020. Several transformer architectures have shown up since then. Some of the famous ones are - GPT, GPT2, and the latest GPT3 which has outperformed many previous state-of-the-art models at several tasks in NLP, BERT (by Google) is also one of the most popular transformers out there.<br/>
Transformers are very large models with multi-billions of parameters. Pretrained transformers have shown tremendous capability when used with a downstream task head in Transfer Learning similar to the CNNs in Computer Vision.<br/>
In this part, I'll use fine-tuned DistilBERT transformer which is a smaller version of the original BERT for the downstream classification task.<br/>
I'll use the `transformers` library from Huggingface which consists of numerous state-of-the-art transformers and supports several downstream tasks out of the box. In short, I consider Huggingface a great starting point for a person engrossed in NLP and it offers tons of great functionalities.<br/>
I'll provide links to resources for you to learn more about these technologies. 


In [9]:
# Imports
import keras
import tensorflow as tf
import numpy as np
from pathlib import Path
from utils import plot_history
from keras.preprocessing import text_dataset_from_directory

ds_dir = Path('data/C50/')
train_dir = ds_dir / 'train'
test_dir = ds_dir / 'test'
seed = 1000
batch_size = 16


train_ds = text_dataset_from_directory(train_dir,
                                     label_mode='int',
                                     seed=seed,
                                     shuffle=True,
                                     validation_split=0.2,
                                     subset='training')

val_ds = text_dataset_from_directory(train_dir,
                                      label_mode='int',
                                      seed=seed,
                                      shuffle=True,
                                      validation_split=0.2,
                                     subset='validation')

test_ds = text_dataset_from_directory(test_dir,
                                       label_mode='int',
                                       seed=seed,
                                       shuffle=True,
                                       batch_size=batch_size)

class_names = train_ds.class_names


Found 2500 files belonging to 50 classes.
Using 2000 files for training.
Found 2500 files belonging to 50 classes.
Using 500 files for validation.
Found 2500 files belonging to 50 classes.


In [10]:
# Prepare and Configure the datasets
from utils import get_text, prepare_batched
from transformers import DistilBertTokenizerFast

AUTOTUNE = tf.data.AUTOTUNE

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

batch_size=2

train_ds = prepare_batched(train_ds, tokenizer, batch_size=batch_size)
val_ds = prepare_batched(val_ds, tokenizer, batch_size=batch_size)
test_ds = prepare_batched(test_ds, tokenizer, batch_size=batch_size)

In [None]:
# Fine-tuning the model
keras.backend.clear_session()
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=50)

model.compile(
    optimizer=tf.keras.optimizers.Adam(lr=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy()
)

history = model.fit(train_ds, validation_data=val_ds, epochs=20)

plot_history(history, 'sparse_categorical_accuracy')
model.save("DistilBERT_finetuned.h5")
