<a href="https://colab.research.google.com/github/badrinarayanan02/LLM/blob/main/2348507_LLMLab7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization and Encodings for a Domain Specific Dataset

Going to Finetune DistillBERT llm model using Transformers and Tensforflow. For this tokenization and encodings is very essential


DistillBERT - It is a smaller version of BERT. It tries to mimic the pretrained model BERT. It has less computations.

Loading the libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast
import tensorflow as tf
import numpy as np
from transformers import TFDistilBertForSequenceClassification, Trainer, TFTrainingArguments

Loading the Dataset

In [2]:
data = pd.read_csv('/content/TranslatedDigikalaDataset.csv')
data.head()

Unnamed: 0,Comment,Liked
0,A great front-facing phone that's great\r\n,1
1,The appearance of the back may not be like a s...,0
2,Hardware is very powerful and very heavy softw...,1
3,"If you're having trouble with it, it's a good ...",1
4,The screen of this handset is one of the most ...,1


The dataset is related to social media domain. Two features are available comment and liked. Positive comments are labeled with 1 and Some troubling comments are labeled with 0.

Going to finetune DistillBERT model, meanwhile performing tokenization and encoding is essential.

In [4]:
data.shape

(719, 2)

Converting into Independent and Dependent Features

In [5]:
X = list(data['Comment'])

In [6]:
y = list(data['Liked'])

In [7]:
X[:5]

["A great front-facing phone that's great\r\n",
 "The appearance of the back may not be like a series of friends because it's not very good",
 'Hardware is very powerful and very heavy software runs smoothly',
 "If you're having trouble with it, it's a good phone and a lot better than the rest",
 'The screen of this handset is one of the most revolutionary types available in the market, which has made other manufacturers mimic it']

In [8]:
y[:5]

[1, 0, 1, 1, 1]

Splitting the Data

In [9]:
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [10]:
X_train[:5]

['Similar to the previous generation, not having a fast charging price',
 'Its good quality is its superb camera quality, especially its selfie camera',
 'Extremely fast and easy to use. High-performance features',
 "Do not get hot in the usual use. I'm completely satisfied with the use of it in six months.",
 'Antenna is good and can not be disconnected']

Using transformers

Three Important Steps

1) Call the pretrained model

2) Call the tokenizer (For specific model specific tokenizer is there)

3) Convert the encodings to a data object

Loading the pretrained model

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Inference: For Tokenization DistillBertokenizerFast is used here.

Performing Encodings

In [16]:
train_encodings = tokenizer(X_train, truncation = True, padding = True)
test_encodings = tokenizer(X_test, truncation = True, padding=True)

Converting these encodings into a dataset objects using tensorflow

In [17]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
)).batch(16)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
)).batch(16)

Inference: It will be compatible when we are fine tuning the model. And this is the format required for sequence classifcation.

Initializing the Pretrained Model

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

In [19]:
model.compile(
    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics = ['accuracy']
)

Inference: Keras is used here because TFTrainer is deprecated in Tensorflow.

In [20]:
history = model.fit(train_dataset, validation_data = test_dataset, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [21]:
loss, accuracy = model.evaluate(test_dataset)
print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.9236


In [22]:
predictions = model.predict(test_dataset)



In [33]:
predictions

TFSequenceClassifierOutput(loss=None, logits=array([[ 2.8974602 , -3.176048  ],
       [-2.928574  ,  2.5583763 ],
       [-3.014189  ,  2.6637576 ],
       [ 2.7336676 , -3.0418036 ],
       [-2.400051  ,  2.0738413 ],
       [ 2.8868978 , -3.213017  ],
       [-2.9602954 ,  2.5321367 ],
       [-2.9555707 ,  2.5422337 ],
       [ 2.7116237 , -2.935691  ],
       [-0.3448124 ,  0.18342088],
       [ 2.9197206 , -3.202659  ],
       [-0.7763336 ,  0.52196366],
       [ 2.5420768 , -2.8114276 ],
       [ 2.9299495 , -3.1626778 ],
       [-3.1809778 ,  2.7263377 ],
       [ 2.6206288 , -2.8762083 ],
       [-1.5650862 ,  1.255106  ],
       [-1.2775854 ,  1.1024499 ],
       [ 2.7955906 , -3.0574121 ],
       [-2.729405  ,  2.302347  ],
       [-2.9271507 ,  2.4907396 ],
       [-2.8982148 ,  2.526477  ],
       [-2.9406412 ,  2.4755232 ],
       [ 2.833     , -3.1813922 ],
       [ 2.7729177 , -3.1040587 ],
       [-3.0436535 ,  2.657355  ],
       [-2.7511394 ,  2.171258  ],
       [-2

These are logits score. Logits - Raw scores output by the model.

In [25]:
logits = predictions.logits
probabilities = tf.nn.softmax(logits,axis=1)
predicted_labels = tf.argmax(probabilities, axis=-1).numpy()

Conversion of logits to probabilities

In [27]:
y_test_np = np.array(y_test)
print('Predicted Labels:',predicted_labels[:20])

Predicted Labels: [0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1]


In [28]:
model.save_pretrained("sentiment_model")

Predictions on user inputs

In [29]:
def preprocess_input(text,tokenizer):
  encodings = tokenizer(text, truncation = True, padding = True, return_tensors='tf')
  return encodings

In [31]:
def predict_sentiment(text,model,tokenizer):
  input_encodings = preprocess_input(text,tokenizer)
  predictions = model(input_encodings)
  logits = predictions.logits
  probabilities = tf.nn.softmax(logits,axis=1)
  predicted_labels = tf.argmax(probabilities, axis=-1).numpy()[0]

  label_map = {0:'disliked',1:'liked'}
  predictedclass = label_map[predicted_labels]
  return predictedclass

In [32]:
input = "The person who took the session was really bad. He was so rude."
predicted_class = predict_sentiment(input,model,tokenizer)
print(f"Comments is classified as {predicted_class}")

Comments is classified as disliked


# Conclusion

Finetuned the DistilBert LLM model. Gave a custom domain specific social media dataset that includes two features. The model will analyse the sentiment of the comments. Used transformers to do all the required operations. Performed encoding and tokenizations as it is very essential for fine tuning. Tensorflow is used for the conversion of encodings to dataset object. Used keras because I faced an issue with TFTrainer in tensorflow, since it is deprecated.