1. Downloading the model 
2. Loading dataset
3. Splitting dataset 
4. Embeddings - Semantic Representations 
5. Model Loading 
6. Train / Val
7. Prediction pipeline 

Image credit: https://jalammar.github.io/

In last session, we learned about the traditional NLP techniques to build a Sentiment classifier

In this session, we will get familiarised with more advanced techniques, especially LLMs, and use it to build the same classifier

In [1]:
import pandas as pd

In [6]:
# lets start with importing the same dataset we used in the previous session
# this is a dataset of yelp reviews, with each review labelled as positive or negative
# the dataset is available at data/yelp_reviews.txt
# the file is a tab-separated file with two columns: text and label

df = pd.read_csv(f"data/yelp_reviews.txt", sep="\t", header=None, names=["text", "label"])
print(df.shape)
df.head(2)


(1000, 2)


Unnamed: 0,text,label
0,Wow... Loved this place.,1
1,Crust is not good.,0


In [23]:
# lets define some helper functions to pre-process the text and extract features from them using traditional NLP techniques
# same as from the previous session
# just to recap, what we did and how the processing/feature extraction looks like

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

stemmer = PorterStemmer()
stop_words = set(stopwords.words("english"))

def process_text(text):
    """
    Processes the input text by tokenizing, stemming, and removing stop words.
    """
    tokens = word_tokenize(text.lower())
    tokens = [stemmer.stem(token) for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [token.lower() for token in tokens]
    return tokens

def identity_tokenizer(x):
    return x


count_vectorizer = CountVectorizer(tokenizer=identity_tokenizer, lowercase=False)
df["tokens"] = df["text"].apply(process_text)
count_vectorizer.fit(df["tokens"])

def extract_features(text):
    """
    Vectorizes the input tokens using the pre-trained CountVectorizer.
    """
    tokens = process_text(text)
    vectorized_text = count_vectorizer.transform([tokens])
    vectorized_df = pd.DataFrame(
        vectorized_text.todense(), columns=count_vectorizer.get_feature_names_out()
    )
    return vectorized_df





In [25]:
# df["tokens"]
# extract_features("This is a great product, I love it!")

<!-- ![alt text](images/traditional-sentiment-classifier.png) -->

<img src="images/traditional-sentiments-classifier.png">


In [13]:
# let's see how the preprcessing and feature extraction looks like

In [5]:
df["tokens"] = df.text.apply(process_text)
print(df.shape)
df.head(2)
# crust not good - missed context by ignoring stop words 


(1000, 3)


Unnamed: 0,text,label,tokens
0,Wow... Loved this place.,1,"[wow, love, thi, place]"
1,Crust is not good.,0,"[crust, good]"


In [27]:
text1 = "The city is located on the bank of the river"
text2 = "Let's go to the bank and deposit some money"
vectorized_df1 = extract_features(text1)
vectorized_df2 = extract_features(text2)


In [28]:
vectorized_df1

Unnamed: 0,abov,absolut,absolutley,accid,accommod,accomod,accordingli,account,ach,acknowledg,...,year,yellow,yellowtail,yelper,yet,yucki,yukon,yum,yummi,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
vectorized_df2

Unnamed: 0,abov,absolut,absolutley,accid,accommod,accomod,accordingli,account,ach,acknowledg,...,year,yellow,yellowtail,yelper,yet,yucki,yukon,yum,yummi,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
vectorized_df2["bank"].values, vectorized_df2["bank"].values


(array([1]), array([1]))

In [32]:
# missing context by ignoring stop words
# we can see that the word "bank" is present in both sentences, but the context
# is different. In the first sentence, "bank" refers to the side of a river
# while in the second sentence, it refers to a financial institution.
# this is a limitation of traditional NLP techniques, as they do not take into account the context of the words
# in the sentence. This is where BERT comes in, as it is able to take into account the context of the words in the sentence
# and provide a more accurate representation of the text.
# BERT is a transformer-based model that uses attention mechanisms to understand the context of the text.
# It is pre-trained on a large corpus of text and can be fine-tuned for specific tasks such as sentiment analysis, text classification, etc.
# In the next section, we will see how to use BERT for text classification tasks.
# We will use the Hugging Face Transformers library to load a pre-trained BERT model and fine-tune it on our dataset.
# The Hugging Face Transformers library provides a simple and easy-to-use interface for working with BERT and other transformer-based models.
# It also provides a wide range of pre-trained models that can be used for various NLP tasks.
# We will use the BERT model for text classification tasks, specifically for sentiment analysis.
# We will fine-tune the BERT model on our dataset and evaluate its performance on the test set.
# The BERT model will be able to take into account the context of the text and provide a more accurate representation of the text.
# This will help us to improve the performance of our text classification tasks and provide better results.
# Let's see how to use BERT for text classification tasks using the Hugging Face Transformers library.
# We will use the BERT model for sentiment analysis and fine-tune it on our dataset.
# We will also evaluate its performance on the test set and compare it with the traditional NLP techniques we used earlier.
# This will help us to understand the advantages of using BERT for text classification tasks and how it can improve the performance of our models
# compared to traditional NLP techniques.  

In [10]:
# problems of basic methods

# ignore context etc 

In [9]:
import transformers
import torch
import datasets


  from .autonotebook import tqdm as notebook_tqdm


In [38]:
MODEL_NAME = "bert-base-uncased"
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}


In [11]:
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
language_model = transformers.AutoModel.from_pretrained(MODEL_NAME)


In [33]:
# show the embs of same words in two different contexts
text1 = "The city is located on the bank of the river"
text2 = "Let's go to the bank and deposit some money"
inputs1 = tokenizer(text1, return_tensors="pt")
inputs2 = tokenizer(text2, return_tensors="pt")
# print(inputs1)
# tokenizer.decode(inputs1["input_ids"][0][-5]), tokenizer.decode(inputs2["input_ids"][0][-6])


In [34]:

with torch.no_grad():
    outputs1 = language_model(**inputs1)
    outputs2 = language_model(**inputs2)
emb1 = outputs1.last_hidden_state[0, -5, :]
emb2 = outputs2.last_hidden_state[0, -6, :]


In [35]:

print(f"Embedding for 'bank': {emb1}")


Embedding for 'bank': tensor([-7.6490e-01, -6.1477e-03, -1.7706e-03, -1.2309e-01,  9.9662e-02,
         7.4598e-01,  4.8393e-01,  2.0661e+00,  2.6952e-01, -7.3961e-01,
         9.7105e-01,  5.0583e-02, -1.7607e-02,  7.8000e-01, -1.3304e+00,
         6.2705e-01,  2.7906e-01, -6.0350e-02,  9.5401e-01,  3.9542e-01,
        -3.1358e-01,  5.0624e-01,  2.9914e-01,  1.1412e+00, -5.0115e-01,
         3.1991e-02,  9.1508e-02,  4.7217e-01,  3.9590e-01,  1.8770e-01,
         1.0081e+00,  3.7353e-01,  3.3866e-01,  5.4936e-02, -5.4455e-01,
        -5.2585e-01, -4.7111e-01, -5.9590e-01, -7.7133e-01, -1.6177e-02,
        -9.0143e-01, -8.3997e-01, -2.2339e-01,  9.8848e-01,  1.6927e-01,
        -5.9720e-02,  8.2857e-01, -3.5977e-01,  4.6731e-01,  9.1251e-01,
         2.0006e-01,  1.7635e+00,  3.0672e-01, -4.0032e-01, -2.5774e-01,
         4.8264e-01, -6.4338e-01, -7.1181e-01, -7.4939e-01, -5.2484e-01,
         9.1113e-01,  8.0004e-02,  3.3433e-01, -5.2414e-02,  1.7914e-01,
         2.6137e-01, -5.7129e

In [None]:
print(f"Embedding for 'bank': {emb2}")


Embedding for 'bank': tensor([ 5.9083e-01,  4.7655e-02,  6.9020e-02,  6.6070e-02,  1.1558e+00,
        -3.4510e-01, -1.8744e-01,  1.0186e+00, -3.6762e-02,  3.6435e-01,
         3.3772e-01, -9.0620e-01, -2.0067e-01,  6.1692e-02, -7.5448e-01,
        -4.1178e-01,  1.9047e-01,  3.9089e-01,  1.2114e+00,  5.5650e-01,
        -3.6517e-01,  1.7254e-01,  6.7308e-01,  1.0322e-02,  2.5009e-01,
         5.2667e-01,  6.8564e-01,  7.1187e-01, -4.8951e-01, -4.5971e-01,
         4.9835e-01,  5.3231e-01,  5.1483e-01,  3.3559e-01,  3.8936e-01,
        -5.0638e-02, -6.9605e-02,  1.3594e-01, -8.7222e-01, -3.8655e-02,
        -3.8724e-01, -1.0177e+00, -3.1066e-01,  3.9200e-01, -6.5535e-01,
        -6.6317e-01,  5.4388e-01,  4.4756e-01, -6.8695e-01, -6.6272e-01,
        -1.4000e-01,  9.2609e-01,  3.4183e-01, -6.4659e-01,  4.6637e-01,
         5.6735e-01, -8.8206e-01, -4.6287e-01, -8.3607e-01, -2.6055e-01,
         7.6207e-01,  2.2162e-01,  3.1346e-01, -5.7102e-01,  3.8509e-01,
        -2.2441e-01,  4.3680e

In [None]:
# how do they do it? distinguish the context of the words in the sentence
# its due to a lot of things
# training on large datasets
# the architecture of the model
# the attention mechanism
# the way the model is trained


# instead of tokenizing and preprocessing to remove noise, and stemming; and then counting the word occurrences,
# bert tokenizes the text in to token and converts the tokens to embeddings
# the embeddings are then passed through the model to get the output
# the model is trained on a large corpus of text and can be fine-tuned for specific tasks such as sentiment analysis, text classification, etc.
# the model is able to take into account the context of the words in the sentence and provide a more accurate representation of the text


![alt text](images/bert-gen-emb-full.png)

![alt text](images/bert-gen-emb.png)

In [None]:
# language models geenrate contextual embeddings for the words in the sentence using the attention mechanism
# which tries to understand what impact each token in the sentence has on the other tokens
# this is done by calculating the attention scores for each token in the sentence

![alt text](images/attention.png)

In [None]:
# innovation - transformers architecture
# can scale to large datasets, train faster
# can be used for transfer learning

# its build in two parts:
# 1. pre-training: learn the general patterns of language from large datasets first, grammar, syntax, semantics
# 2. fine-tuning: pre-train on large datasets like Wikipedia, Common Crawl, etc.
# learn the general patterns of language from large datasets first, grammar, syntax, semantics
# pre-train on large datasets like Wikipedia, Common Crawl, etc.
# pre-trained models can be used for various tasks like text classification, named entity recognition, etc
# pre-trained models can be fine-tuned on specific tasks with smaller datasets
# BERT, GPT, RoBERTa, etc. are examples of pre-trained models
# pre-trained models are trained on large datasets, learn the general patterns of language
# and can be used for various tasks like text classification, named entity recognition, etc.
# pre-trained models can be used for transfer learning, where the model is trained on a large dataset
# to learn the general patterns of language, and then fine-tuned on specific tasks with smaller datasets
# pre-trained models can be used for various tasks like text classification, named entity recognition
# text generation, etc.
# pre-trained models can be used for transfer learning, where the model is trained on a large
# and then fine-tune on specific tasks
# BERT - Bidirectional Encoder Representations from Transformers
# GPT - Generative Pre-trained Transformer

# langauge modelling task 
# predict the next word in a sentence
# masked language modelling task
# predict the masked word in a sentence



<img src="images/bert-transfer-learning.png">

In [None]:
# how did transformers achieve this?
# through attention mechanism
# earlier models used RNNs, LSTMs, GRUs, etc.
# these models processed the input sequence one token at a time, which made them slow and difficult
# to parallelize
# transformers process the entire input sequence at once, which makes them faster and easier to parallelize
# transformers use self-attention mechanism to process the input sequence
# self-attention mechanism allows the model to focus on different parts of the input sequence
# and learn the relationships between the tokens in the sequence
# self-attention mechanism allows the model to learn the relationships between the tokens in the sequence


In [55]:
language_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

# there are two types of language modelling tasks
# 1. Causal LM: next word prediction task: predict the next word in a sentence
# 2. Masked LM: masked language modelling task: predict the masked word in a sentence


## Causal LM

<img src="images/clm.png">

## Masked LM

<img src="images/mlm.png">

In [None]:
# language models are powerful tools for natural language processing tasks
# trained on large datasets, learn the general patterns of language
# can be fine tuned on specific tasks such as sentiment analysis, text classification, etc.
# reason why they are so powerful 

# and current wave of GenAI is due to the advancements in language models
# with more scaling, better architectures, and more data
# language models are able to generate human-like text, understand the context of the text, and
# perform various natural language processing tasks with high accuracy

In [40]:
# now lets build a sentiment classifer with BERT


![alt](images/llm-sentiments-classifier.png)

In [41]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [48]:
tokenized_text = tokenizer("Hello, my dog is cute!", return_tensors="pt")
tokenized_text

{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [49]:
[tokenizer.decode(token_id) for token_id in tokenized_text["input_ids"][0]]

['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '!', '[SEP]']

In [50]:
# split the dataset into train and test sets
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(
    df["text"].values, df["label"].values, test_size=0.2, random_state=42
)
print(f"Train shape: {train_x.shape}, Test shape: {test_x.shape}")

Train shape: (800,), Test shape: (200,)


In [51]:
# using datasets library to create datasets for training and testing
import datasets
train_dataset = datasets.Dataset.from_dict({"text": train_x, "label": train_y})
test_dataset = datasets.Dataset.from_dict({"text": test_x, "label": test_y})
train_dataset = train_dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, padding="max_length", max_length=16),
    batched=True,
)
test_dataset = test_dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, padding="max_length", max_length=16),
    batched=True,
)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map: 100%|██████████| 800/800 [00:00<00:00, 6538.62 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 9197.93 examples/s]


![alt text](images/bert-cls.png)

In [53]:
cls_model = transformers.AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [54]:
cls_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [56]:
pipe = transformers.pipeline("sentiment-analysis", model=cls_model, tokenizer=tokenizer,)
print(pipe("I love this!"))
print(pipe("I hate this!"))

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.5087944865226746}]
[{'label': 'NEGATIVE', 'score': 0.5132016539573669}]


In [57]:
print(train_dataset[0])

{'text': 'The worst was the salmon sashimi.', 'label': 0, 'input_ids': [101, 1996, 5409, 2001, 1996, 11840, 24511, 27605, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}


In [58]:
print(tokenizer.decode(train_dataset[0]["input_ids"], skip_special_tokens=True))
print([tokenizer.decode(id) for id in train_dataset[0]["input_ids"]])

the worst was the salmon sashimi.
['[CLS]', 'the', 'worst', 'was', 'the', 'salmon', 'sash', '##imi', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


In [59]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length", max_length=16)
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=5,
    max_steps=10,
)
# play with these parameters to see how they affect the training
# e.g., change max_steps to 1000, increase batch_size, etc.
# learn about these parameters here: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
trainer = Trainer(
    model=cls_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    compute_metrics=lambda p: {"accuracy": (p.predictions.argmax(-1) == p.label_ids).mean()},
)
trainer.train()
trainer.save_model("./model")




Step,Training Loss,Validation Loss,Accuracy
5,No log,0.695208,0.48
10,No log,0.694541,0.48


In [60]:
tokenizer.save_pretrained("./model")

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.txt',
 './model/added_tokens.json',
 './model/tokenizer.json')

In [63]:
# use transfoermers pipeline to test the model
pipe = transformers.pipeline("sentiment-analysis", model="./model", tokenizer="./model",)
print(pipe("I love using Hugging Face transformers!"))
print(pipe("I hate using Hugging Face transformers!"))

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.5220035314559937}]
[{'label': 'NEGATIVE', 'score': 0.52321457862854}]


In [64]:
print(pipe("I love this!"))
print(pipe("I hate this!"))

[{'label': 'NEGATIVE', 'score': 0.5180331468582153}]
[{'label': 'NEGATIVE', 'score': 0.5273443460464478}]


In [65]:
text = "I love using Hugging Face transformers!"
inputs = tokenizer(text, return_tensors="pt")
outputs = cls_model(**inputs)
logits = outputs.logits
predicted_class_id = logits.argmax().item()
predicted_class = id2label[predicted_class_id]
print(f"Text: {text}\nPredicted class ID: {predicted_class_id}\nPredicted class: {predicted_class}")


Text: I love using Hugging Face transformers!
Predicted class ID: 0
Predicted class: NEGATIVE


In [66]:
# convert the logits to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities: {probabilities}")
logits


Probabilities: tensor([[0.5220, 0.4780]], grad_fn=<SoftmaxBackward0>)


tensor([[0.2455, 0.1574]], grad_fn=<AddmmBackward0>)