## **Day 19/75 Hard**

# DistilBert Model Fine Tuning
## Using IMDB Dataset and Masked Language Modelling Technique 
### 📍Masking - Fine-tuning language model and next sentence prediction (NSP) model involves adapting the pre-trained model to a specific domain or task. You mask parts of Sentence so that model can predict and understand the meaning of Data more deeply. Adding a mask or using masked language modeling (MLM) is a technique commonly used in models like BERT (Bidirectional Encoder Representations from Transformers).

 1. Get the Dataset
 2. Choose the Technique for fine Tuning - Masked modelling , LORA, QLORA, and many more
 3. Based on that Technique Prepare your supervised Learning Data
 4. Then Build the Traninig Arguments and Trainer Objects and Train the Model.

In [2]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [3]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [4]:
text = "This is a great [MASK]."

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [6]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


In [7]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [8]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

In [17]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [18]:
tokenizer.model_max_length

512

In [19]:
chunk_size = 128

In [20]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [21]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


In [22]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


In [23]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [24]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [25]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [26]:
tokenizer.decode(lm_datasets["train"][1]["labels"])


"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [27]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)


In [28]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")



'>>> [CLS] i rented i am [MASK] - [MASK] from [MASK] video store because of all the controversy that surrounded it when it was first released in 1967 raked i also heard that at first it was [MASK] by u. s. customs if it ever tried to enter this country, therefore being a fan of films considered " switched " [MASK] really had to see this for myself. < br / > < br / > the plot is centered around a blaze swedish drama student named lena who wants to [MASK] everything she can about life. in particular she wants to focus her attention [MASK] to making some sort of documentary [MASK] what [MASK] average [MASK]ede [MASK] [MASK] certain political issues such'

'>>> as [MASK] [MASK] [MASK] [MASK] race [MASK] in the united states. in between asking politicians and ordinary denizens of stockholm about their [MASK] on politics, she has [MASK] with her drama teacher, classmates, and married men. < [MASK] / [MASK] [MASK] br / > what kills me [MASK] i [MASK] [MASK] - yellow is that 40 years ago, thi

In [29]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)


In [30]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented [MASK] am curious - yellow from my [MASK] [MASK] because [MASK] [MASK] [MASK] [MASK] that surrounded it when it was first released in 1967. i also heard that at first it was seized by u. s. customs if it ever tried to enter [MASK] [MASK], therefore being [MASK] fan of films considered [MASK] [MASK] " [MASK] really [MASK] to see [MASK] for myself [MASK] < [MASK] [MASK] > < br [MASK] > the plot is centered around [MASK] young [MASK] [MASK] [MASK] named lena [MASK] wants to [MASK] [MASK] [MASK] can about life [MASK] [MASK] [MASK] she wants [MASK] focus her attentions to making some sort [MASK] [MASK] on what the average swede [MASK] [MASK] certain political issues such'

'>>> as [MASK] [MASK] war and race issues in [MASK] united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with [MASK] drama teacher, [MASK] [MASK] and [MASK] [MASK]. < [MASK] / > < [MASK] / > [MASK] kills me about [MASK] am curiou

In [31]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [32]:
from transformers import TrainingArguments
from transformers import Trainer

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    logging_steps=logging_steps,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/16 [00:00<?, ?it/s]

>>> Perplexity: 21.94


In [33]:
trainer.train()

  0%|          | 0/471 [00:00<?, ?it/s]

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

# Fine Tuning with LORA and QLORA

## LORA

In [2]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForMaskedLM, QLORAdapter

# Step 1: Load the pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
pretrained_model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Step 2: Prepare the dataset
texts = ["[CLS] Hello, how are you? [SEP]", "[CLS] I am doing well. [SEP]"]
train_encodings = tokenizer(texts, truncation=True, padding="max_length", return_tensors="pt")
labels = torch.tensor([tokenizer.encode(text, add_special_tokens=True) for text in texts])

# Step 3: Define the QLORAdapter class
adapter = QLORAdapter(input_dim=768, output_dim=768, rank=64)
pretrained_model.bert.encoder.layer[0].attention.output = adapter

# Step 4: Fine-tuning the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pretrained_model.to(device)
adapter.to(device)

optimizer = torch.optim.AdamW(adapter.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    optimizer.zero_grad()
    outputs = pretrained_model(**train_encodings.to(device))
    logits = outputs.logits
    loss = loss_fn(logits.view(-1, logits.shape[-1]), labels.view(-1).to(device))
    loss.backward()
    optimizer.step()

# Step 5: Inference with the fine-tuned model
test_text = "[CLS] How are you doing today? [SEP]"
test_input = tokenizer(test_text, return_tensors="pt").to(device)
output = pretrained_model(**test_input)
predicted_ids = torch.argmax(output.logits, dim=-1)
predicted_text = tokenizer.decode(predicted_ids[0])
print("Predicted text:", predicted_text)

ImportError: cannot import name 'QLORAdapter' from 'transformers' (/Users/simranjeetsingh1497/anaconda3/envs/nextwordpred/lib/python3.10/site-packages/transformers/__init__.py)