# NOTEBOOKE CONTENT

In the classification task, we focus on the content of the sentence to demonstrate what we will do.

This notebook is divided into two parts. First, we will use the Naive Bayes model to classify the descriptions of films. In the second part, we will focus on prompt engineering. Using the large language model `FLAN-T5,` we will present the results and evaluate the model's performance. Finally, we will employ fine-tuning methods to achieve good results.



### "Note: In a practical context, one should use either of them, not both. When you isolate one, you will not encounter an error." so you can do Part-1 or Part-2

## Part-1:
- [1.0 - Setup And Import The Requirements.](#1)
- [2.0 - Load Data And Process it.](#2)
    - [2.1- Load Data ](#2.1)
    - [2.2- Cleaning The Data](#2.2)
    - [2.3- Feature Engineering](#2.3)
    - [2.4- Do TFIDF Method](#2.4)
- [3.0 - Load Model And Make Prediction](#3.0)
    - [3.1- Load Models And Choose The Best](#3.1)
    - [3.2- Make prediction](#3.2)

## Part-2:

- [ 1 -Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Test the Model with Zero Shot Inferencing](#1.1)
  - [ 1.2 - Using One Shot and Few Shot Inference ](#1.2)
- [ 2 - Perform Parameter Efficient Fine-Tuning (PEFT)](#2)
  - [2.1 - Preprocess the Classification Dataset](#2.1)
  - [ 2.2 - Setup the PEFT/LoRA model for Fine-Tuning](#2.2)
  - [ 2.3 - Train PEFT Adapter](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
  
- [ 3 - submission ](#1)

# 1.0 - Setup And Import The Requirements.
let's setup some important libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import string
import warnings
import nltk
import spacy
import sklearn
import unicodedata
import os
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize , sent_tokenize
from nltk.stem import LancasterStemmer
from nltk.tokenize.toktok import ToktokTokenizer
from wordcloud import WordCloud , STOPWORDS
from bs4 import BeautifulSoup
from textblob import TextBlob , Word
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report , confusion_matrix , accuracy_score

warnings.filterwarnings('ignore')


# 2.0 - Load Data And Process it.

# 2.1- Load Data


from discription file we got

` train data formate ` :

Train data:
ID ::: TITLE ::: GENRE ::: DESCRIPTION

` test data formate ` :

Test data:
ID ::: TITLE ::: DESCRIPTION

In [None]:
import os
import pandas as pd


train_file= "/kaggle/input/genre-classification-dataset-imdb/Genre Classification Dataset/train_data.txt"
test_file = "/kaggle/input/genre-classification-dataset-imdb/Genre Classification Dataset/test_data.txt"
test_solution= "/kaggle/input/genre-classification-dataset-imdb/Genre Classification Dataset/test_data_solution.txt"


try:
    # Read the CSV file into a DataFrame

    train_data = pd.read_csv(train_file ,  sep=":::" , engine="python" , names=['ID' , 'TITLE' , 'GENRE' , "DESCRIPTION"] )
    print("Sample data from data.csv:")
    train_data.head()
except FileNotFoundError:
    print(f"'data.csv' not found . Please adjust the file name or directory path as needed.")

In [None]:
train_data.head()

In [None]:
train_data['DESCRIPTION'][0]

In [None]:
train_data.GENRE.value_counts()

# 2.2- Cleaning Data

In [None]:
# preprocessing

#let's do some steps

#1. remove HTML
#2. remove squer prackets
#3. remove special characters
#4. remove stopwords


# finally collect all functions in one preprocessing function


def remove_html(text):
    soup = BeautifulSoup(text , 'html.parser')
    return soup.get_text()

def remove_squer_prackets(text):
    return re.sub('\[[^]]*\]','',text)

def remove_special_char(text):
    return re.sub('[^a-zA-Z0-9\s]','' , text)


def stemming(text):
    stem = nltk.porter.PorterStemmer()
    text = ' '.join([stem.stem(word) for word in text.split()])
    return text

def remove_stopwords(text):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtering = [word for word in tokens if word.lower() not in stopwords]
    return ' '.join(filtering)

# collecte
def preprocessing(text):
    docs = remove_html(text)
    docs = remove_squer_prackets(docs)
    docs = remove_special_char(docs)
    docs = stemming(docs)
    docs = remove_stopwords(docs)
    return docs

# 2.3- Feature Engineering

Adding year column to show relation between year and kind of filme.


At the end of the film's name, we will find the year of the film, so let's take it and put it in another column.

In [None]:
def year_find(text):
    year_pattern = r'\((\d{4})\)'
    year = re.search(year_pattern , str(text))
    if year:
        extracted_year = year.group(1)
        return int(extracted_year)
    else:
        extracted_year = None
        return extracted_year


print(int(year_find('Oscar et la dame rose (2009)')))

In [None]:
import spacy


tokenizer = ToktokTokenizer()
nlp = spacy.load('en_core_web_sm')
stopwords = list(nlp.Defaults.stop_words)


train_copy = train_data.copy()
train_copy['DESCRIPTION'] = train_copy['DESCRIPTION'].apply(preprocessing)
train_copy.head()


In [None]:
# extract year

train_copy['YEAR'] = train_copy['TITLE'].apply(year_find)
train_copy = train_copy.dropna()
train_copy['YEAR'] = train_copy['YEAR'].astype(int)
train_copy.shape[0]

In [None]:
train_copy.tail()

In [None]:
list_films=list(train_copy['GENRE'].unique())
len(list_films)

In [None]:
list_films

In [None]:
# we will add year to discription befor doing TFIDF


train_copy['YEAR'] = train_copy['YEAR'].astype(str)

train_copy['discription'] =  train_copy['YEAR'] + ' '+ train_copy['DESCRIPTION']


In [None]:
train_copy['discription'][88]

In [None]:
# split train data into train and validation sets


# convert Genre column to classes values to classiffy it

label_encode =  LabelEncoder()
labels = label_encode.fit_transform(train_copy['GENRE'])

train_set , val_set , train_label , val_label = train_test_split(train_copy['discription'] , labels , test_size=0.2 , shuffle=True , random_state = 42)

line_dash = '-'.join('' for _ in range(100))

print(line_dash)
print(f'Size of train data: {train_copy.shape[0]}')
print(line_dash)
print(f'Split data into train and eval sets')
print(f'Trani Set\t: {len(train_set)}\nValidation Set\t: {len(val_set)}')
print(line_dash)

# 2.4- Do TFIDF Method


In [None]:
tfidf_model = TfidfVectorizer(ngram_range=(1,3) , use_idf=True , min_df = 0 , max_df=1)

tf_train = tfidf_model.fit_transform(train_set)
tf_val   = tfidf_model.transform(val_set)

# 3.0 - Load Model And Make Prediction


# 3.1- Load Models And Choose The Best

in this classification task, I will use three models and compare between them by evaluate each of them

using GridSearch method to get optimial hyperparameters:

`models` :

- Naive Bayes
- Logistic Regression
- Support Vector Machine `SVM`

`Let's initialize hyperparameters for each of them.`


In [None]:
# Create a MultinomialNB model
NB_model = MultinomialNB()

# Define the hyperparameters and their possible values
param_grid = {
    'alpha': [0.1, 1.0, 10.0],
    'fit_prior': [True, False]
}

# Create a grid search object with cross-validation
grid_search = GridSearchCV(estimator=NB_model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to your data
grid_search.fit(tf_train, train_label)

# Get the best hyperparameters
best_alpha = grid_search.best_params_['alpha']
best_fit_prior = grid_search.best_params_['fit_prior']

# Use the best hyperparameters to create your final model
final_NB_model = MultinomialNB(alpha=best_alpha, fit_prior=best_fit_prior)

# Train your final model on the entire training dataset
final_NB_model.fit(tf_train , train_label)


NB_prediction = final_NB_model.predict(tf_val)
NB_accuracy   = accuracy_score(NB_prediction , val_label)

print(f'Naive Bayes accuracy : {NB_accuracy}')

note: LogisticRegression and SVC take time more than Naive Bayes, and give the same result so let's comments them.

but if you have a time, do it.

In [None]:
# LR_model = LogisticRegression()

# # Define the hyperparameters and their possible values
# param_grid = {
#     'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
#     'penalty': ['l1', 'l2'],  # Regularization type (L1 or L2)
#     'solver': ['liblinear']  # Solver for L1 regularization
# }

# # Create a grid search object with cross-validation
# LR_grid_search = GridSearchCV(estimator =LR_model ,param_grid=param_grid, cv=5, scoring='accuracy'  )

# LR_grid_search.fit(tf_train , train_label)

# # Get the best hyperparameters
# best_C = grid_search.best_params_['C']
# best_penalty = grid_search.best_params_['penalty']

# final_LR_model = LogisticRegression(C=best_C, penalty=best_penalty, solver='liblinear')

# # Train your final model on the entire training dataset
# final_LR_model.fit(tf_train , train_label)
# # Get The Accuracy
# LR_prediction = final_LR_model.predict(tf_val)
# LR_accuracy   = accuracy_score(LR_prediction , val_label)

# print(f'Logistic Regression accuracy : {LR_accuracy}')

In [None]:
# # Create an SVM model
# svm_model = SVC()

# # Define the hyperparameters and their possible values
# param_grid = {
#     'C': [0.1, 1, 10],  # Regularization parameter
#     'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],  # Kernel type
#     'degree': [2, 3],  # Degree of the polynomial kernel (if using poly)
#     'gamma': ['scale', 'auto'] + [0.001, 0.01, 0.1, 1],  # Kernel coefficient (if using poly, rbf, or sigmoid)
# }

# # Create a grid search object with cross-validation
# grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=5, scoring='accuracy')

# # Fit the grid search to your data
# grid_search.fit(tf_train, train_label)  # Replace X_train and y_train with your training data

# # Get the best hyperparameters
# best_C = grid_search.best_params_['C']
# best_kernel = grid_search.best_params_['kernel']
# best_degree = grid_search.best_params_['degree']
# best_gamma = grid_search.best_params_['gamma']

# # Use the best hyperparameters to create your final model
# final_svm_model = SVC(C=best_C, kernel=best_kernel, degree=best_degree, gamma=best_gamma)

# # Train your final model on the entire training dataset
# final_svm_model.fit(tf_train, train_label)

# # Get The Accuracy
# svm_prediction = final_svm_model.predict(tf_val)
# svm_accuracy   = accuracy_score(svm_prediction , val_label)

# print(f'Support Vector Machine accuracy : {svm_accuracy}')

# [Part-2](#2)

In [None]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

# imports

In [None]:
import numpy as np
import pandas as pd
import torch
import warnings
import time
# import evaluate

from pathlib import Path
from string import Template
from transformers import AutoModelForSeq2SeqLM , AutoTokenizer , GenerationConfig , Trainer , TrainingArguments , T5Tokenizer, T5ForConditionalGeneration
warnings.simplefilter('ignore')



## load Model and Tokenizer


In [None]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name , torch_dtype = torch.bfloat16)
tokenizer      = AutoTokenizer.from_pretrained(model_name)


### let's show number of trainable parameters


It is possible to pull out the number of model parameters and find out how many of them are trainable

In [None]:
def print_number_of_trainable_parameters(model):
    trainable = 0
    all_params= 0
    for _,param in model.named_parameters():
        all_params+= param.numel()
        if param.requires_grad:
            trainable+= param.numel()
    return f'trainable model parameters: {trainable}\nAll parameters: {all_params}\npercentage of trainable parameters: {100*trainable/all_params:.2f}%'



print( print_number_of_trainable_parameters(original_model))

# 1.1-Test the Model with Zero shot inference

in this section we just make a prmopt engineering with zero output

In [None]:
preamble = '''
Given the following film descriptions, your task is to classify each description into one of the following categories:
- Thriller
- Comedy
- Documentary
- Drama
- Horror
- Short
- Western
- Sport
- Romance
- War
- Game Show
- Biography
- Adult
- Talk Show
- Family
- Action
- Music
- Crime
- Animation
- Sci-Fi
- Adventure
- Reality TV
- Fantasy
- Mystery
- History
- News
- Musical

Please arrange the classifications from the most likely to be correct to the least likely to be correct.'''

end_prompt   = 'The answer:\n'
template =  Template('$preamble\n\n$prompt\n\n$end_prompt')


In [None]:
def formate_input(df , index):
    prompt     = df.loc[index , 'DESCRIPTION']
    input_text =  template.substitute(preamble = preamble , prompt = prompt , end_prompt=end_prompt)
    return input_text




In [None]:
print(formate_input(train_data , 0))

In [None]:
# let's do zero shot



zero_shot_discription = formate_input(train_data , 0)
zero_shot_answer      = train_data.loc[0 , 'GENRE']

inputs       =  tokenizer(zero_shot_discription , return_tensors='pt')
generate     = original_model.generate(inputs['input_ids'] , max_new_tokens=1)[0]
model_answer = tokenizer.decode(generate , skip_special_tokens =True)
line_dash = '-'.join('' for _ in range(100))
print(line_dash)
print(f'Prompt:\n{zero_shot_discription}')
print(line_dash)
print(f'Acual Answer:\n{zero_shot_answer}')
print(line_dash)
print(f'Model Answer:\n{model_answer}')



# 1.2 - Using One Shot and Few Shot Inference

let's give our model some examples with answers

In [None]:
template2 = Template('$preamble\n\n$prompt\n\n$end_prompt$answer\n\n\n')

def formate2(data , index):
    prompt = data.loc[index , 'DESCRIPTION']
    answer = data.loc[index , 'GENRE']
    text   = template2.substitute(preamble = preamble , prompt = prompt , end_prompt=end_prompt , answer = answer)
    return text




def create_example(example_index , example_class):

    final_prompt = ''

    for index in example_index:
        text = formate2(train_data , index)
        final_prompt += text


    test_text     = formate_input(train_data , example_class)
    final_prompt += test_text
    test_answer   = train_data.loc[example_class , 'GENRE']
    return final_prompt , test_answer


In [None]:
example_index = [0,1,2,3]
example_class = 200

few_shot_text , few_shot_answer = create_example(example_index , example_class)
print(few_shot_text)



In [None]:
# let's test our model again

few_shot_input        = tokenizer(few_shot_text , return_tensors='pt')
few_shot_generate     = original_model.generate(few_shot_input['input_ids'])[0]
few_shot_model_answer = tokenizer.decode(few_shot_generate , skip_special_tokens = True)

print(line_dash)
print(f'Acual Answer :\n{few_shot_answer}')
print(line_dash)
print(f'Model Answer :\n{few_shot_model_answer}')
print(line_dash)


# 2 - Perform Parameter Efficient Fine-Tuning (PEFT)


# 2.1 - Preprocess the Classification Dataset

conver all classification data into the explicit instruction for LLM. prepend the instruction with:


Given the following film descriptions, your task is to classify each description into one of the following categories:
- Thriller
- Comedy
- Documentary
- Drama
- Horror
- Short
- Western
- Sport
- Romance
- War
- Game Show
- Biography
- Adult
- Talk Show
- Family
- Action
- Music
- Crime
- Animation
- Sci-Fi
- Adventure
- Reality TV
- Fantasy
- Mystery
- History
- News
- Musical

Please arrange the classifications from the most likely to be correct to the least likely to be correct.

 L.R. Brane loves his life - his car, his apartment, his job, but especially his girlfriend, Vespa. One day while showering, Vespa runs out of shampoo. L.R. runs across the street to a convenience store to buy some more, a quick trip of no more than a few minutes. When he returns, Vespa is gone and every trace of her existence has been wiped out. L.R.'s life becomes a tortured existence as one strange event after another occurs to confirm in his mind that a conspiracy is working against his finding Vespa.

The answer:

 drama

then tokenize the data into tokens

In [None]:
tokenize_prompt = [formate_input(train_data, index)for index in range(train_data.shape[0])]


tokenize_train_data = train_data.copy()
tokenize_train_data['new_prompt'] = tokenize_prompt
tokenize_train_data = tokenize_train_data.drop(['ID' ,'TITLE' , 'DESCRIPTION'] , axis=1)

tokens_for_prompt = [tokenizer(tokenize_train_data.loc[index , 'new_prompt'] ,  return_tensors='pt' , truncation=True ,padding='max_length').input_ids for index in range(tokenize_train_data.shape[0])]
tokens_for_answer = [tokenizer(tokenize_train_data.loc[index , 'GENRE'] ,  return_tensors='pt' , truncation=True ,padding='max_length').input_ids for index in range(tokenize_train_data.shape[0])]

tokenize_train_data['input_ids'] = tokens_for_prompt
tokenize_train_data['labels']    = tokens_for_answer

tokenize_train_data = tokenize_train_data.drop(['GENRE' , 'new_prompt'] , axis=1)



tokenize_train_data.head()

In [None]:
from datasets import DatasetDict , Dataset

# preparing dataset for LLM

new_tokens_for_prompt = []
new_tokens_for_answer = []
for i in range(len(tokens_for_prompt)):
    new_tokens_for_prompt.append(tokens_for_prompt[i][0])
    new_tokens_for_answer.append(tokens_for_answer[i][0])


train_tokens = {
    'input_ids': new_tokens_for_prompt ,
    'labels': new_tokens_for_answer
}

Train_Dict = Dataset.from_dict(train_tokens)
Train_Dict


# 2.2 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (r) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [None]:
from peft import LoraConfig , get_peft_model , TaskType

lora_configs = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
                               )

peft_model = get_peft_model(original_model , lora_configs)

print(print_number_of_trainable_parameters(peft_model))

In [None]:
peft_output_dir = f'/kaggle/working/peft-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=peft_output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1)


peft_trainer = Trainer(
    model = peft_model ,
    args = peft_training_args ,
    train_dataset = Train_Dict

)

In [None]:
peft_trainer.train()

peft_model_path = f'/kaggle/working/peft_model_trainer'
peft_trainer.model_save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

In [None]:
from peft import PeftModel , PeftConfig 

peft_model_base = AutModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer       = AutoTokenizer.from_pretrained("google/flan-t5-base")


peft_model = PeftModel.from_pretrained(peft_model_base ,  
                                      peft_model_path ,
                                      torch_dtype = torch.bfloat16 ,
                                      is_trainable=False)

print(print_number_of_trainable_parameters(peft_model))




In [None]:
# let's test our fine-tuned model 



import random
human_answer = []
peft_model_answer=[]

# choose 10 numbers randomly 
random_numbers = [ random.randint(0 , len(train_data['GENRE'])) for _ in range(2)]


for i in random_numbers:
    structer = formate_input(train_data , i)
    inputs   = tokenizer(structer , return_tensors='pt')
    generate = peft_model.generate(inputs['input_ids'])[0]
    outputs  = tokenizer.decode(generate , skip_special_tokens=True)
    peft_model_answer.append(outputs)
    human_answer.append(train_data['GENRE'][i])
