<a href="https://colab.research.google.com/github/dawoodrizwan-05/AI-Voice-Rasa-chatbot/blob/main/Fine_tune_pretrained_model_on_custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Name: **Muhammad Dawood Rizwan**
###Dataset: **Spam Collection**








### **Finetune Pretrained Model(DistilBert) on Custom Dataset**

In [1]:
! pip install transformers datasets



In [17]:
#import libraries

# Data manipulation
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Tokenization and dataset preparation
from datasets import Dataset
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification

# TensorFlow
import tensorflow as tf


In [6]:
# Load dataset
df = pd.read_csv('SpamCollection.txt', sep='\t', names=["label", "message"])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
# Prepare the features and labels
X = list(df['message'])
y = list(pd.get_dummies(df['label'], drop_first=True)['spam'])

In [8]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)


In [9]:
# Initialize tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



In [10]:
# Tokenize the data
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=128)

In [11]:
# Convert to TensorFlow datasets
def to_tf_dataset(encodings, labels):
    dataset = tf.data.Dataset.from_tensor_slices((
        dict(encodings),
        labels
    ))
    return dataset.batch(16)

train_dataset = to_tf_dataset(train_encodings, y_train)
test_dataset = to_tf_dataset(test_encodings, y_test)


In [12]:
# Load model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [13]:
# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [14]:
# Early stopping and best model saving callbacks
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True,
    verbose=1
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    'best_model.h5',
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=True,
    verbose=1
)

In [15]:
# Train the model
history = model.fit(train_dataset, epochs=3, validation_data=test_dataset)
# Save the model and tokenizer# Train the model
history = model.fit(
    train_dataset,
    epochs=3,
    validation_data=test_dataset,
    callbacks=[early_stopping, model_checkpoint]
)

# Save the model and tokenizer
model.save_pretrained('./distilbert_spam_model')
tokenizer.save_pretrained('./distilbert_spam_tokenizer_model')



Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 1: val_loss improved from inf to 0.02706, saving model to best_model.h5
Epoch 2/3
Epoch 2: val_loss did not improve from 0.02706
Epoch 3/3
Epoch 3: val_loss did not improve from 0.02706
Epoch 3: early stopping
Restoring model weights from the end of the best epoch: 1.


('./distilbert_spam_tokenizer_model/tokenizer_config.json',
 './distilbert_spam_tokenizer_model/special_tokens_map.json',
 './distilbert_spam_tokenizer_model/vocab.txt',
 './distilbert_spam_tokenizer_model/added_tokens.json',
 './distilbert_spam_tokenizer_model/tokenizer.json')

In [16]:
# Evaluate the model
eval_results = model.evaluate(test_dataset)
print(f"Test Loss: {eval_results[0]}, Test Accuracy: {eval_results[1]}")

# Predictions
predictions = model.predict(test_dataset)
y_pred = tf.argmax(predictions.logits, axis=1).numpy()

# Classification Report
print(classification_report(y_test, y_pred))


Test Loss: 0.02705792337656021, Test Accuracy: 0.9937219619750977
              precision    recall  f1-score   support

       False       1.00      1.00      1.00       955
        True       0.98      0.98      0.98       160

    accuracy                           0.99      1115
   macro avg       0.99      0.99      0.99      1115
weighted avg       0.99      0.99      0.99      1115

