<a href="https://colab.research.google.com/github/Vamsiratnala/Fine-Tuned-LLM/blob/main/model_fineTuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install Conda in Colab
!pip install -q condacolab
import condacolab
condacolab.install()


⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:20
🔁 Restarting kernel...


In [None]:
# Create a clean Conda environment with older versions
!conda install -y python=3.10 numpy=1.24.3 tensorflow=2.13.0 transformers=4.38.2


Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / failed

SpecsConfigurationConflictError: Requested specs conflict with configured specs.
  requested specs: 
    - numpy=1.24.3
    - python=3.10
    - tensorflow=2.13.0
    - transformers=4.38.2
  pinned specs: 
    - cuda-version=12
    - python=3.11
    - python_abi=3.11[build=*cp311*]
Use 'conda config --show-sources' to look for 'pinned_specs' and 'track_features'
configuration parameters.  Pinned specs may also be defined in the file
/usr/local/conda-meta/pinned.




In [None]:
import numpy as np
import tensorflow as tf
import transformers

print("✅ Current Library Versions:")
print(f"NumPy version      : {np.__version__}")
print(f"TensorFlow version : {tf.__version__}")
print(f"Transformers version: {transformers.__version__}")


✅ Current Library Versions:
NumPy version      : 2.0.2
TensorFlow version : 2.18.0
Transformers version: 4.52.4


In [None]:

import pandas as pd
df = pd.read_csv('/content/SMSSpamCollection.csv',sep='\t',header = None,names=['label','message'])
print(df.head())
print(df.shape)
print(df['label'].value_counts())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
(5572, 2)
label
ham     4825
spam     747
Name: count, dtype: int64


In [None]:
print(df['label'].unique())
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
print(df['label'].value_counts())


['ham' 'spam']
label
0    4825
1     747
Name: count, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split

#converting df to lists
all_labels = df['label'].tolist()
all_texts = df['message'].tolist()
# splitting data
train_texts,temp_texts,train_labels,temp_labels = train_test_split(all_texts,all_labels,test_size=0.3,stratify=all_labels,random_state = 42)
val_texts,test_texts,val_labels,test_labels = train_test_split(temp_texts,temp_labels,test_size=0.5,stratify=temp_labels,random_state = 42)

In [None]:

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts,truncation=True,padding=True)
val_encodings = tokenizer(val_texts,truncation=True,padding=True)
test_encodings = tokenizer(test_texts,truncation=True,padding=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
#define a conversion function

def convert_to_tf_dataset(encodings, labels):
  return tf.data.Dataset.from_tensor_slices(
      ({'input_ids':encodings['input_ids'],'attention_mask':encodings['attention_mask']},labels)
  )


In [None]:
train_dataset = convert_to_tf_dataset(train_encodings,train_labels)
val_dataset = convert_to_tf_dataset(val_encodings,val_labels)
test_dataset = convert_to_tf_dataset(test_encodings,test_labels)

In [None]:
BATCH_SIZE = 8

train_dataset = train_dataset.shuffle(len(train_labels)).batch(BATCH_SIZE)
val_dataset = val_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

In [None]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2  # since we're doing binary classification: spam vs ham
)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
from sklearn.utils import class_weight


# Your encoded labels: 0 = ham, 1 = spam
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(train_labels),
    y=train_labels
)

class_weights_dict = {i : weight for i, weight in enumerate(class_weights)}
print(class_weights_dict)


{0: np.float64(0.5774355937222386), 1: np.float64(3.72848948374761)}


In [None]:
import tensorflow as tf

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)


In [None]:
# Assume class_weights_dict is already defined, like:
# class_weights_dict = {0: 0.55, 1: 3.56}  (example)

# Use per-example loss to apply class weights
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction=tf.keras.losses.Reduction.NONE
)

epochs = 1

for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    epoch_loss = 0
    batch_count = 0

    for batch in train_dataset:
        inputs, labels = batch

        with tf.GradientTape() as tape:
            outputs = model(inputs, training=True)
            logits = outputs.logits

            # Step 1: Get un-reduced (per-example) loss
            per_example_loss = loss_fn(labels, logits)

            # Step 2: Look up class weight for each label in the batch
            weights = tf.gather([class_weights_dict[0], class_weights_dict[1]], labels)

            weights = tf.cast(weights, dtype=tf.float32)

            # Step 3: Apply weights and reduce
            weighted_loss = tf.reduce_mean(per_example_loss * weights)

        gradients = tape.gradient(weighted_loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        epoch_loss += weighted_loss.numpy()
        batch_count += 1

    print(f"✅ Epoch {epoch+1} completed | Average Loss: {epoch_loss / batch_count:.4f}")



Epoch 1/1
✅ Epoch 1 completed | Average Loss: 0.1605


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import tensorflow as tf

all_preds = []
all_labels = []

for batch in val_dataset:
    inputs, labels = batch
    outputs = model(inputs, training=False)
    logits = outputs.logits
    preds = tf.argmax(logits, axis=1)

    all_preds.extend(preds.numpy())
    all_labels.append(labels.numpy())

# Convert to numpy arrays
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# Classification report
print("📊 Classification Report:")
print(classification_report(all_labels, all_preds, target_names=["ham", "spam"]))

# Confusion matrix
print("🧾 Confusion Matrix:")
print(confusion_matrix(all_labels, all_preds))


📊 Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       724
        spam       0.90      0.94      0.92       112

    accuracy                           0.98       836
   macro avg       0.94      0.96      0.95       836
weighted avg       0.98      0.98      0.98       836

🧾 Confusion Matrix:
[[712  12]
 [  7 105]]


In [1]:
test_peds =[]
test_labels = []
for batch in test_datset:
  inputs , labels = batch
  output = model(inputs,training = false) #outputs is an object of type TFSequenceClassifierOutput.
  logits = output.logits
  preds = tf.argmax(logits,axis = 1)
  test_preds.extend(preds.numpy())
    if isinstance(labels, tf.Tensor) and len(labels.shape) == 0:
        test_labels.append(labels.numpy())
    else:
        test_labels.extend(labels.numpy())
print(preds)

NameError: name 'test_datset' is not defined

In [None]:
model.save_pretrained("distilbert-sms-spam")
tokenizer.save_pretrained("distilbert-sms-spam")