## Project Discription

Model =  DistilBert-base-uncased (for sequence classification)  

Downstream task = Classify the SMS into Spam or Ham

Dataset = SMSSpamCollection.txt

Source = Downloaded the dataset from kaggle



##Import Dataset




In [None]:
!wget -O SMSSpamCollection.txt https://raw.githubusercontent.com/RAJ123PAPA/SMS-Spam-Finetuning-/main/SMSSpamCollection.txt

In [2]:
path = 'SMSSpamCollection.txt'


## Prepare The Dataset

I have txt file of Dataset.So I use pandas to convert it in required format so that the model accepts the file.

First we convert into dataframe and the divide it into two list-

1). x for message

2). y for label

In [3]:
from re import X
import pandas as pd
df=messages = pd.read_csv(path, sep='\t',names=["label", "message"])
df.head()



x=list(df['message'])
y=list(df['label'])

In dataset we have two column one is label in which two value either it is Spam or Ham.

So convert it into 1 and 0 , so that model can understand .That why I convert spam into 1 and Ham into 0

In [4]:
y=list(pd.get_dummies(y,drop_first=True)['spam'])

Now we use sklearn module to split our dataset into training and test dataset and we take ratio of spliting 0.2.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

Now install Transformers to get pretrained Model.

I am using Distilbert model for Sequence classification

In [None]:
!pip install transformers

## Fine Tune The Model

This is main code in which I did following task-

1.   Load our model and their tokenizer using transformers
2. Use cuda (GPU) for fast training the model
3. Made a CustomDataset for preproccesing the dataset so we can train the model
4.Use Dataloader to divide our dataset into batch and shuffle it ,so that model can train fast and more efficient
5.Use AdamW optimizer and CrossEntropyloss loss function
6.Start the training loop with number of epoch 3.
7.Save the model with name SMSSpam_Filter_model


In [None]:
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, AdamW

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the pre-trained DistilBERT model and tokenizer
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
model.to(device)  # Move the model to GPU
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# Load your dataset and preprocess it into suitable format
class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_length, return_tensors='pt')
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Define your training and validation data
train_texts = X_train  # List of training texts
train_labels = y_train # List of corresponding sentiment labels
train_dataset = CustomDataset(train_texts, train_labels, tokenizer, max_length=128)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Prepare optimizer and loss function
optimizer = AdamW(model.parameters(), lr=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        inputs = batch['input_ids'].to(device)  # Move input tensors to GPU
        attention_mask = batch['attention_mask'].to(device)  # Move attention mask to GPU
        labels = batch['labels'].to(device)  # Move labels to GPU
        outputs = model(inputs, attention_mask=attention_mask)[0]
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()

# Save the trained model
model.save_pretrained("SMSSpam_Filter_model")
tokenizer.save_pretrained("tokenizer")


In [None]:
!pip install gradio
!pip install transformers

In [9]:
model_path='SMSSpam_Filter_model'

In [10]:
import shutil

source_path = '/content/tokenizer/added_tokens.json'
destination_path = '/content/SMSSpam_Filter_model'
shutil.move(source_path, destination_path)

source_path = '/content/tokenizer/special_tokens_map.json'
destination_path = '/content/SMSSpam_Filter_model'
shutil.move(source_path, destination_path)

source_path = '/content/tokenizer/tokenizer_config.json'
destination_path = '/content/SMSSpam_Filter_model'
shutil.move(source_path, destination_path)


source_path = '/content/tokenizer/vocab.txt'
destination_path = '/content/SMSSpam_Filter_model'
shutil.move(source_path, destination_path)


'/content/SMSSpam_Filter_model/vocab.txt'

In [None]:
import gradio as gr
import torch
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

# Load the fine-tuned model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizer.from_pretrained(model_path)

def predict_sentiment(text):
    inputs = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    predicted_label = torch.argmax(outputs.logits).item()
    sentiment = "Spam" if predicted_label == 1 else "Ham"
    return sentiment

iface = gr.Interface(fn=predict_sentiment, inputs="text", outputs="text")
iface.launch()
