To build the SMS Spam Detection project we can follow these steps along with the complete Python code:

1. Load the dataset.

2. Convert the text labels to numeric labels using LabelEncoder.

3. Perform text preprocessing (optional, for better model performance).

4. Split the dataset into training and test sets.

5. Vectorize the text messages using CountVectorizer and TfidfTransformer.

6. Train a machine learning model (e.g., Support Vector Machine) on the data.

7. Evaluate the model with the test dataset.

8. Save the trained model and necessary transformations for future predictions.

Note: Created custom test cases to identify message as ham or scam.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import pickle
import zipfile

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Step 1: Data Preprocessing
# Read the UCI SPAM SMS dataset using Pandas
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
with zipfile.ZipFile('smsspamcollection.zip', 'r') as zip_ref:
    zip_ref.extract('SMSSpamCollection', '.')

--2024-05-17 14:31:04--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [  <=>               ] 198.65K   804KB/s    in 0.2s    

2024-05-17 14:31:05 (804 KB/s) - ‘smsspamcollection.zip’ saved [203415]



In [3]:
sms_df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)

In [4]:
# Step 2: Convert text labels to numeric label
label_encoder = LabelEncoder()
sms_df['label'] = label_encoder.fit_transform(sms_df[0])

In [5]:
sms_df.head()

Unnamed: 0,0,1,label
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [6]:
print(sms_df.columns)

Index([0, 1, 'label'], dtype='object')


In [7]:
# Step 3: Text preprocessing
# Define a function for preprocessing the text
def preprocess_text(text):
    stemmer = SnowballStemmer("english")
    words = nltk.word_tokenize(text)
    words = [stemmer.stem(word) for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

# Preprocess the SMS messages
sms_df['processed'] = sms_df[1].apply(preprocess_text)


In [8]:
# Step 4: Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    sms_df['processed'], sms_df['label'], test_size=0.2, random_state=42
)

In [9]:
# Step 5: Text vectorization
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_vectorized)


In [10]:
# Step 6: Train a machine learning model
model = SVC(kernel='linear')
model.fit(X_train_tfidf, y_train)

In [11]:
# Step 7: Evaluate the model
# Import the necessary modules
from sklearn.metrics import classification_report
X_test_vectorized = vectorizer.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_vectorized)

y_pred = model.predict(X_test_tfidf)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

# Creating the Confusion matrix
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9910313901345291
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       966
           1       1.00      0.93      0.97       149

    accuracy                           0.99      1115
   macro avg       0.99      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115

[[966   0]
 [ 10 139]]


In [12]:
# Step 8: Save the trained model, vectorizer, and transformer
with open('spam_classifier.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
with open('tfidf_transformer.pkl', 'wb') as f:
    pickle.dump(tfidf_transformer, f)

**Test with Custom Inputs**

In this example:

We first load the trained model (spam_classifier.pkl), the vectorizer (vectorizer.pkl), and the TF-IDF transformer (tfidf_transformer.pkl).

These files should be stored in the same directory as the script or notebook, or we should provide the correct file paths.

We preprocess a custom input text using the same process that was applied when training the model.

We use the trained model to predict whether the message is spam or ham.


In [13]:
# Load the model, vectorizer, and TF-IDF transformer
with open('spam_classifier.pkl', 'rb') as f:
    model = pickle.load(f)
with open('vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)
with open('tfidf_transformer.pkl', 'rb') as f:
    tfidf_transformer = pickle.load(f)

# Text Preprocessing Function
# Define a text preprocessing function that will be applied to each SMS message
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")

def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if t.isalnum()]  # Keep only alphanumeric characters
    tokens = [t for t in tokens if t not in stop_words]  # Remove stopwords
    tokens = [stemmer.stem(t) for t in tokens]  # Stemming
    return ' '.join(tokens)

# Function to predict the class label using the trained model
def predict_label(text):
    # Preprocess the text
    preprocessed_text = preprocess_text(text)

    # Vectorize the text
    vectorized_text = vectorizer.transform([preprocessed_text])

    # Transform vectorized text to TF-IDF
    tfidf_text = tfidf_transformer.transform(vectorized_text)

    # Make prediction
    prediction = model.predict(tfidf_text)

    # Return the predicted label (0 for 'ham', 1 for 'spam')
    return prediction[0]

# Example: Test with custom input text
custom_text = "Congratulations! You’ve won a $500 Amazon gift card. Claim it here [Link]. "
predicted_label = predict_label(custom_text)
print(f"The email/msg is predicted as: {'spam' if predicted_label == 1 else 'ham'}")

The email/msg is predicted as: spam


SMS Spam Detection project using Support Vector Machines (SVM) achieved a remarkable accuracy of 99.1%!

The precision for both classes (ham and spam) is extremely high, indicating that when model predicts a message to be spam or ham, it's highly likely to be correct. Specifically, the model has a precision of 100% for spam, meaning there were no false positives among the spam predictions.