Dataset Loading

In [1]:
# Uploading file from local disk to colab drive
# from google.colab import files
# files.upload()

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv(
    "SMSSpamCollection",
    sep="\t", ##Split each row into columns wherever there is a TAB
    header=None, ##There is no header row in the file.
    names=["label", "message"] ##there is no header, we manually assign column names: labels --> spam or ham ; massege --> actual SMS text
)

Converting labels to binary

In [4]:
data["label"] = data["label"].map({"ham": 0, "spam": 1})

In [5]:
print(data.head())

   label                                            message
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...
4      0  Nah I don't think he goes to usf, he lives aro...


##Text Processing
Preprocess the text data( like lower case, lemmatization, stop word removal, remove non
alphabetic tokens etc)

In [6]:
import re     #Handles regular expressions for pattern-based text cleaning
import nltk   #Core Natural Language Toolkit for NLP tasks.
from nltk.corpus import stopwords #Provides common English stop words ("the", "is", "and") that carry little meaning
from nltk.stem import WordNetLemmatizer  #Reduces words to their base/dictionary form (e.g., "running" → "run")
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
stop_words = set(stopwords.words('english')) #creates a set of ~179 common English words ("the", "is", "and", "to", etc.)
lemmatizer = WordNetLemmatizer() #initializes NLTK's lemmatizer

def preprocess(text):
    text = text.lower() #Converts all characters to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words) #Rejoins filtered lemmas with spaces, producing string output for TF-IDF.
data['text_clean'] = data['message'].apply(preprocess)
print(data['text_clean'].head())

0    go jurong point crazy available bugis n great ...
1                              ok lar joking wif u oni
2    free entry wkly comp win fa cup final tkts st ...
3                  u dun say early hor u c already say
4             nah dont think go usf life around though
Name: text_clean, dtype: object


Preprocess the text dataset by converting words into numerical features using TF-IDF

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# Transforms text into TF-IDF matrix where each word's importance = Term Frequency × Inverse Document Frequency.
vectorizer = TfidfVectorizer(max_features=3000, min_df=5)  #'max_features": Limits vocabulary to top 3000 most frequent terms (reduces dimensionality from ~10k+ to manageable size).
X = vectorizer.fit_transform(data['text_clean'])  # "min_df" : gnores words appearing in <5 documents (filters rare noise terms).
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("TF-IDF shape:", X.shape)

TF-IDF shape: (5572, 1529)


###Perceptron Training and Evaluation
Train Perceptron, compute accuracy/precision/recall for non-spam (ham, class 0)

In [8]:
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

model = Perceptron(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec_ham = precision_score(y_test, y_pred, pos_label=0)  # Non-spam precision
rec_ham = recall_score(y_test, y_pred, pos_label=0)     # Non-spam recall

print(f"Accuracy: {acc:.3f}")
print(f"Ham Precision: {prec_ham:.3f}, Recall: {rec_ham:.3f}")
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))


Accuracy: 0.966
Ham Precision: 0.983, Recall: 0.977
              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       966
        spam       0.86      0.89      0.88       149

    accuracy                           0.97      1115
   macro avg       0.92      0.93      0.93      1115
weighted avg       0.97      0.97      0.97      1115



###Limitations
Perceptron assumes linear separability, struggling with non-linear SMS patterns (e.g., sarcasm) or class imbalance without tweaks. It lacks probabilistic outputs and may overfit sparse TF-IDF features; better for binary tasks but inferior to Logistic Regression/SVM (~96% acc) on this dataset.