Spam Detection using SVM 

This is a Spam Detection machine that uses Support Vector Machines. The dataset, loaded from one or more CSV files, contains SMS messages labeled as either spam or not spam. After preprocessing steps, including null value handling and category conversion, the text data is transformed into feature vectors using the TF-IDF vectorization technique. The SVM model, configured with a linear kernel and balanced class weights, is trained on the preprocessed SMS data. The accuracy and various performance metrics, including precision, recall, and F1 score, are calculated for both the training and test datasets. Users can input a new SMS message for the model to predict its spam or not spam status, and the result, along with the model's accuracy and additional metrics, is displayed. The code provides a comprehensive approach to building a simple yet effective spam detection system using SVM.

Import Libraries

This block imports the necessary libraries for data manipulation, feature extraction, model training, and performance evaluation.

In [5]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Load first CSV into dataframe

It attempts to load the dataset from a CSV file. If a UnicodeDecodeError occurs, it prints an error message and exits the program.

In [6]:
try:
    raw_sms_data = pd.read_csv('sms_dataset.csv', encoding='latin-1')
except UnicodeDecodeError:
    print("Error: Unable to decode using 'latin-1' encoding. Please check the file encoding.")
    exit(1)

Load second CSV into dataframe

Same as mentioned above, it attempts to load the dataset from a CSV file. If a UnicodeDecodeError occurs, it prints an error message and exits the program.

In [7]:
try:
    raw_sms_data2 = pd.read_csv('mail_dataset.csv', encoding='latin-1')
except UnicodeDecodeError:
    print("Error: Unable to decode using 'latin-1' encoding. Please check the file encoding.")
    exit(1)

concatenate the two datasets

In [8]:
raw_sms_data = pd.concat([raw_sms_data, raw_sms_data2], ignore_index=True)

In [9]:
raw_sms_data.head()

Unnamed: 0,Category,Message,Normalization
0,spam,J-PC:Spin & Win! 2K Bonus +3% GCASH! Visit 686...,j-pc:spin & win! 2k bonus +3% gcash! visit 686...
1,spam,",Enjoy online slot here at JACKPOT CITY and wi...",",enjoy online slot here at jackpot city and wi..."
2,spam,Claim the J'PC offer! Deposite & get 777 Free ...,claim the j'pc offer! deposite & get 777 free ...
3,spam,Diyos ng kayamanan ay nandito! Magdownload at ...,diyos ng kayamanan ay nandito! magdownload at ...
4,spam,Lahat ay gumagamit ng global currency na plata...,lahat ay gumagamit ng global currency na plata...


Replace Null Values with an Empty String

In [10]:
sms_data = raw_sms_data.fillna('')

Clean the 'Category' column values and convert to integers

This block, along with the block above, replaces null values with empty strings and converts the 'Category' column values to integers (0 for spam, 1 for not spam).

In [11]:
sms_data['Category'] = sms_data['Category'].apply(lambda x: 0 if x.strip() == 'spam' else 1)

separate data into texts and labels

In [12]:
X = sms_data['Message']
Y = sms_data['Category']


Split the data into training data and test data

It separates the dataset into input texts (X) and labels (Y). Then, it initializes an SVM model and splits the data into training and testing sets.

In [13]:
model = SVC(class_weight='balanced', kernel='linear')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

Transform the text data to feature vectors that can be used as input to SVM

This block uses TF-IDF vectorization to convert text data into feature vectors, which are suitable for SVM model input

In [14]:
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

Convert X_train and X_test to feature vectors

In [15]:
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Train the model (SVM)

It initializes an SVM model and trains it using the training data.

In [16]:
model = SVC(class_weight='balanced', kernel='linear') 

Train the SVM model with the training data

In [17]:
model.fit(X_train_features, Y_train)

Prediction on training data

This block makes predictions on the training data and calculates accuracy, precision, recall, and F1 score.

In [18]:
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
precision_on_training_data = precision_score(Y_train, prediction_on_training_data)
recall_on_training_data = recall_score(Y_train, prediction_on_training_data)
f1_on_training_data = f1_score(Y_train, prediction_on_training_data)

Prediction on test data

Similar to the block above, this block evaluates the model on the test data.

In [19]:
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
precision_on_test_data = precision_score(Y_test, prediction_on_test_data)
recall_on_test_data = recall_score(Y_test, prediction_on_test_data)
f1_on_test_data = f1_score(Y_test, prediction_on_test_data)

Ask for user input (SMS) for prediction

It prompts the user to input an SMS message, converts the input to feature vectors, and makes a prediction using the trained model.

In [20]:
sms_input = input('Enter SMS: ')

Convert SMS text to feature vectors

In [21]:
input_data_features = feature_extraction.transform([sms_input])

Make prediction

In [22]:
my_prediction = model.predict(input_data_features)

Display the prediction

This block displays the prediction result, accuracy, precision, recall, and F1 score on the test data after the user's input.

In [23]:
print("User Input:", sms_input)
print("Prediction:", "Spam" if my_prediction[0] == 0 else "Not Spam")
print("Accuracy on test data after prediction:", accuracy_on_test_data * 100, '%')
print("Precision on test data:", precision_on_test_data)
print("Recall on test data:", recall_on_test_data)
print("F1 score on test data:", f1_on_test_data)

User Input: haha lol wer u at
Prediction: Not Spam
Accuracy on test data after prediction: 94.1908713692946 %
Precision on test data: 0.9291338582677166
Recall on test data: 0.959349593495935
F1 score on test data: 0.9440000000000001
