In this project, we'll be developing a spam SMS detection system using the Logistic Regression model and TF-IDF for text vectorization. By leveraging libraries such as Pandas for data manipulation and Scikit-learn for machine learning, we'll classify SMS messages as spam or ham, aiming for accurate and efficient predictions.

The SMS Spam Collection Dataset is a popular dataset in the machine learning and data science community, particularly for text classification tasks.

The dataset has two columns:

1) Category: This column contains the labels, which are either 'ham' (non-spam) or 'spam'.

2) Message: This column contains the actual SMS text. Each message is a short text string that could be a spam or a non-spam message.

**Importing** Neccessary Libraries for Data clenaing and machine learning algorithm

In [4]:

import pandas as pd # For handling datasets
from sklearn.feature_extraction.text import TfidfVectorizer # Converting text into numbers as compiler won't understand words
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


Loading the dataset

In [5]:
path = "/content/spam.csv"

df = pd.read_csv(path, encoding="latin-1" )
df.head(10)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


Renaming columns for clarity

In [9]:
df = df[['v1', 'v2']]

df.columns = ['label', 'message']

Converting labels into numbers as compiler won't understand words

In [10]:
df['label'] = df['label'].map({'spam': 1, 'ham': 0})

 Converting text messages into number format using TF-IDF Vectorizer

In [11]:
vectorizer = TfidfVectorizer(max_features=5000)

X = vectorizer.fit_transform(df['message'])

y = df['label'].values

Split data into training (70%) and testing (30%) sets

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Initializing and Training Logistic Regression Model

In [13]:
model = LogisticRegression() # Initialization

model.fit(X_train, y_train)  #Training

Testing the model and checking accuracy of the machine learning model

In [14]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy Comes out to be: { accuracy*100:.2f}%")

Model Accuracy Comes out to be: 95.99%


In conclusion,

As we can see that when the training set is 70% and testing set is 30%, model accuracy comes at 95%.
When the training set is 80% and testing set is 20%, model accuracy comes at 96%.