<a href="https://colab.research.google.com/github/bmathew05/MACHINE-LEARNING/blob/main/NLP_SMS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# SMS Spam Detection Project

# This project aims to develop a robust and accurate system for detecting spam messages in SMS data. Utilizing Natural Language Processing (NLP) techniques and various machine learning algorithms, the project aims to classify incoming messages as either spam or non-spam (ham).

# **Project Workflow & Highlights:**

# 1. **Data Collection and Preprocessing:** The project gathers SMS data and performs essential preprocessing steps to prepare it for model training. These steps include cleaning the data by removing non-alphanumeric characters, tokenizing messages into individual words, stemming words to their root form (e.g., "running" to "run"), and removing common stop words (e.g., "the", "and") to reduce noise and focus on meaningful words.
# 2. **Feature Extraction:** Using TF-IDF (Term Frequency-Inverse Document Frequency), the project converts text data into numerical features that can be used by machine learning models. This process helps identify important words in messages by considering their frequency and prevalence across the dataset.
# 3. **Model Training and Evaluation:** Five different classification algorithms are trained and evaluated to identify the most effective model for spam detection. These models include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), Decision Tree (DT), and Naive Bayes (NB). Performance is assessed using metrics such as accuracy and classification reports, providing insights into the models' ability to correctly classify spam and non-spam messages.

# **Key Findings and Results:**

# * **Best Performing Models:** Among the evaluated models, Random Forest (RF), Naive Bayes (NB), Support Vector Machine (SVM), and Decision Tree (DT) demonstrated superior performance in accurately classifying SMS messages.
# * **High Accuracy:**  RF, NB, SVM, and DT achieved high accuracy scores above 90% on the test data. This indicates their effectiveness in identifying spam messages while minimizing false positives.
# * **Effective Classification:** The classification reports further confirm the models' ability to distinguish between spam and non-spam, exhibiting good precision, recall, and F1-scores for both classes.
# * **K-Nearest Neighbors (KNN) was the least accurate model for SMS spam detection in this project.**

# **Project Goal:**

# To develop a reliable and accurate SMS spam detection system that can effectively filter unwanted messages, providing a seamless and secure communication experience for users.

SyntaxError: unterminated string literal (detected at line 9) (<ipython-input-1-08f5498507f8>, line 9)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [3]:
df2 = pd.read_csv('/content/drive/MyDrive/ML DATASETS/SMS_test.csv', encoding='ISO-8859-1')
df1 = pd.read_csv('/content/drive/MyDrive/ML DATASETS/SMS_train.csv', encoding='ISO-8859-1')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/ML DATASETS/SMS_test.csv'

In [None]:
#join two df
df = pd.concat([df1, df2],axis=0)
df.head()
df

In [None]:

df.drop('S. No.', axis=1, inplace=True)




In [None]:
df

In [None]:
df.isna().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
#Find any duplicate values exist in table
df.duplicated().sum()

In [None]:
#Plot Label Distribution

plt.figure(figsize=(10,5))
df['Label'].value_counts().plot(kind='bar')
plt.title('Label Distribution')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

In [None]:
# Converts 'Spam' to 1 and 'Non-Spam' to 0 in the 'Label' column for numerical representation.
df['Label']=df['Label'].map({'Spam':1,'Non-Spam':0})
df

In [None]:
# Extracts the 'Message_body' column from the dataframe 'df' and stores it in the variable 'msg'.
msg = df['Message_body']
msg


In [None]:
# Download necessary NLTK resources for text preprocessing: stopwords and punkt tokenizer models.
import nltk
nltk.download('stopwords')  # Downloads the stopwords dataset
nltk.download('punkt')      # Downloads the punkt tokenizer models


In [None]:
# Removes non-alphanumeric characters from 'msg' using regular expression.
import re
msg=msg.str.replace('[^A-Za-z0-9]',' ',regex=True)
msg

In [None]:

# collect meaningful words above 2 letters
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
msg=msg.apply(lambda x: ' '.join([i for i in word_tokenize(x)if len(i)>=3]))
msg

In [None]:
# Perform text preprocessing using stemming
# - Tokenize each text message into words.
# - Convert each word to lowercase to ensure uniformity.
# - Apply Porter stemming to reduce words to their root forms (e.g., "running" -> "run").
# - Reconstruct the message by joining the stemmed words back into a string.

from nltk.stem import PorterStemmer
ps=PorterStemmer()
msg=msg.apply(lambda x: ' '.join([ps.stem(i.lower()) for i in word_tokenize(x)]))
msg

In [None]:
# Perform text preprocessing by removing stopwords
# - Tokenize each text message into words.
# - Remove common stopwords (e.g., 'the', 'and', 'is') from the list of words.
# - Reconstruct the message by joining the remaining words back into a string.
# This helps reduce noise and focus on the meaningful words in the text.

from nltk.corpus import stopwords
sw=stopwords.words('english')
msg=msg.apply(lambda x: ' '.join([i for i in word_tokenize(x)if i not in sw]))
msg

In [None]:
# Convert text data into numerical features using TF-IDF
# - TF-IDF helps identify important words by considering how often they appear in a document
#   and how common they are across all documents.

from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
x=tf.fit_transform(msg)
x

In [None]:
y=df['Label'].values

In [None]:
x.dtype

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
x_train

In [None]:
# Train and evaluate different classifiers
# - A list of classifiers (KNN, SVM, Random Forest, Decision Tree, Naive Bayes) is defined.
# - For each model, the following steps are performed:
#   1. Fit the model to the training data.
#   2. Make predictions on the test data.
#   3. Print the accuracy and classification report to evaluate the model's performance.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
knn=KNeighborsClassifier()
svc=SVC()
rfc=RandomForestClassifier()
dtc=DecisionTreeClassifier()
gnb=BernoulliNB()
lst=[knn,svc,rfc,dtc,gnb]
for i in lst:
  print(f'Model is {i}')
  print("*************")
  i.fit(x_train,y_train)
  y_pred=i.predict(x_test)
  print(f'Accuracy is {accuracy_score(y_test,y_pred)}')
  print(f'classification report is {classification_report(y_test,y_pred)}')