<a href="https://colab.research.google.com/github/akshayaxhacker-12/CODSOFT/blob/main/SPAM_SMS_DETECTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build an AI model that can classify SMS messages as spam or legitimate using the "spam.csv" dataset. Use techniques like TF-IDF or word embeddings with classifiers like Naive Bayes, Logistic Regression, or Support Vector Machines to identify spam messages.

## Load the dataset

### Subtask:
Load the `spam.csv` dataset into a pandas DataFrame.


**Reasoning**:
Import pandas and load the spam.csv dataset into a DataFrame.



In [2]:
import pandas as pd

df = pd.read_csv('/content/spam.csv', encoding='latin-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Explore the data

### Subtask:
Examine the dataset to understand its structure and content, including the columns, data types, and potential issues like missing values.


**Reasoning**:
Examine the dataset by displaying column names, data types, missing values, and the first few rows to understand its structure and content.



In [3]:
print("Column Names:")
print(df.columns)

print("\nData Types:")
print(df.dtypes)

print("\nMissing Values:")
print(df.isnull().sum())

print("\nFirst few rows:")
display(df.head())

Column Names:
Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

Data Types:
v1            object
v2            object
Unnamed: 2    object
Unnamed: 3    object
Unnamed: 4    object
dtype: object

Missing Values:
v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

First few rows:


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Preprocess the text data

### Subtask:
Clean and prepare the text data for model training. This may involve tasks such as removing special characters, punctuation, and stop words, converting text to lowercase, and stemming or lemmatizing words.


**Reasoning**:
The subtask requires cleaning and preparing the text data in the 'v2' column. This involves converting to lowercase, removing punctuation and stop words, and applying stemming/lemmatization. These steps can be performed sequentially on the 'v2' column and stored in a new column. I will use the `nltk` library for text processing, so I need to download the necessary resources first.



In [4]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

df['preprocessed_v2'] = df['v2'].apply(preprocess_text)

display(df[['v2', 'preprocessed_v2']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,v2,preprocessed_v2
0,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win fa cup final tkt 21...
3,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,"Nah I don't think he goes to usf, he lives aro...",nah dont think goe usf live around though


## Split the dataset

### Subtask:
Divide the dataset into training and testing sets to evaluate the model's performance on unseen data.


**Reasoning**:
Split the data into training and testing sets using `train_test_split`.



In [5]:
from sklearn.model_selection import train_test_split

X = df['preprocessed_v2']
y = df['v1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (4457,)
Shape of X_test: (1115,)
Shape of y_train: (4457,)
Shape of y_test: (1115,)


## Extract features

### Subtask:
Convert the text data into numerical features that can be used as input for the classification model. Techniques like TF-IDF or word embeddings can be used for this purpose.


**Reasoning**:
Import TfidfVectorizer, instantiate it, fit and transform the training and testing data, and print the shapes of the resulting matrices.



In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("Shape of X_train_tfidf:", X_train_tfidf.shape)
print("Shape of X_test_tfidf:", X_test_tfidf.shape)

Shape of X_train_tfidf: (4457, 7113)
Shape of X_test_tfidf: (1115, 7113)


## Train a classification model

### Subtask:
Choose a suitable classification algorithm (e.g., Naive Bayes, Logistic Regression, or Support Vector Machines) and train it on the training data.


**Reasoning**:
Train a Logistic Regression model using the TF-IDF transformed training data.



In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

## Evaluate the model

### Subtask:
Assess the performance of the trained model using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score) on the testing data.


**Reasoning**:
Import the necessary evaluation metrics and calculate the accuracy, precision, recall, and F1-score of the trained model on the test set.



In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
f1 = f1_score(y_test, y_pred, pos_label='spam')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.9507
Precision: 0.9612
Recall: 0.6600
F1-score: 0.7826


## Summary:

### Data Analysis Key Findings

*   The dataset contains 5572 SMS messages with labels ('ham' or 'spam') and the message text.
*   The dataset has three columns ('Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4') with a significant number of missing values (5522, 5560, and 5566 respectively), while the label ('v1') and message text ('v2') columns have no missing values.
*   Text preprocessing involved converting text to lowercase, removing punctuation and stop words, and applying stemming.
*   The dataset was split into training (80%, 4457 samples) and testing (20%, 1115 samples) sets.
*   TF-IDF vectorization resulted in numerical features with 7113 unique terms.
*   A Logistic Regression model was trained on the TF-IDF features.
*   The trained model achieved an accuracy of 0.9507, a precision of 0.9612, a recall of 0.6600, and an F1-score of 0.7826 on the test set.

### Insights or Next Steps

*   The model shows high precision but lower recall, indicating it is good at correctly identifying messages as spam when it predicts spam, but it misses a notable portion of actual spam messages. Further tuning or exploring different models might improve recall.
*   Consider investigating the content of the 'Unnamed' columns to determine if they contain any relevant information that could be used to improve the model.
