<a href="https://colab.research.google.com/github/akshayaxhacker-12/CODSOFT/blob/main/SPAM_SMS_DETECTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam SMS Detection

## Objective:

The primary goal of this project is to develop and evaluate an AI-powered system capable of accurately classifying SMS messages as either legitimate ("ham") or unsolicited ("spam"). This will involve leveraging natural language processing (NLP) techniques and machine learning algorithms to build a robust and efficient spam detection model.

## Dataset:

We will be utilizing the "spam.csv" dataset, a publicly available collection of SMS messages labeled as spam or ham.



## Load the dataset

### Subtask:
Load the "spam.csv" dataset into a pandas DataFrame.


**Reasoning**:
The subtask is to load the data, so I will import pandas and load the csv file into a dataframe and display the first few rows and the info of the dataframe.



In [29]:
import pandas as pd

df = pd.read_csv('spam.csv', encoding='latin-1')
display(df.head())
display(df.info())

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


None

## Data preprocessing

### Subtask:
Clean the text data by removing punctuation, converting to lowercase, and applying stemming or lemmatization.


**Reasoning**:
Drop unnecessary columns, rename relevant columns, convert labels to numerical representation, and preprocess the text data by converting to lowercase, removing punctuation and stop words, and applying stemming.



In [30]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Drop unnecessary columns
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

# Rename columns
df = df.rename(columns={'v1': 'label', 'v2': 'message'})

# Convert labels to numerical
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Download necessary NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# Initialize stemmer and stop words
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

# Apply preprocessing
df['preprocessed_v2'] = df['message'].apply(preprocess_text)

display(df.head())

Unnamed: 0,label,message,preprocessed_v2
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,0,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win fa cup final tkt 21...
3,0,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah dont think goe usf live around though


## Feature extraction

### Subtask:
Convert the preprocessed text data into numerical features using techniques like TF-IDF or word embeddings.


**Reasoning**:
Import the necessary library for TF-IDF vectorization and apply it to the preprocessed text data.



In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # You can adjust max_features

# Fit and transform the preprocessed text data
X = tfidf_vectorizer.fit_transform(df['preprocessed_v2'])

# Print the shape of the resulting feature matrix
print("Shape of TF-IDF feature matrix:", X.shape)

Shape of TF-IDF feature matrix: (5572, 5000)


## Split the data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the data into training and testing sets and print their shapes.



In [32]:
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42)

# Print the shapes
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (4457, 5000)
Shape of X_test: (1115, 5000)
Shape of y_train: (4457,)
Shape of y_test: (1115,)


## Train a classifier

### Subtask:
Choose a suitable classification algorithm (e.g., Naive Bayes, Logistic Regression, or Support Vector Machines) and train it on the training data.


**Reasoning**:
Import the MultinomialNB classifier and train the model on the training data.



In [33]:
from sklearn.naive_bayes import MultinomialNB

# Instantiate the model
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the trained model's performance on the testing data using appropriate metrics like accuracy, precision, recall, and F1-score.


**Reasoning**:
Calculate accuracy, precision, recall, and F1-score using the predicted labels and true labels, then print the results.



In [34]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.9740
Precision: 1.0000
Recall: 0.8067
F1-score: 0.8930


## Summary:

### Data Analysis Key Findings

*   The dataset "spam.csv" was successfully loaded, containing columns 'v1' and 'v2' as the primary data, and 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4' with many missing values.
*   Unnecessary columns were removed, 'v1' was renamed to 'label', and 'v2' to 'message'. Labels were converted to numerical (ham: 0, spam: 1).
*   Text data underwent preprocessing, including lowercasing, punctuation removal, stop word removal, and stemming, resulting in a 'preprocessed\_v2' column.
*   TF-IDF vectorization was applied to the preprocessed text, creating a feature matrix `X` with a shape of (5572, 5000).
*   The dataset was split into training and testing sets: `X_train` (4457, 5000), `X_test` (1115, 5000), `y_train` (4457,), and `y_test` (1115,).
*   A Multinomial Naive Bayes model was trained on the training data.
*   The model achieved the following performance metrics on the test set: Accuracy: 0.9740, Precision: 1.0000, Recall: 0.8067, and F1-score: 0.8930.

### Insights or Next Steps

*   The high precision (1.0000) is a strong indicator that the model is highly effective at avoiding misclassifying legitimate messages as spam, which is crucial for user experience.
*   Investigating techniques to improve recall could be a next step, potentially by experimenting with different `max_features` in the TF-IDF vectorizer, exploring other feature extraction methods (like word embeddings), or trying different classification algorithms.
