# SMS Spam Detection Project

## 1. Project Overview
This notebook builds an end-to-end SMS spam classifier using Natural Language Processing (NLP) techniques.
The objective is to classify SMS messages as either 'Ham' (legitimate) or 'Spam' (unsolicited).

**Workflow:**
1. **Data Ingestion**: Loading the dataset.
2. **Data Preprocessing**: Cleaning, removal of unused columns, and target encoding.
3. **Exploratory Data Analysis (EDA)**: analyzing data distribution and structural features.
4. **Feature Engineering**: Creating new features like character, word, and sentence counts.
5. **Text Preprocessing**: Tokenization, stopword removal, and stemming.
6. **Visual Analysis**: Word clouds and frequency distributions.

## 2. Configuration & Setup
We begin by importing the necessary libraries for data manipulation and mathematical operations.

In [None]:
# Core Data Science Utilities
import pandas as pd  # Data manipulation
import numpy as np   # Numerical operations

# Encoder for converting text labels (ham/spam) into numbers (0/1)
from sklearn.preprocessing import LabelEncoder

## 3. Data Ingestion
Loading the SMS Spam Collection dataset from the CSV file.

In [None]:
# Read the CSV file
# 'ISO-8859-1' encoding is used to handle special characters often found in SMS data
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')

## 4. Data Preprocessing
We clean the dataset by removing unnecessary columns generated during import and standardizing the column names for clarity.

In [None]:
# Drop 'Unnamed' columns which often contain parsing errors or empty data
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

# Renaming for standard interpretation:
# 'v1' -> 'target' (the label)
# 'v2' -> 'text' (the message content)
df.rename(columns={'v1': 'target', 'v2': 'text'}, inplace=True)

### 4.1 Label Encoding
Converting categorical targets into numerical format for the machine learning model.

In [None]:
# Initialize the LabelEncoder
encoder = LabelEncoder()

# Transform target labels: 'ham' becomes 0, 'spam' becomes 1
df['target'] = encoder.fit_transform(df['target'])

## 5. Exploratory Data Analysis (EDA)

### 5.1 Data Quality Check
Ensuring data integrity by checking for null values.

In [None]:
# Check for missing values in each column
df.isnull().sum()

### 5.2 Duplicate Handling
Identifying and removing duplicate text messages to prevent bias in the model.

In [None]:
# Count the number of duplicate rows
df.duplicated().sum()

In [None]:
# Remove duplicate rows, keeping the first occurrence
df.drop_duplicates(keep='first', inplace=True)

In [None]:
# Verify the shape of the dataset after duplicate removal
df.shape

### 5.3 Target Distribution Analysis
Checking the balance between Spam (1) and Ham (0) messages.

In [None]:
# Get value counts for each class
df['target'].value_counts()

#### Visualizing Class Imbalance
A pie chart to visualize the proportion of spam vs legitimate messages.

In [None]:
import matplotlib.pyplot as plt

# Generate a pie chart
plt.pie(df['target'].value_counts(), labels=['ham', 'spam'], autopct='%0.2f')
plt.show()

### 5.4 NLP Library Setup
Installing and importing NLTK (Natural Language Toolkit) for text processing.

In [None]:
# Install NLTK (if not already installed)
!pip install nltk

In [None]:
import nltk

In [None]:
# Download required NLTK data packages
nltk.download('punkt')
nltk.download('punkt_tab')

## 6. Feature Engineering
We extract new features from the raw text to help the model distinguish between spam and ham.

### 6.1 Character Count
Calculating the total length (number of characters) of each message.

In [None]:
# Create 'num_characters' column
df['num_characters'] = df['text'].apply(len)

In [None]:
# Preview the dataframe with the new feature
df.head()

### 6.2 Word Count
Calculating the number of words in each message.

In [None]:
# Calculate word count using NLTK word_tokenize
df['num_words'] = df['text'].apply(lambda x: len(nltk.word_tokenize(x)))

In [None]:
df

### 6.3 Sentence Count
Calculating the number of sentences in each message.

In [None]:
# Calculate sentence count using NLTK sent_tokenize
df['num_sentence'] = df['text'].apply(lambda x: len(nltk.sent_tokenize(x)))

In [None]:
df

In [None]:
# Import seaborn for statistical data visualization
import seaborn as sns

### 6.4 Visualizing Feature Distributions
Comparing the character count distribution for Spam vs Ham messages.

In [None]:
# Plot separate histograms for Ham (target=0) and Spam (target=1)
sns.histplot(df[df['target'] == 0]['num_characters'])
sns.histplot(df[df['target'] == 1]['num_characters'])

### 6.5 Pairwise Relationships
Visualizing relationships between all numerical features (characters, words, sentences) to identify separation separation patterns.

In [None]:
# Create a pairplot colored by target class
sns.pairplot(df, hue='target')

### 6.6 Correlation Matrix
Examining the correlation between the numerical features.

In [None]:
# Display heatmap of correlations
sns.heatmap(df.select_dtypes(include='number').corr(), annot=True)

## 7. Text Preprocessing Pipeline
Preparing the text data for modeling by applying a transformation pipeline:
1.  **Lowercasing**: Converting to lowercase.
2.  **Tokenization**: Splitting text into words.
3.  **Special Character Removal**: Removing non-alphanumeric characters.
4.  **Stopword Removal**: Removing common words (is, the, of) that add little semantic meaning.
5.  **Stemming**: Reducing words to their root form (e.g., 'dancing' -> 'danc').

In [None]:
# Import stopwords list and PortStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
import string

In [None]:
def transform_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Tokenize
    text = nltk.word_tokenize(text)
    
    # 3. Keep only alphanumeric tokens
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)

    text = y[:]
    y.clear()

    # 4. Remove stopwords and punctuation
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)

    text = y[:]
    y.clear()

    # 5. Apply Stemming
    for i in text:
        y.append(ps.stem(i))
    
    return " ".join(y)

### 7.1 Applying Transformation
Applying the preprocessing function to the entire dataset.

In [None]:
# Create 'transformed_text' column with preprocessed data
# Note: User code renamed this from transform_text in previous version
if 'transform_text' in df.columns:
    df.drop(columns=['transform_text'], inplace=True)
    
df["transformed_text"] = df['text'].apply(transform_text)

In [None]:
# Preview the final processed dataframe
df.head()

## 8. Visual Analysis
Using WordClouds to visualize the most common words in both Spam and Ham messages.

In [None]:
from wordcloud import WordCloud

# Initialize WordCloud object
wc = WordCloud(width=500, height=500, min_font_size=10, background_color='white')

### 8.1 Spam Word Cloud
Visualizing the most frequent words in Spam messages.

In [None]:
# Generate WordCloud for Spam (target=1)
spam_wc = wc.generate(df[df['target'] == 1]['transformed_text'].str.cat(sep=" "))

In [None]:
# Display the Spam WordCloud
plt.figure(figsize=(15,6))
plt.imshow(spam_wc)
plt.show()

### 8.2 Ham Word Cloud
Visualizing the most frequent words in Ham (legitimate) messages.

In [None]:
# Generate WordCloud for Ham (target=0)
ham_wc = wc.generate(df[df['target'] == 0]['transformed_text'].str.cat(sep=" "))

In [None]:
# Display the Ham WordCloud
plt.figure(figsize=(15,6))
plt.imshow(ham_wc)
plt.show()

## 9. Top Frequent Words Analysis
Identifying the top 30 most recurring words in each category.

### 9.1 Top 30 Spam Words

In [None]:
# Collect all words from Spam messages
spam_corpus = []
for messages in df[df['target'] == 1]['transformed_text'].tolist():
    for word in messages.split():
        spam_corpus.append(word)

In [None]:
# Print total number of words in spam corpus
len(spam_corpus)

In [None]:
from collections import Counter

# Create a bar plot of the top 30 most common words in Spam
sns.barplot(x=pd.DataFrame(Counter(spam_corpus).most_common(30))[0], 
            y=pd.DataFrame(Counter(spam_corpus).most_common(30))[1])
plt.xticks(rotation='vertical')
plt.show()

### 9.2 Top 30 Ham Words

In [None]:
# Collect all words from Ham messages
ham_corpus = []
for messages in df[df['target'] == 0]['transformed_text'].tolist():
    for word in messages.split():
        ham_corpus.append(word)

In [None]:
# Create a bar plot of the top 30 most common words in Ham
sns.barplot(x=pd.DataFrame(Counter(ham_corpus).most_common(30))[0], 
            y=pd.DataFrame(Counter(ham_corpus).most_common(30))[1])
plt.xticks(rotation='vertical')
plt.show()