# SMS Spam Detection

This notebook demonstrates the process of building a spam classifier using the SMS Spam Collection dataset.

## Project Steps:
1. **Data Loading**: Importing libraries and reading the dataset.
2. **Data Cleaning**: Handling missing values, removing unnecessary columns, and renaming columns for clarity.
3. **Feature Encoding**: Converting categorical labels into numerical format.

## 1. Import Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

## 2. Load Data

In [3]:
# Load the dataset with the appropriate encoding
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')

## 3. Data Cleaning

In [4]:
# Drop columns that contain null values and are not needed for analysis
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

# Rename columns to meaningful names: 'v1' -> 'target', 'v2' -> 'text'
df.rename(columns={'v1': 'target', 'v2': 'text'}, inplace=True)

## 4. Feature Encoding

In [5]:
# Initialize the LabelEncoder
encoder = LabelEncoder()

# Encode the 'target' column (ham/spam) into numerical values (0/1)
df['target'] = encoder.fit_transform(df['target'])

In [6]:
# missing values

df.isnull().sum()

target    0
text      0
dtype: int64

In [9]:
# check for duplocated values

df.duplicated().sum()

np.int64(0)

In [8]:
#remove duplicate values

df.drop_duplicates(keep='first', inplace=True)

In [11]:
df.shape

(5169, 2)