---

# Email Spam Filtering using Text Classification

## Problem Statement

Implement e-mail spam filtering using text classification algorithm with appropriate dataset. 

## Background

- **Dataset**: The code utilizes a dataset, retrieved from the UCI Machine Learning Repository, known as the "Spambase" dataset. 

## Problem Description

The objective is to build a system for email spam classification, and the code performs the following steps:

1. **Data Retrieval and Preprocessing**:
   - The "Spambase" dataset is fetched from the UCI Machine Learning Repository.
   - Features (X) and labels (y) are extracted from the dataset.
   - The shapes of X and y are displayed to understand the data dimensions.

2. **Data Splitting**:
   - The dataset is divided into training and testing sets using the `train_test_split` function from scikit-learn. This allows the model to be trained on one subset and tested on another.

3. **Text Classification**:
   - A Multinomial Naive Bayes classifier is used for text classification.
   - The classifier is trained using the training data.
   - Predictions are made on the test data.

4. **Evaluation**:
   - The accuracy of the spam classification model is calculated.
   - A classification report, which includes precision, recall, F1-score, and support for both spam and non-spam categories, is generated.

## Input

The input for this problem is the "Spambase" dataset, which consists of features extracted from emails and labels indicating whether the emails are spam or not.

## Output

The output is an email spam classification model that can accurately distinguish between spam and non-spam emails. The model's performance is assessed through accuracy and a classification report.


## Potential Improvements

In practice, improving the spam filtering system may involve advanced preprocessing techniques, feature engineering, hyperparameter tuning, and the exploration of various machine learning algorithms to achieve higher accuracy and robustness in classifying email messages.

---

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [2]:
from ucimlrepo import fetch_ucirepo 
spambase = fetch_ucirepo(id=94) 

X = spambase.data.features 
y = spambase.data.targets 
print(spambase.metadata) 
print('\n')
print(spambase.variables) 

{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [3]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (4601, 57)
Shape of y: (4601, 1)


In [4]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [5]:
y.head()

Unnamed: 0,Class
0,1
1,1
2,1
3,1
4,1


In [6]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

In [8]:
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (3680, 57)
Shape of y_train: (3680,)
Shape of X_test: (921, 57)
Shape of y_test: (921,)


In [9]:
# Training a text classification model
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

In [10]:
pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, pred)
report = classification_report(y_test, pred, target_names=["Non-Spam", "Spam"])

In [11]:
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", report)

Accuracy: 0.79

Classification Report:
               precision    recall  f1-score   support

    Non-Spam       0.80      0.84      0.82       531
        Spam       0.76      0.72      0.74       390

    accuracy                           0.79       921
   macro avg       0.78      0.78      0.78       921
weighted avg       0.79      0.79      0.79       921

