# Step 1: Data Loading

In [2]:
import pandas as pd

# Load the CSV file into a DataFrame
data = pd.read_excel("/content/drive/MyDrive/data sets/SPAM EMAIL DETECTION WITH MACHINE LEARNING/Spam Email Detection.xlsx")

# Display the first few rows of the DataFrame
print(data.head())


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


we can see that only columns "v1" and "v2" contain meaningful data, while the other columns ("Unnamed: 2", "Unnamed: 3", and "Unnamed: 4") appear to be empty.



# Step 2: Data Preprocessing

Step 2.1: Data Cleaning

We'll start by dropping the unnecessary columns ("Unnamed: 2", "Unnamed: 3", and "Unnamed: 4") and renaming the remaining columns to something more descriptive.



In [3]:
# Drop unnecessary columns
data.drop(columns=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], inplace=True)

# Rename columns
data.columns = ["label", "email"]

# Display the first few rows of the DataFrame after preprocessing
print(data.head())


  label                                              email
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


The DataFrame now contains two columns: "label" and "email", with meaningful data.

Next, we'll proceed with text preprocessing to clean and prepare the text data for further analysis and modeling.

# Step 3: Text Preprocessing

Step 3.1: Text Cleaning

We'll clean the text data by removing any punctuation, converting text to lowercase, and tokenizing the text into individual words.

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Define function for text cleaning
def clean_text(text):
    if isinstance(text, str):
        # Tokenize the text
        tokens = word_tokenize(text)
        # Convert to lowercase
        tokens = [word.lower() for word in tokens]
        # Remove punctuation
        table = str.maketrans('', '', string.punctuation)
        stripped = [word.translate(table) for word in tokens]
        # Remove non-alphabetic tokens
        words = [word for word in stripped if word.isalpha()]
        # Remove stop words
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if not word in stop_words]
        # Join the words back into a string
        cleaned_text = ' '.join(words)
        return cleaned_text
    else:
        return ''


# Apply text cleaning to the 'email' column
data['cleaned_email'] = data['email'].apply(clean_text)

# Display the first few rows of the DataFrame after text preprocessing
print(data.head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


  label                                              email  \
0   ham  Go until jurong point, crazy.. Available only ...   
1   ham                      Ok lar... Joking wif u oni...   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...   
3   ham  U dun say so early hor... U c already then say...   
4   ham  Nah I don't think he goes to usf, he lives aro...   

                                       cleaned_email  
0  go jurong point crazy available bugis n great ...  
1                            ok lar joking wif u oni  
2  free entry wkly comp win fa cup final tkts may...  
3                u dun say early hor u c already say  
4          nah nt think goes usf lives around though  


The text cleaning process seems to have worked fine, and now the DataFrame contains a new column named "cleaned_email" with the preprocessed text.

Next, we'll proceed with feature engineering by converting the cleaned text into numerical features using the TF-IDF vectorizer.

#Step 4: Feature Engineering


the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the cleaned text into numerical features.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=3000)  # You can adjust max_features as needed

# Fit and transform the cleaned email text
X = vectorizer.fit_transform(data['cleaned_email'])

# Convert to DataFrame for better inspection
X_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display the feature matrix
print(X_df.head())


   aah  aathi  abi  ability  abiola  abj  able  absolutly  abt  abta  ...  \
0  0.0    0.0  0.0      0.0     0.0  0.0   0.0        0.0  0.0   0.0  ...   
1  0.0    0.0  0.0      0.0     0.0  0.0   0.0        0.0  0.0   0.0  ...   
2  0.0    0.0  0.0      0.0     0.0  0.0   0.0        0.0  0.0   0.0  ...   
3  0.0    0.0  0.0      0.0     0.0  0.0   0.0        0.0  0.0   0.0  ...   
4  0.0    0.0  0.0      0.0     0.0  0.0   0.0        0.0  0.0   0.0  ...   

    yr  yrs  yummy  yun  yunny  yuo  yup  zed  zindgi  zoe  
0  0.0  0.0    0.0  0.0    0.0  0.0  0.0  0.0     0.0  0.0  
1  0.0  0.0    0.0  0.0    0.0  0.0  0.0  0.0     0.0  0.0  
2  0.0  0.0    0.0  0.0    0.0  0.0  0.0  0.0     0.0  0.0  
3  0.0  0.0    0.0  0.0    0.0  0.0  0.0  0.0     0.0  0.0  
4  0.0  0.0    0.0  0.0    0.0  0.0  0.0  0.0     0.0  0.0  

[5 rows x 3000 columns]


# Step 5: Splitting the Dataset

split the dataset into training and testing sets. We'll use the cleaned email text as our features X and the labels as our target y.

In [6]:
from sklearn.model_selection import train_test_split

# Define the target variable
y = data['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of y_test: {y_test.shape}')


Shape of X_train: (4457, 3000)
Shape of X_test: (1115, 3000)
Shape of y_train: (4457,)
Shape of y_test: (1115,)


The dataset has been successfully split into training and testing sets.

# Step 6: Model Training

train a Logistic Regression model on the training data. Logistic Regression is a commonly used algorithm for binary classification problems like spam detection.

In [7]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Display the first few predictions
print(y_pred[:10])


['ham' 'ham' 'ham' 'ham' 'spam' 'ham' 'ham' 'ham' 'ham' 'ham']


The model has been trained and has made some predictions. Now, let's evaluate the model's performance on the test set.

#Step 7: Model Evaluation

use common evaluation metrics such as accuracy, precision, recall, and F1-score to assess the performance of our spam detection model.

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision
precision = precision_score(y_test, y_pred, pos_label='spam')

# Calculate recall
recall = recall_score(y_test, y_pred, pos_label='spam')

# Calculate F1-score
f1 = f1_score(y_test, y_pred, pos_label='spam')

# Generate classification report
report = classification_report(y_test, y_pred, target_names=['ham', 'spam'])

# Print evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')
print('\nClassification Report:\n', report)


Accuracy: 0.9488789237668162
Precision: 0.9514563106796117
Recall: 0.6533333333333333
F1-score: 0.774703557312253

Classification Report:
               precision    recall  f1-score   support

         ham       0.95      0.99      0.97       965
        spam       0.95      0.65      0.77       150

    accuracy                           0.95      1115
   macro avg       0.95      0.82      0.87      1115
weighted avg       0.95      0.95      0.94      1115



## Model Evaluation Summary

### Performance Metrics
- **Accuracy**: 0.95
- **Precision** (for spam): 0.95
- **Recall** (for spam): 0.65
- **F1-score** (for spam): 0.77

### Summary
The model shows high accuracy and precision but a relatively lower recall for the spam class. This means that while the model is good at correctly identifying emails that are spam (precision), it misses a significant number of actual spam emails (recall).

### Next Steps: Improvements

1. **Balancing the Dataset**:
   - The dataset might be imbalanced (more 'ham' than 'spam'). Techniques such as oversampling the minority class (spam) or undersampling the majority class (ham) can help balance the dataset.

2. **Try Different Models**:
   - Some models might perform better for this specific task. Consider using models like Random Forest, Gradient Boosting, or even deep learning models.

3. **Hyperparameter Tuning**:
   - Fine-tuning the hyperparameters of the Logistic Regression model or other models can lead to better performance.

4. **Feature Engineering**:
   - Adding more features or using different vectorization techniques like Word2Vec or BERT embeddings might improve performance.

### Next Step: Balancing the Dataset
Let's start with balancing the dataset using oversampling and see if it improves the recall for the spam class.


# Step 8: Balancing the Dataset
We will use the SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset.



In [9]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print(y_train_balanced.value_counts())


label
ham     3860
spam    3860
Name: count, dtype: int64


The dataset is now balanced with an equal number of 'ham' and 'spam' emails. Let's retrain the Logistic Regression model on this balanced dataset and evaluate its performance again.

# Step 9: Retraining the Model

We'll retrain the Logistic Regression model on the balanced dataset and make predictions on the test set.

In [10]:
# Initialize the Logistic Regression model
model_balanced = LogisticRegression(max_iter=1000)

# Train the model on the balanced dataset
model_balanced.fit(X_train_balanced, y_train_balanced)

# Predict on the test set
y_pred_balanced = model_balanced.predict(X_test)

# Evaluate the model's performance
accuracy_balanced = accuracy_score(y_test, y_pred_balanced)
precision_balanced = precision_score(y_test, y_pred_balanced, pos_label='spam')
recall_balanced = recall_score(y_test, y_pred_balanced, pos_label='spam')
f1_balanced = f1_score(y_test, y_pred_balanced, pos_label='spam')
report_balanced = classification_report(y_test, y_pred_balanced, target_names=['ham', 'spam'])

# Print evaluation metrics
print(f'Accuracy: {accuracy_balanced}')
print(f'Precision: {precision_balanced}')
print(f'Recall: {recall_balanced}')
print(f'F1-score: {f1_balanced}')
print('\nClassification Report:\n', report_balanced)


Accuracy: 0.9757847533632287
Precision: 0.9072847682119205
Recall: 0.9133333333333333
F1-score: 0.9102990033222591

Classification Report:
               precision    recall  f1-score   support

         ham       0.99      0.99      0.99       965
        spam       0.91      0.91      0.91       150

    accuracy                           0.98      1115
   macro avg       0.95      0.95      0.95      1115
weighted avg       0.98      0.98      0.98      1115



## Model Evaluation Summary

### Performance Metrics
- **Accuracy**: 0.98
- **Precision** (for spam): 0.91
- **Recall** (for spam): 0.91
- **F1-score** (for spam): 0.91

### Summary
- **Accuracy** improved to 0.98, indicating the model's overall correctness.
- **Precision** for spam indicates that 91% of the emails predicted as spam were actually spam.
- **Recall** for spam is now 91%, meaning the model correctly identified 91% of the actual spam emails.
- **F1-score** for spam, a balance between precision and recall, is now 0.91.

### Next Steps
1. **Further Hyperparameter Tuning**:
   - Fine-tuning hyperparameters of the Logistic Regression model or trying more sophisticated models can help in pushing the model performance even further.

2. **Feature Engineering**:
   - Adding or refining features could further improve the model's effectiveness.
   - For example, leveraging advanced text representations like TF-IDF, Word2Vec, or BERT embeddings might help.

3. **Model Deployment**:
   - Once satisfied with the performance, the model can be deployed into a production environment for real-time spam detection.
