# InboxIQ: Intelligent Email Filtering with Machine Learning  
### Winter Data Science (WiDS) Program  

Welcome to the InboxIQ project! This notebook is designed to guide you through the process of building an intelligent email filtering system to classify emails as ham or spam using a Recurrent Neural Network (RNN).  

## Objectives  
This project aims to:  
1. Provide hands-on experience with email classification using RNNs.  
2. Help you understand fundamental concepts such as data preprocessing, tokenization, padding, label encoding, and model evaluation.  
3. Emphasize the importance of key techniques like early stopping, activation functions, and proper data splitting for training and validation.  

## Submission Requirements  
- **Deadline**: The completed project (code + report) is due on **January 7, 11:59 PM**.  
- **What to Submit**: Your submission must include:  
  - A fully functional Python script or Jupyter Notebook for the email classification system.  
  - A detailed report explaining your implementation and highlighting key concepts.  

## Guidelines  
- The outlined code provided in this notebook is kept intentionally simple to help you focus on understanding the core concepts rather than dealing with excessive complexity.  
- **Flexibility**:  
  - You can use the provided outline as is.  
  - Modify the outlined code as needed.  
  - Alternatively, feel free to write your own implementation from scratch.  

## Key Expectations  
- **Conceptual Understanding**: Your report should demonstrate a thorough understanding of:  
  - Data preprocessing steps (e.g., tokenization, padding, label encoding).  
  - The role of activation functions and loss metrics.  
  - How early stopping helps prevent overfitting.  
  - The purpose and implementation of each line of code.  
- **Exhaustive Submission**: Your submission (code + report) will serve as the exhaustive assessment for certification. Report is much important for evaluation.  

Take your time to experiment, learn, and reflect on the process. This project is not just about completing the task; it’s about building a solid foundation in machine learning concepts applied to real-world problems.  

Submission Link : https://forms.gle/VA3ArTGhXcHee4nY7

Good luck, and happy coding!  


In [None]:
# Importing necessary libraries, you're free to add any extra libraries if you want
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Load the dataset

In [None]:
file_path = "________________"  # Provide the correct path to the dataset.
data = pd.read_csv(file_path)

Explore the data

In [None]:
# Play with the data, like see how much biased is the dataset provided to you, print it's starting 10 rows, etc.
________________
________________
________________
________________
________________
________________

Preprocess the text (tokenization, cleaning, etc.)

In [None]:
# Modify the dataset has a column 'text' for email content and 'label' for spam/ham, it would be appreciated if you do it by extra lines of code using Pandas.
texts = ______________  # Fill
labels = ______________

# Convert labels to numerical format (e.g., 0 for ham, 1 for spam), check the CSV file if this step is actually necessary?
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(labels)

Tokenizing and pad the sequences

In [None]:
max_words = 1000  # Maximum vocabulary size
max_len = 100     # Maximum sequence length

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='_____') # Fill with 'pre' or 'post', reason should be reported.

Splitting the data into training and validation sets

In [None]:
X_train, X_val, y_train, y_val = train_test_split(________, labels, test_size=_______, random_state=42)


Building the RNN model

In [None]:
model = Sequential([
    Embedding(input_dim=max_words, output_dim=64, input_length=max_len),
    SimpleRNN(64, activation='____'), # Fill in with appropriate activation functions, a reasoning has to be provided in the report.
    Dense(1, activation='_____')
])

Training the model with early stopping

In [None]:
# Early stopping configuration
callback = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Model training
history = model.fit(
    X_train, y_train,
    epochs=____,  # Fill in number of epochs
    validation_data=(____, ____),  # Fill in validation data
    callbacks=[callback],
    batch_size=____  # Fill in batch size
)

Evaluate the model

In [None]:
# Evaluate the model on the validation set
loss, accuracy = model.evaluate(____, ____) # Fill here
print(f"Validation Loss: {loss}, Validation Accuracy: {accuracy}")


Testing

In [None]:
# Test your model by entering a few sample inputs and check your wonderfull model for yourself.
sample_texts = ["___________________________"] # A message like "Congratulations! you've won a Nobel Pri...."
sample_sequences = tokenizer.texts_to_sequences(sample_texts)
sample_padded = pad_sequences(________________, maxlen=max_len, padding='post') # Fill in here
predictions = model.predict(______________) # Fill in here

for text, pred in zip(sample_texts, predictions):
    print(f"Text: {text}")
    print(f"Prediction (Spam Probability): {pred[0]:.2f}")
