---
### **⚙ Text Classification Pipeline for Email Subjects by VoxDroid ⚙**
---
This notebook provides a comprehensive text classification pipeline specifically designed for analyzing and categorizing email subjects. The step-by-step process includes data loading, text preprocessing, data splitting, feature extraction using CountVectorizer, model training with Multinomial Naive Bayes, model evaluation, saving the trained model, and making predictions on new data. The implementation leverages popular Python libraries such as pandas, scikit-learn, joblib, and NLTK.

Designed for individuals exploring natural language processing (NLP) or aiming to build practical email categorization systems, this tutorial demonstrates how to preprocess data, train a Multinomial Naive Bayes model, evaluate its performance, and make predictions on new text.

In [None]:
# @title Step 1: Data Preparation ✔
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import joblib
file_name = "your_tsv_file_here.tsv" #@param {type:"string"}
data = pd.read_csv(file_name, sep='\t')

data.head()

In [None]:
# @title Step 2: Text Preprocessing ✔
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):

    text = text.lower()


    text = text.translate(str.maketrans("", "", string.punctuation))


    tokens = word_tokenize(text)


    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    return ' '.join(tokens)

data['Subject'] = data['Subject'].apply(preprocess_text)

data['Subject'].head()

In [None]:
# @title Step 3: Splitting the Data ✔
X_train, X_test, y_train, y_test = train_test_split(data['Subject'], data['Category'], test_size=0.2, random_state=42)

In [None]:
# @title Step 4: Handling Missing Values ✔

X_train = X_train.fillna('')
y_train = y_train.dropna()
X_train = X_train[:len(y_train)]

In [None]:
# @title Step 5: Feature Extraction with CountVectorizer ✔
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [None]:
# @title Step 6: Model Training with Multinomial Naive Bayes ✔
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

In [None]:
# @title Step 7: Model Evaluation ✔
y_test_str = y_test.astype(str)
predictions = model.predict(X_test_vectorized)
predictions_str = predictions.astype(str)

print(f"Accuracy: {accuracy_score(y_test_str, predictions_str)}")
print(classification_report(y_test_str, predictions_str))

In [None]:
# @title Step 8: Save the Trained Model and Create a Zip File ✔
import zipfile

joblib.dump(model, 'esub_model.joblib')

with zipfile.ZipFile('model.zip', 'w') as zipf:
    zipf.write('esub_model.joblib')

In [None]:
# @title Step 9: Load the Trained Model and Make a Prediction ✔
loaded_model = joblib.load('esub_model.joblib')

Subject = "Your email subject text here" #@param {type:"string"}

preprocessed_text = preprocess_text(Subject)

text_vectorized = vectorizer.transform([preprocessed_text])

prediction = loaded_model.predict(text_vectorized)

print(f"Predicted Category: {prediction[0]}")