In [3]:
# Install all necessary libraries and packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and text handling
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load in the dataset
df= pd.read_csv('emails.csv')
df.head()

Unnamed: 0,Email Content,Label
0,Thank you for reaching out! Your software solu...,Interested
1,"Hi Sarah, I've booked our meeting for Thursday...",Meeting Booked
2,"Thanks for your email, but we're not intereste...",Not Interested
3,"CONGRATULATIONS! You've won $10,000! Click her...",Spam
4,I am currently out of the office until March 1...,Out of Office


# Exploratory Data Analysis

In [4]:
# Check the shape
print(df.shape)

(55, 2)


In [5]:
# Check for missing values and drop them
print(df.isnull().sum())
df = df.dropna()
print(df.isnull().sum())

Email Content    0
Label            0
dtype: int64
Email Content    0
Label            0
dtype: int64


# TF-IDF Vectorization
In the realm of Natural Language Processing (NLP), transforming text into numerical representations is essential for various tasks. TF-IDF vectorization stands out as a powerful technique for this purpose. 
TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is a method to represent text documents as numerical vectors. It measures the importance of a term within a document relative to the entire corpus. Let's break it down with a step-by-step example:

## Example Corpus:
Consider a small corpus with three documents:

* Document 1: "I love eating apples."
* Document 2: "Apples are delicious fruits."
* Document 3: "Bananas and oranges are also tasty."

### Step 1: Calculate Term Frequency (TF)
Term Frequency (TF) measures how often a term appears in a document relative to the total number of terms in that document.

$$ \text{TF} = \frac{\text{Frequency of the word in the sentence}}{\text{Total number of words in the sentence}} $$

For instance, let's calculate TF for the term "apples" in Document 1:

* Number of times "apples" appears in Document 1 = 1
* Total number of terms in Document 1 (excluding stop words) = 4
* TF("apples", Document 1) = 1 / 4 = 0.25

Similarly, compute TF for all terms in each document.

### Step 2: Calculate Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF) measures the rarity of a term across the entire corpus. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term.

$$ \text{IDF} = \frac{\text{Total number of sentences (documents)}}{\text{Number of sentences (documents) containing the word}} $$

For example, let's calculate IDF for the term "apples":

* Total number of documents in the corpus = 3
* Number of documents containing the term "apples" = 2
* IDF("apples") = log(3 / 2) ≈ 0.176

Calculate IDF for all terms in the corpus.

### Step 3: Compute TF-IDF
TF-IDF is obtained by multiplying TF by IDF for each term in each document.

For "apples" in Document 1:

* TF-IDF("apples", Document 1) = TF("apples", Document 1) * IDF("apples") = 0.25 * 0.176 ≈ 0.044

Similarly, calculate TF-IDF for all terms in each document.

Each row represents a term, and each column represents a document. The values are the TF-IDF scores for each term in each document.

Document 1 can be represented as [0.044,0,0,0.176,0,0.176,0,0,0]

## Advantages of TF-IDF:
1. Term Importance: TF-IDF emphasizes terms that are both frequent in a document and rare across the corpus, highlighting their significance in representing the content.
2. Versatility: TF-IDF can be applied to various NLP tasks such as document classification, information retrieval, and keyword extraction, making it a versatile technique.
3. Language Independence: TF-IDF does not rely on linguistic rules or language-specific features, making it applicable across different languages and domains.
4. Simple Calculation: TF-IDF scores are straightforward to compute, involving basic arithmetic operations (TF and IDF calculations) applied to each term in each document.

## Disadvantages of TF-IDF:
1. Sparse Representation: TF-IDF matrices tend to be sparse, especially in large corpora with many unique terms, which can lead to storage and computational overhead.
2. Lack of Semantic Understanding: TF-IDF does not consider the semantic relationships between terms, potentially leading to limitations in understanding context and meaning.
3. Sensitivity to Vocabulary: TF-IDF is sensitive to the choice of vocabulary and may not perform well with out-of-vocabulary terms or rare words.
4. Normalization Issues: TF-IDF scores may need to be normalized to account for document length variations, which can impact the effectiveness of the technique.

## Applications of TF-IDF Vectorization
TF-IDF vectorization finds extensive applications across various NLP tasks:
1. **Document Classification**: TF-IDF vectors serve as features for training classifiers to categorize documents into predefined classes.
2. **Information Retrieval**: Search engines utilize TF-IDF to rank documents based on their relevance to user queries.
3. **Keyword Extraction**: TF-IDF aids in identifying important keywords within documents for summarization and content analysis.

In [7]:
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(df['Email Content'])

# Train ~ Test Split

In [9]:
# Split the data into training and test sets for the Logistic Regression model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_vectors, df['Label'], test_size=0.2, random_state=42)

# Machine Learning Models
We will use the Logistic Regression, Support Vector Machine, Random Forest Classifier, Gradient Boosting  Classifier for sentiment analysis of the given dataset.

In [10]:
# Importing & calling Machine learning models

# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()

In [13]:
from sklearn.metrics import precision_score,recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
accuracy_scores = accuracy_score(y_test,y_pred)
print(lr_model)
print(f'Accuracy Score: {accuracy_score(y_test,y_pred)}')
print()

LogisticRegression()
Accuracy Score: 0.7272727272727273



# Classification Report
In the classification report, we can see things like accuracy, which tells us overall how often our model is correct. We also see precision, recall, and F1 Score, which give us insights into how well our model is doing at correctly identifying different classes . It's a performance evaluation metric commonly used in supervised machine learning tasks, especially for classification problems. Let me break it down for you:

1. **Precision**: This metric measures how many of the positive predictions made by the model are actually correct. It's calculated as the ratio of true positives (correctly predicted positive samples) to the sum of true positives and false positives (incorrectly predicted positive samples).

2. **Recall (Sensitivity)**: Recall tells us how many of the actual positive samples were correctly predicted by the model. It's calculated as the ratio of true positives to the sum of true positives and false negatives (actual positive samples missed by the model).

3. **F1-Score**: The F1-score is the harmonic mean of precision and recall. It balances both metrics and provides a single value that represents the overall performance of the model.

4. **Support**: Support indicates the number of samples in each class.

We can generate a classification report using libraries like scikit-learn in Python using the `classification_report()`

In [15]:
print("Classification Report for Support Vector Classifier:\n", classification_report(y_test, y_pred, target_names=df['Label'].unique()))

Classification Report for Support Vector Classifier:
                 precision    recall  f1-score   support

    Interested       0.50      1.00      0.67         1
Meeting Booked       1.00      0.67      0.80         3
Not Interested       1.00      0.67      0.80         3
          Spam       0.50      1.00      0.67         2
 Out of Office       1.00      0.50      0.67         2

      accuracy                           0.73        11
     macro avg       0.80      0.77      0.72        11
  weighted avg       0.86      0.73      0.74        11



In [18]:
text = ["Hey team, just confirming our sync-up is at 4 PM tomorrow in Room 101."]
category = lr_model.predict(tfidf_vectorizer.transform(text))
print(category)

['Meeting Booked']


In [19]:
text = ["I'm out of the office for the week and will respond upon my return."]
category = lr_model.predict(tfidf_vectorizer.transform(text))
print(category)

['Out of Office']


In [21]:
import pickle
pickle.dump(lr_model, open('models/email_categoriser.pkl', 'wb'))
pickle.dump(tfidf_vectorizer, open('models/tfidf_vectorizer.pkl', 'wb'))