<a href="https://colab.research.google.com/github/Venkatalakshmikottapalli/NLP/blob/main/V_Kottapalli_Naive_Bayes_Assn2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Assignment 3 Naive Bayes and Sentiment Classification and Logistic Regression

#### Introduction :

Text classification is a fundamental task in Natural Language Processing (NLP) and machine learning, encompassing applications such as sentiment analysis, spam detection, and topic modeling. Two primary algorithms used for text classification are Naive Bayes and logistic regression.

Naive Bayes is a probabilistic generative model that assumes independence among features (words in this context) given the class label. This approach, often termed the "bag-of-words" model, treats each word occurrence independently and calculates probabilities based on their frequencies. Despite ignoring word order and context, Naive Bayes can effectively categorize documents, such as distinguishing between spam and non-spam emails.

In contrast, logistic regression is a discriminative model that learns directly from feature inputs to predict class labels. It does not assume feature independence and can capture more complex relationships between features.

For this assignment, we will utilize a dataset of labeled spam and non-spam emails to train and evaluate both Naive Bayes and logistic regression models. The goal is to predict the class of new documents, demonstrating the effectiveness of these algorithms in real-world text classification tasks.

In [None]:
# Import the libraries
import pandas as pd
import numpy as np
from os import makedirs, path, remove, rename, rmdir
from tarfile import open as open_tar
from shutil import rmtree
from urllib import request, parse
from glob import glob
from os import path
from re import sub
from email import message_from_file
from glob import glob
from sklearn.model_selection import StratifiedShuffleSplit
from collections import defaultdict
from functools import partial
from sklearn.metrics import (accuracy_score, f1_score, precision_score, recall_score)
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
import gc

In [None]:
def download_corpus(dataset_dir: str = 'data'):
    base_url = 'https://spamassassin.apache.org'
    corpus_path = 'old/publiccorpus'
    files = {
        '20021010_easy_ham.tar.bz2': 'ham',
        '20021010_hard_ham.tar.bz2': 'ham',
        '20021010_spam.tar.bz2': 'spam',
        '20030228_easy_ham.tar.bz2': 'ham',
        '20030228_easy_ham_2.tar.bz2': 'ham',
        '20030228_hard_ham.tar.bz2': 'ham',
        '20030228_spam.tar.bz2': 'spam',
        '20030228_spam_2.tar.bz2': 'spam',
        '20050311_spam_2.tar.bz2': 'spam' }

    #creates the folders: downloads, ham and spam
    downloads_dir = path.join(dataset_dir, 'downloads')
    ham_dir = path.join(dataset_dir, 'ham')
    spam_dir = path.join(dataset_dir, 'spam')

    makedirs(downloads_dir, exist_ok=True)
    makedirs(ham_dir, exist_ok=True)
    makedirs(spam_dir, exist_ok=True)


    for file, spam_or_ham in files.items():
        # download files from URL of each specific .bz2 file
        url = parse.urljoin(base_url, f'{corpus_path}/{file}')
        #The above statement create the url similar to below
        #https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
        tar_filename = path.join(downloads_dir, file)
        #data/downloads/20021010_easy_ham.tar.bz2
        request.urlretrieve(url, tar_filename)

        #list e-mails in the compressed .bz2 file
        emails = []
        with open_tar(tar_filename) as tar:
            tar.extractall(path=downloads_dir)
            for tarinfo in tar:
                 # tarinfo.name contains the 'directory/filename'. When split by '/',
                # if the resulting list has more than one element, it indicates
                # that the item is an email file located in a directory.
                if len(tarinfo.name.split('/')) > 1:
                    emails.append(tarinfo.name)

        # move e-mails to ham or spam directory
        for email in emails:
            # split the path to get directory and filename
            directory, filename = email.split('/')
            # Create the full path to the directory where the email was extracted
            directory = path.join(downloads_dir, directory)
            # Check if the email file does not already exist in the target directory (ham or spam)
            if not path.exists(path.join(dataset_dir, spam_or_ham, filename)):
                # Move the email file from the extracted directory to the target directory (ham or spam)
                rename(path.join(directory, filename),
                   path.join(dataset_dir, spam_or_ham, filename))

        # Remove the extracted directory after moving all emails
        rmtree(directory)

# Call the function
download_corpus()

##### Comment:
- First defined a fucntion download_corpus in which base_url and corpus_path define the base URL and the path to the corpus files on the SpamAssassin website.
- files is a dictionary where the keys are the names of the tar.bz2 files to download, and the values indicate whether the files contain 'ham' (non-spam) or 'spam'.
- downloads_dir, ham_dir, and spam_dir are paths to directories where the downloaded files and extracted emails will be stored.
makedirs creates these directories if they do not already exist.
- A loop iterates over each file in the files dictionary.
For each file, the full URL is constructed, and the local path will be saved is specified.
- The file is then downloaded from the URL and saved to the specified local path.
- After that an empty list created and the loop iterates through each file in the tar archive, and if the file is within a subdirectory, it is added to the emails list.
- Now another loop iterates over the emails and split the directories and file names
- The email file is moved from the downloads_dir to either the ham or spam directory, based on its classification.
- now-empty directory is removed after all emails have been moved.
- Finally called the function

In [None]:
# How many e-mails are classified in our dataset as either Spam or not Spam?
ham_dir = path.join('data', 'ham')
spam_dir = path.join('data', 'spam')

print('Number of Non-Spam E-mails:', len(glob(f'{ham_dir}/*')))
print('\nNumber of Spam E-mails:', len(glob(f'{spam_dir}/*')))

Number of Non-Spam E-mails: 6952

Number of Spam E-mails: 2399


#### Comment:
- ham_dir = path.join('data', 'ham'): Defines the directory path for non-spam (ham) emails located within the 'data' folder.
- spam_dir = path.join('data', 'spam'): Defines the directory path for spam emails located within the 'data' folder.

In [None]:
# Import the libraries
import numpy as np
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Define a function to load emails from a directory
def load_emails(directory):
    emails = []
    for filename in glob(os.path.join(directory, '*')):
        with open(filename, 'r', encoding='latin1') as f:
            msg = message_from_file(f)
            emails.append(str(msg.get_payload()))
    return emails
# Load dataset
ham_emails = load_emails(ham_dir)
spam_emails = load_emails(spam_dir)

emails = ham_emails + spam_emails
labels = ['ham'] * len(ham_emails) + ['spam'] * len(spam_emails)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(emails, labels, test_size=0.2, random_state=42, stratify=labels)

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_counts, y_train)

# Step 3: Train Logistic Regression model
log_model = LogisticRegression(max_iter=2000)
log_model.fit(X_train_counts, y_train)


# Step 4: Evaluate the models
def evaluate_model(model, X_test_counts, y_test):
    y_pred = model.predict(X_test_counts)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

# Evaluate Naive Bayes
naive_evaluation = evaluate_model(clf, X_test_counts, y_test)
print(f'Naive Bayes Accuracy: {naive_evaluation:.2f}')

# Evaluate Logistic Regression
logistic_evaluation = evaluate_model(log_model, X_test_counts, y_test)
print(f'Logistic Regression Accuracy: {logistic_evaluation:.2f}')

Naive Bayes Accuracy: 0.95
Logistic Regression Accuracy: 0.98


#### Comment:
- Imported the necessary libraries
- Defined a function loads_emails. It initializes an empty list
- Loop through each file in the specified directory. The glob function generates a list of file paths that match the specified pattern (all files in the directory)
- Open each file in read mode with 'latin1' encoding to handle a wide range of characters. 'latin1' is used here because it can handle more types of characters without raising an encoding error.
- after that it will read the content of the file as an email message object using the email library's message_from_file function.
- and convert the payload (the main content of the email) to a string and append it to the emails list.
- ham_emails calls the load_emails function with ham_dir directoy
- spam_emails calls the load_emails function with spam_dir directory
- The emails list contains all the email texts, combining both ham and spam emails.
- The labels list contains corresponding labels ('ham' or 'spam') for each email in the emails list and return it.

After that
- Split the data into training and testing sets, with 20% of the data reserved for testing.
- Create an instance of CountVectorizer, which will convert text data into a matrix of token counts.
- Fit the CountVectorizer to the training data and transform the training data into a matrix of token counts.
- Transform the test data into a matrix of token counts using the already fitted CountVectorizer.
- Create an instance of the Multinomial Naive Bayes classifier.
- Train the Naive Bayes classifier using the training data and labels.
- Create an instance of the Logistic Regression model with a maximum of 2000 iterations to ensure convergence.
- Train the Logistic Regression model using the training data and labels.
- The evaluate_model function measures and returns the accuracy of a given model's predictions on test data.
- Evaluate the Naive Bayes classifier using the test data and store the results and print it.
- Evaluate the Logistic Regression model using the test data and store the results and print it.


#### The following email is a test email. You can take this and test your classifier to see if it predicts spam or not.


In [None]:
# Test email
spam_email = """
Subject: Get Rich Quick!

Dear Friend,

Congratulations! You've been selected to participate in an exclusive opportunity to make thousands of dollars from the comfort of your own home. Our revolutionary system guarantees quick and easy cash with minimal effort.

No more struggling to pay bills or worrying about financial security. With our proven method, you can start earning massive amounts of money in no time.

Here's what some of our satisfied customers have to say:
- "I was skeptical at first, but I'm now living my dream life thanks to this incredible system!" - John S.
- "I never thought making money online could be this simple. It's changed my life!" - Sarah L.

Don't miss out on this limited-time offer. Act now to secure your spot and start enjoying a life of financial freedom.

Click the link below to get started:
www.getrichquick.com

Remember, this opportunity is exclusive and won't last long. Take control of your financial future today!

Best regards,
The Get Rich Quick Team
"""


In [None]:
# Vectorize the spam email
X_spam = vectorizer.transform([spam_email])

# Predict with the classifier
Naive_classifier_prediction = clf.predict(X_spam)

# Predict with the logistic regression
logistic_regression_prediction = log_model.predict(X_spam)

# Print the prediction
print(f'By using Naive bayes classifier the email is predicted as: {Naive_classifier_prediction[0]}')
print(f'By using Logistic regression model the email is predicted as: {logistic_regression_prediction[0]}')

By using Naive bayes classifier the email is predicted as: spam
By using Logistic regression model the email is predicted as: spam


##### Comment:
- The spam email converts into numerical features using 'Count vectorizer'
- now the numerical features are input into the trained naive bayes classifier to get predictions
- finally printed the prediction

#### Conclusion:
- Logistic regression achieved higher accuracy (98%) compared to Naive Bayes (95%) in classifying spam and non-spam emails. Both models correctly predicted the label of the test email as spam, showcasing their effectiveness in email classification tasks.