# Naive Bayes

## Introduction to Naive Bayes

Provide an brief overview of Naive Bayes classification and how it works.
Don’t go into too much detail, assume the audience is familiar with math and statistics but is not familiar with Naive Bayes.
Explain the probabilistic nature of Naive Bayes and its Bayes’ theorem foundation.
Clearly define the objectives of what you are trying to do.
Explain what you aim to achieve through Naive Bayes classification.
Describe different variants of Naive Bayes, such as Gaussian, Multinomial, and Bernoulli Naive Bayes, and explain when to use each

Bayes Theorem is a mathematical theorem used for predicting a future outcome depending on a collected piece of evidence. More specifically, this theorem calculates "conditional probabilities", which depicts the probability of an outcome based on a prior condition/ event. The formula for the theorem is: 
$\[P(A|B) = (P(B|A) (P(A))) / P(B) \]
In this formula, to predict the probability of event A given event B, we multiply the probability of event B when event A occurs by the probability of event A. We then divide the product by the total probability of event B.
With that in mind, Naive Bayes is a...
There are many variants under the umbrella of Naive Bayes. Included in this bunch are Complement Naive Bayes, Out- of- core Naive Bayes model- fitting, Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes. When speaking of Gaussian Naive Bayes... Whereas, Bernoulli Naive Bayes... Likewise, Multinomial Naive Bayes... 
(Week 7 lecture)

## Data Prep

To prepare for Naive Bayes classification, I split my dataset into training and testing datasets. I do this by .. Once partitioned, there should be a training set-- which consists of the training set as well as a validation set, to do a minor evaluation on what has been trained-- and a test set. The purpose of the partitions of different sets is so I can inform my model of what categories to look for when actually classifying my collected datasets.

In [2]:
import requests
import json
import re
#import pycountry
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

appledf = pd.read_csv("../websitedata/apple_py.csv")

newAppledf = pd.read_csv("../websitedata/newApple_py.csv")

In [3]:
y = appledf["Peak"]
y=np.array(y)

In [5]:
#training, validation, and testing sets
import random

x = newAppledf.to_numpy()
N = x.shape[0]
l = [*range(N)]     # indices
cut = int(0.8 * N) #80% of the list
random.shuffle(l)   # randomize
train_index = l[:cut] # first 80% of shuffled list
test_index = l[cut:] # last 20% of shuffled list

#little validation set from training set
#valid = int(0.1 * len(train_index))  # 10% of training set will be used as a validation set
#random.shuffle(train_index)
#val_index = train_index[:valid]
#train_index = train_index[valid:]

#x_valid = x[val_index]
#y_valid = y[val_index]

print(train_index[0:10])
print(test_index[0:10])

set_train = set(train_index)
set_test = set(test_index)

[484, 582, 534, 49, 699, 192, 521, 525, 701, 763]
[680, 295, 425, 506, 136, 602, 270, 683, 533, 766]


## Feature Selection

You need to use either R or Python to code Naive Bayes (NB) as a classification model for your data, we will use NB as a wrapper for feature selection. While not required, if you want to take it further, you can try this in both R and Python.

Objective: The primary objective of the Feature Selection component in this project is to identify and choose the most relevant and informative features (variables or attributes) from the dataset, for the given task. Effective feature selection can improve the model’s performance, reduce overfitting, and enhance the interpretability of the results.

Instructions: Generalize and apply the code in the lab assignment and lab-demonstration called “Feature selection with text data” to the text and record data you have collected for your project. This code demonstrates feature selection for a text classification task, so map the task onto your projects dataset.

In [9]:
# COMPUTE UPPER AND LOWER LIMIT FOR VARIANCE ACCROSS SAMPLES
x_var=np.var(x,axis=0)
print(np.min(x_var))
print(np.max(x_var))

0.5245687484260891
25775958.26785101


## Results- Record Data

Using your optimal feature set from the previous section, fit a final “optimal” NB model for your Record data.
Report and comment on the findings. It is required that you create code, appropriate visualizations, result summaries, confusion matrices, etc
Describe how the trained model is tested on the testing dataset.
Discuss the evaluation metrics used to assess the performance of the Naive Bayes classifier (e.g., accuracy, precision, recall, F1-score).
Discuss the concepts of overfitting and under-fitting and whether your model is doing it.
Discuss the model’s performance in terms of accuracy and other relevant metrics.
Describe how the project findings will be documented and reported, including the format of reports or presentations.
e.g. what is the output that you generate. What does the output mean? What does it tell you about your data? Does your model do a good job of predicting your test data? Include and discuss relevant visualizations, results, the confusion matrices, etc .
Create and include a minimum of three visualizations for each case (text and record classification).
Write a conclusion paragraph interpreting the results. Note, this is not the same as a write-up of technical methodological details.


## Results- Text Data

Repeat HW-3.2.3 but with your text data

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import time

def train_MNB_model(X,Y, i_print=False):

    if(i_print):
        print(X.shape,Y.shape)

    #SPLIT
    x_train=X[train_index]
    y_train=Y[train_index].flatten()

    x_test=X[test_index]
    y_test=Y[test_index].flatten()


    # INITIALIZE MODEL
    model = MultinomialNB()
    
    # TRAIN MODEL 
    start = time.process_time()
    model.fit(x_train,y_train)
    time_train=time.process_time() - start

    # LABEL PREDICTIONS FOR TRAINING AND TEST SET 
    start = time.process_time()
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)
    time_eval=time.process_time() - start

    acc_train= accuracy_score(y_train, yp_train)*100
    acc_test= accuracy_score(y_test, yp_test)*100

    if(i_print):
        print(acc_train, acc_test, time_train,time_eval)

    return (acc_train, acc_test, time_train,time_eval)


#TEST
print(type(x),type(y))
print(x.shape,y.shape)
(acc_train, acc_test,time_train,time_eval)=train_MNB_model(x,y,i_print=True)

In [None]:
def report(y,ypred):
      #ACCURACY COMPUTE 
      print("Accuracy:",accuracy_score(y, ypred)*100)
      print("Number of mislabeled points out of a total %d points = %d"
            % (y.shape[0], (y != ypred).sum()))

def print_model_summary():
      # LABEL PREDICTIONS FOR TRAINING AND TEST SET 
      yp_train = model.predict(x_train)
      yp_test = model.predict(x_test)

      print("ACCURACY CALCULATION\n")

      print("TRAINING SET:")
      report(y_train,yp_train)

      print("\nTEST SET (UNTRAINED DATA):")
      report(y_test,yp_test)

      print("\nCHECK FIRST 20 PREDICTIONS")
      print("TRAINING SET:")
      print(y_train[0:20])
      print(yp_train[0:20])
      print("ERRORS:",yp_train[0:20]-y_train[0:20])

      print("\nTEST SET (UNTRAINED DATA):")
      print(y_test[0:20])
      print(yp_test[0:20])
      print("ERRORS:",yp_test[0:20]-y_test[0:20])

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import os
import shutil

#OUTPUT FOLDER: START FRESH (DELETE OLD ONE IF EXISTS)
output_dir = "output"
if os.path.exists(output_dir) and os.path.isdir(output_dir):
    shutil.rmtree(output_dir)
os.mkdir(output_dir)

newAppledf=pd.read_csv("apple.csv")
print(newAppledf.shape)
print(newAppledf.columns)

In [None]:
##UTILITY FUNCTION TO INITIALIZE RELEVANT ARRAYS
def initialize_arrays():
    global num_features,train_accuracies
    global test_accuracies,train_time,eval_time
    num_features=[]
    train_accuracies=[]
    test_accuracies=[]
    train_time=[]
    eval_time=[]

In [None]:
# INITIALIZE ARRAYS
initialize_arrays()

# DEFINE SEARCH FUNCTION
def partial_grid_search(num_runs, min_index, max_index):
    for i in range(1, num_runs+1):
        # SUBSET FEATURES 
        upper_index=min_index+i*int((max_index-min_index)/num_runs)
        xtmp=x[:,0:upper_index]

        #TRAIN 
        (acc_train,acc_test,time_train,time_eval)=train_MNB_model(xtmp,y,i_print=False)

        if(i%5==0):
            print(i,upper_index,xtmp.shape[1],acc_train,acc_test)
            
        #RECORD 
        num_features.append(xtmp.shape[1])
        train_accuracies.append(acc_train)
        test_accuracies.append(acc_test)
        train_time.append(time_train)
        eval_time.append(time_eval)

# DENSE SEARCH (SMALL NUMBER OF FEATURES (FAST))
partial_grid_search(num_runs=100, min_index=0, max_index=1000)

# SPARSE SEARCH (LARGE NUMBER OF FEATURES (SLOWER))
partial_grid_search(num_runs=20, min_index=1000, max_index=10000)

In [None]:
#UTILITY FUNCTION TO SAVE RESULTS
def save_results(path_root):
    out=np.transpose(np.array([num_features,train_accuracies,test_accuracies,train_time,eval_time])) 
    out=pd.DataFrame(out)
    out.to_csv(path_root+".csv")

In [None]:
#UTILITY FUNCTION TO PLOT RESULTS
def plot_results(path_root):

    #PLOT-1
    plt.plot(num_features,train_accuracies,'-or')
    plt.plot(num_features,test_accuracies,'-ob')
    plt.xlabel('Number of features')
    plt.ylabel('ACCURACY: Training (blue) and Test (red)')
    plt.savefig(path_root+'-1.png')
    plt.show()

    # #PLOT-2
    plt.plot(num_features,train_time,'-or')
    plt.plot(num_features,eval_time,'-ob')
    plt.xlabel('Number of features')
    plt.ylabel('Runtime: training time (red) and evaluation time(blue)')
    plt.savefig(path_root+'-2.png')
    plt.show()

    # #PLOT-3
    plt.plot(np.array(test_accuracies),train_time,'-or')
    plt.plot(np.array(test_accuracies),eval_time,'-ob')
    plt.xlabel('test_accuracies')
    plt.ylabel('Runtime: training time (red) and evaluation time (blue)')
    plt.savefig(path_root+'-3.png')
    plt.show()

    # #PLOT-3
    plt.plot(num_features,np.array(train_accuracies)-np.array(test_accuracies),'-or')
    plt.xlabel('Number of features')
    plt.ylabel('train_accuracies-test_accuracies')
    plt.savefig(path_root+'-4.png')
    plt.show()

In [None]:

save_results(output_dir+"/partial_grid_search")
plot_results(output_dir+"/partial_grid_search")