---
title: "Naïve Bayes"
format:
  html:
    page-layout: full
    code-fold: show
    code-copy: true
    code-tools: true
    code-overflow: wrap
---

# Introduction to Naïve Bayes

In this section, we are going to explore Naïve Bayes (NB)

## Overview & Objective
Naïve Bayes is a probabilistic classification method rooted in applying Bayes' theorem, with an emphasis on the "naïve" assumption of independence between features. This means that for the purpose of classification, each feature contributes independently to the probability of a particular class, regardless of the values of other features.

## Probabilistic Nature and Bayes' Theorem

Naïve Bayes estimates the posterior probability of a class given a set of features using Bayes' theorem. Let's denote $x$ as feature and $C_k$ as class. Formula can be written as: $$P(C_k|x) = \frac{P(x|C_k) \cdot P(C_k)}{P(x)}$$ Where: 

- $P(C_k|x)$ is the posterior probability of class $C_k$ given features x.

- $P(x|C_k)$ is the likelihood which the probability of features x given class $C_k$.

- $P(C_k)$ is the prior probability of class $C_k$.

- $P(x)$ is the prior probability of features x.

**Note:** $P(x)$ is constant for all classes, it's often omitted in practice, and the classification becomes about finding the class 
$C_k$ that maximizes the numerator.

## Objectives & Aim

The primary objective of Naive Bayes classification is to determine the most probable class for a given instance based on its features. By leveraging the power of probability and statistics, the algorithm aims to provide an accurate and computationally efficient classification mechanism.

By using Naïve Bayes classification, we want to:

-Predict the class of an instance based on its features with a high level of accuracy.

-Offer a simple yet effective model, especially for datasets with many features or large datasets where other algorithms might be computationally intensive.

-Handle both continuous and discrete data through its various variants.

## Variants of Naïve Bayes:

**Gaussian Naïve Bayes:**

-Used when features are continuous and follow a Gaussian or normal distribution.

-It assumes that the values associated with each class are distributed according to a Gaussian distribution.

-Choose Gaussian when dealing with continuous data that's normally distributed.

**Multinomial Naïve Bayes:**

-Appropriate for discrete counts.

-Often used for text classification where the features are the frequency of words (or n-grams) in documents.

-Choose Multinomial when working with discrete data, especially in applications like text analytics where you're counting word occurrences.

**Bernoulli Naïve Bayes:**

-Works with binary/boolean features.

-Useful in text classification tasks where features represent the presence or absence of a word.

-Choose Bernoulli for binary data, or when modeling binary features in datasets, such as "word exists or not" in text classification.


In conclusion, Naïve Bayes is a versatile and probabilistic classification method offering simplicity and efficiency, especially in high-dimensional datasets. It's paramount to select the appropriate variant based on the nature of your data to ensure optimal results.

In this segment, we will be applying the Naive Bayes algorithm to both labeled records and textual data. Analogous to our work in Exploratory Data Analysis (EDA), we will utilize the IHE stock market dataset for the labeled records. For the textual data, we will extract information from the NewsAPI, which we interacted with in a prior lab session. Detailed descriptions and characteristics of these datasets can be found under the 'Small Data' and 'Large Data' tabs, respectively.

# Record Data - IHE Stock Market

To enhance the preparation of record data for the Naive Bayes algorithm, we should begin by partitioning our dataset into training and testing subsets. This critical step not only facilitates a more robust training process but also sets the stage for subsequent model validation. Engaging in model validation is imperative as it provides a clear and empirical basis to assess the performance of our model, drawing insights from the results of our analysis to ensure accuracy and reliability in its predictive capabilities. By adopting this meticulous approach, we pave the way for a more effective and reliable implementation of the Naive Bayes algorithm. Record data are using R code for analysis. 

In [7]:
library(tidyverse)
library(tidyquant)
library(ggplot2)
library(forecast)
library(astsa) 
library(xts)
library(tseries)
library(lubridate)
library(plotly)
library(dplyr)

#load df
ihe_df <- read.csv("cleaned_data/IHE.csv")

In [8]:
ihe_df$Date = as.Date(ihe_df$Date)

ihe.ts = subset(ihe_df, select = Adj.Close)

ihe.ts = ts(ihe.ts, start=c(2019,1),frequency = 365.25) #per day for stock market price

ihe.diff = diff(ihe.ts)

In [17]:
train <- ts(ihe.diff[1:799])
test <- ts(ihe.diff[800:1005])

fit = auto.arima(train, seasonal = FALSE)
summary(fit)

Series: train 
ARIMA(3,0,3) with zero mean 

Coefficients:
          ar1     ar2     ar3     ma1      ma2      ma3
      -0.7611  0.8261  0.8285  0.6540  -0.8722  -0.7183
s.e.   0.0526  0.0384  0.0504  0.0638   0.0333   0.0633

sigma^2 = 3.743:  log likelihood = -1658.24
AIC=3330.48   AICc=3330.62   BIC=3363.26

Training set error measures:
                    ME     RMSE      MAE MPE MAPE      MASE       ACF1
Training set 0.1068385 1.927508 1.443424 Inf  Inf 0.6684412 0.02125072

With the data now suitably prepped and partitioned, we are all set to implement the Naive Bayes algorithm. For an in-depth look at the calculations and the selection process of features specifically for record data, refer to the tab labeled "Feature Selection for Record Data". This section provides a comprehensive breakdown of the methodology, elucidating how each feature is evaluated and chosen based on its statistical relevance and impact on the model’s performance by using time series techniques.

# Text Data - News API

Leveraging NewsAPI, we are able to efficiently fetch textual data pertinent to our project's focus. In this instance, I utilized "medical cost USA" as a search keyword to curate the relevant dataset. Following this, akin to handling record data, our next steps involve partitioning this textual information into training and testing sets. This is a crucial step in our data preparation process for the Naive Bayes classifier. Post-segmentation, we proceed to transform the text data into a vectorized format, suitable for model ingestion. Having prepared our data, we then train the Naive Bayes model, ensuring to evaluate its performance rigorously to guarantee its predictive reliability. Finally, with confidence in our model's capabilities, we deploy it to make predictions on new, unseen data, showcasing its practical applicability and robustness.

In [3]:
text_df <- read.csv("cleaned_data/medicine-cleaned.csv")

head(text_df)

summary(text_df)

Unnamed: 0_level_0,title,description
Unnamed: 0_level_1,<chr>,<chr>
1,a round-up of the talks from gaconf usa 2023,"gaconf returned to the us this week with a series of accessibility talks hosted at the archer hotel in redmond, washington virtual attendance was available through zoom and youtube talks ranged from those on personal experiences of disability and accessibil…"
2,american can prevent (and control) type 2 diabetes so why arent we doing it?,"usa today's health team spoke with scores of experts to understand why, despite solutions, more americans continue to struggle with type 2 diabetes"
3,"the steep cost of type 2: when diabetes dragged her down, she chose to fight",the nation's disjointed and confusing health care system leaves many type 2 diabetes patients to navigate it with little support
4,a hidden system of exploitation underpins us hospitals employment of foreign nurses,"this series was produced in partnership with the nonprofit newsroom type investigations, with support from the gertrude blumenthal kasbekar fund, the puffin foundation, and the pulitzer centerread more"
5,"the childcare cliff: $122 billion dollar crisis, but whose problem is it?","the childcare crisis in america; the harsh truth of systemic inequality, peculiar economics and lack of support for a industry at the backbone of the economy"
6,‘mad men meets medicine - #metoo hits the uks national health service,new report reveals one in three female surgeons in the uk suffer sexual harassment women start to organise - at a women in medicine conference


    title           description       
 Length:100         Length:100        
 Class :character   Class :character  
 Mode  :character   Mode  :character  

Examining the text dataset reveals a lack of numerical variables, necessitating a tailored approach for splitting it into training and testing sets. We aim for a 75-25 split, dedicating 75% to training and the remaining 25% to testing. Given that our dataset comprises text data and is not sequential like time series data, Python stands out as an apt choice for this task but kernel does not like to mix with R, so for a detailed guide on how to prepare and select features from text data, please refer to the tab labeled "Feature Selection for Text Data," where comprehensive steps and practices are outlined to optimize your text data for machine learning models.