# Multinomial Naive Bayes Classifier

**Goal:** Classify a document *d* into one of *K* classes $C_k$, given its words.


## 1. Bayes' Theorem

$$
P(C_k \mid d) = \frac{P(d \mid C_k) \cdot P(C_k)}{P(d)}
$$

Since  $P(d)$ is constant across classes, we use:

$$
\arg\max_{C_k} \left[ P(d \mid C_k) \cdot P(C_k) \right]
$$


## 2. Multinomial Model

- Represent document *d* as a vector of word counts:  
- $ d = n_1, n_2, \dots, n_V $
- $V$ is the size of the vocabulary.
- $n_i$ is the number of times word $w_i$ appears in document *d*.

$$
P(d \mid C_k) = \frac{N_d!}{\prod_{i=1}^{V} n_i!} \prod_{i=1}^{V} \left( P(w_i \mid C_k) \right)^{n_i}
$$

In practice, we ignore the multinomial coefficient and compute:

$$
\log P(d \mid C_k) = \sum_{i=1}^{V} n_i \cdot \log P(w_i \mid C_k)
$$


## 3. Priors

$$
P(C_k) = \frac{\text{Number of documents in class } C_k}{\text{Total number of documents}}
$$


## 4. Likelihoods with Smoothing

$$
P(w_i \mid C_k) = \frac{N_{ik} + \alpha}{N_k + \alpha V}
$$

Where:

- $N_{ik}$ = number of times word $w_i$ occurs in documents of class $C_k$
- $N_k$ = total number of words in documents of class $C_k$
- $\alpha$ = smoothing parameter (usually $\alpha = 1$, Laplace smoothing)
- $V$ = size of vocabulary


## 5. Prediction Rule

For a given document *d*, compute for each class:

$$
\log P(C_k) + \sum_{i=1}^{V} n_i \cdot \log P(w_i \mid C_k)
$$

Choose the class with the highest score:

$$
\hat{C} = \arg\max_{C_k} \left[ \log P(C_k) + \sum_{i=1}^{V} n_i \cdot \log P(w_i \mid C_k) \right]
$$


## 6. Summary of Training Steps

1. Compute $P(C_k)$ for each class.
2. For each word $w_i$ and each class $C_k$, compute $P(w_i \mid C_k)$ with smoothing.


## 7. Summary of Prediction Steps

1. For a new document *d*, compute the score for each class.
2. Choose the class with the highest score.



---
# Load the datasets

In [91]:
# Load the required libraries
library(ggplot2)
library(tidytext)
library(dplyr)

In [93]:
# Load the training and testing datasets
train_data = read.csv("data/train.csv")
test_data = read.csv("data/test.csv")

In [95]:
# Divide the training dataset into training and validation set. 20% of the data will be used for validation.

set.seed(123)  # For reproducibility

train_indices = sample(1:nrow(train_data), size = 0.8 * nrow(train_data))
train_set = train_data[train_indices, ]
validation_data = train_data[-train_indices, ]

In [97]:
print("--------- Training Data Summary ---------")
summarize(train_data,
          num_rows = n(), 
          num_cols = ncol(train_data),
          num_missing = sum(is.na(train_data)))

print("--------- Validation Set Summary ---------")
summarise(validation_data,
          num_rows = n(),
          num_cols = ncol(validation_data),
          num_missing = sum(is.na(validation_data)))

print("--------- Testing Data Summary ---------")
summarise(test_data,
          num_rows = n(),
          num_cols = ncol(test_data),
          num_missing = sum(is.na(test_data)))

[1] "--------- Training Data Summary ---------"


num_rows,num_cols,num_missing
10240,3,0


[1] "--------- Validation Set Summary ---------"


num_rows,num_cols,num_missing
2048,3,0


[1] "--------- Testing Data Summary ---------"


num_rows,num_cols,num_missing
1267,2,0


In [109]:
train_set = train_set %>% mutate(doc_id = row_number())
validation_data = validation_data %>% mutate(doc_id = row_number())
test_data = test_data %>% mutate(doc_id = row_number())

In [111]:
train_set$Text = as.character(train_set$Text)
validation_data$Text = as.character(validation_data$Text)
test_data$Text = as.character(test_data$Text)

In [113]:
train_set

Labels,Text,Text_Tag,doc_id
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1
5,Californias credit rating is the worst in the country.,"economy,message-machine,state-budget",2
5,Michael Thurmond authored major legislation that has provided more than $250 million in tax relief to Georgias senior citizens and working families.,taxes,3
3,"Palin ""fired Wasilla's Police Chief because he 'intimidated' her.""",crime,4
5,[L]ess than one-tenth of Atlantas transportation needs are covered in a referendum to levy a 1-cent sales tax.,transportation,5
1,Im the only candidate for governor whos rolled out any policies so far.,elections,6
1,"Said U.S. Rep. Jim Marshall, D-Macon, sent nearly $2 billion overseas to build wind turbines and create jobs, mostly in China","environment,message-machine,stimulus",7
2,"In Wisconsin, Gov. Scott Walker has created a manufacturing-led jobs recovery. 30,000 new jobs were created this year, with 15,000 created in the struggling manufacturing sector.","jobs,message-machine-2012",8
1,Today there are more men and women out of work in America than there are people working in Canada.,"economy,workers",9
5,Wayne Powell has a stated position of having no objection to taking `In God We Trust off of U.S. currency.,religion,10


# Tokenization
We tokenize all of the datasets such that each datapoint will be a vector of the words it contains. All of the letters are converted into lowercase. The punctuation is removed. Stopwords such as "and", "of", or "the" are also removed. 

In [115]:
tokenize_and_clean_text <- function(data_frame) {
  data_frame %>%
    unnest_tokens(word, Text, drop = FALSE) %>%
    # Remove punctuation
    mutate(word = gsub("[[:punct:]]", "", word)) %>%
    # Remove numbers
    mutate(word = gsub("[[:digit:]]", "", word)) %>%
    # Remove English stopwords
    anti_join(stop_words, by = "word") %>%
    # Filter out any empty strings
    filter(word != "") %>%
    # Filter out single-character words
    filter(nchar(word) > 1)
}

In [117]:
tokenized_train_set = tokenize_and_clean_text(train_set)
tokenized_test_data = tokenize_and_clean_text(test_data)
validation_data = tokenize_and_clean_text(validation_data)

In [118]:
tokenized_train_set

Labels,Text,Text_Tag,doc_id,word
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,wisconsin
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,gov
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,scott
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,walkers
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,budget
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,calls
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,raising
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,property
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,taxes
4,Says Wisconsin Gov. Scott Walkers budget calls for raising property taxes by nearly $500 billion.,"state-budget,taxes",1,billion
