Welcome to Lab 1! We are going to get started right away and have you prove to yourself that you can learn a new ML algorithm on your own, using only the concepts you've learned in Lecture 1.

**Naive Bayes**
Naive Bayes (NB) is a simple, yet often effective, probabilistic classification technique. NB algorithms is primarily known for its ability to classify documents. Well-known examples include identifying spam emails and predicting categories and sentiment in tweets, web pages, etc.

Helpful learning resources:
* https://scikit-learn.org/stable/modules/naive_bayes.html
* https://en.wikipedia.org/wiki/Naive_Bayes_classifier
* https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
* Working with text data: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* Let me know if you find anything else!

Concepts you'll need from Lecture 1:
* Probability distributions
* Bayes' rule
* Maximum likelihood estimation

After learning the NB algorithm, we will use it to classify tweets about the COVID pandemic based on their sentiment (e.g. extremely positive, positive, negative, extremely negative). The dataset available for the lab was downloaded from https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification.

**How does the Naive Bayes algorithm work?**

We'll split the class in half. Half of you will read 1.9-1.9.2 in https://scikit-learn.org/stable/modules/naive_bayes.html. Half of you will read the following sections in the wikipedia entry https://en.wikipedia.org/wiki/Naive_Bayes_classifier: "Probabilistic model" and "Multinomial naive Bayes". Within each group, try to use the materials given to you to answer the following questions. The materials differ in the information they provide, so the exact answer may not be available in the given material. If so, do your best to figure out the answer.

Each group will be asked to send a different representative for each question to the whiteboard to present the group's answer and explain the group's reasoning. 

For simplicity, it can be helpful to start with thinking about the simplest data setting: we want to classify data into two classes where we have access to $p$ features. That is, let $X = (X_1,\cdots, X_p)$ be the features and $Y$ be the outcome.  $Y$ can only take on values of 0 or 1.

Q1: What is the main assumption made by the NB algorithm and what is the probability model? Why does it make this assumption? Do you think it is reasonable? In addition to answering the question in words, show the exact equations.

Predictors are conditionally independent given the class and all features contribute equally to the outcome, i.e. $X_1 \perp X_2 \perp \cdots \perp X_p |Y$

Consequently, NB supposes the joint distribution $P(X_1, \cdots, X_p | Y)$ factorizes into $\prod_{i=1}^p \Pr(X_i |Y)$.

*Why NB makes this assumption?* 
* By assuming conditional independece, the model becomes simpler and easier to implement, since we only need to estimate individual probabilities of each word for each class.
* When one has only limited data, NB might actually do quite well.

Q2: How does the NB algorithm calculate the probability that $Y=1$ for a given observation $X$? In addition to answering the question in words, show the exact equations.

NB estimates $p(X_i|Y)$ for each word $X_i$ in the data. Then using Bayes' Theorem, it computes the probability that $Y=1$ given $X$. That is, $p(Y = 1|X) = \frac{p(X|Y=1) p(Y=1)}{p(X)}$.

Q3: Explain what the Multinomial NB algorithm does and what types of data/problem settings it is suitable for. In addition to answering the question in words, show the exact equations.

Multinomial NB are good for modeling histograms (e.g. prevalence of different words in a document) That is, we are modeling the occurence of a particular event in each class. The likelihood of observing a histogram $X$ is given by
$p(X|Y) \propto \prod_{i=1}^d p(X = i | Y)^{n_i}$ where there are $d$ events ($i=1$ to $d$),  $n_i$ is the number of times event $i$ is observed, and $p(X = i | Y)$ is the probability that event $i$ occurs for class $Y$.

Q4: We are planning to classify tweets using NB. To do this, how would you translate a tweet into a feature vector $X$? Skimming the rest of the variants of the NB algorithm for the website references given, which of these options would you try to use? Draw out the steps of how you plan to process the data and train a NB model. (Written response, include equations when helpful)

Perform a Bag of Words representation, where $X$ could be the number of times each word in a set vocabulary appears or $X$ is simply whether or not each word appears.

**Applying Naive Bayes to classify COVID-related tweets**

Now let's apply this procedure to our dataset. The basic steps are to load the data, preprocess it, fit the model, and then evaluate it. You get to fill in the details.

Q1: Load the data (Code Required)

In [None]:
#import libraries
import numpy as np
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

In [3]:
#load data

train = pd.read_csv("Corona_NLP_train.csv",encoding="latin-1", index_col=None)
test = pd.read_csv("Corona_NLP_test.csv",encoding="latin-1", index_col=None)

In [4]:
train

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
...,...,...,...,...,...,...
41152,44951,89903,"Wellington City, New Zealand",14-04-2020,Airline pilots offering to stock supermarket s...,Neutral
41153,44952,89904,,14-04-2020,Response to complaint not provided citing COVI...,Extremely Negative
41154,44953,89905,,14-04-2020,You know itÃ‚Â’s getting tough when @KameronWild...,Positive
41155,44954,89906,,14-04-2020,Is it wrong that the smell of hand sanitizer i...,Neutral


Q2: What is the prevalence of each sentiment label? Group together the labels so there are three classes (Positive, Neutral, Negative) rather than five. (Code Required)

In [5]:
print("test labels", test.Sentiment.value_counts())
print("train labels", train.Sentiment.value_counts())

test labels Sentiment
Negative              1041
Positive               947
Neutral                619
Extremely Positive     599
Extremely Negative     592
Name: count, dtype: int64
train labels Sentiment
Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: count, dtype: int64


In [6]:
train['Sentiment'] = train['Sentiment'].str.replace('Extremely Positive', 'Positive')
train['Sentiment'] = train['Sentiment'].str.replace('Extremely Negative', 'Negative')

test['Sentiment'] = test['Sentiment'].str.replace('Extremely Positive', 'Positive')
test['Sentiment'] = test['Sentiment'].str.replace('Extremely Negative', 'Negative')

In [7]:
print("test labels", test.Sentiment.value_counts())
print("train labels", train.Sentiment.value_counts())

test labels Sentiment
Negative    1633
Positive    1546
Neutral      619
Name: count, dtype: int64
train labels Sentiment
Positive    18046
Negative    15398
Neutral      7713
Name: count, dtype: int64


Q3: Preprocess the training and test data and create feature vectors. You are provided the following function to clean tweets, which removes special characters and tags. (Code required)
```
def cleaning(df):
    def remove_web_urls(text): return re.sub(r'https?://\S+', ' ', text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_web_urls)

    def remove_tags(text): return re.sub(r'@\w*', ' ' , text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_tags)

    def remove_hashtags(text): return re.sub(r'#\w*', ' ' , text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_hashtags)

    def remove_apostrophe(text): return re.sub(r"'s\b", "", text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_apostrophe)

    def remove_special_chars(text): return re.sub(r"[^a-zA-Z0-9\s]", ' ', text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_special_chars)

    def remove_number(text): return re.sub(r'[\d]', ' ', text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_number)

    df['OriginalTweet'] = df['OriginalTweet'].str.lower()
    return df

```


In [10]:
def cleaning(df):
    def remove_web_urls(text): return re.sub(r'https?://\S+', ' ', text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_web_urls)

    def remove_tags(text): return re.sub(r'@\w*', ' ' , text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_tags)

    def remove_hashtags(text): return re.sub(r'#\w*', ' ' , text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_hashtags)

    def remove_apostrophe(text): return re.sub(r"'s\b", "", text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_apostrophe)

    def remove_special_chars(text): return re.sub(r"[^a-zA-Z0-9\s]", ' ', text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_special_chars)

    def remove_number(text): return re.sub(r'[\d]', ' ', text)
    df['OriginalTweet'] = df['OriginalTweet'].apply(remove_number)

    df['OriginalTweet'] = df['OriginalTweet'].str.lower()
    return df

train = train[['OriginalTweet', 'Sentiment']]
test = test[['OriginalTweet', 'Sentiment']]

cleaned_train = cleaning(train)
cleaned_test = cleaning(test)

Q4: Fit the Multinomial Naive Bayes algorithm (Code required)

In [11]:
#make training and test sets
X_train = cleaned_train['OriginalTweet']
y_train = cleaned_train['Sentiment']

In [12]:
#transform text into vectors
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)

In [13]:
#fit Naive Bayes
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

Q5: Evaluate the accuracy of the Multinomial Naive Bayes algorithm. (code required)

In [14]:
from sklearn import metrics
X_test_vectorized = vectorizer.transform(cleaned_test.OriginalTweet)
y_pred = model.predict(X_test_vectorized)

accuracy = accuracy_score(cleaned_test.Sentiment, y_pred)
print("Accuracy:", accuracy)

print(metrics.classification_report(cleaned_test.Sentiment,y_pred))

Accuracy: 0.6827277514481306
              precision    recall  f1-score   support

    Negative       0.71      0.77      0.74      1633
     Neutral       0.62      0.20      0.30       619
    Positive       0.66      0.78      0.72      1546

    accuracy                           0.68      3798
   macro avg       0.67      0.58      0.59      3798
weighted avg       0.68      0.68      0.66      3798

