**Name: Xuecheng Liu**

**CSE 5522 Hands-on #2: Naive Bayes**

The goals of today's exercise are to familarize you with:


*   Naive Bayes
*   Binary Classification
*   Data exploration

**END OF CLASS GOAL:** Submit a link to your notebook (Share > Get Sharable Link) in Carmen so I can see how far you got.  This should be submitted in a group assignment page on Carmen.

**Part 0: Initial setup**

**0.0:** If none of your team members are familar with python this will be difficult to accomplish - you may want to split up and join different groups.

**0.1:** Go to the Carmen Page for CSE 5522, and find the Group signup tab.  Choose a group under HandsOn2-xx (where xx=1 to 20), and sign your group members up.  This will allow you to submit a group assignment at the end.

**0.2:** Make a copy of this page in your google drive so that you can edit it. Edit the filename to include your group number.  Share the copied page with your teammates. At the end of class, share a URL and submit (so I can see how far you got).  This will count as the participation grade for all members.

**0.3:** While not completely necessary for this assignment, you may want to familiarize yourself with the following packages: [numpy](https://numpy.org), [scikit-learn](https://scikit-learn.org), [pandas](https://pandas.pydata.org), [matplotlib](https://matplotlib.org).

---
---


**Part 1: A Simple Bayes Net: Naive Bayes**

In class, we discussed how conditional independences of a joint probablity distribution get encoded by a Bayesian Network. One of the simplest form of BNs is the Naive Bayes model which encodes a set of simple conditional independences: 

- Given a single cause all of the effects are independent from each other.
- Mathematically: 
$P($*cause*$, $*effect*$_1, ..., $*effect*$_n) = P($*cause*$) \prod_i P($*effect*$_i|$*cause*$)$ 

NB can be used for classification by assuming that cause is the true (unknown) label and it (probabilistically) generates all of the features (effects) while features are independent given the cause. 

For example, in sentiment analysis the *cause* is the author's sentiment (say, unknown label from the set of {sad, happy, feared, suprised, disgusted, angry}) and the *effects* are words that s/he writes. The simplifying assumption of NB says that knowing the latent sentiment, words of the sentence are independent. We know this assumption is not true because grammar and word-use impose some dependency structure between words in the sentence, but we choose to ignore that in this model.

Although simple, NB has shown good performance in many classifcation tasks and has become a standard classic baseline for classification. 

Today we want to perform Twitter sentiment analysis using NB. The goal is to figure out if a tweet has a positive or negative sentiment about the weather.  

**1.0:** Set up the environment (you can click on the play button below to import the appropriate modules).

In [0]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

**1.1** Read the data from GitHub into a pandas dataframe.

In [0]:
TweetUrl='https://github.com/aasiaeet/cse5522data/raw/master/db3_final_clean.csv'
tweet_dataframe=pd.read_csv(TweetUrl)

**1.2** Print out the top of the dataframe to make sure that the data loaded correctly.  It should be a data table with three columns (weight, tweet, label), and 3697 rows.

In [366]:
display(tweet_dataframe.shape)
tweet_dataframe.head()

(3697, 3)

Unnamed: 0,weight,tweet,label
0,1.0,it is very cold out want it to be warmer,-1
1,0.7698,dammmmmmm its pretty cold this morning burr lol,-1
2,0.6146,why does halsey have to be so far away think m...,-1
3,0.9356,dammit stop being so cold so can work out,-1
4,1.0,its too freakin cold,-1


**1.3.** In the next step, we should build our feature matrix by converting the string of words to a vector of numeric values. 

First we need to assign a unique id to each word and create the feature matrix with correct size:

In [0]:
# wordDict maps words to id
# X is the document-word matrix holding the presence/absence of words in each tweet
wordDict = {}
idCounter = 0
for i in range(tweet_dataframe.shape[0]):
  allWords = tweet_dataframe.iloc[i,1].split(" ")
  for word in allWords:
    if word not in wordDict:
      wordDict[word] = idCounter
      idCounter += 1
X = np.zeros((tweet_dataframe.shape[0], idCounter),dtype='float')

Checking head of the dictionary:

In [368]:
dict(list(wordDict.items())[0:10])

{'': 9,
 'be': 7,
 'cold': 3,
 'is': 1,
 'it': 0,
 'out': 4,
 'to': 6,
 'very': 2,
 'want': 5,
 'warmer': 8}

**1.4:** The simplest way of coding a tweet to numbers is to mark the occurrence of a word, and forget about its frequency in the document (tweet). This works well with tweets as there are not many repetitive words in a single tweet. So let's fill the document-word matrix:

In [0]:
for i in range(tweet_dataframe.shape[0]):
  allWords = tweet_dataframe.iloc[i,1].split(" ")
  for word in allWords:
    X[i, wordDict[word]]  = 1

Now we check if the number of words are correct:

In [370]:
np.sum(X[0:5, ], axis = 1)

array([10.,  9., 17.,  9.,  4.])

Finally, we extract the labels from the dataframe:

In [371]:
y = np.array(tweet_dataframe.iloc[:,2])
y[0:5]

array([-1, -1, -1, -1, -1])

Let's compute the total number of positive and negative tweets:

In [372]:
numNeg = np.sum(y<0)
numPos = np.sum(y>=0) #len(y) - numNeg
probNeg = numNeg / (numNeg + numPos)
probPos = 1 - probNeg
display(numNeg, numPos, probNeg, probPos)

1650

2047

0.4463078171490398

0.5536921828509602

So samples 0:1649 are negative and 1650:-1 are positive.

**1.5: Train/Test Split** Now with do the 20/80 split and learn the word probabilities using the 80 % part and test the NB performance on the 20 % part. 

In [373]:
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 0)
display(xTrain.shape, xTest.shape, yTrain.shape, yTest.shape)
#Note: random_state=0 fixes the random seed so we get the same split every run. Don't use this below

(2957, 5989)

(740, 5989)

(2957,)

(740,)

**1.6: Computing Probabilities by Counting** Now the real work begins. Write the code that, from the train feature matrix xTrain computes the needed word probabilites, i.e., $P(word|label)$ where label is + or - and word is any of the words saved in the `wordDict`:

In [374]:
# compute three distributions (four variables):
def compute_distros(x,y):
  # probWordGivenPositive: P(word|Sentiment = +ive)
  probWordGivenPositive=np.sum(x[y>=0,:],axis=0) #Sum each word (column) to count how many times each word shows up (in positive examples)
  probWordGivenPositive=probWordGivenPositive/np.sum(y>=0) #Divide by total number of (positive) examples to give distribution

  # probWordGivenNegative: P(word|Sentiment = -ive)
  probWordGivenNegative=np.sum(x[y<0,:],axis=0)
  probWordGivenNegative=probWordGivenNegative/np.sum(y<0)

  # priorPositive: P(Sentiment = +ive)
  priorPositive = np.sum(y>=0)/y.shape[0] #Number of positive examples vs. all examples
  # priorNegative: P(Sentiment = -ive)
  priorNegative = 1 - priorPositive
  #  (note these last two form one distribution)

  return probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative

# compute distributions here
probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)

# checking the results
display(probWordGivenPositive[0:5])
display(probWordGivenNegative[0:5])
display(priorPositive, priorNegative)

array([0.1185006 , 0.20737606, 0.01088271, 0.01451028, 0.10217654])

array([0.14504988, 0.19493477, 0.00537222, 0.09669992, 0.13967767])

0.5593506932702063

0.44064930672979374

Note that you only needed to compute $P(word = 1| +)$ or $P(word = 1| -)$ and the probabilities of the word being absent from a tweet is just 1 minus those probabilities. 

However, as we see in 1.7, for convenience, we will also want to compute $log P(word = 1 | +)$, $log P(word = 0 | +)$, $log P(word = 1 | -)$ and $log P(word = 0 | -)$.  Also we should compute the log priors.  Let's do so now.


In [375]:
# compute the following:
# logProbWordPresentGivenPositive
# logProbWordAbsentGivenPositive
# logProbWordPresentGivenNegative
# logProbWordAbsentGivenNegative
# logPriorPositive
# logPriorNegative
def compute_logdistros(distros, min_prob):
  if True:
    #Assume missing words are simply very rare
    #So, assign minimum probability to very small elements (e.g. 0 elements)
    distros=np.where(distros>=min_prob,distros,min_prob)
    #Also need to consider minimum probability for "not" distribution
    distros=np.where(distros<=(1-min_prob),distros,1-min_prob)

    return np.log(distros), np.log(1-distros)
  else:
    #Ignore missing words (assume they have P==1, i.e. force log 0 to 0)
    return np.log(np.where(distros>0,distros,1)), np.log(np.where(distros<1,1-distros,1))

min_prob = 1/yTrain.shape[0] #Assume very rare words only appeared once
logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)

# Did this work, or did you get an error?  (Read below.)
display(logProbWordPresentGivenPositive[0:5])
display(logProbWordAbsentGivenPositive[0:5])
display(logProbWordPresentGivenNegative[0:5])
display(logProbWordAbsentGivenNegative[0:5])
display(logPriorPositive, logPriorNegative)

array([-2.13283722, -1.57322143, -4.52058012, -4.23289805, -2.28105316])

array([-0.12613096, -0.23240639, -0.01094236, -0.01461658, -0.10778182])

array([-1.93067756, -1.63509031, -5.22651443, -2.33614267, -1.96841789])

array([-0.15671216, -0.21683197, -0.0053867 , -0.10170047, -0.15044815])

-0.5809786442688406

-0.819505942727632

You likely received an error when you tried to take $log(0)$ at some point.  Can your group think of a way to avoid taking $log(0)$?  Check in with your instructor/TA to see if what you're thinking will work.  Implement that change in your code above.

**Lab 2 Part 1**

We first do the prediction without incorperating absent words


In [0]:

# predict the label using Naive Bayes
def predict(xTest,logProbWordPresentGivenPositive,logProbWordPresentGivenNegative,logPriorPositive, logPriorNegative):
  pred = []
  for i in range(xTest.shape[0]):
    probPositive = xTest[i,:] * logProbWordPresentGivenPositive
    probPositive = np.sum(probPositive)
    probPositive = probPositive + logPriorPositive
    
    probNegetive = xTest[i,:] * logProbWordPresentGivenNegative
    probNegetive = np.sum(probNegetive)
    probNegetive = probNegetive + logPriorNegative
    if probPositive - probNegetive >= 0:
      pred.append(1)
    else:
      pred.append(-1)
  return pred

pred = predict(xTest,logProbWordPresentGivenPositive,logProbWordPresentGivenNegative,logPriorPositive, logPriorNegative)


Now we take absent word into consideration

In [0]:

def predict_Absent(xTest,logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,logPriorPositive, logPriorNegative):
  pred = []
  for i in range(xTest.shape[0]):
    probPositive = xTest[i,:] * logProbWordPresentGivenPositive
    probPositive = np.sum(probPositive)
    probPositive = probPositive + logPriorPositive
    probPositive += np.sum((1-xTest[i,:]) * logProbWordAbsentGivenPositive)
    
    probNegetive = xTest[i,:] * logProbWordPresentGivenNegative 
    probNegetive = np.sum(probNegetive)
    probNegetive = probNegetive + logPriorNegative
    probNegetive += np.sum((1-xTest[i,:]) * logProbWordAbsentGivenNegative)
    
    if probPositive >= probNegetive:
      pred.append(1)
    else:
      pred.append(-1)
  return pred

pred_absent = predict_Absent(xTest,logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,logPriorPositive, logPriorNegative)


**Calculate the accuracy**

In [0]:
def accuracy(pred,yTest):
  correct = 0
  for i in range(yTest.shape[0]):
    if pred[i] == yTest[i]:
      correct+=1
  return correct/yTest.shape[0]

In [379]:
acc = accuracy(pred,yTest)
acc_absent = accuracy(pred_absent,yTest)
print("The accuracy for not using absent word to predict is: ", acc)
print("The accuracy for using absent word to predict is: ", acc_absent)

The accuracy for not using absent word to predict is:  0.8283783783783784
The accuracy for using absent word to predict is:  0.8297297297297297


**As you can see from the result above, the accuracy with using absent word is slightly higher than without considering absent word**

**Part 2**
The labels each came with a weight.  Devise a method for weighting samples, and use that method to recalculate the probability distributions.  Report the effect of weighting samples on the test set. (40 points + 2 bonus points)

In [0]:
def accuracy_weighted(pred,yTest):
  correct = 0
  for i in range(yTest.shape[0]):
    if pred[i] * yTest[i] > 0 :
      correct+=1
  return correct/yTest.shape[0]

**incoperating weight into data**

In [0]:
weight = np.array(tweet_dataframe.iloc[:,0])

X = X * weight.reshape(-1,1)
y = y * weight

In [387]:
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, shuffle= True)
min_prob = 1/np.sum(np.abs(y)) 
probWordGivenPositive, probWordGivenNegative, priorPositive, priorNegative = compute_distros(xTrain,yTrain)
logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive = compute_logdistros(probWordGivenPositive,min_prob)
logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative = compute_logdistros(probWordGivenNegative,min_prob)
logPriorPositive, logPriorNegative = compute_logdistros(priorPositive,min_prob)
pred = predict(xTest,logProbWordPresentGivenPositive,logProbWordPresentGivenNegative,logPriorPositive, logPriorNegative)
pred_absent = predict_Absent(xTest,logProbWordPresentGivenPositive, logProbWordAbsentGivenPositive,logProbWordPresentGivenNegative, logProbWordAbsentGivenNegative,logPriorPositive, logPriorNegative)
  
c = accuracy_weight(pred,yTest)
c_absent = accuracy_weight(pred_absent,yTest)
print("The accuracy of prediction without absent word(weighted) is: ",c)
print("The accuracy of prediction using absent word(weighted) is: ",c_absent)
 

The mean of the accuracy of prediction without absent word(weighted) is:  0.8094594594594594
The mean of the accuracy of prediction using absent word(weighted) is:  0.822972972972973


**Afer adding the weight as a new feature, the accuracy of the model decreased for both conditions**