# HOMEWORK 1

Welcome to your first homework! This will be a mix of conceptual, mathematical, and computational questions. The point is to help you think through what you've learned so far, and (occasionally) to introduce some new skills in a pedagogical fashion. Remember that everything will be graded on a simple, holistic check minus, check, check plus system. Feel free to discuss the homework with each other, but remember: you are responsible for writing up your own answers individually. In particular, do not borrow or copy code **from each other**. You are free, however, to ferret out useful code by reading help pages, stack overflow, etc. That's part of being an effective computational social scientist! Please make sure you comment your code so both you and I can understand what you are doing.

You can answer everything here in the notebook (using LaTeX for the math questions) or you can write out the non-coding questions in another format--though we strongly encourage you to try writing everything in the notebook!

### Question 1
How would you define machine learning? And why is machine learning important to social scientists? Please answer briefly (4 or 5 sentences).

### Question 2
What is the difference between supervised and unsupervised learning? Give an example of each, and a potential social science application.

### Question 3
Solve Chapter 6, Exercise 1 in *Machine Learning for Predictive Data Analytics*. **Hint**: While some of these can be solved by enumerating cases, you would be well-advised to consult your notes on the binomial distribution. For part c, you may find it much easier to calculate the probability that *less* than four people will get heads; from this, it's easy to compute the probabilty that *at least* four people will get heads.

### Question 4
Solve Chapter 6, Exercise 2 in *Machine Learning for Predictive Data Analytics.

### Question 5
Suppose that $10\%$ of all extra-terrestrial civilizations are evil: when they find out about another planet with intelligent life, they will immediately try to conquer it. This means that $\Pr(evil)= 0.1$ and $\Pr(good)= 0.9$. You can get a hint of whether a civilization is "good" or "evil" by looking at the signals it sends. But this is unreliable. If a civilization is good, its signals will be friendly $80\%$ of the time: $\Pr(friendly\,|\,good)= 0.80$. However, some evil civilizations are SO tricksy that they will send friendly messages too: $\Pr(friendly\,|\,evil)= 0.10$. Suppose that we receive a friendly message from an extra-terrestrial civilization.  What is the probability that this civilization is evil? Using ideas from Bayesian decision theory, explain whether or not we should respond to this message.

### Question 6
What is Naive about the Naive Bayes Classifier? Please answer briefly (3 or 4 sentences); your answer should probably include some math.

### Question 7
What is the advantage of the conditional independence assumption in Naive Bayes? Please answer briefly (3 or 4 sentences); your answer should probably include some math.

### Question 8 
Suppose that we have $S_1 \dots S_n$ mutually exclusive, comprehensive situations; a prior over those situations $P(S_i)$; and some data $\mathcal{D}$ that can be used to compute the likelihood functions $P(\mathcal{D}\,|\,S_i)$. We are trying to guess what situation actually obtains, given the data. Prove that for data $\mathcal{D}$ consisting of discrete features, the maximum likelihood estimate (MLE) of the situation is the same as the maximum a posteriori (MAP) estimate, if we assume a uniform prior $P(S_i) = \frac{1}{n}$, where $n$ counts the number of distinct situations. 

### Question 9
Let $X$ and $Y$ be discrete random variables with joint distribution $P(X,Y)$. Prove that if $X$ and $Y$ are independent, i.e., $P(X = x_i,Y = y_j) = P(X = x_i)P(Y = y_j)$, then the covariance $\mathrm{Cov}(X,Y) = 0$. **Hint**: you may want to use the formula $\mathrm{Cov}(X,Y) = \mathrm{E}[XY] - \mathrm{E}[X]\mathrm{E}[Y]$

### Question 10 
Solve Chapter 6, Exercise 6 in *Machine Learning for Predictive Data Analytics*. **Hint**: when the question is asking for the "target level," it is simply asking which of the two classes a Naive Bayes Classifier will predict for the given document using MAP estimation.

Now we are going to turn to a detailed programming exercise on Naive Bayes for text classification. We will be using data from a recent Kaggle competition; to obtain the data, go to www.kaggle.com/c/word2vec-nlp-tutorial/data. You will need to set up a Kaggle account; make sure you download the labeled training data. The tutorials for this particular competition are interesting, but don't get too distracted; in some cases, they'll be doing a lot more than we will be (particularly data cleaning). 

First let's load in the data. We will use pandas, which is a great data science library for Python.

In [None]:
import pandas
import os
import numpy
print(os.getcwd()) #Make sure that your current working directory is where you've stored the data
kagdata = pandas.read_csv("labeledTrainData.tsv",header=0,delimiter="\t",quoting=3) 
"""
Pandas has a built in function, read_csv, to read in the data. The remaining parameters are telling pandas that
(1) there is a first line of headers, which will become the column names (2) that tabs are used to delimit (mark off) 
individual data entries, and (3) some subtleties with quotes.
""";

In [None]:
print (kagdata.shape) #Tells us the shape of the pandas DataFrame, in rows and columns
print (kagdata.index) #Tells us the row indices. Note that these are more like dictionary keys, meaning 
#that if we cut the DataFrame (as we are about to) they stick around
print (kagdata.columns.values) #Tells us what the three columns are

In [None]:
print (kagdata['review'][0]) #The first review. You'll see why Kaggle has you remove some of the HTML stuff...
print (kagdata['sentiment'][0]) #The first sentiment; it will be 0 (negative) or 1 (positive)
"""Note how we are addressing things here. We first get the column (with 'review') and then pull the appropriate row
using the index. Indices usually start with 0 in python.""";

In [None]:
print (kagdata.review[0]) #We can also grab columns in this way, and then pass it the index

Now that we've ingested the data and looked at it a little bit, we need to split into training and testing sets. In general, this process is called **cross-validation**. The simplest version, which we will do here, is the **holdout** method. It simply *holds out* a subset of the data as *testing* data. You train the model (obviously) on the remainder--the *training* data.

In [None]:
split = 0.7 #70% training, 30% testing
kag_train = kagdata[:int(split*len(kagdata))] #This pulls the first 70% of the data into a training set. 
kag_test = kagdata[int(split*len(kagdata)):] #This pulls the remaining 30% into a testing set. 
#See if you can figure out what we did above (called "slicing")
print (kag_train.shape) #You can check that this has worked correctly
print (kag_test.shape) #Ditto. 
#We have the right number of rows; each row corresponds to a unit of data: tuple of (id, review, sentiment)
#And right number of columns: one column for id, one column for review, and one column for sentiment.

### Question 11
In the following code block, please print the row indices and column labels for the training set and the testing set. Then in a new code block, please print the last review in the training set and the first review in the testing set.

In [None]:
#Print the row and column labels for the training set and the testing set.

In [None]:
#Print the last review in the training set and the first review in the testing set.

Note that just splitting the raw data like this isn't the smartest thing in the world. What if our data hadn't been randomized appropriately? Here's a somewhat smarter way to do it. You will end up using both versions of the holdout in the subsequent questions. 

In [None]:
N = kagdata.shape[0] #First see how many rows of data we have.
N_train = int(0.7*N) #Now figure out how many rows would be 70% of this
print(N) #So you can see...
print(N_train)

In [None]:
r = numpy.random.RandomState(0) #This allows us to call numpy's random functions in a replicable way
#How do you think you might CHANGE the behavior of the random number generator? Guess and see!
idx = r.permutation(N) #This creates an index set... just a random list of the numbers from 1 to 25000
idx[0:10] #Look at the first 10 elements

In [None]:
idx_train = idx[:N_train] #Pull 70% of the randomly permuted indices into a list of training indices
idx_test = idx[N_train:] #Pull the remaining 30% into a list of testing indices 

In [None]:
kag_train2_features = kagdata['review'][idx_train] 
#This grabs the column of reviews and all the rows with indices in our training set
kag_train2_targets = kagdata['sentiment'][idx_train]
#This grabs the column of sentiment (what we want to predict) and then all the indices in our training set

In [None]:
print(0 in idx_train) #See whether the first item is in fact in the training indices

In [None]:
#Since it is, we can check that everything worked
print(kag_train2_features[0])
print(kag_train2_targets[0])

In [None]:
kag_test2_features = kagdata['review'][idx_test] #And of course the syntax is identical for the test data 
kag_test2_targets = kagdata['sentiment'][idx_test]

An even smarter way to do this is to use k-fold cross-validation; after all, we might get lucky (or unlucky) in our train/test split, so splitting the data just once and evaluating the model on that basis isn't necessarily the best way to see how it will perform on totally new data. We will discuss this later in class.

Ok, now time to get down to brass tacks. Let's actually do some Naive Bayes on this!

### Question 12
Please comment the code below to describe what is going on.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score 
vectorizer_binary = CountVectorizer(binary=True) 
kag_train_features_bin = vectorizer_binary.fit_transform(kag_train.review)
kag_test_features_bin = vectorizer_binary.transform(kag_test.review)
modelBNB = BernoulliNB() 
modelBNB.fit(kag_train_features_bin,kag_train.sentiment) 
labelsBNB1 = modelBNB.predict(kag_test_features_bin)
accuracy_score(kag_test.sentiment,labelsBNB1)

### Question 13
In the next code block, please rewrite the code in Question 12 to use the ALTERNATE split into training and testing data. Do you see a big difference in performance?

In [None]:
#Hint: you can follow the code above pretty closely...

### Question 14
In the next code block, please use the original train/test split (kag_train and kag_test). Now instead of a Bernoulli Naive Bayes model, try to create and fit a Multinomial Naive Bayes. Before you run it, ask yourself: do you think performance will increase or decrease? Make sure you comment the code appropriately.

In [None]:
#Hint: you can follow the code above pretty closely...

### Question 15
Great job! Penultimate question, so you are almost there. In class, we also learned about transforming your features from counts to tf-idf (term frequency-inverse document frequency) features. In the next code block, please rewrite the code from Question 14 to use tf-idf features. You will still use Multinomial Naive Bayes.

In [None]:
#Hint: you can Google sklearn.feature_extraction.text to figure out how to get a tf-idf vectorizer.

In a 4-5 sentences, describe the difference between Bernoulli Naive Bayes, Multinomial Naive Bayes with count features, and Multinomial Naive Bayes with tf-idf features. 

### Question 16
Ultimate question! Let's see if we can FOOL our classifier. I will write the code below to use the Bernoulli Naive Bayes Classifier, but you should feel free to use whichever one you like. 

In [None]:
good_review = "This is the best movie since Arrival." #We have to make it a string
good_review_vec = vectorizer_binary.transform([good_review]) #Now we transform our string into a binary feature vector. 
modelBNB.predict(good_review_vec)[0] #Now we predict the class (sentiment).
#It shows a 1, which is good, because this is a good review!
#Note that [0] is to return the sentiment only, instead of an array (which would still contain only the sentiment score)

In [None]:
bad_review = "It stinks! This is the worst movie I have seen since Transformers: Revenge of the Fallen."
bad_review_vec = vectorizer_binary.transform([bad_review])
modelBNB.predict(bad_review_vec)[0]  #IT shows a 0, because this is a bad review!

In [None]:
#In this code block, see if you can come up with a good review that is correctly classified as a good review.

In [None]:
#In this code block, see if you can come up with a bad review that is incorrectly classified as a good review