<a href="https://colab.research.google.com/github/ajfisch/deeplearning_bootcamp_2020/blob/master/Introduction_to_Python_and_Sklearn_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Task: Beer Review Sentiment Analysis

Given long and detailed beer reviews, we want to predict if the reviewed ranked it as an bad, okay or good.

# Step 1: Downloading and Preprocessing the Data

To start off, we're going to load the data from some pickle files and do some simple preprocessing. We'll throw away non-alphanumeric characters and lowercase everything.

i.e "Best Beer ever!!!" -> "best beer ever"

The sanity check the data, we'll look at a few examples.

In [0]:
import pickle

!apt-get install wget
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_train.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_dev.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_test.p

train_set =  pickle.load(open("overall_train.p", "rb"))
dev_set =  pickle.load(open("overall_dev.p", "rb"))
test_set =  pickle.load(open("overall_test.p", "rb"))

# Extract tweets and labels into 2 lists
def preprocess_data(data):
    for indx, sample in enumerate(data):
        text, label = sample['text'], sample['y']
        text = text.lower().strip()
        data[indx] = text, label
    return data

# Preprocess all the data splits.
train_set = preprocess_data(train_set)
dev_set = preprocess_data(dev_set)
test_set =  preprocess_data(test_set)

# Separate components into X and Y lists.
trainText = [t[0] for t in train_set]
trainY = [t[1] for t in train_set]

devText = [t[0] for t in dev_set]
devY = [t[1] for t in dev_set]

testText = [t[0] for t in test_set]
testY = [t[1] for t in test_set]

# Sanity check:
print("EXAMPLE INPUTS")
print(trainText[0])
print(trainY[0])

# Exercise 1:

Make sure you have run the first cell to download and read the train, dev, and test data files.

1.   How many examples each are in train, dev, and test?
2.   What are the different fields of each example?
3.   What is the input? What is the output? How many types of outputs are there?

In [0]:
# Question 1:
print("Number of examples in train: %d" % len(trainText))
print("Number of examples in dev: %d" % len(devText))
print("Number of examples in test: %d" % len(testText))

# Question 2:
# The different fields are the "text" (*Text[i]) and the "label" (*Y[i]).
# Within each text, you can see that it talks about "taste", "aroma", and "smell".

# Question 3:
# Input is the text, i.e., *Text[i].
# Output is the label (negative, neutral, positive), i.e., *Y[i].
# Number of outputs is 3:
print("Number of outputs: %d" % len(set(trainY)))

# Step 2: Feature Engineering 

How do we represent a review? We're going to use a simple bag of words representation. Meaning we'll represent each review as a vector, and the whole set of reviews as a large matrix.

For example, consider our vocabulary is ```[best, ever, beer, cat, good, dog]```.
The bag of words representation for:
```"best beer ever"``` is ```[1, 1, 1, 0, 0, 0]```
Where one indicates that the vocab words did appear and 0 indicates the words that did not. S

With sklearn, we can do this very easily with ```sklearn.feature_extraction.text.CountVectorizer```

<img src="https://github.com/yala/MLCodeLab/blob/master/lab1/vectorizer.png?raw=true">

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

# Set that word has to appear at least 5 times to be in vocab
min_df = 5
max_features = 1000
countVec = CountVectorizer(min_df=min_df, max_features=max_features )

# Learn vocabulary from train set
countVec.fit(trainText)

# Transform list of review to matrix of bag-of-word vectors
trainX = countVec.transform(trainText)
devX = countVec.transform(devText)
testX = countVec.transform(testText)

## Exercise 2:

1. What is the size of the vocabulary?
2. What if you change the mininum token frequency to 500?
3. What is the index of "beer"?

Hint: Use the documentation!


In [0]:
# Question 1:
print("Vocabulary size: %d" % len(countVec.vocabulary_))

# Question 2:
countVec2 = CountVectorizer(min_df=500, max_features=max_features)
countVec2.fit(trainText)
print("New vocabulary size: %d" % len(countVec2.vocabulary_))

# Question 3:
print("Index of beer: %d" % countVec.vocabulary_["beer"])

# Step 3: Pick a Model

Here we'll explore various types of linear models, namely Logistic Regression, Passive Aggressive, and Perceptron. It's very straight-forward
to fit a new classifier and get preliminary results!

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron

lr = LogisticRegression()
lr.fit(trainX, trainY)
print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))

## Exercise 3:

1. What is the *test* score of the logistic regression model?
2. What are the maximum and minimum weight values?
3. What are the train/dev scores if you use a perceptron?


In [0]:
# Question 1:
print("Logistic Regression Test:", lr.score(testX, testY))

# Question 2:
print("Maximum weight value: %2.4f" % lr.coef_.max())
print("Minimum weight value: %2.4f" % lr.coef_.min())

# Question 3:
perceptron = Perceptron()
perceptron.fit(trainX, trainY)
print("Perceptron Train:", perceptron.score(trainX, trainY))
print("Perceptron Dev:", perceptron.score(devX, devY))


# Step 4: Analysis, Debugging the Model
To understand how to make the model better, it's important understand what the model is learning, and what it's getting wrong.

To do this, we can inspect the highest weighted features of our best LR model and look at some examples the model got wrong on the development set. 


In [0]:
lr = LogisticRegression()
lr.fit(trainX, trainY)
print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

In [0]:
import numpy as np

print("Interpreting LR")
for label in range(3):
    coefs = lr.coef_[label]
    vocab = np.array(countVec.get_feature_names())
    num_features = 10

    top = np.argpartition(coefs, -num_features)[-num_features:]
    # Sort top
    top = top[np.argsort(coefs[top])]
    s_coef = coefs[top]
    scored_vocab = list(zip(vocab[top], s_coef))
    print("Top weighted features for label {}:\n \n {}\n -- \n".format(label, scored_vocab))

In [0]:
# Find erronous dev errors
devPred = lr.predict(devX)
errors = []
for indx in range(len(devText)):
    if devPred[indx] != devY[indx]:
        error = "Review: \n {} \n Predicted: {} \n Correct: {} \n ---".format(
            devText[indx],
            devPred[indx],
            devY[indx])
        errors.append(error)

np.random.seed(2)
print("Random dev error: \n {} \n \n {} \n \n{}".format(
        np.random.choice(errors,1),
        np.random.choice(errors,1),
        np.random.choice(errors,1))
     )

# Exercise 4:

1. Count the number of false positives in the dev set.
2. Count the number of false negatives in the dev set.

In [0]:
# Question 1:
fp = 0
for idx in range(len(devY)):
  if devPred[idx] > devY[idx]:
    fp += 1
print("Number of false positives: %d" % fp)

# Question 2:
fn = 0
for idx in range(len(devY)):
  if devPred[idx] < devY[idx]:
    fn += 1
print("Number of false negatives: %d" % fn)

## Step 5: Play with regularization

We can see that LogisticRegression so far works the best so far, but it is greatly over fitting. Meaning that it does much better on train than development. A common strategy to dealing with this is adding an extra penalty for model complexity, like the square sum of the model weights. We call this idea regularization. 

In sklearn, it is very easy to test out various regularization amounts and tune the model. The smaller the parameter `C`, the stronger the regularization cost.

In [0]:
lr = LogisticRegression(C=.5)
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

lr = LogisticRegression(C=.1)
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

In [0]:
lr = LogisticRegression(C=.01)
lr.fit(trainX, trainY)


print("Logistic Regression Train:", lr.score(trainX, trainY))
print("Logistic Regression Dev:", lr.score(devX, devY))
print("--")

# Step 6: Adding in Ngrams

How does our model distinguish between the sentiment phrase that says:
```"great flavor and too bad there isn't more."```
versus
```"bad flavor and too great there isn't more."```

In our bag of words model, both have the same vector. In order to capture some of these ordering depency, we generalize the bag-of-words model to take "n-grams" of words that occur in the training set. a "bi-gram" is a pair of words, "tri-gram" triple, etc.

Let see how this imporves our model 


In [0]:
# Set that word has to appear at least 5 times to be in vocab
min_df = 5
ngram_range = (1,3)
max_features = 5000
countVecNgram = CountVectorizer(min_df = min_df, ngram_range = ngram_range, max_features=max_features)

# Learn vocabulary from train set
countVecNgram.fit(trainText)

# Transform list of review to matrix of bag-of-word vectors
trainXNgram = countVecNgram.transform(trainText)
devXNgram = countVecNgram.transform(devText)
testXNgram = countVecNgram.transform(testText)

In [0]:
lrNgram = LogisticRegression(C=1)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

lrNgram = LogisticRegression(C=.5)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

lrNgram = LogisticRegression(C=.1)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

lrNgram = LogisticRegression(C=.01)
lrNgram.fit(trainXNgram, trainY)
print("Logistic Regression Train:", lrNgram.score(trainXNgram, trainY))
print("Logistic Regression Dev:", lrNgram.score(devXNgram, devY))
print("--")

## Step 7: Take best model, and report results on Test

In [0]:
print("Logistic Regression Test:", lrNgram.score(testXNgram, testY))