###Assignment 4 Beer Reviews
####Allison Hwang and Albert Kuo

In [2]:
import numpy as np
import json 
import matplotlib.pyplot as plt
import random
import re
%matplotlib inline
from pyspark.mllib.feature import HashingTF, IDF, Normalizer
from pyspark.mllib.regression import LabeledPoint

In [3]:
from pyspark.mllib.tree import DecisionTree, RandomForest, \
                                      GradientBoostedTrees

###part (a) Generating features (hashed TF-IDF)

In [4]:
all_reviews = sc.textFile("s3n://stat-37601/ratings.json",minPartitions=1000).map(json.loads)
reviews, reviews_test = all_reviews.randomSplit([.7, .3])
reviews.cache()

seed = random.randint(0,10000)
reviews_cv = reviews.sample(False, 0.7, seed)

In [18]:
def getLabel(review):
    """
    Get the overall rating from a review
    """
    label, total = review["review_overall"].split("/")
    return float(label) / float(total)
labels_cv = reviews_cv.map(getLabel)
labels_cv.first()

0.65

In [6]:
def parse(f):
    wordlist = re.sub("[^\w]", " ",  f).split()
    for i in range(len(wordlist)):
        wordlist[i] = wordlist[i].lower()
    return wordlist


In [7]:
reviews_wordlists_cv = reviews_cv.map(lambda x: parse(x['review_text']))

I chose 100 features for speed considerations. Although there is lower error rates with more features (I tried various numFeatures values below), the speed improves enough for the observed increase in error to be worth it in this case.

In [8]:
hashingTF = HashingTF(100)
tf = hashingTF.transform(reviews_wordlists_cv)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)

In [9]:
nor = Normalizer()
normalized_cv = nor.transform(tfidf)
normalized_cv.first()

SparseVector(100, {3: 0.138, 8: 0.025, 15: 0.1444, 16: 0.3203, 17: 0.2554, 18: 0.1868, 24: 0.1704, 31: 0.1397, 33: 0.0893, 34: 0.0707, 37: 0.1003, 38: 0.154, 44: 0.0211, 46: 0.0881, 51: 0.1421, 57: 0.085, 62: 0.2019, 65: 0.1145, 70: 0.5044, 71: 0.1599, 72: 0.1305, 74: 0.1441, 75: 0.0869, 76: 0.0375, 78: 0.1417, 80: 0.1768, 81: 0.222, 85: 0.1395, 89: 0.241, 90: 0.1375, 95: 0.1809, 97: 0.1239})

In [10]:
from pyspark.mllib.regression import LabeledPoint
train_data = normalized_cv.zip(labels_cv).map(lambda (feature, label): LabeledPoint(label, feature))

In [13]:
reviews_test.cache()
test_wordlists = reviews_test.map(lambda x: parse(x['review_text']))
test_tf = hashingTF.transform(test_wordlists)
test_tf.cache()
test_idf = IDF().fit(test_tf)
test_tfidf = test_idf.transform(test_tf)
normalized_test = nor.transform(test_tfidf)
a = normalized_test.first()

In [14]:
test_labels = reviews_test.map(getLabel)

### part (b) function to compute mean-squared-error

In [36]:
def mean_squared_error(model):
    yhats = model.predict(normalized_test)
    d = yhats.zip(test_labels)
    diffs = d.map(lambda x: pow((x[0] - x[1]),2))
    mse = diffs.reduce(lambda a,b: a+b)
    return mse
    

###part (c) Trees

## Decision Tree

In [17]:
m = DecisionTree.trainRegressor(train_data, {}, impurity='variance')
mse = mean_squared_error(m)

In [20]:
# default settings:
# impurity = 'variance', maxDepth = 5, numFeatures = 1000
mse

21590.61583910424

There's no space left on device when i used maxDepth= 10, numFeatures=1000

Since it seems that using numFeatures=1000 may not be the best way to conserve space, I will try using numFeatures = 100

In [13]:
# cv set, impurity = 'variance', maxDepth = 5, numFeatures = 100
m = DecisionTree.trainRegressor(train_data, {}, impurity='variance')
mse_numF100 = mean_squared_error(m)
# avg is 0.1248

In [15]:
mse_numF100

22317.612618724987

maxDepth = 7, numFeatures=100

In [16]:
m = DecisionTree.trainRegressor(train_data, {}, maxDepth=7, impurity='variance')
mse_maxD7 = mean_squared_error(m)
mse_maxD7

21942.210518037

maxDepth = 10, numFeatures = 100

In [17]:
m = DecisionTree.trainRegressor(train_data, {}, maxDepth=10, impurity='variance')
mse_maxD10 = mean_squared_error(m)
mse_maxD10

21569.050156969046

Increases in max depth lower the error rate for Decision Trees. However, I ran out of space when I used maxDepth = 20, numFeatures = 100. 

##Random Forest

First I look at how the number of trees affects random forests. Having more trees lowers the number of errors but not by much as I greatly increase the number of trees.

In [14]:
train_data.cache()
m = RandomForest.trainRegressor(train_data, {}, 2)
mse_RF_2trees = mean_squared_error(m)
mse_RF_2trees

22453.99479241267

In [15]:
m = RandomForest.trainRegressor(train_data, {}, 5)
mse_RF_5trees = mean_squared_error(m)
mse_RF_5trees

22290.344383166343

In [16]:
#onethird is the default
m = RandomForest.trainRegressor(train_data, {}, 20)
mse_RF_20trees = mean_squared_error(m)
mse_RF_20trees

22220.551270942728

In [13]:
m = RandomForest.trainRegressor(train_data, {}, 40)
mse_RF_40trees = mean_squared_error(m)
mse_RF_40trees

22198.353201596092

Now I try different feature subset strategies. The default strategy of choosing one-third of the training set seems to work best in this case.

In [17]:
m = RandomForest.trainRegressor(train_data, {}, 20, featureSubsetStrategy='all')
mse_RF_all = mean_squared_error(m)
mse_RF_all

22455.339936908935

In [18]:
m = RandomForest.trainRegressor(train_data, {}, 20, featureSubsetStrategy='sqrt')
mse_RF_sqrt = mean_squared_error(m)
mse_RF_sqrt

22575.90009851554

##Gradient Boosting

In [13]:
m = GradientBoostedTrees.trainRegressor(train_data, {})
GBT = mean_squared_error(m)
GBT

20050.742135819157

Default is 100 iterations but that took a long time to run; here is 10 iterations. There are definitely more errors when I use fewer iterations.

In [13]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10)
GBT_10iterate = mean_squared_error(m)
GBT_10iterate

21880.91291290098

learning rate default is 0.1; here we set learning rate to 0.2, 0.3, 0.5, and 0.9. The error rate decreases as I increase the learning rate, except for learning rate=0.9. 

In [14]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10, learningRate=0.2)
GBT_LR2 = mean_squared_error(m)
GBT_LR2

21497.16044324459

In [15]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10, learningRate=0.3)
GBT_LR3 = mean_squared_error(m)
GBT_LR3

21203.667532941596

In [16]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10, learningRate=0.5)
GBT_LR5 = mean_squared_error(m)
GBT_LR5

21034.56692364598

In [17]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10, learningRate=0.9)
GBT_LR9 = mean_squared_error(m)
GBT_LR9

22530.51713044965

Now I try different types of loss functions. Least squares error does better than the default loss function, log loss. Least absolute error does way worse.

In [21]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10, learningRate=0.5, loss='leastSquaresError')
GBT_leastSquares = mean_squared_error(m)
GBT_leastSquares

21052.285890285017

In [22]:
m = GradientBoostedTrees.trainRegressor(train_data, {}, numIterations=10, learningRate=0.5, loss='leastAbsoluteError')
GBT_leastAbsErr = mean_squared_error(m)
GBT_leastAbsErr

72211.62011648448

###part (d) beyond basic features
Make bigrams

In [9]:
reviews_wordlists_cv = reviews_cv.map(lambda x: parse(x['review_text']))
def double(wordlist):
    pairs = []
    for i in range(0, len(wordlist)-1):
        pairs.append((wordlist[i], wordlist[i+1]))
    return pairs
bigrams = reviews_wordlists_cv.map(double)

In [11]:
hashingTF = HashingTF(100)
tf = hashingTF.transform(bigrams)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
nor = Normalizer()
normalized_cv = nor.transform(tfidf)
normalized_cv.first()

SparseVector(100, {3: 0.1075, 7: 0.1509, 8: 0.2788, 10: 0.2356, 12: 0.2277, 13: 0.1663, 15: 0.0889, 16: 0.0981, 17: 0.1807, 18: 0.0855, 20: 0.188, 23: 0.0933, 24: 0.092, 26: 0.1625, 28: 0.1111, 29: 0.1007, 31: 0.2835, 36: 0.1089, 40: 0.113, 41: 0.0993, 42: 0.0948, 43: 0.1987, 45: 0.114, 48: 0.0683, 50: 0.0986, 54: 0.1679, 56: 0.0997, 58: 0.1874, 60: 0.0999, 61: 0.0995, 65: 0.0994, 67: 0.2597, 70: 0.1926, 71: 0.2055, 72: 0.0955, 73: 0.1121, 74: 0.0821, 76: 0.0911, 81: 0.1803, 82: 0.1098, 84: 0.1052, 86: 0.0918, 87: 0.0964, 89: 0.0927, 92: 0.0959, 93: 0.0999, 95: 0.1258})

In [32]:
bigram_train_data = normalized_cv.zip(labels_cv).map(lambda (feature, label): LabeledPoint(label, feature))

###Random Forest with bigrams

In [37]:
m = RandomForest.trainRegressor(bigram_train_data, {}, 5)
mse_RF_5trees_bigram = mean_squared_error(m)
mse_RF_5trees_bigram

24410.33527282741

###Gradient Boosted Trees with bigrams

In [39]:
m = GradientBoostedTrees.trainRegressor(bigram_train_data, {}, numIterations=10, learningRate=0.5)
GBT_LR5 = mean_squared_error(m)
GBT_LR5

24270.383709230755

Bigrams doesn't do much better, but it may be because I used 100 features when hashing. Perhaps I would see a greater difference if I used more features and there's more ability to differentiate.