# Capstone Example

Last week's notebook discussed a cummulative project that would be used as a test of knowledge from this series of courses.
This notebook will serve as a reference point for you while you work on said project. Included in this notebook is a set of answers to your tasks, based on a set dataset. Make sure in your final submission you are using a different dataset! 

You will be working on 4 tasks:
1. __Data Processing__ 
2. __Classification__ 
3. __Regression__ 
4. __Recommender Sytstems__

These tasks are each representative of one of the courses in the series. So if you need help with any one of these tasks, be sure to look back at those courses for reference! Along with the previous courses, there will be checkpoints with given solutions so you can check to make sure you are headed in the right direction. ___Good Luck!___

# Task 1: Data Processing

## The Data

For this final project you will be doing your work on a dataset of your choice. For reference, an example with checkpoint answers will be included. This example will be an amazon dataset, which does not need any cleaning before proper analysis. This dataset in particular can be found [here.](https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Home_Improvement_v1_00.tsv.gz)
This dataset is a set of Home Improvement Product reviews on amazon. It is a rather large dataset, so our computation might take slightly longer than normal.

### First Step: Imports

In the next cell we will give you all of the imports you should need to do your project. Feel free to add more if you would like, but these should be sufficient.

In [1]:
import gzip
from collections import defaultdict
import random
import numpy
import scipy.optimize
import string
from sklearn import linear_model
from nltk.stem.porter import PorterStemmer # Stemming

### TODO 1: Read the data and Fill your dataset

Take care of int casting the votes and rating. Also __add this bit of code__ to your for loop, taking off the outer " ":

" d['verified_purchase'] = d['verified_purchase'] == 'Y' "

This simple makes the verified purchase column be strictly true/false values rather than Y/N strings.

In [1]:
#YOUR CODE HERE


To do this setup properly, you __should__ shuffle your data (which you should do in your submission), but the checkpoint values would change so for the sake of this example we will ___not___ shuffle the data.

### TODO 2: Split the data into a Training and Testing set

Have Training be the first 80%, and testing be the remaining 20%. 

In [5]:
#YOUR CODE HERE

print(len(trainingSet), len(testSet))
print("Lengths should be: 2107824 526957")

2107824 526957
Lengths should be: 2107824 526957


#### Now delete your dataset
You don't want any of your answers to come from your original dataset any longer, but rather your Training Set, this will help you to not make any mistakes later on, especialy when referencing the checkpoint solutions.

In [6]:
del dataset

### TODO 3: Extracting Basic Statistics

Next you need to answer some questions through any means (i.e. write a function or just find the answer) all based on the __Training Set:__
1. What is the __average rating__?
2. What fraction of reviews are from __verified purchases__?
3. How many __total users__ are there?
4. How many __total items__ are there?
5. What fraction of reviews have __5-star ratings__?

In [2]:
#YOUR CODE HERE


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
1. 4.219492709068689
2. 0.9176558384381238
3. 1396587
4. 294787
5. 0.642813631498645

# Task 2: Classification

Next you will use our knowledge of classification to extract features and make predictions based on them. Here you will be using a Logistic Regression Model, keep this in mind so you know where to get help from.

### TODO 1: Define the feature function

This implementation will be based on the __star rating__ and the ___length___ of the __review body__. Hint: Remember the offset!

In [3]:
#YOUR CODE HERE


### TODO 2: Fit your model

1. Create your __Feature Vector__ based on your feature function defined above. 
2. Create your __Label Vector__ based on the "verified purchase" column of your training set.
3. Define your model as a __Logistic Regression__ model.
4. Fit your model.

In [4]:
#YOUR CODE HERE


### TODO 3: Compute Accuracy of Your Model

1. Make __Predictions__ based on your model.
2. Compute the __Accuracy__ of your model.

In [5]:
#YOUR CODE HERE


### TODO 4: Finding the Balanced Error Rate

1. Compute __True__ and __False Positives__
2. Compute __True__ and __False Negatives__
3. Compute __Balanced Error Rate__ based on your above defined variables.

In [6]:
#YOUR CODE HERE


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
3. Accuracy = 0.9957396822505105
4. BER = 0.4895446422238072

# Task 3: Regression

In this section you will start by working though two examples of altering features to further differentiate. Then you will work through how to evaluate a Regularaized model.

Lets start by defining a new y vector, specific to our Regression model.

In [16]:
y_reg = [d['star_rating'] for d in trainingSet]

### TODO 1: Unique Words in a Sample Set

We are going to work with a smaller Sample Set here, as stemming on the normal training set will take a very long time. (Feel free to change sampleSet -> trainingSet if you would like to see)

1. Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to __Ignore Punctuation and Capitalization__.
2. Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of __Stemming,__ __Ignoring Puctuation,__ ___and___ __Capitalization__.

In [17]:
#GIVEN for 1.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

#GIVEN for 2.
wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

In [18]:
sampleSet = trainingSet[:2*len(trainingSet)//10]

In [7]:
#YOUR CODE HERE


### TODO 2: Evaluating Classifiers

1. Given the feature function and your counts vector, __Define__ your X_reg vector. (This being the X vector, simply labeled for the Regression model)
2. __Fit__ your model using a __Ridge Model__ with (alpha = 1.0, fit_intercept = True).
3. Using your model, __Make your Predictions__.
4. Find the __MSE__ between your predictions and your y_reg vector.

In [24]:
#GIVEN FUNCTIONS
def feature_reg(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    return feat

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

In [25]:
#GIVEN COUNTS AND SETS
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:100]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [9]:
#YOUR CODE HERE


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
1. len(wordCount) = 135769
2. len(wordCountStem) = 113888
4. MSE = 1.2869981011943792 (Roughly, could change slightly due to rounding errors)

In [27]:
# If you would like to work with this example more in your free time, here are some tips to improve your solution:
# 1. Implement a validation pipeline and tune the regularization parameter
# 2. Alter the word features (e.g. dictionary size, punctuation, capitalization, stemming, etc.)
# 3. Incorporate features other than word features

# Task 4: Recommendation Systems

For your final task, you will use your knowledge of simple latent factor-based recommender systems to make predictions. Then evaluating the performance of your predictions.

### Starting up

The next cell contains some starter code that you will need for your tasks in this section.
Notice you are back to using the __trainingSet__.

In [28]:
#Create and fill our default dictionaries for our dataset
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)

for d in trainingSet:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)
    
#Create two dictionaries that will be filled with our rating prediction values
userBiases = defaultdict(float)
itemBiases = defaultdict(float)

#Getting the respective lengths of our dataset and dictionaries
N = len(trainingSet)
nUsers = len(reviewsPerUser)
nItems = len(reviewsPerItem)

#Getting the list of keys
users = list(reviewsPerUser.keys())
items = list(reviewsPerItem.keys())

### You will need to use this list
y_rec = [d['star_rating'] for d in trainingSet]

### TODO 1: Calculate the ratingMean

1. Find the __average rating__ of your training set.
2. Calculate a __baseline MSE value__ from the actual ratings to the average ratings.

In [10]:
#YOUR CODE HERE


Here we are defining the functions you will need to optimize your MSE value. 

In [31]:
#GIVEN
alpha = ratingMean

def prediction(user, item):
    return alpha + userBiases[user] + itemBiases[item]

def unpack(theta):
    global alpha
    global userBiases
    global itemBiases
    alpha = theta[0]
    userBiases = dict(zip(users, theta[1:nUsers+1]))
    itemBiases = dict(zip(items, theta[1+nUsers:]))
    
def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(d['customer_id'], d['product_id']) for d in trainingSet]
    cost = MSE(predictions, labels)
    print("MSE = " + str(cost))
    for u in userBiases:
        cost += lamb*userBiases[u]**2
    for i in itemBiases:
        cost += lamb*itemBiases[i]**2
    return cost

def derivative(theta, labels, lamb):
    unpack(theta)
    N = len(trainingSet)
    dalpha = 0
    dUserBiases = defaultdict(float)
    dItemBiases = defaultdict(float)
    for d in trainingSet:
        u,i = d['customer_id'], d['product_id']
        pred = prediction(u, i)
        diff = pred - d['star_rating']
        dalpha += 2/N*diff
        dUserBiases[u] += 2/N*diff
        dItemBiases[i] += 2/N*diff
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
    for i in itemBiases:
        dItemBiases[i] += 2*lamb*itemBiases[i]
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
    return numpy.array(dtheta)

### TODO 2: Optimize

1. __Optimize__ your MSE using the scipy.optimize.fmin_1_bfgs_b("arguments") functions.

In [11]:
#YOUR CODE HERE


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
1. ratingMean = 4.219492709068689
2. baseLine = 1.634495697493549
3. optimized MSE -> converges to roughly 1.6083083.....

## You're all done!

Congratulations! This project was the end of 4 whole courses worth of content! This project clearly didn't cover every single topic from those courses, but it serves as a summary for everything you have learned. This is only the start of Python Data Projects, so continue to learn and good luck in your future endeavors!