### How Are We Doing?

In the last notebook, you created a working version of SVD for situations even when there are tons of missing values.  This is awesome!  The question now is how well does this solution work?

In this notebook, we are going to simulate exactly what we would do in the real world to tune our recommender.  

Run the cell below to read in the data and get started.

In [1]:
import numpy as np
import pandas as pd

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

1. Using the **reviews** dataframe, perform the following tasks to create a training and validation set of data we can use to test the performance of your SVD algorithm using **off-line** validation techniques.

 * Order the reviews dataframe from earliest to most recent 
 * Pull the first 10000 reviews from  the dataset
 * Make the first 8000/10000 reviews the training data 
 * Make the last 2000/10000 the test data
 * Return the training and test datasets

In [4]:
reviews.sort_values(['date'], ascending = True)[:10000]

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
498923,37287,2171847,6,1362062307,2013-02-28 14:38:27,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
442554,33140,444778,8,1362062624,2013-02-28 14:43:44,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
81920,6338,1411238,6,1362062838,2013-02-28 14:47:18,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
584570,43691,1496422,7,1362063503,2013-02-28 14:58:23,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
450669,33799,118799,5,1362063653,2013-02-28 15:00:53,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
521186,39200,408236,2,1363584469,2013-03-18 05:27:49,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
389437,29381,1598828,7,1363584508,2013-03-18 05:28:28,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
521188,39200,1899353,8,1363584544,2013-03-18 05:29:04,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
97966,7530,2093944,7,1363584601,2013-03-18 05:30:01,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [8]:
reviews.sort_values(['date'], ascending = True).iloc[:5]

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
498923,37287,2171847,6,1362062307,2013-02-28 14:38:27,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
442554,33140,444778,8,1362062624,2013-02-28 14:43:44,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
81920,6338,1411238,6,1362062838,2013-02-28 14:47:18,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
584570,43691,1496422,7,1362063503,2013-02-28 14:58:23,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
450669,33799,118799,5,1362063653,2013-02-28 15:00:53,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [11]:
def create_train_test(reviews, order_by, training_size, testing_size):
    '''    
    INPUT:
    reviews - (pandas df) dataframe to split into train and test
    order_by - (string) column name to sort by
    training_size - (int) number of rows in training set
    testing_size - (int) number of columns in the test set
    
    OUTPUT:
    training_df -  (pandas df) dataframe of the training set
    validation_df - (pandas df) dataframe of the test set
    '''
    
    my_reviews = reviews.sort_values([order_by], ascending = True)
    my_reviews = my_reviews.iloc[:training_size + testing_size]
    training_df = my_reviews.iloc[:training_size]
    validation_df = my_reviews.iloc[-testing_size:]

    return training_df, validation_df

In [12]:
# Nothing to change in this or the next cell
# Use our function to create training and test datasets
train_df, val_df = create_train_test(reviews, 'date', 8000, 2000)

In [13]:
# Make sure the dataframes we are using are the right shape
assert train_df.shape[0] == 8000, "The number of rows doesn't look right in the training dataset."
assert val_df.shape[0] == 2000, "The number of rows doesn't look right in the validation dataset"
assert str(train_df.tail(1)['date']).split()[1] == '2013-03-15', "The last date in the training dataset doesn't look like what we expected."
assert str(val_df.tail(1)['date']).split()[1] == '2013-03-18', "The last date in the validation dataset doesn't look like what we expected."
print("Nice job!  Looks like you have written a function that provides training and validation dataframes for you to use in the next steps.")

Nice job!  Looks like you have written a function that provides training and validation dataframes for you to use in the next steps.


In the real world, we might have all of the data up to this final date in the training data.  Then we want to see how well we are doing for each of the new ratings, which show up in the test data.

Below is a working example of the function created in the previous example you can use (or you can replace with your own).

`2.`  Fit the function to the training data with the following hyperparameters: 15 latent features, a learning rate of 0.005, and 250 iterations. This will take some time to run, so you may choose fewer latent features, a higher learning rate, or fewer iteratios if you want to speed up the process.  

**Note:** Again, this might be a good time to take a phone call, go for a walk, or just take a little break.  No need to change the code below unless you would like to make changes to reduce the time needed to obtain results.

In [15]:
def FunkSVD(ratings_mat, latent_features=12, learning_rate=0.0001, iters=100):
    '''
    This function performs matrix factorization using a basic form of FunkSVD with no regularization
    
    INPUT:
    ratings_mat - (numpy array) a matrix with users as rows, movies as columns, and ratings as values
    latent_features - (int) the number of latent features used
    learning_rate - (float) the learning rate 
    iters - (int) the number of iterations
    
    OUTPUT:
    user_mat - (numpy array) a user by latent feature matrix
    movie_mat - (numpy array) a latent feature by movie matrix
    '''
    
    # Set up useful values to be used through the rest of the function
    n_users = ratings_mat.shape[0]
    n_movies = ratings_mat.shape[1]
    num_ratings = np.count_nonzero(~np.isnan(ratings_mat))
    
    # initialize the user and movie matrices with random values
    user_mat = np.random.rand(n_users, latent_features)
    movie_mat = np.random.rand(latent_features, n_movies)
    
    # initialize sse at 0 for first iteration
    sse_accum = 0
    
    # keep track of iteration and MSE
    print("Optimizaiton Statistics")
    print("Iterations | Mean Squared Error ")
    
    # for each iteration
    for iteration in range(iters):

        # update our sse
        old_sse = sse_accum
        sse_accum = 0
        
        # For each user-movie pair
        for i in range(n_users):
            for j in range(n_movies):
                
                # if the rating exists
                if ratings_mat[i, j] > 0:
                    
                    # compute the error as the actual minus the dot product of the user and movie latent features
                    diff = ratings_mat[i, j] - np.dot(user_mat[i, :], movie_mat[:, j])
                    
                    # Keep track of the sum of squared errors for the matrix
                    sse_accum += diff**2
                    
                    # update the values in each matrix in the direction of the gradient
                    for k in range(latent_features):
                        user_mat[i, k] += learning_rate * (2*diff*movie_mat[k, j])
                        movie_mat[k, j] += learning_rate * (2*diff*user_mat[i, k])

        # print results
        print("%d \t\t %f" % (iteration+1, sse_accum / num_ratings))
        
    return user_mat, movie_mat 

In [16]:
# Create user-by-item matrix - nothing to do here
train_user_item = train_df[['user_id', 'movie_id', 'rating', 'timestamp']]
train_data_df = train_user_item.groupby(['user_id', 'movie_id'])['rating'].max().unstack()
train_data_np = np.array(train_data_df)

# Fit FunkSVD with the specified hyper parameters to the training data
user_mat, movie_mat = FunkSVD(train_data_np, latent_features=15, learning_rate=0.005, iters=250)

Optimizaiton Statistics
Iterations | Mean Squared Error 
1 		 10.771641
2 		 6.036877
3 		 4.211930
4 		 3.152025
5 		 2.457491
6 		 1.967765
7 		 1.605260
8 		 1.327665
9 		 1.109722
10 		 0.935338
11 		 0.793756
12 		 0.677507
13 		 0.581217
14 		 0.500891
15 		 0.433476
16 		 0.376583
17 		 0.328325
18 		 0.287196
19 		 0.251988
20 		 0.221729
21 		 0.195631
22 		 0.173050
23 		 0.153454
24 		 0.136404
25 		 0.121532
26 		 0.108528
27 		 0.097130
28 		 0.087118
29 		 0.078302
30 		 0.070522
31 		 0.063639
32 		 0.057537
33 		 0.052114
34 		 0.047284
35 		 0.042972
36 		 0.039115
37 		 0.035657
38 		 0.032551
39 		 0.029756
40 		 0.027235
41 		 0.024958
42 		 0.022898
43 		 0.021030
44 		 0.019335
45 		 0.017795
46 		 0.016392
47 		 0.015114
48 		 0.013948
49 		 0.012882
50 		 0.011907
51 		 0.011015
52 		 0.010197
53 		 0.009447
54 		 0.008758
55 		 0.008125
56 		 0.007542
57 		 0.007006
58 		 0.006512
59 		 0.006057
60 		 0.005637
61 		 0.005249
62 		 0.004890
63 		 0.004559
64 		 

Now that you have created the **user_mat** and **movie_mat**, we can use this to make predictions for how users would rate movies, by just computing the dot product of the row associated with a user and the column associated with the movie.

`3.` Use the comments in the function below to complete the **predict_rating** function.

In [31]:
movie_mat.shape

(15, 2679)

In [32]:
user_mat.shape

(3278, 15)

In [35]:
len(user_mat[0])

15

In [37]:
len(movie_mat[:,0])

15

In [40]:
np.dot(user_mat[0], movie_mat[:,0])

6.666900093319603

In [103]:
def predict_rating(user_matrix, movie_matrix, user_id, movie_id):
    '''
    INPUT:
    user_matrix - user by latent factor matrix
    movie_matrix - latent factor by movie matrix
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    
    OUTPUT:
    pred - the predicted rating for user_id-movie_id according to FunkSVD
    '''
    # Use the training data to create a series of users and movies that matches the ordering in training data
    movies = pd.DataFrame(train_data_df.columns)
    users = pd.DataFrame(train_data_df.index)
    
    # User row and Movie Column
    column = movies[movies['movie_id']==movie_id].index[0]
    row = users[users['user_id']==user_id].index[0]
    
    # Take dot product of that row and column in U and V to make prediction
    
    pred = np.dot(user_matrix[row], movie_matrix[:, column])
   
    return pred

In [104]:
# Test your function with the first user-movie in the user-movie matrix (notice this is a nan)
pred_val = predict_rating(user_mat, movie_mat, 8, 2844)
pred_val

6.666900093319603

In [82]:
a = pd.DataFrame(train_data_df.columns).head()
a

Unnamed: 0,movie_id
0,2844
1,8133
2,13427
3,14142
4,14538


In [92]:
a[a['movie_id']==13427].index[0]

2

In [78]:
pd.DataFrame(train_data_df.index).head()

Unnamed: 0,user_id
0,8
1,46
2,48
3,51
4,66


In [143]:
movies[movies['movie_id']==2844].movie.tolist()[0]

"Fantômas - À l'ombre de la guillotine (1913)"

In [136]:
movies.movie_id.unique()[:200]

array([    8,    10,    12,    25,    91,   417,   439,   443,   628,
         833,  1223,  1740,  2101,  2130,  2354,  2844,  3740,  3863,
        4099,  4100,  4101,  4210,  4395,  4457,  4518,  4546,  4936,
        4972,  5074,  5078,  5530,  5571,  5960,  6177,  6206,  6414,
        6437,  6684,  6689,  6864,  7145,  7162,  7264,  7507,  7832,
        7880,  8133,  8395,  9018,  9340,  9678,  9893,  9968, 10247,
       10258, 10323, 10806, 10930, 11130, 11267, 11439, 11508, 11541,
       11607, 11656, 11717, 11841, 11870, 12224, 12278, 12349, 12364,
       12494, 12532, 12651, 12844, 13025, 13086, 13099, 13140, 13257,
       13427, 13442, 13486, 13626, 13741, 13858, 14142, 14341, 14390,
       14417, 14429, 14532, 14538, 14624, 14664, 14872, 14972, 15002,
       15064, 15163, 15174, 15175, 15233, 15310, 15324, 15361, 15400,
       15532, 15624, 15648, 15768, 15772, 15864, 15881, 16039, 16172,
       16220, 16230, 16332, 16544, 16600, 16847, 16903, 16954, 17075,
       17136, 17196,

It is great that you now have a way to make predictions. However it might be nice to get a little phrase back about the user, movie, and rating.

`4.` Use the comments in the function below to complete the **predict_rating** function.  

**Note:** The movie name doesn't come back in a great format, so you can see in the solution I messed around with it a bit just to make it a little nicer.

In [153]:
def print_prediction_summary(user_id, movie_id, prediction):
    '''
    INPUT:
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    prediction - the predicted rating for user_id-movie_id
    
    OUTPUT:
    None - prints a statement about the user, movie, and prediction made
    
    '''
    movie_name = movies[movies['movie_id']==movie_id].movie.tolist()[0]
    print('Predication for user {} for movie \"{}\" is {}'.format(user_id, movie_name, prediction))


In [154]:
# Test your function the the results of the previous function
print_prediction_summary(8, 2844, pred_val)

Predication for user 8 for movie "Fantômas - À l'ombre de la guillotine (1913)" is 6.666900093319603


Now that we have the ability to make predictions, let's see how well our predictions do on the test ratings we already have.  This will give an indication of how well we have captured the latent features, and our ability to use the latent features to make predictions in the future!

`5.` For each of the user-movie rating in the **val_df** dataset, compare the actual rating given to the prediction you would make.  How do your predictions do?  Do you run into any problems?  If yes, what is the problem?  Use the document strings and comments below to assist as you work through these questions.

In [158]:
val_df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
650588,49056,1598822,8,1363308721,2013-03-15 00:52:01,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
650569,49056,289879,9,1363308742,2013-03-15 00:52:22,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
650585,49056,1563738,9,1363308780,2013-03-15 00:53:00,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
650583,49056,1458175,4,1363308799,2013-03-15 00:53:19,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
378686,28599,103639,8,1363309112,2013-03-15 00:58:32,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [163]:
val_df.iloc[0]['user_id']

49056

In [174]:
def validation_comparison(val_df, num_preds):
    '''
    INPUT:
    val_df - the validation dataset created in the third cell above
    num_preds - (int) the number of rows (going in order) you would like to make predictions for
    
    OUTPUT:
    Nothing returned - print a statement about the prediciton made for each row of val_df from row 0 to num_preds
    '''
    for row in range(num_preds):
        movie_id = val_df.iloc[row]['movie_id']
        user_id = val_df.iloc[row]['user_id']
        prediction = predict_rating(user_mat, movie_mat, user_id, movie_id)
        actual = val_df.iloc[row]['rating']
        print('Prediction for user {} for movie {} is {}, actual value is {}'.format(user_id, movie_id, prediction, actual))
        
# Perform the predicted vs. actual for the first 6 rows.  How does it look?
validation_comparison(val_df, 6)        

Prediction for user 49056 for movie 1598822 is 6.06898390369978, actual value is 8
Prediction for user 49056 for movie 289879 is 7.546983911666807, actual value is 9
Prediction for user 49056 for movie 1563738 is 7.38501235565069, actual value is 9
Prediction for user 49056 for movie 1458175 is 6.190076867743516, actual value is 4
Prediction for user 28599 for movie 103639 is 7.657151331297593, actual value is 8
Prediction for user 50593 for movie 1560985 is 5.136406819997321, actual value is 4


In [175]:
# Perform the predicted vs. actual for the first 7 rows.  What happened?
validation_comparison(val_df, 7)        

Prediction for user 49056 for movie 1598822 is 6.06898390369978, actual value is 8
Prediction for user 49056 for movie 289879 is 7.546983911666807, actual value is 9
Prediction for user 49056 for movie 1563738 is 7.38501235565069, actual value is 9
Prediction for user 49056 for movie 1458175 is 6.190076867743516, actual value is 4
Prediction for user 28599 for movie 103639 is 7.657151331297593, actual value is 8
Prediction for user 50593 for movie 1560985 is 5.136406819997321, actual value is 4


IndexError: index 0 is out of bounds for axis 0 with size 0

** The first 6 rows completed without issue, but there was an error in the 7th row. Why do you think that happened?**

In [169]:
val_df[:7]

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
650588,49056,1598822,8,1363308721,2013-03-15 00:52:01,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
650569,49056,289879,9,1363308742,2013-03-15 00:52:22,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
650585,49056,1563738,9,1363308780,2013-03-15 00:53:00,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
650583,49056,1458175,4,1363308799,2013-03-15 00:53:19,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
378686,28599,103639,8,1363309112,2013-03-15 00:58:32,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
667660,50593,1560985,4,1363309202,2013-03-15 01:00:02,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
385306,29000,287978,9,1363309214,2013-03-15 01:00:14,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [172]:
train_df[train_df['user_id']==29000]

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
385337,29000,1509767,8,1362585272,2013-03-06 15:54:32,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
385328,29000,1045658,10,1363305347,2013-03-14 23:55:47,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
385304,29000,137523,2,1363306471,2013-03-15 00:14:31,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
385334,29000,1375666,10,1363306502,2013-03-15 00:15:02,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
385302,29000,101414,7,1363306813,2013-03-15 00:20:13,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [173]:
train_df[train_df['movie_id']==287978]

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018


Movie 287978 is not in the training set