# Movie Recommender II

We will build a movie recommendation system to learn scipy's sparse array handling. We'll use data from the [GroupLens](https://grouplens.org) project, an effort by the University of Minnesota Department of Computer Science.  Among other things, GroupLens researches recommender systems.

In [0]:
#requisite libraries:
import numpy as np
import pandas as pd
from urllib.request import urlopen


## Data Files
The data for this exercise comes from publicly available datasets maintained and provided by GroupLens.  The data files are available [here](http://grouplens.org/datasets/).  Please see the [Readme](http://files.grouplens.org/datasets/movielens/ml-latest-README.html) for a description of the complete dataset.

For our purposes, we need only two files from the complete dataset.  These two files files, nearly 1 GB in size have been downloaded, read into pandas dataframes, pickled and uploaded to AWS S3 storage so that they are available for easy access.

**These files may disappear. If you intend to continue using this data after tonights session, please get your own copy from the source.**

For this exercise, we need the movies.csv file and the ratings.csv file.

### Movies

In [0]:
# read pickled data files
path = 'https://focods.s3.us-east-2.amazonaws.com/movies.pcl'
movies = pd.read_pickle(urlopen(path), compression=None)


In [3]:
#@title
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [4]:
len(movies)

58098

### Ratings

In [0]:
#Ratings: (takes about 20 seconds)
path = 'https://focods.s3.us-east-2.amazonaws.com/ratings.pcl'
ratings = pd.read_pickle(urlopen(path), compression=None)


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,2009-10-27 21:00:21
1,1,481,3.5,2009-10-27 21:04:16
2,1,1091,1.5,2009-10-27 21:04:31
3,1,1257,4.5,2009-10-27 21:04:20
4,1,1449,4.5,2009-10-27 21:01:04


In [7]:
len(ratings)

27753444

From the [Readme](http://files.grouplens.org/datasets/movielens/ml-latest-README.html), the definition of the  rating column is:
> >Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).


## Partition Dataset: Training and Test

Partition on the basis of users, i.e. a user is either a training user or a test user.  Need to partition the ratings on whether the user providing the rating is a training or test user.

### Assign Users to Training or Test Partition

Keep training partition whole for now; don't break out a validation partition, save for later in case of doing cross validation

In [8]:
train_pct = 0.7 # 70% for training
np.random.seed(1234)
users = ratings.userId.unique()
user_partition = np.where(np.random.random(len(users)) < train_pct,'train','test')

np.unique(user_partition, return_counts=True)

(array(['test', 'train'], dtype='<U5'), array([ 84628, 198600]))

In [0]:
# make a look up table user_id -> status
partition_lookup = pd.DataFrame({'Partition': user_partition}, index=users)

In [10]:
# tack the user's partition onto the ratings data frame
ratings['partition'] = partition_lookup.loc[ratings.userId].Partition.values
ratings.partition.value_counts(normalize=True)

train    0.700547
test     0.299453
Name: partition, dtype: float64

### Cut the Ratings by User's Partition

In [0]:
train_ratings = ratings.query('partition == \'train\'')
test_ratings = ratings.query('partition == \'test\'')

In [12]:
#look at the ratings for arbitrary user
train_ratings.query('userId == 283225').sort_values('timestamp')

Unnamed: 0,userId,movieId,rating,timestamp,partition
27753288,283225,784,3.0,2006-02-08 00:19:50,train
27753298,283225,2599,4.0,2006-02-08 00:19:54,train
27753292,283225,1380,2.5,2006-02-08 00:20:15,train
27753293,283225,1500,4.0,2006-02-08 00:20:19,train
27753287,283225,370,3.0,2006-02-08 00:20:21,train
27753284,283225,151,2.5,2006-02-08 00:20:57,train
27753297,283225,2406,3.0,2006-02-08 00:21:01,train
27753283,283225,112,2.5,2006-02-08 00:21:03,train
27753291,283225,1288,3.0,2006-02-08 00:21:05,train
27753290,283225,1252,4.0,2006-02-08 00:21:18,train


### Create Response and Predictor Variables

User's most recent rating is response; previous ratings are predictors. return these as a CSR matrix.

Dimension all of the arrays to by n_users by n_movies so that the userId and movieId values can be used as indices into the arrays

In [0]:
n_users =ratings.userId.max()+1
n_movies = ratings.movieId.max()+1

In [0]:
from scipy.sparse import csr_matrix

def get_preds_response(df, shape):
  """
  splits the argument data frame into predictor and response sparse matrices.
  The response is the latest movieId and rating for each user in the arg df.
  The predictors are the movieId and ratings of all the other movies reviewed
  by each user.
  The return values are csr_matrices of shape user x movie
  """
  # get the index of each user's latest rating
  indx = df.groupby('userId').timestamp.idxmax()
  # get the rows of the data frame corresponding to the each users latest rating
  y_df = df.loc[indx]
  
  # exclude the latest ratings from the input data frame
  x_df = df.drop(indx)
  
  # turn them into sparse matrices
  preds = csr_matrix((x_df.rating, (x_df.userId, x_df.movieId)), shape=shape)
  resp  = csr_matrix((y_df.rating, (y_df.userId, y_df.movieId)), shape=shape)
  
  return preds, resp

In [0]:
x_train, y_train = get_preds_response(train_ratings, (n_users, n_movies))

Both x_train and y_train are of dimension users x movies.

In [16]:
x_train

<283229x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 19243991 stored elements in Compressed Sparse Row format>

In [17]:
y_train

<283229x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 198600 stored elements in Compressed Sparse Row format>

In [0]:
# likewise split out the test data
x_test, y_test = get_preds_response(test_ratings, (n_users, n_movies))

In [19]:
x_test

<283229x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 8226225 stored elements in Compressed Sparse Row format>

In [20]:
y_test

<283229x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 84628 stored elements in Compressed Sparse Row format>

## Create User-Item Collaborative Filter


1.   Segment training users into $n$ categories based on their previous reviews (all but the last review (Gaussian mixture model with LDA)
2. Identify the top 5 or 10 movies that characterize each of  the $n$ categories
3. Predict by:
  1.  Assign test user to most likely category based on viewing history (predictors)
  2. Find first movie in the category that user hasn't seen; **this is the prediction**
  3. Compare prediction to response (the last movie the user rated.)
4. Compute overall Test and Dev accuracy measures
5. Maybe iterate over $n$ to see impact on accuracy measures




### Some helper routines

In [0]:
def find_movies(title_regexp):
  """
  searches the list of movies with a reg exp
  """
  return movies.title[movies.title.str.match(title_regexp, case=False)]

In [22]:
find_movies('.*toy story.*')

movieId
1                                  Toy Story (1995)
3114                             Toy Story 2 (1999)
78499                            Toy Story 3 (2010)
106022                   Toy Story of Terror (2013)
115875    Toy Story Toons: Hawaiian Vacation (2011)
115879            Toy Story Toons: Small Fry (2011)
120468      Toy Story Toons: Partysaurus Rex (2012)
120474            Toy Story That Time Forgot (2014)
Name: title, dtype: object

In [0]:
def get_movieId(moviename):
  """
  return the movie id given the title/moviename
  """
  mvindex = movies.query('title == @moviename').index
  if len(mvindex) == 0:
    raise Exception(f'Invalid movie name: {moviename}')
  return mvindex[0] # just the first one in case there were multiples
  

In [0]:
# for lda, print the top words (movies) driving each category/segment
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += "; ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [25]:
get_movieId('Toy Story 3 (2010)')

78499

In [26]:
movies.loc[78499]

title                                   Toy Story 3 (2010)
genres    Adventure|Animation|Children|Comedy|Fantasy|IMAX
Name: 78499, dtype: object

In [0]:
from sklearn.decomposition import LatentDirichletAllocation

In [0]:
# Parameters for the lda model
#n_samples = 2000
#n_features = 1000
n_components = 20 # number of segments to assign users to
n_top_words = 20

### Don't Run This Code! - Fit the LDA model

...unless you have an hour to spare! This code runs  in 3517.131 seconds

In [0]:
#lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
#                                learning_method='online',
#                                learning_offset=50.,
#                                random_state=0)

In [0]:
#from time import time
#t0 = time()
#lda.fit(x_train)
#print("done in %0.3fs." % (time() - t0))

In [0]:
#from google.colab import drive
#drive.mount('/content/drive')


In [0]:
#!ls /content/drive/'My Drive'

In [0]:
# save the model as a pickle
#import pickle
#pcl = open('/content/drive/My Drive/lda.pcl','wb')
#pickle.dump(lda, pcl)
#pcl.close()

### Read the Pickled Model


In [0]:
# read pickled lda model

import pickle
path = 'https://focods.s3.us-east-2.amazonaws.com/lda.pcl'

lda = pickle.load( urlopen(path) ) 

In [34]:
lda

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=50.0,
                          max_doc_update_iter=100, max_iter=5,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [0]:
movie_names =movies['title'].to_dict()

In [36]:
# show which movies drive each segment
print_top_words(lda, movie_names, n_top_words)

Topic #0: Rashomon (Rashômon) (1950); Yojimbo (1961); 400 Blows, The (Les quatre cents coups) (1959); Double Indemnity (1944); All the President's Men (1976); Wild Strawberries (Smultronstället) (1957); Modern Times (1936); M (1931); Seventh Seal, The (Sjunde inseglet, Det) (1957); Third Man, The (1949); Maltese Falcon, The (a.k.a. Dangerous Female) (1931); Conversation, The (1974); Breathless (À bout de souffle) (1960); Seven Samurai (Shichinin no samurai) (1954); City Lights (1931); Sunset Blvd. (a.k.a. Sunset Boulevard) (1950); Hannah and Her Sisters (1986); Ran (1985); Night at the Opera, A (1935); Gold Rush, The (1925)
Topic #1: Arrival (2016); Logan (2017); Guardians of the Galaxy 2 (2017); Rogue One: A Star Wars Story (2016); Ex Machina (2015); Blade Runner 2049 (2017); The Martian (2015); Get Out (2017); Deadpool (2016); Doctor Strange (2016); Mad Max: Fury Road (2015); The Revenant (2015); Interstellar (2014); Zootopia (2016); The Hateful Eight (2015); Gone Girl (2014); Thor: 

In [37]:
lda.components_.shape

(20, 193887)

## Assign Users to Segments

### Take a look at an Arbitrary User

In [38]:
#take a look at an arbitrary user's ratings:
train_ratings.query('userId == 283225')

Unnamed: 0,userId,movieId,rating,timestamp,partition
27753283,283225,112,2.5,2006-02-08 00:21:03,train
27753284,283225,151,2.5,2006-02-08 00:20:57,train
27753285,283225,163,3.5,2006-02-08 00:21:20,train
27753286,283225,172,2.0,2006-02-08 00:21:54,train
27753287,283225,370,3.0,2006-02-08 00:20:21,train
27753288,283225,784,3.0,2006-02-08 00:19:50,train
27753289,283225,1234,4.0,2006-02-08 00:22:08,train
27753290,283225,1252,4.0,2006-02-08 00:21:18,train
27753291,283225,1288,3.0,2006-02-08 00:21:05,train
27753292,283225,1380,2.5,2006-02-08 00:20:15,train


In [0]:
def show_train_ratings(uid):
  return train_ratings.query('userId == @uid')\
            .merge(movies, left_on="movieId", right_index=True)[['title','rating']]\
            .reset_index(drop=True)

In [40]:
uid = 283225
show_train_ratings(uid)

Unnamed: 0,title,rating
0,Rumble in the Bronx (Hont faan kui) (1995),2.5
1,Rob Roy (1995),2.5
2,Desperado (1995),3.5
3,Johnny Mnemonic (1995),2.0
4,Naked Gun 33 1/3: The Final Insult (1994),3.0
5,"Cable Guy, The (1996)",3.0
6,"Sting, The (1973)",4.0
7,Chinatown (1974),4.0
8,This Is Spinal Tap (1984),3.0
9,Grease (1978),2.5


### Calculate this User's Segment Affinity

In [41]:
#get this user's training row vector
u_train = x_train.getrow(uid)
u_train

<1x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [42]:
lda.components_.shape

(20, 193887)

In [43]:
from time import time
t0 = time()
u_class = lda.fit_transform(u_train)
print("done in %0.3fs." % (time() - t0))

done in 2.118s.


  return np.exp(-1.0 * perword_bound)


In [44]:
u_class

array([[8.26446287e-04, 8.26446287e-04, 8.26446287e-04, 8.26446287e-04,
        9.84297521e-01, 8.26446287e-04, 8.26446287e-04, 8.26446287e-04,
        8.26446287e-04, 8.26446287e-04, 8.26446287e-04, 8.26446287e-04,
        8.26446287e-04, 8.26446287e-04, 8.26446287e-04, 8.26446287e-04,
        8.26446287e-04, 8.26446287e-04, 8.26446287e-04, 8.26446287e-04]])

u_class is the probability that this user belongs to each of the 20 segments

In [45]:
u_class.sum()

1.0

In [46]:
import altair as alt
xx =pd.DataFrame( {'Probability': u_class[0], 'Segment':np.arange(n_components)})
alt.Chart(xx).mark_bar().encode(
  x='Segment',
  y='Probability',

  #color='UserCount'
).properties(
    title='Probability of Segment Membership')

In [0]:
def show_segment(mod, movie_names, segno, max_n=20):
  movie_idx = np.flip(mod.components_[segno].argsort())
  #return movie_idx
  return [movie_names[i] for i in movie_idx[:max_n]]

In [0]:
#something in the foregoing code corrupts the LDA variable/model
# need to chase this down sometime!
# for now, just reload the model

# read pickled lda model

import pickle
path = 'https://focods.s3.us-east-2.amazonaws.com/lda.pcl'

lda = pickle.load( urlopen(path) ) 

In [49]:
show_segment(lda, movie_names,3)

['Aliens (1986)',
 'Alien (1979)',
 'Terminator, The (1984)',
 'Total Recall (1990)',
 'Terminator 2: Judgment Day (1991)',
 'Blade Runner (1982)',
 'Star Trek II: The Wrath of Khan (1982)',
 'Abyss, The (1989)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Army of Darkness (1993)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'RoboCop (1987)',
 'Thing, The (1982)',
 'Predator (1987)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Star Trek IV: The Voyage Home (1986)',
 'Highlander (1986)',
 'Fifth Element, The (1997)',
 'Star Trek: First Contact (1996)',
 'Road Warrior, The (Mad Max 2) (1981)']

### Don't run this code - Compute segment membership for all training users
because it takes better part of an hour! 
 The code ran in 3573.096s.

In [0]:
#t0 = time()
#user_segs = lda.fit_transform(x_train)
#print("done in %0.3fs." % (time() - t0))



In [0]:
# save the user segment matrix as a pickle
#import pickle
#pcl = open('/content/drive/My Drive/user_segs.pcl','wb')
#pickle.dump(user_segs, pcl)
#pcl.close()

### load pre-computed user segment matrix

In [0]:
# load pickled user_segs matrix
path = 'https://focods.s3.us-east-2.amazonaws.com/user_segs.pcl'

user_segs = pickle.load( urlopen(path) ) 


In [52]:
user_segs.shape

(283229, 20)

In [0]:
# find some users that don't have really extreme preferences
user_segs_max = user_segs.max(axis=1)
user_segs_min = user_segs.min(axis=1)
user_segs_rng = user_segs_max - user_segs_min

In [54]:
np.where(user_segs_rng == user_segs_rng[user_segs_rng != 0].min())

(array([187432]),)

In [55]:
np.where(user_segs_rng == user_segs_rng[user_segs_rng > 0.5].min())

(array([276890]),)

In [95]:
xx = pd.concat([
  pd.DataFrame( {'userId':'187432','Probability': user_segs[187432], 'Segment':np.arange(n_components)}),
  pd.DataFrame( {'userId':'283225','Probability': user_segs[283225], 'Segment':np.arange(n_components)}),
  pd.DataFrame( {'userId':'276890','Probability': user_segs[276890], 'Segment':np.arange(n_components)})])  
alt.Chart(xx).mark_bar().encode(
  x='Probability',
  y='userId',

  color='userId',
  row = 'Segment'
).properties(
    title='Probability of Segment Membership: Three Users')

In [57]:
x_train.shape

(283229, 193887)

## Average Rating: Movie by Segment

In [0]:
# make hard assignments of users to segments
# assign the user to segment of highest probability of membership
user_segs_max = user_segs.argmax(axis=1) # row-wise maximum

# make one-hot encoded,
#see https://stackoverflow.com/questions/29831489/convert-array-of-indices-to-1-hot-encoded-numpy-array
user_segs_onehot = np.eye(n_components)[user_segs_max]

### Some Validation

In [59]:
user_segs_max[:10]

array([ 0, 17, 10, 16,  0,  0,  4, 17,  0,  0])

Users 2 and 3 belong to segment 15 so we should see a 1 at column index 15 in the one hot matrix for rows 2 and 3

In [60]:
user_segs_onehot[:10]

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]])

...which they do. Yay!

In [61]:


# plot the dataset, referencing dataframe column names
import altair as alt
xx =pd.DataFrame( pd.Series(user_segs_max).value_counts(),columns=['UserCount'])\
      .reset_index()\
      .rename({'index':'Segment'},axis=1)

alt.Chart(xx).mark_bar().encode(
  x='Segment',
  y='UserCount',

  #color='UserCount'
).properties(
    title='UserCount by Segment')

### Calculate Total and Ratings Count by Segment

In [0]:
# Total Rating by (movie by segment)
# x_train = (user x movie) of ratings
# user_segs_onehot = (user x seg)

total_rating = x_train.transpose().dot(user_segs_onehot)

In [63]:
user_segs_onehot.shape

(283229, 20)

In [64]:
x_train.shape

(283229, 193887)

In [65]:
user_segs.shape

(283229, 20)

In [66]:
total_rating.shape # should be movies by segments

(193887, 20)

In [67]:
# count of ratings (movies by segment)
# use csr.sign to get matrix of 1's
x_train_is_rated = x_train.sign()

rating_counts = x_train_is_rated.transpose().dot(user_segs_onehot)
rating_counts.shape


(193887, 20)

Note, we get a dense matrix out of the above multiply operation.

In [68]:
type(rating_counts)

numpy.ndarray

In [69]:
rating_counts.sum(axis=1).shape

(193887,)

### Validate the Ratings Counts

In [70]:
# number of times each movie was rated
rating_counts.sum(axis=1)[:10] #row-wise sum, i.e. by movie

array([    0., 46834., 18665., 10403.,  2034., 10627., 19774., 10472.,
        1025.,  3009.])

In [71]:
x_train_is_rated

<283229x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 19243991 stored elements in Compressed Sparse Row format>

In [72]:
x_train

<283229x193887 sparse matrix of type '<class 'numpy.float64'>'
	with 19243991 stored elements in Compressed Sparse Row format>

In [73]:
# sum of the ratings counts should match the number of non-zero
# elements in the x_train (users x movies) ratings matrix
rating_counts.sum(), x_train.nnz

(19243991.0, 19243991)

In [74]:
# not all movies were rated in each segment
# hence ratings counts will have a lot of zeros in it
(rating_counts == 0.0).sum()

3571512

### Compute the average rating by segment

...and ignore the divide by zero errors

In [75]:
# average = total divided by count
avg_rating = total_rating/ rating_counts

avg_rating.shape

  """Entry point for launching an IPython kernel.


(193887, 20)

### View Average Ratings by Segment

In [0]:
def show_rating(movieId):

  df = pd.DataFrame({'Segment':np.arange(n_components),
                     'Rating':avg_rating[movieId]})
  chart = alt.Chart(df).mark_bar().encode(
    x='Segment',
    y=alt.Y('Rating', axis=alt.Axis( title='Average Rating'))

    #color='UserCount'
  ).properties(
      title=movie_names[movieId]
  )

  chart.display()


In [77]:
# Movie #1 is Toy Story
show_rating(1)

In [78]:
find_movies('.*repo man.*')

movieId
1965    Repo Man (1984)
Name: title, dtype: object

In [79]:
show_rating(1965)

In [80]:
show_segment(lda, movie_names,13)

['Iron Lady, The (2011)',
 'Visitors, The (Visiteurs, Les) (1993)',
 'The Lunchbox (2013)',
 'Single Man, A (2009)',
 'Queen of Versailles, The (2012)',
 'Dinner Game, The (Dîner de cons, Le) (1998)',
 'Bernie (2011)',
 'George Carlin: You Are All Diseased (1999)',
 'Yi Yi (2000)',
 'Last Days, The (1998)',
 "Doug's 1st Movie (1999)",
 'Rent-a-Kid (1995)',
 "Illusionist, The (L'illusionniste) (2010)",
 'Shootist, The (1976)',
 'Heartbeats (Les amours imaginaires) (2010)',
 'Stories We Tell (2012)',
 "I Am Love (Io sono l'amore) (2009)",
 'Great Beauty, The (Grande Bellezza, La) (2013)',
 'Another Year (2010)',
 'Pride (2014)']

In [81]:
find_movies('.*Independence Day.*')

movieId
780       Independence Day (a.k.a. ID4) (1996)
135567     Independence Day: Resurgence (2016)
143357               Independence Day 3 (2017)
Name: title, dtype: object

In [82]:
show_rating(780)

Independence Day panned in segment 11. What movies do the people in seg 11 like?

In [83]:
show_segment(lda, movie_names,11)

['Notebook, The (2004)',
 'Love Actually (2003)',
 'Devil Wears Prada, The (2006)',
 '(500) Days of Summer (2009)',
 'Mean Girls (2004)',
 'Juno (2007)',
 'Pride & Prejudice (2005)',
 '10 Things I Hate About You (1999)',
 '50 First Dates (2004)',
 'Titanic (1997)',
 'Forrest Gump (1994)',
 'Legally Blonde (2001)',
 "Bridget Jones's Diary (2001)",
 '13 Going on 30 (2004)',
 'Holiday, The (2006)',
 'Pretty Woman (1990)',
 'How to Lose a Guy in 10 Days (2003)',
 'Miss Congeniality (2000)',
 'Hitch (2005)',
 'Finding Nemo (2003)']

## Predict the Training Data

In [84]:
x_train.shape, y_train.shape

((283229, 193887), (283229, 193887))

In [85]:
# prediction is the average for the pair (segment, movie)
# user_segs_max - is the hard assignment of each user to segment
user_segs_max[:10]

array([ 0, 17, 10, 16,  0,  0,  4, 17,  0,  0])

In [86]:
type(avg_rating), avg_rating.shape

(numpy.ndarray, (193887, 20))

In [0]:
#which movieId are we predicting for each user?
#switch to coo (coordinate sparse array) format to easily get the column indicies
y_train_coo = y_train.tocoo()


In [0]:
# a lot easier to work with a data frame
y_train_df = pd.DataFrame({'userId':y_train_coo.row,
                           'movieId': y_train_coo.col,
                           'rating': y_train_coo.data})

In [0]:
y_train_df['segment'] = user_segs_max[y_train_df.userId]

In [0]:
y_train_df['pred_rating'] = np.array([avg_rating[r.movieId, r.segment] \
                                      for r in y_train_df.itertuples()])

In [91]:
y_train_df.tail()

Unnamed: 0,userId,movieId,rating,segment,pred_rating
198595,283222,1246,5.0,8,3.977692
198596,283223,2401,4.0,18,3.667749
198597,283224,426,4.0,18,3.199468
198598,283225,3623,2.5,8,2.913832
198599,283227,96110,4.0,10,3.0


In [0]:
def r_squared(y, y_hat):
  y_bar = y.mean()
  ss_tot = ((y-y_bar)**2).sum()
  ss_resid = ((y-y_hat)**2).sum()
  r_sq = (1.0 - (ss_resid/ss_tot))
  
  return r_sq

In [93]:
r_squared(y_train_df.rating, y_train_df.pred_rating)

0.12653523375337905

**Not very good $R^2$ on the training data! Uh-oh, need a different model!**

# Conclusions
1. Work with smaller data sets when developing methodology and algorithms, then unleash on whole data set
2. scipy.sparse is great package for dealing with sparsity
3. Segmenting using LDA seems to work well
4. Predicting rating by using average rating within segment appears to work poorly
5. Pickles are a great way to save computationally intense results for future use