<img align="right" style="padding-left:10px; height: 20%; width: 20%" src="figures/projector-300x300.jpg" ></a>

## Case Study: Movie Suggestion

### The Movies Dataset

Collaborative datasets for movies (and other products) can be large! Here is a [small (1 MB) subset of the IMDB database](https://grouplens.org/datasets/movielens/latest/), downloaded and unzipped for your convenience.

The dataset consists of 9742 movies.

In [43]:
import pandas as pd
import numpy as np
from pandas import DataFrame
movies_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-small/'
movies = pd.read_csv(movies_directory+'movies.csv',header = 0) 

movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Movie Ratings

The above 9742 movies were rated by 610 users; this works out to about 165 movies on average rated by each user, available in the `ratings.csv` file as sampled in the DataFrame below.

In [44]:
ratings = pd.read_csv(movies_directory+'ratings.csv',header = 0) 
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Investigating the ratings dataset

Find the number of ratings created by each user.

In [129]:
# Ratings Group By userId:
# print (type(ratings.groupby(["userId"])["userId"].count())) # prints <class 'pandas.core.series.Series'>

# Convert Series to DataFrame
#     Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html#pandas.Series.reset_index

counts = ratings.groupby(["userId"])["userId"].count().reset_index(name="Count")
counts.head(10)

Unnamed: 0,userId,Count
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44
5,6,314
6,7,152
7,8,47
8,9,46
9,10,140


### Build a collaborative filtering recommender system

1. Create a function `recommend_movies(uid,threshold)` that takes `userId`, `threshold` and `movies_directory` as parameters and produces recommendations for the user. Test the code first with userId = 607. Try various values of threshold such that the user gets at least 6 movie recommendations.

In [393]:
import pandas as pd
import numpy as np
from pandas import DataFrame
from scipy.spatial.distance import cdist, euclidean
from scipy.spatial import distance_matrix
from math import sqrt
assert(np.isclose(euclidean(the_x, all_x), 
                  sqrt(sum([_*_ for _ in distances]))))

In [394]:
# Initial Parameters
given_userId = 607
threshold_distance = 130
movies_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-small/'
# movies_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-tiny/'
movies = pd.read_csv(movies_directory+'movies.csv',header = 0) 
ratings = pd.read_csv(movies_directory+'ratings.csv',header = 0) 

In [395]:
from random import random, seed
seed(a=314159)
ratings = ratings.drop(ratings.index[[(random() < 0.85) for i in range(len(ratings.index))]])

In [400]:
def recommendation_movies(given_userId, threshold_distance):
    ratings_np = ratings.to_numpy(dtype=np.float32)
    x = ratings_np
    the_x_2d = x[np.where(x[:,0] == given_userId)][:, [2]]
    the_x_2d = the_x_2d[0].reshape(1,1) # pick the first one in case we have multiple ratings records for user.
    the_x = the_x_2d.reshape(the_x_2d.shape[0])
    
    all_x_2d = ratings_np[:, [2]]
    all_x = all_x_2d.reshape(all_x_2d.shape[0])
    all_u_2d = ratings_np[:, [0]]
    all_u = all_u_2d.reshape(all_u_2d.shape[0])
    all_m_2d = ratings_np[:, [1]]
    all_m = all_m_2d.reshape(all_m_2d.shape[0])
    dm = distance_matrix(all_x_2d, all_x_2d)
    a = np.apply_along_axis(lambda x: sqrt(sum([_*_ for _ in x])), 0, dm)

    dists = np.stack([all_u,
                  all_m, a])

    dists_df = DataFrame(dists.transpose(), columns=['u', 'm', 'x'])
    
    dists_df = dists_df[dists_df['x'] < threshold_distance]
    
    candidates_df = dists_df[dists_df['u'] == given_userId]
    candidates_np = candidates_df['m']
    dists_np = dists_df['m']
    candidate_movies = [i for i in set(dists_np) if not i in set(candidates_np) ]
    
    np.isin(movies['movieId'], list(candidate_movies))
    movies1 =  movies[np.isin(movies['movieId'], list(candidate_movies))]
    movies1['candidates'] = candidate_movies
    movies2 = movies1.sort_values(by=['candidates'])
    return movies2

In [401]:
recommendation_movies(given_userId, threshold_distance)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,movieId,title,genres,candidates
1,2,Jumanji (1995),Adventure|Children|Fantasy,1.0
9,10,GoldenEye (1995),Action|Adventure|Thriller,2.0
20,21,Get Shorty (1995),Comedy|Crime|Thriller,6.0
30,31,Dangerous Minds (1995),Drama,10.0
28,29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi,12.0
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,16.0
397,456,Fresh (1994),Crime|Drama|Thriller,17.0
37,41,Richard III (1995),Drama|War,20.0
80,89,Nick of Time (1995),Action|Thriller,21.0
43,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,22.0


2. Instead of picking the first record for the user, modify the code to begin instead with the movie our user liked the most. 



In [402]:
# Modify Cell 15
# What movies has the user picked already? Don't recommend those!
candidates_df = dists_df[dists_df['u'] == 5]
candidates_np = candidates_df['m']
dists_np = (dists_df['x'])
candidate_movies = sorted([i for i in set(dists_np) if not i in set(candidates_np) ])
print (candidate_movies)

[2.1213202476501465, 2.7838821411132812, 3.391165018081665, 3.9370038509368896, 4.663689613342285]


In [407]:
# Incorporate into the suggestion function
def recommendation_movies(given_userId, threshold_distance):
    ratings_np = ratings.to_numpy(dtype=np.float32)
    x = ratings_np
    the_x_2d = x[np.where(x[:,0] == given_userId)][:, [2]]
    the_x_2d = the_x_2d[0].reshape(1,1) # pick the first one in case we have multiple ratings records for user.
    the_x = the_x_2d.reshape(the_x_2d.shape[0])
    
    all_x_2d = ratings_np[:, [2]]
    all_x = all_x_2d.reshape(all_x_2d.shape[0])
    all_u_2d = ratings_np[:, [0]]
    all_u = all_u_2d.reshape(all_u_2d.shape[0])
    all_m_2d = ratings_np[:, [1]]
    all_m = all_m_2d.reshape(all_m_2d.shape[0])
    dm = distance_matrix(all_x_2d, all_x_2d)
    a = np.apply_along_axis(lambda x: sqrt(sum([_*_ for _ in x])), 0, dm)

    dists = np.stack([all_u,
                  all_m, a])

    dists_df = DataFrame(dists.transpose(), columns=['u', 'm', 'x'])
    
    dists_df = dists_df[dists_df['x'] < threshold_distance]
    
    candidates_df = dists_df[dists_df['u'] == given_userId]
    candidates_np = candidates_df['m']
    dists_np = dists_df['m']
    candidate_movies = sorted([i for i in set(dists_np) if not i in set(candidates_np) ])
    
    
    np.isin(movies['movieId'], list(candidate_movies))
    movies1 =  movies[np.isin(movies['movieId'], list(candidate_movies))]
    movies1['candidates'] = candidate_movies
    movies2 = movies1.sort_values(by=['candidates'])
    return movies2

In [408]:
recommendation_movies(given_userId, threshold_distance)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,movieId,title,genres,candidates
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,2.0
5,6,Heat (1995),Action|Crime|Thriller,6.0
9,10,GoldenEye (1995),Action|Adventure|Thriller,10.0
11,12,Dracula: Dead and Loving It (1995),Comedy|Horror,12.0
15,16,Casino (1995),Crime|Drama,16.0
16,17,Sense and Sensibility (1995),Drama|Romance,17.0
19,20,Money Train (1995),Action|Comedy|Crime|Drama|Thriller,20.0
20,21,Get Shorty (1995),Comedy|Crime|Thriller,21.0
21,22,Copycat (1995),Crime|Drama|Horror|Mystery|Thriller,22.0


3. Time your code for various values of `userId` and `threshold`. What accounts for the variation in timing?

In [391]:
import timeit

mysetup = '''import pandas as pd
given_userId = 700
threshold_distance = 130
movies_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-small/'
# movies_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-tiny/'
movies = pd.read_csv(movies_directory+'movies.csv',header = 0) 
ratings = pd.read_csv(movies_directory+'ratings.csv',header = 0)'''


mycode = ''' 
def recommendation_movies(given_userId, threshold_distance):
    ratings_np = ratings.to_numpy(dtype=np.float32)
    x = ratings_np
    the_x_2d = x[np.where(x[:,0] == given_userId)][:, [2]]
    the_x_2d = the_x_2d[0].reshape(1,1) # pick the first one in case we have multiple ratings records for user.
    the_x = the_x_2d.reshape(the_x_2d.shape[0])
    
    all_x_2d = ratings_np[:, [2]]
    all_x = all_x_2d.reshape(all_x_2d.shape[0])
    all_u_2d = ratings_np[:, [0]]
    all_u = all_u_2d.reshape(all_u_2d.shape[0])
    all_m_2d = ratings_np[:, [1]]
    all_m = all_m_2d.reshape(all_m_2d.shape[0])
    dm = distance_matrix(all_x_2d, all_x_2d)
    a = np.apply_along_axis(lambda x: sqrt(sum([_*_ for _ in x])), 0, dm)

    dists = np.stack([all_u,
                  all_m, a])

    dists_df = DataFrame(dists.transpose(), columns=['u', 'm', 'x'])
    
    dists_df = dists_df[dists_df['x'] < threshold_distance]
    
    candidates_df = dists_df[dists_df['u'] == given_userId]
    candidates_np = candidates_df['m']
    dists_np = dists_df['m']
    candidate_movies = [i for i in set(dists_np) if not i in set(candidates_np) ]
    
    np.isin(movies['movieId'], list(candidate_movies))
    return movies[np.isin(movies['movieId'], list(candidate_movies))]
    
'''
# timeit statement 
timeit.timeit(setup=mysetup, stmt=mycode, number=100, globals=None)

1.0371208190917969e-05

In [388]:
# user 607, thres = 130:  1.0015908628702164e-05
# user 607, thres = 500:  1.0098796337842941e-05
# user 607, thres = 1000: 1.0150950402021408e-05

In [392]:
# user 700, thres = 130: 1.0371208190917969e-05
# user 700, thres = 500: 1.3859011232852936e-05
# user 700, thres = 1000: 1.0411720722913742e-05

userID accounts for the variation in timing. 