#  Movie Recommender EDA notebook

author: Ben Sturm <br />
contact: bwsturm@gmail.com <br />
date: 6/12/2018

In [1]:
import pandas as pd
import numpy as np

### Data Extraction Steps

In [2]:
# Reading in the MovieLens 20M ratings dataset
ratings = pd.read_csv('../../data/ml-20m/ratings.csv')

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [4]:
# Reading in the MovieLens 20M movies dataset
movies = pd.read_csv('../../data/ml-20m/movies.csv')

In [5]:
movies.tail()

Unnamed: 0,movieId,title,genres
27273,131254,Kein Bund für's Leben (2007),Comedy
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy
27275,131258,The Pirates (2014),Adventure
27276,131260,Rentun Ruusu (2001),(no genres listed)
27277,131262,Innocence (2014),Adventure|Fantasy|Horror


In [6]:
# Reading in the MovieLens Latest ratings dataset
ratings_latest = pd.read_csv('../../data/ml-latest/ratings.csv')

In [7]:
ratings_latest.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [8]:
# Reading in the MovieLens Latest movies dataset
movies_latest = pd.read_csv('../../data/ml-latest/movies.csv')

In [9]:
movies_latest.tail()

Unnamed: 0,movieId,title,genres
45838,176269,Subdue,Children|Drama
45839,176271,Century of Birthing (2011),Drama
45840,176273,Betrayal (2003),Action|Drama|Thriller
45841,176275,Satan Triumphant (1917),(no genres listed)
45842,176279,Queerama (2017),(no genres listed)


### Data Exploration

Some questions I have so far:
* Is it possible to merge the two ratings and movies datasets?
* Is the data consistent between the MovieLens 20M and the MovieLens Latest such that userIds correspond to the same user?

In [10]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,20000260.0,20000260.0,20000260.0,20000260.0
mean,69045.87,9041.567,3.525529,1100918000.0
std,40038.63,19789.48,1.051989,162169400.0
min,1.0,1.0,0.5,789652000.0
25%,34395.0,902.0,3.0,966797700.0
50%,69141.0,2167.0,3.5,1103556000.0
75%,103637.0,4770.0,4.0,1225642000.0
max,138493.0,131262.0,5.0,1427784000.0


Now examining if userId=1 is possibly the same user between the two ratings DataFrames

In [11]:
ratings[ratings['userId']==1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
5,1,112,3.5,1094785740
6,1,151,4.0,1094785734
7,1,223,4.0,1112485573
8,1,253,4.0,1112484940
9,1,260,4.0,1112484826


In [12]:
ratings_latest[ratings_latest['userId']==1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556
5,1,1968,4.0,1425942148
6,1,2762,4.5,1425941300
7,1,2918,5.0,1425941593
8,1,2959,4.0,1425941601
9,1,4226,4.0,1425942228


In [13]:
ratings_latest.shape

(26024289, 4)

In [14]:
ratings_latest.movieId.sample(10)

5396608       3959
2704545       1127
21718841      1258
16405685      7458
1511883       1680
4092617        372
10230649      5618
16846305    117895
1600666       3507
6491960      50685
Name: movieId, dtype: int64

In [15]:
movies_latest[movies_latest.movieId==98809]

Unnamed: 0,movieId,title,genres
20146,98809,"Hobbit: An Unexpected Journey, The (2012)",Adventure|Fantasy|IMAX


Clearly the userId of 1 is not consistent between the two ratings data sets.  Therefore, I don't think it's safe to merge these two datasets.

Write a function to strip out the Year from the 'title' column in case if I want to filter by year

In [16]:
def get_releaseYear(row):
    row_len = len(row)
    if row_len < 2:
        return np.nan
    else:
        return row[-1].replace(')','')

In [17]:
def get_title(row):
    row_len = len(row)
    if row_len < 2:
        return row[0]
    else:
        return '('.join(row[:-1])

In [18]:
def strip_Year_from_title(df):
    temp = df['title'].str.split('(')
    df['releaseYear'] = temp.apply(get_releaseYear)
    df['title2'] = temp.apply(get_title)
    return df

Running my code on movies DataFrame

In [19]:
movies2 = strip_Year_from_title(movies)
movies2.head()

Unnamed: 0,movieId,title,genres,releaseYear,title2
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,Toy Story
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,Jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,Grumpier Old Men
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,Waiting to Exhale
4,5,Father of the Bride Part II (1995),Comedy,1995,Father of the Bride Part II


In [20]:
movies2.drop(columns='title',axis=1,inplace=True)
movies2.rename(columns={'title2':'title'},inplace=True)

In [21]:
movies2 = movies2.reindex(columns=['movieId','title','releaseYear','genres'])

In [22]:
movies2.head()

Unnamed: 0,movieId,title,releaseYear,genres
0,1,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,1995,Adventure|Children|Fantasy
2,3,Grumpier Old Men,1995,Comedy|Romance
3,4,Waiting to Exhale,1995,Comedy|Drama|Romance
4,5,Father of the Bride Part II,1995,Comedy


Running my code on movies_latest DataFrame

In [23]:
movies_latest2 = strip_Year_from_title(movies_latest)
movies_latest2.drop(columns='title',axis=1,inplace=True)
movies_latest2.rename(columns={'title2':'title'},inplace=True)

In [24]:
movies_latest2 = movies_latest2.reindex(columns=['movieId','title','releaseYear','genres'])

In [25]:
movies_latest2.head()

Unnamed: 0,movieId,title,releaseYear,genres
0,1,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji,1995,Adventure|Children|Fantasy
2,3,Grumpier Old Men,1995,Comedy|Romance
3,4,Waiting to Exhale,1995,Comedy|Drama|Romance
4,5,Father of the Bride Part II,1995,Comedy


In [26]:
movies_latest2['releaseYear'].value_counts()

2014                        1953
2015                        1931
2013                        1872
2012                        1726
2011                        1637
2009                        1610
2016                        1593
2010                        1506
2008                        1467
2007                        1333
2006                        1277
2005                        1123
2004                        1000
2002                         906
2003                         894
2001                         856
2000                         819
1998                         727
1999                         717
1997                         684
1996                         636
1995                         606
1994                         559
1993                         494
2017                         488
1988                         466
1987                         463
1992                         461
1989                         444
1990                         431
          

In [27]:
ratings_latest.shape

(26024289, 4)

In [28]:
movies_latest2.sample(30)

Unnamed: 0,movieId,title,releaseYear,genres
44490,172789,Lovey-Dovey,2007.0,Comedy|Fantasy
28956,130038,Little Deaths,2011.0,Horror
11040,45512,Keeping Up with the Steins,2006.0,Comedy
43878,171463,"The Execution of Mary, Queen of Scots",1895.0,(no genres listed)
41432,165397,Cinderella,1899.0,Children|Fantasy|Horror|Sci-Fi
43525,170631,Is That a Gun in Your Pocket?,2016.0,Comedy
15544,78980,Dynamite Girl (Dynamiittityttö),1944.0,Comedy|Crime
16883,85108,Lakota Woman: Siege at Wounded Knee,1994.0,Drama
14842,74097,Cabiria,1914.0,Adventure|Drama|War
16453,82836,"Life of Reilly, The",2006.0,Comedy


In [29]:
movies_latest2.movieId.describe()

count     45843.000000
mean      96578.775626
std       57216.863469
min           1.000000
25%       49202.500000
50%      108799.000000
75%      145270.500000
max      176279.000000
Name: movieId, dtype: float64

### Recommender model

Now I'm going to set up a very basic collaborative recommender using the Surprise library.

In [30]:
from surprise import SVD
from surprise import evaluate, print_perf
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

My dataset is super large, so I'm going to just sample 1M rows from my dataset and run my first model on my downsampled DataFrame

In [31]:
ratings_latest_1M = ratings_latest.sample(n=1000000,replace=True)
ratings_latest_1M.shape

(1000000, 4)

In [32]:
ratings_latest_1M.head()

Unnamed: 0,userId,movieId,rating,timestamp
7323622,75499,367,2.0,837262574
14056426,146081,54278,4.5,1227494543
4542288,46640,1196,5.0,1445536589
7170441,73999,245,3.0,993169936
16473112,171269,231,2.0,1460148287


In [33]:
ratings_latest_1M.reset_index(drop=True,inplace=True)

In [34]:
ratings_latest_1M.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,75499,367,2.0,837262574
1,146081,54278,4.5,1227494543
2,46640,1196,5.0,1445536589
3,73999,245,3.0,993169936
4,171269,231,2.0,1460148287


In [35]:
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings_latest_1M[['userId', 'movieId', 'rating']], reader)

In [36]:
data.split(n_folds=3)

algo = SVD()

perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9261
MAE:  0.7141
------------
Fold 2
RMSE: 0.9273
MAE:  0.7148
------------
Fold 3
RMSE: 0.9277
MAE:  0.7154
------------
------------
Mean RMSE: 0.9271
Mean MAE : 0.7148
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
RMSE    0.9261  0.9273  0.9277  0.9271  
MAE     0.7141  0.7148  0.7154  0.7148  


In [37]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=None, verbose=True)

user: 196        item: 302        r_ui = None   est = 3.53   {'was_impossible': False}


In [38]:
from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
#for uid, user_ratings in top_n.items():
#    print(uid, [iid for (iid, _) in user_ratings])


In [39]:
# Now I'd like to train on the 1M datatset with 3-fold cross validation
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings_latest_1M[['userId', 'movieId', 'rating']], reader)

In [None]:
#data.split(n_folds=3)
#algo = SVD()
#perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
#print_perf(perf)

In [None]:
#trainset = data.build_full_trainset()

Surprise isn't working very well for a large dataset like I have.  I'm going to now try another library called implicit.

In [40]:
from scipy.sparse import csr_matrix
import implicit

In [41]:
# Initialize sparse matrix of ratings
item_user_data = csr_matrix((ratings_latest_1M['rating'].astype(np.double),
                       (ratings_latest_1M['userId'], #row_id
                        ratings_latest_1M['movieId']))) #column_id

#pd.DataFrame(item_user_data.todense()).head()
print(item_user_data)

  (1, 1246)	5.0
  (4, 3285)	3.0
  (5, 1225)	4.0
  (6, 11)	3.0
  (6, 2565)	4.0
  (7, 58559)	5.0
  (8, 585)	2.0
  (8, 1405)	4.0
  (8, 1965)	3.0
  (8, 2919)	3.0
  (8, 3175)	5.0
  (8, 3986)	1.0
  (8, 5033)	1.0
  (9, 1183)	4.0
  (9, 4308)	5.0
  (9, 5013)	3.0
  (9, 5902)	3.5
  (11, 110)	3.5
  (11, 1089)	7.0
  (11, 6870)	3.5
  (11, 33437)	3.0
  (11, 52328)	3.5
  (11, 58839)	2.0
  (12, 273)	3.0
  (12, 290)	5.0
  :	:
  (270893, 2716)	5.0
  (270893, 2717)	4.0
  (270893, 4148)	4.0
  (270894, 296)	3.0
  (270894, 2165)	2.5
  (270894, 2858)	3.5
  (270894, 4475)	4.0
  (270894, 5618)	1.5
  (270894, 6787)	5.0
  (270894, 48774)	2.5
  (270894, 55820)	3.0
  (270894, 108729)	4.5
  (270896, 1)	4.5
  (270896, 153)	2.0
  (270896, 288)	3.5
  (270896, 296)	5.0
  (270896, 300)	3.5
  (270896, 1089)	5.0
  (270896, 1580)	3.0
  (270896, 1777)	3.0
  (270896, 1968)	4.5
  (270896, 2706)	2.5
  (270896, 5952)	5.0
  (270896, 7438)	5.0
  (270896, 35836)	3.5


In [42]:
# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50)

# train the model on a sparse matrix of item/user/confidence weights
model.fit(item_user_data)

100%|██████████| 15.0/15 [00:12<00:00,  1.14it/s]


In [43]:
# recommend items for a specific user (user_id = 3)
user_items = item_user_data.T
rec = model.recommend(3,user_items)
#recommendations = pd.DataFrame(model.recommend(3, item_user_data), columns=['movieId', 'score'])

# find related items to a specific movie (Jumanji)
#related = pd.DataFrame(model.similar_items(2), columns=['movieId', 'score'])

In [44]:
movies_latest2['index'] = movies_latest2.index + 1
movies_latest2.head()

Unnamed: 0,movieId,title,releaseYear,genres,index
0,1,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy,1
1,2,Jumanji,1995,Adventure|Children|Fantasy,2
2,3,Grumpier Old Men,1995,Comedy|Romance,3
3,4,Waiting to Exhale,1995,Comedy|Drama|Romance,4
4,5,Father of the Bride Part II,1995,Comedy,5


In [46]:
#recommendations_with_title = pd.merge(recommendations,movies_latest2,left_on='movieId', right_on='index')
#print(recommendations.columns)
#print(movies_latest2.columns)

In [None]:
recommendations_with_title

In [None]:
recommendations

In [None]:
related

In [None]:
rec

In [None]:
# Initialize sparse matrix of ratings using the full dataset
item_user_data = csr_matrix((ratings_latest['rating'].astype(np.double),
                       (ratings_latest['userId'], #row_id
                        ratings_latest['movieId']))) #column_id

#pd.DataFrame(item_user_data.todense()).head()
print(item_user_data)

In [None]:
# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50)

# train the model on a sparse matrix of item/user/confidence weights
model.fit(item_user_data)

In [None]:
# recommend items for a specific user (user_id = 3)
user_items = item_user_data.T
#rec = model.recommend(3,user_items)
recommendations = pd.DataFrame(model.recommend(3, user_items), columns=['movieId', 'score'])

# find related items to a specific movie (Jumanji)
#related = pd.DataFrame(model.similar_items(2), columns=['movieId', 'score'])

In [None]:
recommendations

I have a hunch that the model might not be working because some users had just 1 rating.  In order to test this hunch, I'm going to use the 20M dataset.  The readme states that each user in this datatset had at least 20 ratings.

In [None]:
# Initialize sparse matrix of ratings using the full dataset
item_user_data_20M = csr_matrix((ratings['rating'].astype(np.double),
                       (ratings['userId'], #row_id
                        ratings['movieId']))) #column_id

#pd.DataFrame(item_user_data.todense()).head()
#print(item_user_data)

In [None]:
# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50)

# train the model on a sparse matrix of item/user/confidence weights
model.fit(item_user_data_20M)

In [None]:
# recommend items for a specific user (user_id = 3)
user_items_20M = item_user_data_20M.T
#rec = model.recommend(3,user_items)
recommendations = pd.DataFrame(model.recommend(3, user_items_20M), columns=['movieId', 'score'])

# find related items to a specific movie (Jumanji)
#related = pd.DataFrame(model.similar_items(2), columns=['movieId', 'score'])

In [None]:
movies2['index'] = movies2.index + 1
movies2.head()

In [None]:
recommendations

In [None]:
recommendations_with_title = pd.merge(recommendations,movies2,left_on='movieId',right_on='index')

In [None]:
recommendations_with_title

In [None]:
movies2[movies2.movieId==34962]