
___
# Chapter 2 -  Machine Learning Based Recommendation Systems
## Segment 2 - Model-based Collaborative Filtering Systems
## SVD Matrix Factorization

In [1]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.decomposition import TruncatedSVD

The MovieLens dataset was collected by the GroupLens Research Project at the University of Minnesota. You can download the dataset for this demostration at the following URL: https://grouplens.org/datasets/movielens/100k/


DETAILED DESCRIPTIONS OF DATA FILES: 
https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

### Preparing the data

In [2]:
# creating a list of column names for user data
columns = ['user_id', 'item_id', 'rating', 'timestamp']

# reading user data
frame = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)
frame.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
# import item attribute
# creating a list of column names for item data
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

# reading item data
movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movies.head()

Unnamed: 0,item_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [4]:
movie_names = movies[['item_id', 'movie title']]
movie_names.head()

Unnamed: 0,item_id,movie title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [5]:
# Merging user data and item data
combined_movies_data = pd.merge(frame, movie_names, on='item_id')
combined_movies_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


Now we've got a dataset that contains all of the movies, and all of the reviews, that were given to each of the movies. 

So you can see that there are duplicate entries, based on item ID, because more than one movie goer reviewed the same movie.

Now let's see what movies have the most number of reviews. 

This is like we did with the popularity based recommender at the beginning of this course. 

To do that, we'll have to group the combined movies data frame by item ID, and then count up the number of ratings that were given to each of the movies, then sort the data frame in descending value.

In [6]:
combined_movies_data.groupby('item_id')['rating'].count().sort_values(ascending=False).head()

item_id
50     583
258    509
100    508
181    507
294    485
Name: rating, dtype: int64

Movie number 50, was rated 583 times, it was the most popular. 

To see the name of that movie we'll filter the combined movies data frame. 

In [7]:
filter = combined_movies_data['item_id']==50
combined_movies_data[filter]['movie title'].unique()

array(['Star Wars (1977)'], dtype=object)

We can see that, item ID 50 is actually Star Wars, 1977, which appears to have been a very popular movie. 

Now we're going to move into building a utility matrix, this matrix is going to contain a value for each user and each movie.

For cases where the user did provide a movie review, that rating shows us a numeric value. 

All other user movie values will return as null.

### Building a Utility Matrix

In [8]:
# truncated SVD model does not accept null values, so what we need to do is pass a parameter fill value equal to zero. 
 
rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
rating_crosstab.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,5,0,0,3,4,0,0,...,0,0,0,5,3,0,0,0,4,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,2,0,0,0,0,4,0,0,...,0,0,0,4,0,0,0,0,4,0


We're going to take this utility matrix, and transpose it, and then we're going to use SVD to decompose it down to synthetic representations, of the user reviews. 


Now, since we want to recommend movies, we want to reserve the movie names as uncompressed rows. 

We want to use the similarities between users, to decide which movies to recommend, so we can use truncated SVD to compress all of the user ratings down to just 12 latent variables. 

These variables are going to capture most of the information that was stored in the 943 user columns previously. 

They represent a generalized view of users' tastes and preferences. The first thing we will do is to transpose our matrix, so that movies are represented by rows, and users are represented by columns. 

Then we'll use SVD to compress this matrix down to a 1,664 by 12 matrix. All of the individual movie names will be retained along the rows. 

But the users will have been compressed down to 12 synthetic components, that represent a generalized view of users' tastes

### Transposing the Matrix

In [9]:
# get shape of matrix
rating_crosstab.shape

(943, 1664)

In [10]:
X = rating_crosstab.T
X.shape

(1664, 943)

### Decomposing the Matrix

In [11]:
SVD = TruncatedSVD(n_components=12, random_state=17)

# we call the fit.transform() of the SVD object, and this fits the SVD model to matrix X
# and this performs the dimension reduction on it.
resultant_matrix = SVD.fit_transform(X)

resultant_matrix.shape

(1664, 12)

So we have a 1,664 by 12 matrix. 

Next let's move into generating a correlation matrix. 

We'll calculate the Pearson r correlation coefficient, for every movie pair in the resultant matrix. 

With correlation being based on similarities between user preferences. 


### Generating a Correlation Matrix

Now we want to find out how similar each movie is to other movies on the basis of generalized user tastes. 

We use the Pearson r correlation coefficient to do that. 

The correlation matrix will get back the 1,664 by 1,664 matrix. In our demo, Star Wars, 1977, will be our movie of interest. 

We'll take the array that represents Star Wars, and examine how well its user ratings correlate with the user ratings given to other movies in the dataset. 

For each movie pair in the matrix, we'll calculate how well they correlate, based on user preferences. 

The goal is to recommend the movie that has the highest correlation with our movie of interest, based on generalized user tastes.

In [16]:
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(1664, 1664)

Okay, perfect. So we have a 1,664 by 1,664 correlation matrix just what we needed

Now let's move into isolating Star Wars from this correlation matrix.

### Isolating Star Wars From the Correlation Matrix

The first thing we're going to do, is generate a movie names index. 

And we'll set that equal to the columns of the ratings crosstab matrix.


Now this is going to give us a numpy array, that contains one record for each movie name in the ratings crosstab. 

Let's convert this num pi array to a list, by saying list, and passing in, movies, underscore name, and we'll call this our, movies list. 

Now we need to find the numeric index value of Star Wars, so we can use that as a movie of interest. 

To do that, we'll call the dot index method, off of our movies list. And we'll pass in the title of our movie, which is Star Wars and 1977, and then, let's name this Star Wars, since that's what movie the index value will represent. 

And let's just print it out. So we say print, Star Wars

In [17]:
# pull these movie names from our ratings crosstab columns
movie_names = rating_crosstab.columns

# Lets convert this numpy array to a list
movies_list = list(movie_names)

# Now we need to find the numeric index value of Star Wars, so we can use that as a movie of interest. 
star_wars = movies_list.index('Star Wars (1977)')
star_wars

1398

We see that the index value of Star Wars is 1398.

In [18]:
# Now, for my correlation matrix, let's isolate the array that represents Star Wars, at numerical index value 1398. 
corr_star_wars = corr_mat[1398]

# check the shape
corr_star_wars.shape

(1664,)

We have a vertical array of 1,664 rows, each of the rows contains a Pearson r coefficient that indicates how well each movie in the dataset correlates with Star Wars, based on user preferences. 

### Recommending a Highly Correlated Movie

Now let's generate a list of movie names that exhibit a high degree of correlation with Star Wars.

To do that, we'll call the list function, and then pass in the expression to retrieve only the movies that have a Pearson r coefficient close to one.

In [19]:


# we select the movie names that have a correlation with Star Wars, of less than one but greater than 0.9.
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.9)])

['Die Hard (1988)',
 'Empire Strikes Back, The (1980)',
 'Fugitive, The (1993)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Terminator 2: Judgment Day (1991)',
 'Terminator, The (1984)',
 'Toy Story (1995)']

Here we get a list of movie names that have a high degree of correlation with Star Wars. 

Now let's make a list of movies that correlates with Star Wars even a little closer (having a correlation of Pearson r greater than 0.95) 

In [15]:
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.95)])

['Return of the Jedi (1983)']

We get 'Return of the Jedi, 1983', now both movies came out around they same time, and they're both very popular sci-fi films. 

So it really makes sense that if a person likes Star Wars from 1977, then they'll probably also really like Return of the Jedi 1983. 