# HW3: Netflix Data Analysis

In this homework assignment, you will analyze the netflix prize data. The data consist of 100,480,50 movie ratings on a scale from 0 to 5 stars. The reveiws are distributed across 17,770 movies and 480,189. We have provided the training data as a sparse matrix where the row corresponds to the movie ID and the column corresponds to the user ID. A seperate file contains the title and year of release for each movie. The original, raw data consists of multiple lists of tuples; each list is a seperate movie and each tuple is User ID, Rating, and Rating Year. 
The original data can be downloaded here: https://archive.org/download/nf_prize_dataset.tar
Further information about the netflix prize is available online: 
https://en.wikipedia.org/wiki/Netflix_Prize
https://www.netflixprize.com/

In [6]:
import numpy as np
import pandas as pd
import scipy.sparse
from scipy.sparse import csr_matrix

In [7]:
# This file consists of titles and release years associated with each ID
movie_titles = pd.read_csv('data/movie_titles.txt', header = None, names = ['ID','Year','Name'])
print(movie_titles.head())

   ID    Year                          Name
0   1  2003.0               Dinosaur Planet
1   2  2004.0    Isle of Man TT 2004 Review
2   3  1997.0                     Character
3   4  1994.0  Paula Abdul's Get Up & Dance
4   5  2004.0      The Rise and Fall of ECW


In [8]:
# This file is a sparse matrix of users by movies, with each element a rating (1-5) or nonresponse (0)
ratings_csr = scipy.sparse.load_npz('data/netflix_full_csr.npz').T
print(ratings_csr.shape)

(2649430, 17771)


To avoid memory overflow errors we have randomly subsampled the data. Some computers can handle the full dataset (e.g. 2017 Macbook Pro can perform SVD on the full dataset). Older computers likely need to subsample the data. You can consider using Princeton computing resources and clusters to perform more computationally expensive analysis.

In [9]:
#n_samples = 5000
n_viewers = 10000
#random_sample_movies = np.random.choice(17771, size = n_samples)
random_sample_viewers = np.random.choice(2649430, size = n_viewers)
ratings_small = ratings_csr[random_sample_viewers,:]

A common methods for analyzing large datasets is dimension reduction. Here we perform a truncated SVD suited for sparse datasets and analyze which movies are associated with different latent dimensions

In [10]:
from sklearn.decomposition import TruncatedSVD

In [11]:
n_components = 5
svd = TruncatedSVD(n_components = n_components)

In [12]:
Z = svd.fit_transform(ratings_csr)

In [13]:
components = svd.components_

In [14]:
print(svd.explained_variance_ratio_)

[0.22842178 0.02974135 0.01995603 0.0167842  0.01250708]


In [15]:
for i in range(n_components):
    Z_sort = np.argsort(np.abs(Z[:,i]))
    print('Component ' + str(i))
    for j in range(1, 5):
        movie_index = Z_sort[-j]
        movie_title = movie_titles[movie_titles['ID'] == movie_index].iloc[0]['Name']
        movie_weight = Z[movie_index,i]
        print('{}: {}'.format(movie_title, movie_weight))
    print(' ')

Component 0


IndexError: single positional indexer is out-of-bounds