# Collaborative Filtering on Netflix Prize Data

We will be taking our combined data set and reducing the size by reducing to smaller data types, and then pivoting the dataframe before using it to create a sparse dataframe with pandas. This reduces the size of the dataframe to approximately 1.1 GB from 2.6 GB. 
Next, we will use Stochastic Gradient Descent to find the rank-40 SVD of the full matrix, ignoring the empty values, which allows us to predict ratings with for the empty values.

We will also use a Nearest Neighborhood algorithm for our collaborative filtering. Thsi requires creating a similarity function that subtracts a users individual average of all items (not including null values), then add it back for the target user. From there, we can use the neighbors to predict a rating.

In [2]:
import numpy as np
import pandas as pd
import os

cwd = os.getcwd()
movie = pd.read_csv(cwd + "/data/final.csv")

movie.describe()

Unnamed: 0,MovieID,CustomerID,Rating
count,100480500.0,100480500.0,100480500.0
mean,9070.915,1322489.0,3.60429
std,5131.891,764536.8,1.085219
min,1.0,6.0,1.0
25%,4677.0,661198.0,3.0
50%,9051.0,1319012.0,4.0
75%,13635.0,1984455.0,4.0
max,17770.0,2649429.0,5.0


In [31]:
movie.head()

Unnamed: 0,MovieID,CustomerID,Rating,Date
0,1,1488844,3,2005-09-06
375,1,1605780,4,2004-09-17
374,1,2005193,4,2005-11-17
373,1,1565175,5,2004-08-10
372,1,493945,5,2005-04-12


## Reducing Data Size

We will be changing the Date object to 'category', as well as reducing the data type of the rest of the columns from int64 to int32 for the IDs, and to int8 for the ratings.

In [4]:
movie.memory_usage()

Index               128
MovieID       803844056
CustomerID    803844056
Rating        803844056
Date          803844056
dtype: int64

In [5]:
movie.memory_usage().sum() / (1024**2)

3066.421844482422

In [6]:
movie.dtypes

MovieID        int64
CustomerID     int64
Rating         int64
Date          object
dtype: object

In [7]:
movie.Date.value_counts().size

2182

In [23]:
movie['Date'] = movie['Date'].astype('category')
movie['MovieID'] = movie['MovieID'].astype('int16')
movie['CustomerID'] = movie['CustomerID'].astype('int32')
movie['Rating'] = movie['Rating'].astype('int8')

In [24]:
movie.memory_usage().sum() / (1024**2)

1629.116213798523

## Changing into a Sparse Dataframe

First, we must pivot the dataframe so that it is in the form of a sparse dataframe. However, in order to pivot, we need a far smaller dataframe. So first we sort, divide, pivot, and then merge back together.

In [20]:
movie.sort_values(by = "MovieID", inplace = True)

In [21]:
movie.memory_usage().sum() / (1024**2)

1629.116213798523

In [29]:
from tqdm import tqdm

chunk_size = 12560063
chunks = [x for x in range(0, movie.shape[0], chunk_size)]

smovie = pd.DataFrame()
for i in tqdm(range(0, len(chunks) - 1)):
    chunk_movie = movie.iloc[chunks[i]:chunks[i + 1] - 1]
    pmovie = chunk_movie.pivot_table(values='Rating', index='CustomerID', columns='MovieID')
    smovie = smovie.append(pmovie.astype(pd.SparseDtype("int8", np.nan)))

100%|██████████| 8/8 [17:55<00:00, 134.48s/it]


In [None]:
matrix1 = movie1.pivot_table(values='Rating', index='CustomerID', columns='MovieID')
matrix2 = movie2.pivot_table(values='Rating', index='CustomerID', columns='MovieID')
matrix3 = movie3.pivot_table(values='Rating', index='CustomerID', columns='MovieID')
matrix4 = movie4.pivot_table(values='Rating', index='CustomerID', columns='MovieID')

In [30]:
smovie.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,17761,17762,17763,17764,17765,17766,17767,17768,17769,17770
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,5.0,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,
25,,,,,,,,,,,...,,,,,,,,,,


In [38]:
smovie.shape

(3701937, 17770)

## Matrix Factorization

First, we need to find the reduced rank SVD of smovie. We want rank-40 as a starting number.