# Your own personal Netflix
## Data Preprocessing

To read the dataset you might need to alter the path to look for it:

In [1]:
import pandas as pd # pandas is a data manipulation library
import numpy as np
from scipy import linalg
# lets explore movies.csv
movies= pd.read_csv('ml-latest-small/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [2]:
# lets explore ratings.CSV
ratings=pd.read_csv('ml-latest-small/ratings.csv',sep=',')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


The given ratings are in the range of 0.5 and 5:

In [3]:
min(ratings["rating"]), max(ratings["rating"])

(0.5, 5.0)

We convert the sparse data representation of movie ratings into a data matrix. The missing values are filled with zeros.

In [4]:
df_movie_ratings = ratings.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)  #fill unobserved entries with μ
df_movie_ratings.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We consider here only the movies which have been rated by more than 100 users. That are 134 movies. We will not be able to infer a pattern for movies with very few observations anyways, but for this exercise we are mostly interested in the principle and do not need a big dataset.

In [5]:
np.sum(np.sum(df_movie_ratings!=0,0)>100)

134

In [6]:
keep_movie = np.sum(df_movie_ratings!=0,0)>100
df_D = df_movie_ratings.loc[:,keep_movie]
df_D.head()

movieId,1,2,6,10,32,34,39,47,50,110,...,7153,7361,7438,8961,33794,48516,58559,60069,68954,79132
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,5.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.0,4.5,0.0,0.0,4.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Furthermore, we will throw out all the users which have rated fewer than five movies. It would be hard anyways to make recommendations based on 4 movies.

In [7]:
np.sum(np.sum(df_D!=0,1)>=5)

556

The resulting dataset has the userID as rows and movieIDs as columns. Hence, userID 1 and 4 addresses the first two rows of this dataset.

In [8]:
keep_user = np.sum(df_D!=0,1)>=5
df_D = df_D.loc[keep_user,:]
df_D.head()

movieId,1,2,6,10,32,34,39,47,50,110,...,7153,7361,7438,8961,33794,48516,58559,60069,68954,79132
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,5.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.0,4.5,0.0,0.0,4.0
4,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,4.0,3.0,0.0,4.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,4.0,4.0,3.0,4.0,4.0,0.0,4.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The movie number- title assignments are given as follows:

In [9]:
selected_movies = movies.loc[movies['movieId'].isin(df_D.columns)]
selected_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
5,6,Heat (1995),Action|Crime|Thriller
9,10,GoldenEye (1995),Action|Adventure|Thriller
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
...,...,...,...
6315,48516,"Departed, The (2006)",Crime|Drama|Thriller
6710,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
6772,60069,WALL·E (2008),Adventure|Animation|Children|Romance|Sci-Fi
7039,68954,Up (2009),Adventure|Animation|Children|Drama


The resulting data matrix is given as follows:

In [10]:
D = df_D.to_numpy()
D.shape

(556, 134)

## Optimization
Use the following initialization for your implementation of the optimization scheme.

In [11]:
def matrix_completion(D, r, t_max=100, λ = 0.1):
    n,d = D.shape
    # r = np.random.randint(1,np.min([n,d]))

    X = np.random.normal(size =(d,r))
    Y = np.random.normal(size =(n,r))
    O = np.zeros_like(D)
    O[np.nonzero(D)] = 1
    
    # Implement now the optimization procedure
    for t in range(t_max):
        for k in range(d):
            Oxk = np.diag(O[:,k])
            inv = linalg.inv(np.matmul(np.matmul(np.transpose(Y),Oxk),Y)+λ*np.identity(r))
            X[k] = np.matmul(np.matmul(np.transpose(D[:,k]),Y),inv)
        for i in range(n):
            Oyi= np.diag(O[i,:])
            inv = linalg.inv(np.matmul(np.matmul(np.transpose(X),Oyi),X)+λ*np.identity(r))
            Y[i] = np.matmul(np.matmul(D[i,:],X),inv)
    return X,Y

In [12]:
X,Y = matrix_completion(D,r=20,t_max=100,λ=0.1)

0


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


In [13]:
def getAproxError(X,Y):
    O = np.zeros_like(D) 
    O[np.nonzero(D)] = 1
    D_aprox = np.multiply(O,np.matmul(Y,np.transpose(X)))
    err = linalg.norm(D-D_aprox)**2
    print(err)
    return err

In [26]:
# average squared approximation error
getAproxError(X,Y);

1810.1583124142587


In [15]:
def getEstimatedRating(X,Y, user, movieTitle):
    D_aprox = np.matmul(Y,np.transpose(X))
    movieId = selected_movies.loc[selected_movies["title"]==movieTitle]["movieId"].iloc[0]
    return D_aprox[df_D.index.get_loc(user),df_D.columns.get_loc(movieId)]


In [16]:
getEstimatedRating(X,Y,1,"Lord of the Rings: The Two Towers, The (2002)")

5.4156372729670545

In [17]:
getEstimatedRating(X,Y,1,"Dark Knight, The (2008)")

5.118671072145141

In [18]:
getEstimatedRating(X,Y,1,"Clueless (1995)")

3.0031359623359766

In [19]:
getEstimatedRating(X,Y,1,"2001: A Space Odyssey (1968)")

2.87769811576338

In [20]:
# 3b
def getmissingValuesStats(X,Y):
    O = np.zeros_like(D) 
    O[np.nonzero(D)] = 1
    mis = np.multiply(np.ones_like(D)-O,np.matmul(Y,np.transpose(X))) # YX^t where O=0

    out = mis[np.where(mis < 0.5)].size+mis[np.where(mis > 5)].size
    mean = mis.mean()
    var = mis.var()
    print("Missing value imputations outside [0.5,5]: ",out)
    print("Mean of missing value imputations: ", mean)
    print("Variance of missing value imputations: ", var)
    return out,mean,var

In [21]:
X1,Y1 = matrix_completion(D,r=20,t_max=100,λ=0.01)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


In [22]:
X2,Y2 = matrix_completion(D,r=20,t_max=100,λ=0.5)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


In [23]:
# Approximation error for different values of λ
print("Values for λ=0.01")
e1 = getAproxError(X1,Y1)
print("-----------------")
print("Values for λ=0.1")
e = getAproxError(X,Y)
print("-----------------")
print("Values for λ=0.5")
e2 = getAproxError(X2,Y2)

Values for λ=0.01
1812.5484252663728
-----------------
Values for λ=0.1
1810.1583124142587
-----------------
Values for λ=0.5
2025.305054831553


In [24]:
# Number of outliers, mean and variance of missing value imputations for different values of λ
print("Values for λ=0.01")
r1 = getmissingValuesStats(X1,Y1)
print("-----------------")
print("Values for λ=0.1")
r = getmissingValuesStats(X,Y)
print("-----------------")
print("Values for λ=0.5")
r2 = getmissingValuesStats(X2,Y2)

Values for λ=0.01
Missing value imputations outside [0.5,5]:  35071
Mean of missing value imputations:  2.416781394440063
Variance of missing value imputations:  7.2245358117358025
-----------------
Values for λ=0.1
Missing value imputations outside [0.5,5]:  29992
Mean of missing value imputations:  2.2671360702150567
Variance of missing value imputations:  4.220471492123602
-----------------
Values for λ=0.5
Missing value imputations outside [0.5,5]:  23582
Mean of missing value imputations:  2.4174643245366747
Variance of missing value imputations:  3.086039762964901


In [25]:
if r1[0] > r[0] and r[0] > r2[0]:
    print("The higher λ, the fewer missing values imputations are outside of the original range of ratings in [0.5,5].")

if r1[1] > r[1] and r[1] > r2[1]:
    print("The higher λ, the lower the mean of the missing value imputations is.")

if e1 < e and e < e2:
    print("The higher λ, the higher the approximation error.")

if r1[2] > r[2] and r[2] > r2[2]:
    print("The higher λ, the lower the variance of the missing value imputations is.")


The higher λ, the fewer missing values imputations are outside of the original range of ratings in [0.5,5].
The higher λ, the lower the variance of the missing value imputations is.
