# <b> <center> Project: Collaborative Filtering with Gaussian Mixtures</b> </center>

In this project, we have users movie ratings data extracted from Netflix database. The data has a lot of missing data since a lot of users have not seen the movie yet or did not rated the movie. If we can predict what the ratings could be, we can use this knowledge to recommend new movies to the user. <b>Collaborative filtering</b> is a great way to fill up the gaps based on the existing user data. We are using Expectation Maximization <b>(EM)</b> algorithm. It iteratively assigns the data with Gaussian Mixtures Model and then use the model to predict the missing entries in the data.

In [1]:
import numpy as np
import kmeans_clustering
import tools
import em_simple
import em_method

## Testing with Toy Data
Here we will test our algorithm with a toy dataset. This will make sure that our algorithm is working.

In [2]:
X= np.loadtxt('toy_data.txt')
X.shape

(250, 2)

In [3]:
mix, post = tools.init(X, 3)
post, ll = em_method.estep(X, mix)
mix = em_method.mstep(X, post, mix)
ll

-1388.081800044069

## Applying on Netflix Data
Now is the time for the real test. We will load the Netflix user movie rating data and see how many missing entries are there.

In [4]:
X= np.loadtxt('netflix_incomplete.txt')
print(np.sum(X==0),' missing entries!')

328232  missing entries!


The <b>EM</b> algorithm will create a Gaussian Mixture Model and then fill up the missing entries based on those Gaussian Mixtures.

In [5]:
K=np.array([12])
s=0
for kk in K:
    mix, post = tools.init( X, kk, s)
    mixture, post, ll = em_method.run(X, mix, post)
    print('Likelihood is : ',ll)
    #common.plot( X, mixture, post, title='The  model for K = %d'%(kk))

Likelihood is :  -1399820.8093013526


In [6]:
for s in range(4):
    mix, post = tools.init( X, 12, s)
    mixture, post, ll = em_method.run(X, mix, post)
    print('Likelihood for seed ',s, ' is : ',ll)

Likelihood for seed  0  is :  -1399820.8093013526
Likelihood for seed  1  is :  -1390280.999157461
Likelihood for seed  2  is :  -1417137.302463856
Likelihood for seed  3  is :  -1393103.8986528188


It seems like that the likelihood is highest at seed 1. Let's use that seed to run our algorithm.

In [10]:
mix, post = tools.init( X, 12, 1)
mixture, post, ll = em_method.run(X, mix, post)

xx = em_method.fill_matrix(X, mixture)
print(np.sum(xx==0),' missing entries!')

1271  missing entries!


From 328232 missing entries to only 1271 missing entries. That is a great improvement. We can calculate the RMSE to quantify the improvement.

In [12]:
Y = np.loadtxt('netflix_complete.txt')

print('The RMSE of the incomplete dataset is ',tools.rmse(X,Y))
print('And the RMSE of the incomplete dataset is ',tools.rmse(xx,Y))

The RMSE of the incomplete dataset is  1.6787480867863673
And the RMSE of the incomplete dataset is  0.48050704941977734


It seems like our Collaborative Filtering is a success.