In [2]:
import pandas as pd
import plotly.express as px
import numpy as np
import random as rd
from sklearn.cluster import KMeans

---
### About this notebook:
The purpose of this notebook is to implement K-means clusterign to the moviegeeks dataset in order to split customers into 22 clusters as shown on page 179 of the text. This will allow us to speed up recommendation computations by only selecting users that belong to the same cluster and the target user.

---
### Import dataset:

In [48]:
user_ratings_df = pd.read_csv('data/user_ratings.csv')
kmeans_df = user_ratings_df[['user_id', 'movie_id', 'rating']].sort_values(by='movie_id')
kmeans_df

Unnamed: 0,user_id,movie_id,rating
38088,43471,8,5.0
399426,71498,10,10.0
386069,70439,12,10.0
891143,38130,25,8.0
189843,55201,91,7.0
...,...,...,...
125167,50282,15711402,6.0
477535,5608,15831978,7.0
66754,45682,15839820,7.0
409376,279,15842076,10.0


---
### Prepare the data:
*Note:* In order for the Kmeans algo to work properly, we need to have all movies as columns and all users as rows. However, the dataframe is too big, which leads to int32 overflow errors. So below we had to cut the dataframe in thirds in order for my machine to be able to run it. We only produced output for the bottom third of all movies to check against the moviegeeks output.

In [45]:
# kmeans_df.pivot_table(index='user_id', columns='movie_id', values='rating')
df1 = kmeans_df[:300000]
df1_pivot = df1.pivot_table(index='user_id', columns='movie_id', values='rating')

---
### Defining the kmeans function with initialization as k-means++

In [46]:
kmeans = KMeans(n_clusters=22, init='k-means++')

---
### Fitting the k means algorithm on data:

In [51]:
# fill nas with zeroes:
df1_pivot.fillna(0, inplace=True)
kmeans.fit(df1_pivot)

KMeans(n_clusters=22)

---
### Predict and show clusters:

In [42]:
pred = kmeans.predict(df1_pivot)

In [44]:
frame = pd.DataFrame(df1_pivot)
frame['cluster'] = pred
frame['cluster'].value_counts()

1     27770
9      1827
0      1577
15     1476
2       972
4       925
16      816
14      750
3       667
19      631
5       271
11      270
12       67
21       55
13        3
17        1
18        1
8         1
10        1
7         1
20        1
6         1
Name: cluster, dtype: int64

---
See the following for a quick refresher on Kmeans implemetation: https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/