# ANN Recommendation system

Per questo esercizio usiamo un dataset noto, sfruttando i 20 milioni di righe che ha. L'idea è usare il dataset moovielens per costruire un vero e proprio reccomendation system.

Dello zip che scarichiamo non usiamo l'intero dataset, ma ci limitiamo al file "ratings.csv"

In [29]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Embedding, Flatten, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.utils import shuffle
import os

In [11]:
if not os.path.isfile("data/ml-20m.zip"):
    !curl -o "data/ml-20m.zip" "http://files.grouplens.org/datasets/movielens/ml-20m.zip"

In [13]:
df = pd.read_csv("data/ml-20m/ratings.csv")

In [14]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
...,...,...,...,...
20000258,138493,68954,4.5,1258126920
20000259,138493,69526,4.5,1259865108
20000260,138493,69644,3.0,1260209457
20000261,138493,70286,5.0,1258126944


#### Nota

Nel dataset che abbiamo, anche se ci sembra che userID e MoovieID siano già belli in forma di interi (indici) per una embedding matrix, non possiamo fidarci solamente di questo. Noi vogliamo che gli indici vadano consecutivamente da 0 a N-1 film, così da creare una dimensione embedding di N. Esempio: vogliamo gli id dei film che siano 0, 1, 2, 3 e non 0, 12, 56, 1234. Stesso per gli userId.

Verifichiamo che sia così.

Da sotto notiamo come gli userId siano in ordine, mentre i moovieId siano molti di meno.

In [18]:
df["userId"].unique(), len(df["userId"].unique())

(array([     1,      2,      3, ..., 138491, 138492, 138493], dtype=int64),
 138493)

In [19]:
df["movieId"].unique(), len(df["movieId"].unique())

(array([     2,     29,     32, ..., 121021, 110167, 110510], dtype=int64),
 26744)

In [23]:
df.userId = pd.Categorical(df.userId)
df["newUserId"] = df.userId.cat.codes
df.newUserId.unique

<bound method Series.unique of 0                0
1                0
2                0
3                0
4                0
             ...  
20000258    138492
20000259    138492
20000260    138492
20000261    138492
20000262    138492
Name: newUserId, Length: 20000263, dtype: int32>

In [24]:
df.movieId = pd.Categorical(df.movieId)
df["newMoovieId"] = df.movieId.cat.codes
df.newMoovieId.unique

<bound method Series.unique of 0               1
1              28
2              31
3              46
4              49
            ...  
20000258    13754
20000259    13862
20000260    13875
20000261    13993
20000262    14277
Name: newMoovieId, Length: 20000263, dtype: int16>

In [26]:
user_ids = df.newUserId.values
moovie_ids = df.newMoovieId.values
ratings = df.rating.values

In [28]:
N = len(set(user_ids))
M = len(set(moovie_ids))

D = 10 # Embedding dimension

In [35]:
user_ids, moovie_ids, ratings = shuffle(user_ids, moovie_ids, ratings)
n_train = int(len(ratings)*0.8)

user_train = user_ids[:n_train]
moovie_train = moovie_ids[:n_train]
ratings_train = ratings[:n_train]

user_test = user_ids[n_train:]
moovie_test = moovie_ids[n_train:]
ratings_test = ratings[n_train:]

# Centering ratings
avg_rating = np.mean(ratings_train)
ratings_train = ratings_train - avg_rating
ratings_test = ratings_test - avg_rating

In [38]:
u = Input(shape=(1,))
m = Input(shape=(1,))
u_emb = Embedding(N, D)(u) # Dimensione = n_samples, 1, D
m_emb = Embedding(M, D)(m)
u_emb = Flatten()(u_emb) # Ora dimensione n_samples, D
m_emb = Flatten()(m_emb)
i = Concatenate()([u_emb, m_emb])
o = Dense(1024, activation="relu")(i)
o = Dense(1, activation="linear")(o)

model = Model(inputs=[u, m], outputs=o)
model.compile(loss="mse", optimizer="adam")
model.summary()

In [39]:
r = model.fit(
    [user_train, moovie_train],
    ratings_train,
    validation_data=([user_test, moovie_test], ratings_test),
    epochs=25,
    batch_size=1024,
    # verbose=2 -> un modo per fare andare un po' più velocemente il training senza printare progress bar
)

Epoch 1/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m219s[0m 14ms/step - loss: 0.7762 - val_loss: 0.6904
Epoch 2/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m220s[0m 14ms/step - loss: 0.6694 - val_loss: 0.6632
Epoch 3/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m220s[0m 14ms/step - loss: 0.6286 - val_loss: 0.6471
Epoch 4/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m235s[0m 15ms/step - loss: 0.6013 - val_loss: 0.6371
Epoch 5/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m245s[0m 16ms/step - loss: 0.5818 - val_loss: 0.6331
Epoch 6/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m241s[0m 15ms/step - loss: 0.5692 - val_loss: 0.6298
Epoch 7/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m239s[0m 15ms/step - loss: 0.5599 - val_loss: 0.6285
Epoch 8/25
[1m15626/15626[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m238s[0m 15ms/step - loss: 0.5533 - v

In [None]:
# Dataset Benchmark: https://datascience.stackexchange.com/questions/29740/benchmark-result-for-movielens-dataset
# MSE: 0.6, RMSE: 0.8