# Datasets:

* Dataset "**Movies**" e "**Ratings**"
* Transformação de timestamps em DateTimes
* Nos **Ratings** não existem dados antes de 1995 e depois de 2018
* Há cerca de 27 milhões de ratings e 58 mil filmes nos datasets

## Ratings:

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
import implicit

ratings = pd.read_csv("datasets/ratings.csv")
print(ratings.shape)
ratings.head()

(27753444, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264


##### Transformar timestamps em datetime:
* transformar a coluna "timestamp" no dataset dos ratings
* Possivelmente utilizá-la como index

In [2]:
ratings["date"] = pd.to_datetime(ratings.timestamp, unit = "s")
ratings = ratings.drop(columns = "timestamp")
ratings.head()

#Se for preciso utilizar a data como index
#ratings_plot.index = pd.DatetimeIndex(ratings_plot.date)

Unnamed: 0,userId,movieId,rating,date
0,1,307,3.5,2009-10-27 21:00:21
1,1,481,3.5,2009-10-27 21:04:16
2,1,1091,1.5,2009-10-27 21:04:31
3,1,1257,4.5,2009-10-27 21:04:20
4,1,1449,4.5,2009-10-27 21:01:04


**Não existem ratings antes de 1995 e depois de 2018**

In [3]:
print(ratings.loc[ratings.date.dt.year < 1995])
print(ratings.loc[ratings.date.dt.year > 2018])

Empty DataFrame
Columns: [userId, movieId, rating, date]
Index: []
Empty DataFrame
Columns: [userId, movieId, rating, date]
Index: []


## Movies:

In [4]:
movies = pd.read_csv("datasets/movies.csv")
print(movies.shape)
movies.head()

(58098, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
#Extrair o ano
movies["year"] = movies.title.str.extract("\((\d{4})\)")
movies.title = movies.title.str[:-7]
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [6]:
movies = movies.dropna()
movies.shape

(57771, 4)

# Sampling do dataset:
* Juntando os ratings com os movies
* Sendo o peso o ano do filme (filmes mais antigos têm menos ratings e sao menos importantes para recomendação)
* **BIAS:** Usando o ano como peso na amostragem acrescenta uma parcialidade a favor de filmes mais recentes.


#### Número único de utilizadores e filmes:

In [7]:
movie_ratings = ratings.join(movies.set_index("movieId"), on = "movieId")
movie_ratings.drop(columns = ["genres"], inplace = True)
print(movie_ratings.shape)
print(movie_ratings.userId.unique().size)
print(movie_ratings.title.unique().size)
movie_ratings.head()

(27753444, 6)
283228
50591


Unnamed: 0,userId,movieId,rating,date,title,year
0,1,307,3.5,2009-10-27 21:00:21,Three Colors: Blue (Trois couleurs: Bleu),1993
1,1,481,3.5,2009-10-27 21:04:16,Kalifornia,1993
2,1,1091,1.5,2009-10-27 21:04:31,Weekend at Bernie's,1989
3,1,1257,4.5,2009-10-27 21:04:20,Better Off Dead...,1985
4,1,1449,4.5,2009-10-27 21:01:04,Waiting for Guffman,1996


In [8]:
def get_count_movie_ratings(df, ratings_threshold):
    movie_ratings = df.groupby("movieId")["rating"].count()
    return(movie_ratings[movie_ratings > ratings_threshold])

In [9]:
def ratings_from_movie(df, ratings_threshold):
    movie_ratings = get_count_movie_ratings(df, ratings_threshold)
    mask = df.movieId.apply(lambda x: x in movie_ratings.index)
    return df[mask]

In [10]:
movie_ratings = ratings_from_movie(movie_ratings, 2000)
print("user count:", movie_ratings.userId.unique().shape)
print("movie count:", movie_ratings.movieId.unique().shape)

user count: (282438,)
movie count: (2564,)


### get_count_users_ratings:
Extrai todos os utilizador de um dataframe que fizeram mais de um certo número de ratings (*ratings_threshold*)
* *df* -> dataframe com userId e ratings
* *ratings_threshold* -> número de ratings a partir do qual um utilizador é relevante

In [11]:
def get_count_users_ratings(df, ratings_threshold):
    user_ratings = df.groupby("userId")["rating"].count()
    return(user_ratings[user_ratings > ratings_threshold])

In [12]:
#Exemplo de utilizacao
user_ratings = get_count_users_ratings(movie_ratings, 400)
print(user_ratings.shape)
user_ratings.head()

(10585,)


userId
4      659
56     636
81     834
100    477
134    842
Name: rating, dtype: int64

### ratings_from_users:
Mostra apenas os utilizadores com mais de certos ratings num determinado dataframe, utilizando a função **get_count_users_ratings**.

In [13]:
def ratings_from_users(df, ratings_threshold):
    user_ratings = get_count_users_ratings(df, ratings_threshold)
    mask = df.userId.apply(lambda x: x in user_ratings.index)
    return df[mask]

In [14]:
#Exemplo
movie_ratings = ratings_from_users(movie_ratings, 1000)
print(movie_ratings.shape)
print("movie count:", movie_ratings.movieId.unique().shape)
print("user count:", movie_ratings.userId.unique().shape)
movie_ratings.head()

(1132119, 6)
movie count: (2564,)
user count: (910,)


Unnamed: 0,userId,movieId,rating,date,title,year
22715,235,1,5.0,2001-12-30 12:50:08,Toy Story,1995
22716,235,2,3.0,2001-12-30 12:40:16,Jumanji,1995
22717,235,6,4.0,2002-01-09 14:07:39,Heat,1995
22718,235,10,5.0,2002-01-04 20:11:16,GoldenEye,1995
22719,235,11,4.0,2001-12-30 12:43:01,"American President, The",1995


Sample de 100K ratings, dando mais importância a anos mais recentes

In [15]:
sampled_ratings = pd.DataFrame(movie_ratings)
sampled_ratings = sampled_ratings.sample(n = 200000, replace = False, weights = "year", random_state = 1)
print(sampled_ratings.shape)
sampled_ratings.head()

(200000, 6)


Unnamed: 0,userId,movieId,rating,date,title,year
11486193,117790,480,5.0,2015-06-21 08:57:35,Jurassic Park,1993
20239568,206360,913,3.5,2006-01-02 04:21:21,"Maltese Falcon, The",1941
22857,235,593,5.0,2001-11-21 18:58:52,"Silence of the Lambs, The",1991
8444100,87000,799,3.5,2005-10-23 19:06:58,"Frighteners, The",1996
4157730,42704,519,3.0,2005-02-09 07:58:48,RoboCop 3,1993


Verificação da possibilidade de construir uma pivot_table com os utilizadores e filmes existentes na sample. Estes não podem ultrapassar o valor máximo de um **int32**.

In [16]:
#não existem valores em falta no dataframe
print(sampled_ratings.dropna().shape)
print("user count:",sampled_ratings.userId.unique().size)
print("movie count:",sampled_ratings.movieId.unique().size)
print(sampled_ratings.userId.unique().size * sampled_ratings.movieId.unique().size)

#Verificar se o tamanho da sample (em forma de pivot table) é maior que o maior valor do int32 (evitar memory errors)
print(2147483647 > (sampled_ratings.userId.unique().size * sampled_ratings.movieId.unique().size))

(500000, 6)
user count: 910
movie count: 2564
2333240
True


In [17]:
sampled_ratings.userId = pd.to_numeric(sampled_ratings.userId, downcast = "integer")
sampled_ratings.movieId = pd.to_numeric(sampled_ratings.movieId, downcast = "integer")
sampled_ratings.rating = pd.to_numeric(sampled_ratings.rating, downcast = "float")

#para usar com o implicit é preciso as duas
merge_matrix = sampled_ratings.pivot_table(index = "userId", columns = "title", values = "rating")
#user_movie_matrix = sampled_ratings.pivot_table(index = "userId", columns = "title", values = "rating")
#movie_user_matrix = sampled_ratings.pivot_table(index = "title", columns = "userId", values = "rating")

In [18]:
import correlation as corr
corr.ratings_correlation(sampled_ratings, merge_matrix, "Shrek", 3, 10, 0.5).get_correlations()

title
Marley & Me                                       0.684252
Insurgent                                         0.677486
Sandlot, The                                      0.643707
Wings of the Dove, The                            0.638555
Shrek 2                                           0.637273
Houseguest                                        0.628128
Ulee's Gold                                       0.623437
Defending Your Life                               0.610485
Spy                                               0.609055
For Love of the Game                              0.607001
Coco                                              0.606300
Chariots of Fire                                  0.605136
Crucible, The                                     0.603622
Doctor Strange                                    0.594131
Fantastic Beasts and Where to Find Them           0.592157
She's the Man                                     0.591017
Made in America                                   

# Baseline predictors:

In [38]:
import baseline as base
base.make_baseline(sampled_ratings, damping_factor = 25).get_ratings().head()

Unnamed: 0,userId,movieId,rating,date,title,year,bi,bu,bui
0,381,257,5.0,2015-06-21 08:57:35,Jurassic Park,1993,0.395913,0.463309,5.859221
1,658,408,3.5,2006-01-02 04:21:21,"Maltese Falcon, The",1941,0.62893,-0.095349,4.03358
2,0,313,5.0,2001-11-21 18:58:52,"Silence of the Lambs, The",1991,0.916927,0.539949,6.456875
3,278,372,3.5,2005-10-23 19:06:58,"Frighteners, The",1996,0.018827,-0.081526,3.437302
4,138,278,3.0,2005-02-09 07:58:48,RoboCop 3,1993,-1.240506,0.444984,2.204478


# Alternating Least Squares (ALS)
## Tensor
Este recommender system define-se por (U, I, A), em que U é o conjunto de utilizadores com dimensão *n*, I é o conjunto de itens (filmes) com dimensão *p* e A é o conjunto de ratings dados pelos utilizadores aos produtos com dimensão *q*.

* $r_{uia}$ representa o feedback do utilizador $u$ através do rating $a$ para um item $i$.
* Todos os feedbacks vão criar um *tensor* (array multidimensinal) **R** de dimensão $n * p * q$
* Em **R**, nem todos os feedback vão existir, logo o objetivo do recommender system é que todos os valores deste tensor sejam preenchidos


### Tensor factorization:
* CANDECOMP-PARAFAC (CP)
* Tucker decompositions

São problemas muito difíceis computacionalmente, por isso a solução é:

## Higher Order Alternating Least Squares (HOALS):
* Usando Tucker decomposition
* Tensor de ordem n = 2
* Cada feature deve ser tratada independentemente

In [19]:
alpha = 40
confidence = (movie_user_matrix * alpha).astype("double")

#Als model with 10 latent factor, lambda = 0.1 and 10 alternating iterations
als_model = implicit.als.AlternatingLeastSquares(factors = 10, regularization = 0.1, iterations = 10)
als_model.fit(confidence)
%time

NameError: name 'movie_user_matrix' is not defined

In [36]:
#Using the implicit built in user recommender function
recommended = als_model.recommend(10, user_movie_matrix)

for rec in recommended:
    i, score = rec
    print("Title:", sampled_ratings.loc[sampled_ratings.movieId == i].title.iloc[0], "\nScore:", score)

Title: You Can Count on Me 
Score: 1.0374522
Title: Living in Oblivion 
Score: 1.0325506
Title: American Movie 
Score: 1.0321634
Title: Everyone Says I Love You 
Score: 1.0266441
Title: From Here to Eternity 
Score: 1.0256691
Title: Charade 
Score: 1.0253102
Title: Meet Me in St. Louis 
Score: 1.024713
Title: His Girl Friday 
Score: 1.0230129
Title: Purple Rose of Cairo, The 
Score: 1.0183117
Title: 8 1/2 (8½) 
Score: 1.014023


In [17]:
#Fitting with class
import als_recommender as als
model = als.ALSRecommender(iterations = 10, latent = 10, alpha = 40, regularizer = 0.1)
model.fit(sampled_ratings)



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




In [18]:
#Find movies similar to a certain movie with class
model.similar_to_movie(movie_id = 0, n_similar = 10)

Unnamed: 0,movie_title,similarity
0,Toy Story,0.327326
1,"Goonies, The",0.32616
2,Spaceballs,0.325734
3,Airplane!,0.325431
4,A.I. Artificial Intelligence,0.325042
5,Predator,0.324934
6,Home Alone,0.324923
7,"Lord of the Rings: The Two Towers, The",0.324891
8,Austin Powers: International Man of Mystery,0.324873
9,Indiana Jones and the Temple of Doom,0.324666


In [29]:
model.recommend_to_user(user_id = 10, n_movies = 10)

Unnamed: 0,movie_title,score
0,Manhattan Murder Mystery,1.033533
1,His Girl Friday,1.026646
2,Charade,1.025606
3,From Here to Eternity,1.024198
4,"Purple Rose of Cairo, The",1.020313
5,Everyone Says I Love You,1.017498
6,Mighty Aphrodite,1.016955
7,Spellbound,1.016857
8,M,1.015822
9,"Streetcar Named Desire, A",1.015375
