# Datasets:

* Dataset "**Movies**" e "**Ratings**"
* Transformação de timestamps em DateTimes
* Nos **Ratings** não existem dados antes de 1995 e depois de 2018
* Há cerca de 27 milhões de ratings e 58 mil filmes nos datasets

## Ratings:

In [1]:
import pandas as pd
import numpy as np
ratings = pd.read_csv("datasets/ratings.csv")
print(ratings.shape)
ratings.head()

(27753444, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264


##### Transformar timestamps em datetime:
* transformar a coluna "timestamp" no dataset dos ratings
* Possivelmente utilizá-la como index

In [2]:
ratings["date"] = pd.to_datetime(ratings.timestamp, unit = "s")
ratings = ratings.drop(columns = "timestamp")
ratings.head()

#Se for preciso utilizar a data como index
#ratings_plot.index = pd.DatetimeIndex(ratings_plot.date)

Unnamed: 0,userId,movieId,rating,date
0,1,307,3.5,2009-10-27 21:00:21
1,1,481,3.5,2009-10-27 21:04:16
2,1,1091,1.5,2009-10-27 21:04:31
3,1,1257,4.5,2009-10-27 21:04:20
4,1,1449,4.5,2009-10-27 21:01:04


**Não existem ratings antes de 1995 e depois de 2018**

In [3]:
print(ratings.loc[ratings.date.dt.year < 1995])
print(ratings.loc[ratings.date.dt.year > 2018])

Empty DataFrame
Columns: [userId, movieId, rating, date]
Index: []
Empty DataFrame
Columns: [userId, movieId, rating, date]
Index: []


## Movies:

In [4]:
movies = pd.read_csv("datasets/movies.csv")
print(movies.shape)
movies.head()

(58098, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
#Extrair o ano
movies["year"] = movies.title.str.extract("\((\d{4})\)")
movies.title = movies.title.str[:-7]
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [104]:
movies = movies.dropna()
movies.shape

(57447, 5)

# Sampling do dataset:
* Juntando os ratings com os movies
* Sendo o peso o ano do filme (filmes mais antigos têm menos ratings e sao menos importantes para recomendação)
* **BIAS:** Usando o ano como peso na amostragem acrescenta uma parcialidade a favor de filmes mais recentes.


#### Número único de utilizadores e filmes:

In [7]:
movie_ratings = ratings.join(movies.set_index("movieId"), on = "movieId")
movie_ratings.drop(columns = ["genres"], inplace = True)
print(movie_ratings.shape)
print(movie_ratings.userId.unique().size)
print(movie_ratings.title.unique().size)
movie_ratings.head()

(27753444, 6)
283228
50591


Unnamed: 0,userId,movieId,rating,date,title,year
0,1,307,3.5,2009-10-27 21:00:21,Three Colors: Blue (Trois couleurs: Bleu),1993
1,1,481,3.5,2009-10-27 21:04:16,Kalifornia,1993
2,1,1091,1.5,2009-10-27 21:04:31,Weekend at Bernie's,1989
3,1,1257,4.5,2009-10-27 21:04:20,Better Off Dead...,1985
4,1,1449,4.5,2009-10-27 21:01:04,Waiting for Guffman,1996


### get_count_users_ratings:
Extrai todos os utilizador de um dataframe que fizeram mais de um certo número de ratings (*ratings_threshold*)
* *df* -> dataframe com userId e ratings
* *ratings_threshold* -> número de ratings a partir do qual um utilizador é relevante

In [107]:
def get_count_users_ratings(df, ratings_threshold):
    user_ratings = df.groupby("userId")["rating"].count()
    return(user_ratings[user_ratings > ratings_threshold])

In [108]:
#Exemplo de utilizacao
user_ratings = get_count_users_ratings(movie_ratings, 400)
print(user_ratings.shape)
user_ratings.head()

(14412,)


userId
4      736
56     764
73     468
79     425
81    1093
Name: rating, dtype: int64

### ratings_from_users:
Mostra apenas os utilizadores com mais de certos ratings num determinado dataframe, utilizando a função **get_count_users_ratings**.

In [109]:
def ratings_from_users(df, ratings_threshold):
    user_ratings = get_count_users_ratings(df, ratings_threshold)
    mask = df.userId.apply(lambda x: x in user_ratings.index)
    return df[mask]

In [110]:
#Exemplo
movie_ratings = ratings_from_users(movie_ratings, 400)
print(movie_ratings.shape)
movie_ratings.head()

(11203227, 6)


Unnamed: 0,userId,movieId,rating,date,title,year
42,4,1,4.0,2005-04-17 19:25:37,Toy Story,1995
43,4,2,4.0,2005-04-17 19:48:26,Jumanji,1995
44,4,5,2.0,2005-08-14 03:34:13,Father of the Bride Part II,1995
45,4,6,4.5,2005-04-17 19:47:22,Heat,1995
46,4,10,4.0,2005-04-17 19:26:35,GoldenEye,1995


Sample de 100K ratings, dando mais importância a anos mais recentes

In [21]:
sampled_ratings = pd.DataFrame(movie_ratings)
sampled_ratings = sampled_ratings.sample(n = 100000, replace = False, weights = "year", random_state = 1)
print(sampled_ratings.shape)
sampled_ratings.head()

(100000, 6)


Unnamed: 0,userId,movieId,rating,date,title,year
11602861,118903,6711,3.5,2003-10-06 00:16:00,Lost in Translation,2003
19961188,203569,1544,1.5,2010-11-15 11:40:29,"Lost World: Jurassic Park, The",1997
5287,56,7004,3.0,2010-05-25 10:03:01,Kindergarten Cop,1990
8460254,87201,110,4.0,2006-01-17 05:12:51,Braveheart,1995
4103898,42151,4011,4.5,2003-09-04 20:44:49,Snatch,2000


Verificação da possibilidade de construir uma pivot_table com os utilizadores e filmes existentes na sample. Estes não podem ultrapassar o valor máximo de um **int32**.

In [24]:
#não existem valores em falta no dataframe
print(sampled_ratings.dropna().shape)
print(sampled_ratings.userId.unique().size)
print(sampled_ratings.movieId.unique().size)
print(sampled_ratings.userId.unique().size * sampled_ratings.movieId.unique().size)

#Verificar se o tamanho da sample (em forma de pivot table) é maior que o maior valor do int32 (evitar memory errors)
print(2147483647 > (sampled_ratings.userId.unique().size * sampled_ratings.movieId.unique().size))

(100000, 6)
14301
12249
175172949
True


In [30]:
sampled_ratings.userId = pd.to_numeric(sampled_ratings.userId, downcast = "integer")
sampled_ratings.movieId = pd.to_numeric(sampled_ratings.movieId, downcast = "integer")
sampled_ratings.rating = pd.to_numeric(sampled_ratings.rating, downcast = "float")

merge_matrix = sampled_ratings.pivot_table(index = "userId", columns = "title", values = "rating")
merge_matrix.head()

title,#Horror,#realityhigh,$9.99,'71,'R Xmas,'Round Midnight,'Salem's Lot,'Til There Was You,"'burbs, The",'night Mother,...,a/k/a Tommy Chong,eXistenZ,iBoy,loudQUIETloud: A Film About the Pixies,xXx,xXx: Return of Xander Cage,xXx: State of the Union,¡Three Amigos!,Путь к себе,チェブラーシカ
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
56,,,,,,,,,,,...,,,,,,,,,,
73,,,,,,,,,,,...,,,,,,,,,,
79,,,,,,,,,,,...,,,,,,,,,,
81,,,,,,,,,,,...,,,,,,,,,,


### find_intersections:
Encontra os filmes em que os utilizadores votaram, sendo que estes também votaram num certo filme (col) (P(X|Y)).
* *df* -> pivot_table
* *col* -> Filme para comparar ratings

In [31]:
def find_intersections(df, col):
    return df.apply(lambda x: find_number_of_intersections(df, col, x.name))

In [32]:
def find_number_of_intersections(df, col1, col2):
    col1_unique = df[col1].dropna().reset_index().userId.unique()
    col2_unique = df[col2].dropna().reset_index().userId.unique()
    return np.intersect1d(col1_unique, col2_unique).size

### most_relevant:
Mostra os filmes em que os utilizadores deram ratings em comum com o *movie* e encontra os que têm mais do que certos utilizadores em comum (least_common_ratings)
* *df* -> pivot_table
* *movie* -> o filme escolhido para comparar
* *least_common_ratings* -> threshold para o número de utilizadores que deram ratings aos dois filmes em comum

In [38]:
def most_relevant(df, movie, least_common_ratings):
    intersections = find_intersections(df, movie)
    return intersections[intersections >= least_common_ratings]

### movies_with_counts:
Esta função encontra os filmes que possuem mais do que um certo numero de ratings (*ratings_threshold*)

In [90]:
def movies_with_counts(df, cols, ratings_threshold):
    ratings = df.groupby("title")["rating"].mean().reset_index()
    ratings["rating_count"] = df.groupby("title")["rating"].count().reset_index().rating
    ratings = ratings.set_index("title").loc[cols].reset_index()
    return ratings[ratings.rating_count > ratings_threshold]

In [39]:
relevant = most_relevant(merge_matrix, "Shawshank Redemption, The", 3)
relevant

title
Aladdin                                                                      3
American Beauty                                                              3
Beautiful Mind, A                                                            3
Beautiful Thing                                                              3
Breakfast Club, The                                                          3
City Slickers                                                                3
Dragonheart                                                                  3
Father of the Bride Part II                                                  3
Fight Club                                                                   3
Gladiator                                                                    3
Godfather, The                                                               3
Independence Day (a.k.a. ID4)                                                3
Mission: Impossible                           

### movie_correlations:
Faz a correlação entre os ratings de um certo filme (*movie*) e os restantes filmes a comparar (*movies_to_compare*)

In [40]:
def movie_correlations(df, movie, movies_to_compare):
    movie_user_ratings = df[movie]
    return df[movies_to_compare].corrwith(movie_user_ratings).sort_values(ascending = False)

### get_correlations:
Função que junta todas as outras por conveniência. Retorna uma série com as correlações mais altas, de modo a encontrar uma recomendação.
* *sample* -> amostra dos dataset dos ratings
* *pivot* -> a pivot_table necessária à comparação dos ratings
* *movie* -> o filme para o qual se quer encontrar uma recomendação
* *n_movies_to_compare* -> quantidade de utilizadores que têm ratings em comum com o *movie*, para um filme ser considerado relevante
* *ratings_threshold* -> número ratings que um filme de ter para ser considerado relevante
* *corr_threshold* -> correlação mínima para um filme ser recomendado

In [97]:
def get_correlations(sample, pivot, movie, n_movies_to_compare = 3, ratings_threshold = 50, corr_threshold = 0.5):
    relevant = most_relevant(pivot, movie, n_movies_to_compare)
    rel = movies_with_counts(sample, relevant.index.values.tolist(), ratings_threshold)
    similar_to_movie = movie_correlations(pivot, movie, rel.title)
    similar_to_movie.drop(movie, inplace = True)
    return similar_to_movie[similar_to_movie > corr_threshold]

In [103]:
corrs = get_correlations(sampled_ratings, merge_matrix, "Shrek", 3, 10, 0.5)
corrs

title
Austin Powers: The Spy Who Shagged Me    0.981981
Equilibrium                              0.576557
Independence Day (a.k.a. ID4)            0.500000
dtype: float64