## Projeto Recomendação de Filmes

#### O que faremos?

Criaremos algoritimos de recomendadores de filmes para alguns cenários:
- Com base nos filmes mais populares (para usuários novos na plataforma)
- Com base em histórico de consumo do usuário
- Com base em escolhas da rede de contatos do usuário.


#### De onde consumiremos?

Neste projeto, consumiremos bases de filmes disponibilizada pelo GroupLens, que é um grupo de pesquisa da universidade de Minnesota que possui publicações em diversas áreas de estudo.
fontes:
- https://grouplens.org/
- https://movielens.org/

#### Como dividiremos o projeto?

Dividiremos o projeto da seguinte forma:
- Import das ferramentas que utilizaremos
- Ingestão dos dados
- Tratamento dos dados
    - Verificação de Nnlos
    - Verificação de valores duplicados
    - Merge das tabelas
    - Remoção de outliers
- Identificação de variáveis categóricas e modelagem do DataFrame
- Clusterização utilizando o KMeans
- Desenvolvimento do modelo


---------------------------------------------------------------------

###### Import das ferramentas

In [2]:
import pandas as pd
import numpy as np
import os
os.chdir("D:\Desktop\estudos\RecomendadorFilmes\ml-latest-small")

###### Ingerindo as bases

In [159]:
filmes = pd.read_csv("movies.csv")
notas = pd.read_csv("ratings.csv")

###### Verificando a estrutura das bases

In [7]:
filmes.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
notas.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


###### Verificando valores nulos

In [12]:
filmes.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [13]:
notas.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

###### Verificando e Removendo filmes duplicados

In [138]:
filmes.title.value_counts()

Eros (2004)                                     2
Confessions of a Dangerous Mind (2002)          2
War of the Worlds (2005)                        2
Emma (1996)                                     2
Saturn 3 (1980)                                 2
                                               ..
Wrath of the Titans (2012)                      1
Foxfire (1996)                                  1
Mod Squad, The (1999)                           1
Fay Grim (2006)                                 1
Scooby-Doo! Curse of the Lake Monster (2010)    1
Name: title, Length: 9737, dtype: int64

In [139]:

filmes_duplicados = filmes[filmes.duplicated(subset="title")]["movieId"].tolist()


In [160]:
filmes_duplicados = filmes[filmes.duplicated(subset="title")]["movieId"].tolist()
filmes = filmes.query(f'movieId not in {filmes_duplicados}')

In [141]:
filmes.title.value_counts()

Elephant (2003)                                 1
Band Wagon, The (1953)                          1
Rum Diary, The (2011)                           1
Love Song for Bobby Long, A (2004)              1
Dirty Dozen, The (1967)                         1
                                               ..
Wrath of the Titans (2012)                      1
Foxfire (1996)                                  1
Mod Squad, The (1999)                           1
Fay Grim (2006)                                 1
Scooby-Doo! Curse of the Lake Monster (2010)    1
Name: title, Length: 9737, dtype: int64

###### Merge das tabelas

In [142]:
filmes_notas = notas.merge(filmes, how='left', on='movieId')

In [116]:
filmes_notas.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


Verificando os filmes com maiores avaliações

In [126]:
filmes_notas.groupby('title')['rating'].mean().sort_values(ascending=False).head()

title
Gena the Crocodile (1969)              5.0
True Stories (1986)                    5.0
Cosmic Scrat-tastrophe (2015)          5.0
Love and Pigeons (1985)                5.0
Red Sorghum (Hong gao liang) (1987)    5.0
Name: rating, dtype: float64

Verificando quantos votos tiveram

In [127]:
filmes_notas[filmes_notas.title == 'Gena the Crocodile (1969)']

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
16929,105,175293,5.0,1526208082,Gena the Crocodile (1969),Animation|Children


In [144]:
media = filmes_notas.groupby('movieId')['rating'].mean()
media['movieId'] = filmes['movieId']

In [161]:
filmes = filmes.merge(media, on='movieId')

In [162]:
filmes.columns = ['movieId', 'title', 'genres', 'average_rating']

In [163]:
filmes.head()

Unnamed: 0,movieId,title,genres,average_rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143
4,5,Father of the Bride Part II (1995),Comedy,3.071429


In [151]:
filmes_notas.groupby('title').size().describe()

count    9719.000000
mean       10.374524
std        22.405799
min         1.000000
25%         1.000000
50%         3.000000
75%         9.000000
max       329.000000
dtype: float64

In [186]:
qtde_notas = filmes_notas.groupby('movieId').size()
qtde_notas.columns = ['movieId','qtde_votos']
qtde_notas = pd.DataFrame(qtde_notas)
qtde_notas.columns = ['qtde_notas']

In [176]:
filmes.columns

Index(['movieId', 'title', 'genres', 'average_rating'], dtype='object')

In [187]:
filmes = filmes.merge(qtde_notas, how='left', on='movieId')

In [192]:
filmes

Unnamed: 0,movieId,title,genres,average_rating,qtde_notas
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49
...,...,...,...,...,...
9714,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,4.0,1
9715,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,3.5,1
9716,193585,Flint (2017),Drama,3.5,1
9717,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,3.5,1


In [193]:
filmes_notas = notas.merge(filmes, on='movieId')

#### Top 5 Filmes mais avaliados

In [206]:
filmes.sort_values(by='qtde_notas', ascending=False).head(5)

Unnamed: 0,movieId,title,genres,average_rating,qtde_notas
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.164134,329
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,4.429022,317
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,4.197068,307
510,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,4.16129,279
1938,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,4.192446,278


## Clusterizando

In [208]:
filmes = pd.read_csv("movies.csv")
notas = pd.read_csv("ratings.csv")

In [209]:
filmes_duplicados = filmes[filmes.duplicated(subset="title")]["movieId"].tolist()
filmes = filmes.query(f'movieId not in {filmes_duplicados}')

In [211]:
filmes.value_counts('title')

title
'71 (2014)                                   1
Particle Fever (2013)                        1
Parent Trap, The (1961)                      1
Parent Trap, The (1998)                      1
Parental Guidance (2012)                     1
                                            ..
Friends & Lovers (1999)                      1
Friends with Benefits (2011)                 1
Friends with Kids (2011)                     1
Friends with Money (2006)                    1
À nous la liberté (Freedom for Us) (1931)    1
Length: 9737, dtype: int64

In [214]:
filmes['genres'].str.split('|')

0       [Adventure, Animation, Children, Comedy, Fantasy]
1                          [Adventure, Children, Fantasy]
2                                       [Comedy, Romance]
3                                [Comedy, Drama, Romance]
4                                                [Comedy]
                              ...                        
9737                 [Action, Animation, Comedy, Fantasy]
9738                         [Animation, Comedy, Fantasy]
9739                                              [Drama]
9740                                  [Action, Animation]
9741                                             [Comedy]
Name: genres, Length: 9737, dtype: object

##### Dando um 'Split' nos generos (dados categóricos) dos filmes e inputando nas colunas

In [216]:
generos = filmes.genres.str.get_dummies('|')

In [217]:
generos['movieId'] = filmes.movieId

In [219]:
filmes = filmes.merge(generos, on='movieId')

In [220]:
filmes

Unnamed: 0,movieId,title,genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9732,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9733,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9734,193585,Flint (2017),Drama,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9735,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
