# Pré-processamento dados do dataset MovieLens 100k

Esse dataset apresenta 100 mil avaliações de 600 usuários para quase 9000 filmes. Ele também possui 3600 tags de aplicação.

## Importação de dependências e dados

In [2]:
import pandas as pd
import os

In [3]:
data_path = "../data/raw/ml-latest-small"

In [61]:
output_path = "../data/processed"

In [32]:
movies = pd.read_csv(os.path.join(data_path, "movies.csv"))
ratings = pd.read_csv(os.path.join(data_path, "ratings.csv"))

## Pré-processamento 'ratings'

Nessa seção, vamos ajustar o timestamp do dataframe 'ratings' e transformá-lo em tipo datetime.

In [33]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Timestamp to datetime:

In [34]:
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')

Removendo coluna 'timestamp':

In [35]:
ratings = ratings.drop('timestamp', axis=1)

In [36]:
ratings.head()

Unnamed: 0,userId,movieId,rating,datetime
0,1,1,4.0,2000-07-30 18:45:03
1,1,3,4.0,2000-07-30 18:20:47
2,1,6,4.0,2000-07-30 18:37:04
3,1,47,5.0,2000-07-30 19:03:35
4,1,50,5.0,2000-07-30 18:48:51


## Pré-processamento 'movies'

Nesta seção, vamos separar o ano do título do filme em uma nova coluna e separar os gêneros em colunas separadas (One - Hot Encoding)

In [37]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [38]:
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)')
movies['title'] = movies['title'].str.replace(r'\(\d{4}\)', '', regex=True).str.strip()

In [42]:
movies = movies[['movieId', 'title', 'year', 'genres']]

In [48]:
movies['genres'] = movies['genres'].str.replace('|', ',', regex=False)

In [49]:
movies.head()

Unnamed: 0,movieId,title,year,genres
0,1,Toy Story,1995,"Adventure,Animation,Children,Comedy,Fantasy"
1,2,Jumanji,1995,"Adventure,Children,Fantasy"
2,3,Grumpier Old Men,1995,"Comedy,Romance"
3,4,Waiting to Exhale,1995,"Comedy,Drama,Romance"
4,5,Father of the Bride Part II,1995,Comedy


### Aplicando one-hot encoding dos gêneros

In [50]:
genre_dummies = movies['genres'].str.get_dummies(sep=',')

In [52]:
movies = pd.concat([movies, genre_dummies], axis=1)

In [57]:
movies = movies.drop('genres', axis=1)

In [58]:
movies.head()

Unnamed: 0,movieId,title,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,1995,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,1995,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,1995,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Merge de 'ratings' com 'movies'

In [59]:
df = ratings.merge(movies, on='movieId', how='left')

In [60]:
df.head()

Unnamed: 0,userId,movieId,rating,datetime,title,year,(no genres listed),Action,Adventure,Animation,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1,4.0,2000-07-30 18:45:03,Toy Story,1995,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,3,4.0,2000-07-30 18:20:47,Grumpier Old Men,1995,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,1,6,4.0,2000-07-30 18:37:04,Heat,1995,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,47,5.0,2000-07-30 19:03:35,Seven (a.k.a. Se7en),1995,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
4,1,50,5.0,2000-07-30 18:48:51,"Usual Suspects, The",1995,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


## Exportando dados pré-processados

In [62]:
os.makedirs(output_path, exist_ok=True)

In [63]:
df.to_csv(os.path.join(output_path, "movielens_100k_interactions.csv"), index=False)