
# Introduction

In this assignment, we are asked to create a **data pipeline** using two csv files taken from [MovieLens](http://movielens.ogr). 

Our goal is to use the available raw data to provide a clean, valid and transformed data that helps the data scientist to obtain information about the data easily. We are told that, after our data pipeline, the following query will be done:

> **Obtain a list of movie genres, sorted by their average score
> obtained in the last week**.

We will orient our code to make this query as easy as possible.

Firstly, we will import the needed libraries. `Pandas` will help us to load the *.csv* files and to do the transformations in the data.

In [16]:
# Imports cell
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline


Now, we can read the data from the provided *.csv* files and show the first 5 elements of each file, having a sneak peek at the data.

In [17]:
movies = pd.read_csv("datasets/movies.csv")
ratings = pd.read_csv("datasets/ratings.csv")

In [18]:
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [19]:
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


As we can see, the data contained in each file is quite different, so we will analyze each part separately and we will try to combine their information at the end to produce the output desired in our query.


# Ratings 

We will begin transforming the input data from this file. Using this data, our goal will be to clean and transform it so that obtaining the rating of each film is as easy as possible.

The most basic step is to remove the rows of the dataframe that contain null values.

In [14]:
# Function that removes rows that have a null value
remove_nulls = lambda df : df.dropna()
        

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f2f3293d640>


In [None]:
# Function that removes rows that have a null value
remove_nulls = lambda df : df.dropna()
        
    

Having removed all the completely invalid elements of our data, we want to remove values that may seem to be correct but are not 
$\{0.5\cdot i\}_{i=1}^{10}$

In [42]:
def remove_wrong_evaluations(df):
    correct_vals = np.arange(0.5, 5.01, 0.5)
    return df[df.rating in correct_vals]

-1.5


In [None]:
week = 60*60*24*7

In [None]:


mean_by_id = ratings.groupby("movieId")['rating']
print(mean_by_id)

In [4]:
# Remove nulls
print("Movies before removing: {}".format(len(movies.index)))
movies = movies[movies.genres != '(no genres listed)']
print("Movies after removing: {}".format(len(movies.index)))

Movies before removing: 62423
Movies after removing: 57361


In [6]:
movies['genres'] = movies['genres'].apply(lambda x : x.split("|"))

In [22]:
arr = np.concatenate(np.array(movies['genres'].tolist(), dtype=object))
unique = np.unique(arr)

In [25]:
for genre in unique:
    movies[genre] = movies['genres'].apply(lambda x: True if genre in x else False)

In [27]:
movies.head()

Unnamed: 0,movieId,title,genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Horror,IMAX,Musical,Mystery,Psychological Thriller,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",False,True,True,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",False,True,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
4,5,Father of the Bride Part II (1995),[Comedy],False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
