<a href="https://colab.research.google.com/github/galvaowesley/DataScience_Learning/blob/master/MovieLens/MovieLens_Data_Analysing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Analysing of MovieLens dataset 
===

Wesley Galvão

May, 2020

# The Data

## Dictionary

It describes the meaning of each datasets column. For more information, go to the official datasets [README](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html). 

* `movieId`  :  movie ID
* `title`    :  movie tittle
* `genres`   :  movie genres
* `userId`   :  user ID that rated a movie(or movies) on the platform 
* `rating`   :  movie rating score. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
* `timestamp`:  timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

# Feature Engineering

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

Here is how to import the movies dataset that contains movieId, title and genres about the movies. Each line of this dataset, after the header row, represents one movie.

In [0]:
# Importing movies dataset
movies = pd.read_csv("https://raw.githubusercontent.com/alura-cursos/introducao-a-data-science/master/aula0/ml-latest-small/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [0]:
print("Shape of movies dataset")
movies.shape

Shape of movies dataset


(9742, 3)

Similarly, we will import the rating dataset that contains the fields userId, movieId, rating and timestamp.Each line of this dataset, after the header row, represents one rating of one movie by one user.
 We can also take a look at its shape.

In [0]:
# Importing movies rating dataset
rating = pd.read_csv("https://github.com/alura-cursos/introducao-a-data-science/blob/master/aula0/ml-latest-small/ratings.csv?raw=true")
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [0]:
print("Shape of rating dataset")
rating.shape

Shape of rating dataset


(100836, 4)

At this point, we can notice that the two datasets have different number of rows. But, both are related. So, how can we correlate and analyse the datasets? 

To anwser this question, firstly let's take a look at the `rating` column and get some information about its. 

As each line contains a rating for the related film, it is expected that a film has different ratings from different users. Then, we can choose a single movie to analyze its statistics. For this, we will use the `query ()` method which takes as an argument a field value that we want to query. For example, we will run the command `rating.query ('movieId == 1')` and it will return only the lines for which the Boolean expression` 'movieId == 1'` is true.

Once the informations about Toy Story (1995) is collected, we can apply the `describe()` method to obtain the movie statistics. 


In [0]:
# Query rows from movie rating where movieID = 1 . 
rating_movie_1 = rating.query('movieId == 1')
rating_movie_1.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
516,5,1,4.0,847434962
874,7,1,4.5,1106635946
1434,15,1,2.5,1510577970
1667,17,1,4.5,1305696483


In [0]:
# Get the statistics from movie 1 rating. 
rating_movie_1.rating.describe()

count    215.000000
mean       3.920930
std        0.834859
min        0.500000
25%        3.500000
50%        4.000000
75%        4.500000
max        5.000000
Name: rating, dtype: float64

The table above reveals a lot of informations. Now we know that 215 users rated the Toy Story (1995) movie. The worst score is 0.5 and the score avarage is 3.92. 
If we just wanna know the mean, or avarage, we can use `mean()` method.

In [0]:
# Get the Toy Story rating avarage
rating_movie_1_avg  = rating_movie_1.rating.mean()
print('Toy Story rating avarage : %0.2f' %rating_movie_1_avg)

Toy Story rating avarage : 3.92


What's the next step? Well, now that we know how to get the average score for a movie, we will collect the average for all movies. Then, this new set of information will be concatenated with the dataset movies.

To make this possible, we will use a new method called `groupby ()`, which groups a set of information from the original data set, given a feature passed by parameter.

For example, given the `rating` dataset, we will group by movieId and get the average for each movie, thus:

In [0]:
# Get the avarage rating score per movie. 
avg_score_per_movie = rating.groupby('movieId')['rating'].mean()
avg_score_per_movie.head()

movieId
1    3.920930
2    3.431818
3    3.259615
4    2.357143
5    3.071429
Name: rating, dtype: float64

Similarly, we can create a new feature that contains how many ratings a movie received.

In [0]:
# Get how many ratings a movie received
rating_counting = rating.groupby('movieId')['rating'].count()
rating_counting.head()

movieId
1    215
2    110
3     52
4      7
5     49
Name: rating, dtype: int64

In [0]:
# Join the previous dataframes with movies dataframe
tmp = movies.join(avg_score_per_movie, on = 'movieId')
tmp.columns = ['movieId', 'title', 'genres', 'rating_avg']
movies_2 = tmp.join(rating_counting,   on = 'movieId')
movies_2.columns = ['movieId', 'title', 'genres', 'rating_avg', 'count']
movies_2.head()

Unnamed: 0,movieId,title,genres,rating_avg,count
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,215.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49.0



# Exploratory data analysis

## Missing values

In [0]:
# Count missing values per column
movies_2.isna().sum()

movieId        0
title          0
genres         0
rating_avg    18
count         18
dtype: int64

In [0]:
# Show rows with related missing values
movies_2[pd.isnull(movies_2.rating_avg)]

Unnamed: 0,movieId,title,genres,rating_avg,count
816,1076,"Innocents, The (1961)",Drama|Horror|Thriller,,
2211,2939,Niagara (1953),Drama|Thriller,,
2499,3338,For All Mankind (1989),Documentary,,
2587,3456,"Color of Paradise, The (Rang-e khoda) (1999)",Drama,,
3118,4194,I Know Where I'm Going! (1945),Drama|Romance|War,,
4037,5721,"Chosen, The (1981)",Drama,,
4506,6668,"Road Home, The (Wo de fu qin mu qin) (1999)",Drama|Romance,,
4598,6849,Scrooge (1970),Drama|Fantasy|Musical,,
4704,7020,Proof (1991),Comedy|Drama|Romance,,
5020,7792,"Parallax View, The (1974)",Thriller,,
