## MovieLens 20M Dataset Recommender System - 01_Data Wrangling

This dataset describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. These data were created by users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015, and updated on October 17, 2016 to update links.csv and add genome-* files. 

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided,

The data are contained in six files, 'genome-scores.csv', 'genome-tags.csv', 'links.csv', 'movies.csv', 'ratings.csv' and 'tags.csv'.

We would like to use it to build a model of recommender systems.

This is the first part of the study: Data Wrangling.

### Content

### 1. Import Modules and Read Data Files

#### 1.1 Import  Necessary Modules

In [71]:
import pandas as pd
import numpy as np

#### 1.2 Read Movie File

In [72]:
# First, read the movie file.
movies = pd.read_csv('./ml-20m/movies.csv')

In [73]:
# look at the first few rows of the movie file.
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [74]:
# look at the size of the movie file.
movies.shape

(27278, 3)

In [75]:
# see if every movie is unique.
movies.movieId.nunique()

27278

We can see that every row is one movie and they are all unique in the movie file.
Let's then read the rating file.

#### 1.3 Read Rating File

In [76]:
# read the rating file.
ratings = pd.read_csv('./ml-20m/ratings.csv')

In [77]:
# look at first few rows in the rating file.
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [78]:
# column 'timestamp' is not necessary to our model. So let's drop it.
ratings = ratings.drop(columns=['timestamp'])

In [79]:
# see if there is missing values in the rating file.
ratings.isna().sum()

userId     0
movieId    0
rating     0
dtype: int64

It shows that there is no missing values in the rating file.

In [80]:
# look at the size of the rating file. 
ratings.shape

(20000263, 3)

Next we want to build a model to predict users' rating. The steps are: 1. select a data with suitable size; 2. Split the data into train and test set; 3. Apply several algorithms and compare the performance. In this Notebook we will only work on the first part, which is to create a dataset with the right size. Considering the capacity for a personal computer, we think the size of 100,000 rows is good for this study. 

In [81]:
# look at how many people who give movie ratings.
ratings['userId'].nunique()

138493

In [82]:
# look at how many movies in the rating file. 
ratings['movieId'].nunique()

26744

There are 138,493 users and 26744 movies in the rating file. 

#### 1.4 Read Tag File

In [83]:
tags = pd.read_csv('./ml-20m/tags.csv')

In [84]:
# show the first few rows
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


In [85]:
# show the top 10 tag values
tags.tag.value_counts().head(10)

sci-fi             3384
based on a book    3281
atmospheric        2917
comedy             2779
action             2657
surreal            2427
BD-R               2334
twist ending       2323
funny              2072
dystopia           1991
Name: tag, dtype: int64

In [86]:
# show the tag file size
tags.shape

(465564, 4)

In [87]:
# show how many users give tags
tags['userId'].nunique()

7801

In [88]:
#show how many movies have tags
tags['movieId'].nunique()

19545

#### 1.5 Read Genome Score File, Genome Tag File, and Links File

In [89]:
genome_scores = pd.read_csv('./ml-20m/genome-scores.csv')

In [90]:
genome_tags = pd.read_csv('./ml-20m/genome-tags.csv')

In [91]:
links = pd.read_csv('./ml-20m/links.csv')

In [92]:
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.025
1,1,2,0.025
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675


In [93]:
genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [94]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [95]:
genome_scores.tagId.value_counts()

1       10381
750     10381
756     10381
755     10381
754     10381
        ...  
383     10381
384     10381
385     10381
386     10381
1128    10381
Name: tagId, Length: 1128, dtype: int64

In [96]:
genome_scores.movieId.nunique()

10381

In [97]:
genome_scores.shape

(11709768, 3)

### 2. Explore the data

Let's look at statistics on the data.

#### What is the average rating?

In [98]:
ratings['rating'].mean()

3.5255285642993797

The average rating for all movies is 3.5.

#### How many ratings each user gives on average?

In [99]:
ratings.shape[0] / ratings['userId'].nunique()

144.4135299257002

On average, each user gives 144 movie ratings.

#### Which movie has the most ratings?

In [100]:
ratings['movieId'].value_counts().head(1)

296    67310
Name: movieId, dtype: int64

In [101]:
movies[movies['movieId']==296]

Unnamed: 0,movieId,title,genres
293,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller


The movie "Pulp Fiction (1994)" is the most rated movie. It has 67310 ratings. 

### 3. Data Wrangling and Export the Data

Let's merge the rating file with the movie file, so that there will be information of title and genres in the rating file.

In [102]:
# merge ratings with movie file. 
new_ratings = ratings.merge(movies, on='movieId', how='left')

In [103]:
# check the size of the rating file before and after the merge.
ratings.shape, new_ratings.shape

((20000263, 3), (20000263, 5))

Let's then think about reducing data size. As mentioned before, our target is to reduce its size to about 100,000 rows. In order to do that, let's consider the genres, the count on users, and the count on movies.

For the genres, we wants to select only the top 20 genres.

In [104]:
# create a dataframe with the top five genres
most_rated_genres = pd.DataFrame(new_ratings.genres.value_counts(ascending=False).reset_index().head(20))

In [105]:
most_rated_genres

Unnamed: 0,index,genres
0,Drama,1467402
1,Comedy,1316161
2,Comedy|Romance,793252
3,Comedy|Drama,656474
4,Drama|Romance,644626
5,Comedy|Drama|Romance,615897
6,Crime|Drama,467417
7,Action|Adventure|Sci-Fi,441351
8,Action|Adventure|Thriller,313902
9,Action|Crime|Thriller,310685


In [106]:
most_rated_genres = most_rated_genres.rename(columns=({'index':'genres', 'genres':'count'}))

In [107]:
most_rated_genres

Unnamed: 0,genres,count
0,Drama,1467402
1,Comedy,1316161
2,Comedy|Romance,793252
3,Comedy|Drama,656474
4,Drama|Romance,644626
5,Comedy|Drama|Romance,615897
6,Crime|Drama,467417
7,Action|Adventure|Sci-Fi,441351
8,Action|Adventure|Thriller,313902
9,Action|Crime|Thriller,310685


We will then select the ratings whose genres is within the top genres

In [108]:
ratings_top_genres = new_ratings[new_ratings['genres'].isin(most_rated_genres['genres'])]

In [109]:
ratings_top_genres.shape, new_ratings.shape

((9252357, 5), (20000263, 5))

There are 400,000 rows in the new dataset. Let's then consider the count on users. We will select only users who give more than 200 ratings in the new dataset. 

In [110]:
# list total users in the new rating data.
user_counts = pd.DataFrame(ratings_top_genres.userId.value_counts().reset_index())

In [111]:
most_rating_users = user_counts[user_counts['userId']>100]

In [112]:
most_rating_users = most_rating_users.rename(columns=({'index':'userId', 'userId':'count'}))

In [113]:
most_rating_users

Unnamed: 0,userId,count
0,118205,4985
1,8405,3843
2,8963,3266
3,82418,3233
4,121535,2855
...,...,...
23635,117640,101
23636,11709,101
23637,91894,101
23638,14687,101


In [114]:
# create the new rating file
new_ratings = ratings_top_genres[ratings_top_genres['userId'].isin(most_rating_users['userId'])]

In [115]:
# check the new and old data size
new_ratings.shape, ratings_top_genres.shape

((5486765, 5), (9252357, 5))

Finally let's consider the count on movie ids in the new dataset. We will drop movies with less than 500 ratings.

In [116]:
movie_count = new_ratings['movieId'].value_counts()

In [117]:
movie_count.values

array([18386, 18377, 18336, ...,     1,     1,     1], dtype=int64)

In [118]:
most_rated_movies = movie_count[movie_count.values>100].index

In [119]:
most_rated_movies

Int64Index([  480,  2571,   318,   260,  1196,  2858,  1210,   110,   780,
              527,
            ...
            46959,  5215,  1519,  2545,  2061,   394,  6180, 62235,  6528,
             6467],
           dtype='int64', length=4078)

In [120]:
latest_ratings = new_ratings[new_ratings['movieId'].isin(most_rated_movies)]

In [121]:
latest_ratings.shape, new_ratings.shape

((5319088, 5), (5486765, 5))

In [122]:
final_users = most_rating_users.sample(frac=0.1)

In [123]:
final_users.head()

Unnamed: 0,userId,count
4193,120840,316
3390,120118,353
14409,80559,148
22396,85166,105
17134,79859,130


In [124]:
final_ratings = latest_ratings[latest_ratings['userId'].isin(final_users['userId'])]

In [125]:
final_ratings.shape, latest_ratings.shape

((535404, 5), (5319088, 5))

So the "final_ratings" is our final data for the rating predicting. Let's look at its details. 

In [126]:
final_ratings.head()

Unnamed: 0,userId,movieId,rating,title,genres
2538,24,5,2.0,Father of the Bride Part II (1995),Comedy
2539,24,6,4.0,Heat (1995),Action|Crime|Thriller
2540,24,7,3.0,Sabrina (1995),Comedy|Romance
2541,24,10,3.0,GoldenEye (1995),Action|Adventure|Thriller
2542,24,16,5.0,Casino (1995),Crime|Drama


Let's then count its userId, movieId, and genres.

In [127]:
final_ratings['userId'].value_counts()

118205    3103
125794    2235
135425    1260
57735     1214
65401     1167
          ... 
7526        96
73097       95
55324       95
83638       94
36505       93
Name: userId, Length: 2364, dtype: int64

In [128]:
final_ratings['movieId'].value_counts()

480      1864
318      1864
2571     1856
260      1803
1196     1784
         ... 
6180        5
7227        5
61210       4
33815       4
3132        3
Name: movieId, Length: 4078, dtype: int64

In [129]:
final_ratings['genres'].value_counts()

Drama                               90527
Comedy                              80598
Comedy|Romance                      44903
Comedy|Drama                        41303
Comedy|Drama|Romance                37750
Drama|Romance                       37543
Crime|Drama                         25083
Action|Adventure|Sci-Fi             20774
Drama|Thriller                      17757
Action|Adventure|Thriller           16638
Action|Crime|Thriller               14811
Crime|Drama|Thriller                14153
Action|Adventure|Sci-Fi|Thriller    14037
Comedy|Crime                        13978
Drama|War                           13588
Action|Sci-Fi|Thriller              12019
Documentary                         11041
Action|Drama|War                    10609
Action|Crime|Drama|Thriller         10200
Thriller                             8092
Name: genres, dtype: int64

They all looks good. Let's write it to a new csv file. 

In [130]:
file_name = './ml-20m/new_ratings.csv'
final_ratings.to_csv(file_name, index=False)