### Recommendations with MovieTweetings: Getting to Know The Data

Throughout this lesson, you will be working with the [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).  To get started, you can read more about this project and the dataset from the [publication here](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf).

**Note:** There are solutions to each of the notebooks available by hitting the orange jupyter logo in the top left of this notebook.  Additionally, you can watch me work through the solutions on the screencasts that follow each workbook. 

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

In [2]:
# Read in the datasets
movies = pd.read_csv('original_movies.dat',
                     delimiter='::',
                     header=None,
                     names=['movie_id', 'movie', 'genre'],
                     dtype={'movie_id': object}, engine='python')

reviews = pd.read_csv('original_ratings.dat',
                      delimiter='::',
                      header=None,
                      names=['user_id', 'movie_id', 'rating', 'timestamp'],
                      dtype={'movie_id': object, 'user_id': object, 'timestamp': object},
                      engine='python')

# Reduce the size reviews dataset
reviews = reviews.loc[:100000,:]

#### 1. Take a Look At The Data 

Take a look at the data and use your findings to fill in the dictionary below with the correct responses to show your understanding of the data.

In [3]:
# Use this cell to find the following information:
# number of movies
# number of ratings
# number of different genres
# number of unique users
# number of missing ratings
# the `average`, `min`, and `max` ratings given
movies.head()

Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [4]:
movies.describe()

Unnamed: 0,movie_id,movie,genre
count,35479,35479,35230
unique,35479,35416,2736
top,8,Home (2016),Drama
freq,1,4,3602


In [5]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,208092,5,1586466072
2,2,358273,9,1579057827
3,2,10039344,5,1578603053
4,2,6751668,9,1578955697


In [6]:
reviews.describe()

Unnamed: 0,rating
count,100001.0
mean,7.397666
std,1.833944
min,0.0
25%,6.0
50%,8.0
75%,9.0
max,10.0


In [7]:
# number of different genres
# first, split str in genres by |
# the result looks like
    # 0: [Documentary, Short]
genres = movies.genre[movies.genre.notna()].str.split('|')
# second, transform each element of a list-like to a row
# the result looks like:
    # 0: Documentary
    # 0: Short
genres = genres.explode()
# third, get unique values
genres = genres.unique()
# forth, count
genres.size
# in summary: genres = movies.genre[movies.genre.notna()].str.split('|').explode().unique().size

28

In [8]:
# Use your findings to match each variable to the correct statement in the dictionary
dict_sol1 = {
'The number of movies in the dataset': movies.movie_id.size, 
'The number of ratings in the dataset': reviews.rating.size,
'The number of different genres': movies.genre[movies.genre.notna()].str.split('|').explode().unique().size, 
'The number of unique users in the dataset': reviews.user_id.unique().size, 
'The number missing ratings in the reviews dataset': reviews.rating.isna().sum(), 
'The average rating given across all ratings': int(reviews.rating.mean()),
'The minimum rating given across all ratings': reviews.rating.min(),
'The maximum rating given across all ratings': reviews.rating.max()
}

# Check your solution
t.q1_check(dict_sol1)

That looks good to me!


#### 2. Data Cleaning

Next, we need to pull some additional relevant information out of the existing columns. 

For each of the datasets, there are a couple of cleaning steps we need to take care of:

#### Movies
* Pull the date from the title and create new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's for each genre

#### Reviews
* Create a date out of time stamp

You can check your results against the header of my solution by running the cell below with the **show_clean_dataframes** function.

In [9]:
movies.head()

Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [10]:
# title = movies.movie.str.extract('^(.*?)\s*\(\d{4}\)$', expand=False)
year = movies.movie.str.extract('^.*\((\d{4})\)$', expand=False)
century = year.apply(lambda x: x[:-2] + "00's")
movies['date'] = year
movies['century'] = century

movies.head()

Unnamed: 0,movie_id,movie,genre,date,century
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1800's
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1800's
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1800's
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1800's
4,91,Le manoir du diable (1896),Short|Horror,1896,1800's


In [11]:
# One-hot Encoding century
# create OneHotEncoder object, output int instead of float
encoder = OneHotEncoder(dtype=int)

# fit and transform data
encoded_data = encoder.fit_transform(movies[['century']]).toarray()

# create new dataframe with encoded data
_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['century']))
# remove prefix century_
_df.columns = map(lambda x: x.split('century_')[1], _df.columns)

# concat movies and _df
movies = pd.concat([movies, _df], axis=1)
# drop col century
movies.drop(columns=['century'], inplace=True)

movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0


In [12]:
# One-hot Encoding genre
genre_dummy = movies['genre'].str.get_dummies(sep='|')
movies = pd.concat([movies, genre_dummy], axis=1)

movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Action,Adult,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [13]:
# save cleaned movies to data/processed/movies_clean.csv
movies.to_csv('data/processed/movies_clean.csv', index=False)

In [14]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,208092,5,1586466072
2,2,358273,9,1579057827
3,2,10039344,5,1578603053
4,2,6751668,9,1578955697


In [15]:
# timestamp is str, we need to transform it to int
reviews.dtypes

user_id      object
movie_id     object
rating        int64
timestamp    object
dtype: object

In [16]:
reviews['date'] = pd.to_datetime(reviews['timestamp'].astype(int), unit='s')

reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 21:00:50
1,2,208092,5,1586466072,2020-04-09 21:01:12
2,2,358273,9,1579057827,2020-01-15 03:10:27
3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,2,6751668,9,1578955697,2020-01-13 22:48:17


In [17]:
# save cleaned reviews to ../data/processed/reviews_clean.csv
reviews.to_csv('data/processed/reviews_clean.csv', index=False)

In [18]:
reviews_new, movies_new = t.show_clean_dataframes()

   Unnamed: 0  user_id  movie_id  rating   timestamp                 date
0           0        1    114508       8  1381006850  2013-10-05 21:00:50
1           1        2    208092       5  1586466072  2020-04-09 21:01:12
2           2        2    358273       9  1579057827  2020-01-15 03:10:27
3           3        2  10039344       5  1578603053  2020-01-09 20:50:53
4           4        2   6751668       9  1578955697  2020-01-13 22:48:17
   Unnamed: 0  movie_id                                              movie   
0           0         8      Edison Kinetoscopic Record of a Sneeze (1894)  \
1           1        10                La sortie des usines Lumière (1895)   
2           2        12                      The Arrival of a Train (1896)   
3           3        25  The Oxford and Cambridge University Boat Race ...   
4           4        91                         Le manoir du diable (1896)   

               genre  date  1800's  1900's  2000's  Documentary  Adventure   
0  Docume

In [19]:
movies_new

Unnamed: 0.1,Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Documentary,Adventure,...,War,Romance,Musical,Film-Noir,Sport,Mystery,Western,Horror,Comedy,Music
0,0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35474,35474,12036052,Elephant (2020),Documentary|Adventure|Biography|Family,2020,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
35475,35475,12058882,Coachella: 20 Years in the Desert (2020),Documentary|Music,2020,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,1
35476,35476,12078990,Money Heist: The Phenomenon (2020),Documentary,2020,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
35477,35477,12131262,Out of Shadows (2020),Documentary,2020,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
reviews_new

Unnamed: 0.1,Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,0,1,114508,8,1381006850,2013-10-05 21:00:50
1,1,2,208092,5,1586466072,2020-04-09 21:01:12
2,2,2,358273,9,1579057827,2020-01-15 03:10:27
3,3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,4,2,6751668,9,1578955697,2020-01-13 22:48:17
...,...,...,...,...,...,...
99996,99996,8022,40746,9,1585954942,2020-04-03 23:02:22
99997,99997,8022,41959,9,1586569384,2020-04-11 01:43:04
99998,99998,8022,43014,9,1587085691,2020-04-17 01:08:11
99999,99999,8022,44079,9,1586738312,2020-04-13 00:38:32
