### 通过 MovieTweetings 创建推荐系统：了解数据

在这节课，你将使用 [MovieTweetings 数据](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014)。首先，你可以通过[这篇论文](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf)详细了解此项目和数据集。

**注意：**点击 notebook 左上角的橙色 Jupyter 徽标，可以转到每个 notebook 的解答部分。此外，你可以在每个 workbook 之后的页面中观看我的截屏录像，看看我演示的过程。 

首先，使用以下代码读取将在这节课中一直使用的库和两个数据集。

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_raw.csv', index_col=0)
reviews = pd.read_csv('reviews_raw.csv', index_col=0)

#### 1.查看数据 

查看数据并填写以下字典，这些问题旨在检查你对数据的理解情况。

In [14]:
movies.head()

Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [15]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,358273,9,1579057827
2,2,10039344,5,1578603053
3,2,6751668,9,1578955697
4,2,7131622,8,1579559244


In [16]:
movies.shape

(34996, 3)

In [17]:
reviews.shape

(846973, 4)

In [18]:
genres = []
for genre in movies['genre']:
    try:
        genre = genre.split('|')
        genres.extend(genre)
    except:
        pass
genres = set(genres)
print(len(genres))

28


In [19]:
len(np.unique(reviews.user_id))

65425

In [20]:
reviews['rating'].isnull().sum()

0

In [21]:
reviews['rating'].describe()

count    846973.000000
mean          7.315004
std           1.853283
min           0.000000
25%           6.000000
50%           8.000000
75%           9.000000
max          10.000000
Name: rating, dtype: float64

In [33]:
# Use your findings to match each variable to the correct statement in the dictionary


dict_sol1 = {
'The number of movies in the dataset': 34996, 
'The number of ratings in the dataset': 846973, 
'The number of different genres': 28, 
'The number of unique users in the dataset': 65425, 
'The number missing ratings in the reviews dataset': 0, 
'The average rating given across all ratings': 7.315, 
'The minimum rating given across all ratings': 0, 
'The maximum rating given across all ratings': 10
}

# Originally, I had this to check your solution, but the 
# links are live and updating.  That didn't end up being
# a great idea


#### 2.数据清理

接下来，我们需要从现有列中提取一些其他相关信息。 

对于每个数据集，我们需要执行几个清理步骤：

#### Movies
* 从标题中提取日期并创建新的列
* 对于电影所属的每个世纪（1800 年代、1900 年代和 2000 年代），用 1 和 0 创建虚拟日期列
* 使用 1 和 0 创建虚拟 genre 列

#### Reviews
* 根据时间戳创建日期

你可以使用 **show_clean_dataframes** 函数运行以下单元格，对照我的答案标题检查你的结果。

In [25]:
create_date = lambda x: x[-5:-1] if x[-1] == ')' else np.nan
movies['date'] = movies['movie'].apply(create_date)
movies.head()

Unnamed: 0,movie_id,movie,genre,date
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895
2,12,The Arrival of a Train (1896),Documentary|Short,1896
3,25,The Oxford and Cambridge University Boat Race ...,,1895
4,91,Le manoir du diable (1896),Short|Horror,1896


In [39]:
def get_century(val):
    if val[0:2] == yr:
        return 1 
    else:
        return 0
yrs = ['18', '19', '20']
for yr in yrs:
    movies[str(yr) + "00's"] = movies['date'].apply(get_century)

In [40]:
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0


In [49]:
genres = []
for item in movies['genre']:
    try:
        genres.extend(item.split('|'))
    except AttributeError:
        pass
genres = set(genres)
def get_genre(val):
    if isinstance(val, str):
        vals = val.split('|')
        if genre in vals:
            return 1 
        else:
            return 0
    return 0
for genre in genres:
    movies[str(genre)] = movies['genre'].apply(get_genreenre)

In [56]:
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Adult,Action,Crime,...,Horror,Mystery,Western,Animation,Talk-Show,Biography,Thriller,Short,Fantasy,Sport
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0


In [106]:
import datetime

change_timestamp = lambda val: datetime.datetime.fromtimestamp(int(val)).strftime('%Y-%m-%d %H:%M:%S')
reviews['date_time'] = reviews['timestamp'].apply(change_timestamp)

In [107]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date_time
0,1,114508,8,1381006850,2013-10-06 05:00:50
1,2,358273,9,1579057827,2020-01-15 11:10:27
2,2,10039344,5,1578603053,2020-01-10 04:50:53
3,2,6751668,9,1578955697,2020-01-14 06:48:17
4,2,7131622,8,1579559244,2020-01-21 06:27:24
