### Рекомендации на основе содержания ###

Решаем задачу регрессии (предсказываем оценку) для датасета Movies, используя фичи:

* TF-IDF на тегах
* TF-IDF на жанрах

Целевая переменная - средний рейтинг (mean)

Для оценки качества будем считать RMSE на тестовой выборке

### Загружаем и смотрим данные ###

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from tqdm import tqdm_notebook
%matplotlib inline

In [2]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


### Готовим датасет ###

Для построения CBRS построим датасет одной таблицей со следующими фичами:

* movieId
* title
* genres_formatted - набор всех жанров к данному фильму в виде строки слов в нижнем регистре через пробелы
* tags_formatted - набор всех тегов от всех пользователей к данному фильму, собранный в строку слов в нижнем регистре через пробелы 
* rating_mean - средний рейтинг фильма по оценкам всех пользователей
* rating_median - срединный уровень оценок фильма всеми пользователями (50% процентили)
* rating_variance - отклонение оценок фильма пользователями от среднего значения

In [6]:
x = movies.copy()

Преобразуем жанры к формату слов через пробел

In [7]:
def change_string(s):
    return ' '.join(s.lower().replace(' ', '').replace('-', '').split('|'))
x['genres_formatted'] = x.genres.apply(change_string)
x.drop(columns=['genres'], inplace=True)

In [8]:
x.head()

Unnamed: 0,movieId,title,genres_formatted
0,1,Toy Story (1995),adventure animation children comedy fantasy
1,2,Jumanji (1995),adventure children fantasy
2,3,Grumpier Old Men (1995),comedy romance
3,4,Waiting to Exhale (1995),comedy drama romance
4,5,Father of the Bride Part II (1995),comedy


---
Добавляем теги

---

In [9]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [10]:
tags_grouped = tags.groupby('movieId')
tags_preprocessed = tags_grouped.tag.apply(lambda g: ' '.join(list(g)).lower())
tags_preprocessed_df = pd.DataFrame()
tags_preprocessed_df['movieId'] = tags_preprocessed.index
tags_preprocessed_df['tags_formatted'] = np.array(tags_preprocessed)
tags_preprocessed_df.head()

Unnamed: 0,movieId,tags_formatted
0,1,pixar pixar fun
1,2,fantasy magic board game robin williams game
2,3,moldy old
3,5,pregnancy remake
4,7,remake


In [11]:
x = pd.merge(x, tags_preprocessed_df, on='movieId', how='outer')
x.head()

Unnamed: 0,movieId,title,genres_formatted,tags_formatted
0,1,Toy Story (1995),adventure animation children comedy fantasy,pixar pixar fun
1,2,Jumanji (1995),adventure children fantasy,fantasy magic board game robin williams game
2,3,Grumpier Old Men (1995),comedy romance,moldy old
3,4,Waiting to Exhale (1995),comedy drama romance,
4,5,Father of the Bride Part II (1995),comedy,pregnancy remake


In [12]:
# заменяем отсутствующие теги на пустые строки
x.tags_formatted = x.tags_formatted.fillna('')
x.head()

Unnamed: 0,movieId,title,genres_formatted,tags_formatted
0,1,Toy Story (1995),adventure animation children comedy fantasy,pixar pixar fun
1,2,Jumanji (1995),adventure children fantasy,fantasy magic board game robin williams game
2,3,Grumpier Old Men (1995),comedy romance,moldy old
3,4,Waiting to Exhale (1995),comedy drama romance,
4,5,Father of the Bride Part II (1995),comedy,pregnancy remake


In [13]:
x.isnull().any().any()

False

---
Добавляем рейтинг - mean, median, variance

---

In [14]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [15]:
ratings_by_movies = ratings.groupby('movieId')

In [16]:
rmean = ratings_by_movies.mean()[['rating']]
rmean.rename(columns={'rating': 'rating_mean'}, inplace=True)
rmean.head()

Unnamed: 0_level_0,rating_mean
movieId,Unnamed: 1_level_1
1,3.92093
2,3.431818
3,3.259615
4,2.357143
5,3.071429


In [17]:
x = pd.merge(x, rmean, on='movieId', how='outer')
x.head()

Unnamed: 0,movieId,title,genres_formatted,tags_formatted,rating_mean
0,1,Toy Story (1995),adventure animation children comedy fantasy,pixar pixar fun,3.92093
1,2,Jumanji (1995),adventure children fantasy,fantasy magic board game robin williams game,3.431818
2,3,Grumpier Old Men (1995),comedy romance,moldy old,3.259615
3,4,Waiting to Exhale (1995),comedy drama romance,,2.357143
4,5,Father of the Bride Part II (1995),comedy,pregnancy remake,3.071429


In [18]:
x.isnull().any()

movieId             False
title               False
genres_formatted    False
tags_formatted      False
rating_mean          True
dtype: bool

In [19]:
# заменяем отсутствующие значения на 0
x.rating_mean = x.rating_mean.fillna(0)
x.head()

Unnamed: 0,movieId,title,genres_formatted,tags_formatted,rating_mean
0,1,Toy Story (1995),adventure animation children comedy fantasy,pixar pixar fun,3.92093
1,2,Jumanji (1995),adventure children fantasy,fantasy magic board game robin williams game,3.431818
2,3,Grumpier Old Men (1995),comedy romance,moldy old,3.259615
3,4,Waiting to Exhale (1995),comedy drama romance,,2.357143
4,5,Father of the Bride Part II (1995),comedy,pregnancy remake,3.071429


In [20]:
x.isnull().any()

movieId             False
title               False
genres_formatted    False
tags_formatted      False
rating_mean         False
dtype: bool

In [21]:
# Выделяем целевую переменную (rating_mean)
y = x[['rating_mean']]
x.drop(columns=['rating_mean'], inplace=True)

In [22]:
x.head()

Unnamed: 0,movieId,title,genres_formatted,tags_formatted
0,1,Toy Story (1995),adventure animation children comedy fantasy,pixar pixar fun
1,2,Jumanji (1995),adventure children fantasy,fantasy magic board game robin williams game
2,3,Grumpier Old Men (1995),comedy romance,moldy old
3,4,Waiting to Exhale (1995),comedy drama romance,
4,5,Father of the Bride Part II (1995),comedy,pregnancy remake


### Отделяем валидационную выборку ###

In [23]:
x.shape

(9742, 4)

In [24]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)

### Предобработка данных ###

In [25]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
cv_genres = CountVectorizer()
tf_genres = TfidfTransformer()
cv_tags = CountVectorizer()
tf_tags = TfidfTransformer()

In [26]:
# предобработка текстовых полей
genres_train_cv = cv_genres.fit_transform(x_train.genres_formatted)
genres_train_tf = tf_genres.fit_transform(genres_train_cv)
tags_train_cv = cv_tags.fit_transform(x_train.tags_formatted)
tags_train_tf = tf_tags.fit_transform(tags_train_cv)

### Обучаем модели (KNNRegression): одну для жанров, вторую - для тегов  ###

In [27]:
from sklearn.neighbors import KNeighborsRegressor
kr_genres = KNeighborsRegressor(n_neighbors=7, n_jobs=-1, metric='euclidean')
kr_genres.fit(genres_train_tf, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
          metric_params=None, n_jobs=-1, n_neighbors=7, p=2,
          weights='uniform')

In [28]:
kr_tags = KNeighborsRegressor(n_neighbors=7, n_jobs=-1, metric='euclidean')
kr_tags.fit(tags_train_tf, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
          metric_params=None, n_jobs=-1, n_neighbors=7, p=2,
          weights='uniform')

### Считаем RMSE для моделей на валидационной выборке ###

In [29]:
genres_test_cv = cv_genres.transform(x_test.genres_formatted)
genres_test_tf = tf_genres.transform(genres_test_cv)
tags_test_cv = cv_tags.transform(x_test.tags_formatted)
tags_test_tf = tf_tags.transform(tags_test_cv)

In [30]:
y_pred_genres = kr_genres.predict(genres_test_tf)
y_pred_tags = kr_tags.predict(tags_test_tf)

In [31]:
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_pred_genres))

0.923900356738036

In [32]:
np.sqrt(mean_squared_error(y_test, y_pred_tags))

0.9526650313112206

### Попробуем другие модели (для жанров) ###

**Linear regression**

In [33]:
from sklearn.linear_model import LinearRegression
lr_genres = LinearRegression(n_jobs=-1)
lr_genres.fit(genres_train_tf, y_train)
np.sqrt(mean_squared_error(y_test, lr_genres.predict(genres_test_tf)))

0.8284395874821902

**RandomForestRegressor**

In [34]:
from sklearn.ensemble import RandomForestRegressor
rf_genres = RandomForestRegressor(n_jobs=-1)
rf_genres.fit(genres_train_tf, y_train)
np.sqrt(mean_squared_error(y_test, rf_genres.predict(genres_test_tf)))

  This is separate from the ipykernel package so we can avoid doing imports until


0.841068227452273

**SVMRegressor**

In [35]:
from sklearn.svm import SVR
svr_genres = SVR()
svr_genres.fit(genres_train_tf, y_train)
np.sqrt(mean_squared_error(y_test, svr_genres.predict(genres_test_tf)))

  y = column_or_1d(y, warn=True)


0.8321525918352745

**LassoRegressor**

In [36]:
from sklearn.linear_model import Lasso
ls_genres = Lasso()
ls_genres.fit(genres_train_tf, y_train)
np.sqrt(mean_squared_error(y_test, ls_genres.predict(genres_test_tf)))

0.869836214126563

### Выводы ###

Видно, что точность предсказаний низкая. 

Из всех моделей по значению RMSE в лучшую сторону можно выделить LinearRegression.

Возможные направления улучшения качества:

- больше данных
- подбор гиперпараметров моделей через GridSearch или RandomSearch
- ансамблирование моделей