### Домашнее задание к теме:   
**Рекомендации на основе содержания**

1. Использовать датасет [MovieLens](https://grouplens.org/datasets/movielens/latest/)  
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:  
 TF-IDF на тегах и жанрах    
 Средние оценки (+median, variance, etc.) пользователя и фильма  
3. Оценить RMSE на тестовой выборке

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

Познакомимся с таблицами в датасете:

In [3]:
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [4]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


Специфичная информация заложена в данной таблице. Как я понял, это три разных видеоресурса, с помощью которых можно найти конкретный фильм. Всё понятно из описания датасета [MovieLens](https://grouplens.org/datasets/movielens/latest/). Данная таблица особой ценности не представляет, в дальнейшем использовать не будем.

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


Посмотрим, сколько строк в каждой из таблиц:

In [8]:
links.shape #Для информации, ниже видно, что размерность совпадает с movies.shape

(9742, 3)

In [9]:
movies.shape

(9742, 3)

In [10]:
ratings.shape

(100836, 4)

In [11]:
tags.shape

(3683, 4)

Естественно, есть различия в количестве строк, но для этого надо подготовить данные.

Функция делает преобразование жанров, заменяет вертикальную черту на пробел, делает все буквы маленькими.

---

In [12]:
def change_string(s):
    return ' '.join(s.lower().replace(' ', '').replace('-', '').split('|'))

---

### Варим фичи

Создадим новую датасет из датасета по рейтингам.

In [13]:
umr = ratings.copy()
del umr['timestamp']
umr.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [14]:
umr.shape

(100836, 3)

---

#### Проведём предварительную оценку по пользователям (средние оценки (+median, variance, etc.) пользователя).

Создадим новый датафрейм, куда введём новые фичи, показывающие средние оценки по пользователям.

Начнём с **median** оценки по рейтингу пользователей

In [15]:
umr_users = pd.DataFrame(umr.groupby('userId').rating.median().round(2)).rename(columns={'rating': 'user_median_rating'})

In [16]:
umr_users.reset_index().head()

Unnamed: 0,userId,user_median_rating
0,1,5.0
1,2,4.0
2,3,0.5
3,4,4.0
4,5,4.0


In [17]:
umr_users.shape

(610, 1)

Добавим колонку с **variance** значением по рейтингу пользователей.

In [18]:
umr_users = pd.merge(umr_users, pd.DataFrame(umr.groupby('userId').rating.var().round(2)).rename(columns={'rating': 'user_variance_rating'}),
         left_index=True,
         right_index=True)

In [19]:
umr_users.reset_index().head()

Unnamed: 0,userId,user_median_rating,user_variance_rating
0,1,5.0,0.64
1,2,4.0,0.65
2,3,0.5,4.37
3,4,4.0,1.73
4,5,4.0,0.98


Добавим колонку с **min** значением по рейтингу пользователей.

In [20]:
umr_users = pd.merge(umr_users, pd.DataFrame(umr.groupby('userId').rating.min().round(2)).rename(columns={'rating': 'user_min_rating'}),
         left_index=True,
         right_index=True)

In [21]:
umr_users.reset_index().head()

Unnamed: 0,userId,user_median_rating,user_variance_rating,user_min_rating
0,1,5.0,0.64,1.0
1,2,4.0,0.65,2.0
2,3,0.5,4.37,0.5
3,4,4.0,1.73,1.0
4,5,4.0,0.98,1.0


Добавим колонку с **max** значением по рейтингу пользователей.

In [22]:
umr_users = pd.merge(umr_users, pd.DataFrame(umr.groupby('userId').rating.max().round(2)).rename(columns={'rating': 'user_max_rating'}),
         left_index=True,
         right_index=True)

In [23]:
umr_users.reset_index().head()

Unnamed: 0,userId,user_median_rating,user_variance_rating,user_min_rating,user_max_rating
0,1,5.0,0.64,1.0,5.0
1,2,4.0,0.65,2.0,5.0
2,3,0.5,4.37,0.5,5.0
3,4,4.0,1.73,1.0,5.0
4,5,4.0,0.98,1.0,5.0


Добавим колонку с **mean** значением по рейтингу пользователей.

In [24]:
umr_users = pd.merge(umr_users, pd.DataFrame(umr.groupby('userId').rating.mean().round(2)).rename(columns={'rating': 'user_mean_rating'}),
         left_index=True,
         right_index=True)

In [25]:
umr_users.reset_index().head()

Unnamed: 0,userId,user_median_rating,user_variance_rating,user_min_rating,user_max_rating,user_mean_rating
0,1,5.0,0.64,1.0,5.0,4.37
1,2,4.0,0.65,2.0,5.0,3.95
2,3,0.5,4.37,0.5,5.0,2.44
3,4,4.0,1.73,1.0,5.0,3.56
4,5,4.0,0.98,1.0,5.0,3.64


In [26]:
umr_users.shape

(610, 5)

#### Проведём предварительную оценку по фильмам (средние оценки (+median, variance, etc.) фильма).

$\odot$ **movie_median_rating**

In [27]:
umr_films = pd.DataFrame(umr.groupby('movieId').rating.median().round(2)).rename(columns={'rating': 'movie_median_rating'})

$\odot$ **movie_variance_rating**

In [28]:
umr_films = pd.merge(umr_films, pd.DataFrame(umr.groupby('movieId').rating.var().round(2)).rename(columns={'rating': 'movie_variance_rating'}),
         left_index=True,
         right_index=True).fillna(0)

$\odot$ **movie_min_rating**

In [29]:
umr_films = pd.merge(umr_films, pd.DataFrame(umr.groupby('movieId').rating.min().round(2)).rename(columns={'rating': 'movie_min_rating'}),
         left_index=True,
         right_index=True)

$\odot$ **movie_max_rating**

In [30]:
umr_films = pd.merge(umr_films, pd.DataFrame(umr.groupby('movieId').rating.max().round(2)).rename(columns={'rating': 'movie_max_rating'}),
         left_index=True,
         right_index=True)

$\odot$ **movie_mean_rating**

In [31]:
umr_films = pd.merge(umr_films, pd.DataFrame(umr.groupby('movieId').rating.mean().round(2)).rename(columns={'rating': 'movie_mean_rating'}),
         left_index=True,
         right_index=True)

In [32]:
umr_films.reset_index().head()

Unnamed: 0,movieId,movie_median_rating,movie_variance_rating,movie_min_rating,movie_max_rating,movie_mean_rating
0,1,4.0,0.7,0.5,5.0,3.92
1,2,3.5,0.78,0.5,5.0,3.43
2,3,3.0,1.11,0.5,5.0,3.26
3,4,3.0,0.73,1.0,3.0,2.36
4,5,3.0,0.82,0.5,5.0,3.07


In [33]:
umr_films.shape

(9724, 5)

Уникальных пользователей 610, уникальных фильмов 9724.

---

### Теперь поработаем над тэгами и жанрами.

Нам потребуются модули: 

**Важно:** В первом варианте домашнего задания использовал две модели (обучал отдельно для тэгов и жанров), теперь модели (модули) общие и для тэгов и жанров.

[Разница между .fit, .fit_transform() и ... transform()](https://stackoverflow.com/questions/23838056/what-is-the-difference-between-transform-and-fit-transform-in-sklearn)  
Мы знаем или не знаем, что хотим предсказать. Используем разное.

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

#### **Применим общую модель к тэгам и жанрам:**

In [35]:
cvect = CountVectorizer()
ttrans = TfidfTransformer()

###### Работа с тэгами.

разбиваем многозначные тэги по жанрам, работа с текстом

In [36]:
movie_gtags = [change_string(g) for g in tags.tag.values]


обучаем и создаём данные

In [37]:
preproc = cvect.fit_transform(movie_gtags)

даём колонкам имена по индексам

In [38]:
preproc_df = pd.DataFrame(preproc.toarray(), columns=cvect.get_feature_names())

сортировка по популярным тэгам

In [39]:
popular_tags = preproc_df.sum().sort_values(ascending=False)[preproc_df.sum().sort_values(ascending=False) > 5].index

обучаем частоте слов и обратной частоте документа

In [40]:
preproc = ttrans.fit_transform(preproc)

даём колонкам имена по индексам

In [41]:
movie_tags = pd.DataFrame(preproc.toarray(), columns=cvect.get_feature_names())

оставляем одну колонку

In [42]:
movie_tags = movie_tags[popular_tags]

добавляем слева колонку                   

In [43]:
movie_tags = pd.merge(movies[['movieId']], movie_tags, how='left', left_index=True, right_index=True)

вводим её в качестве индекса                   

In [44]:
movie_tags.index = movie_tags.movieId

и удаляем её как колонку                    

In [45]:
movie_tags.drop(columns=['movieId'], inplace=True)

заполняем пропуски нулями

In [46]:
movie_tags.fillna(0.0, inplace=True)

выводим полученный датасет                

In [47]:
movie_tags.head()

Unnamed: 0_level_0,innetflixqueue,atmospheric,thoughtprovoking,funny,scifi,surreal,superhero,disney,quirky,religion,...,wedding,zombies,twins,hitmen,visuallystunning,fantasy,dystopia,gambling,greatsoundtrack,gothic
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


###### Работа с жанрами по той же модели.

In [48]:
mg = movies[['movieId', 'genres']]

In [49]:
movie_genres = [change_string(g) for g in mg.genres.values]


In [50]:
preproc = cvect.fit_transform(movie_genres)
preproc = ttrans.fit_transform(preproc)

In [51]:
movie_genres = pd.DataFrame(preproc.toarray(), columns=cvect.get_feature_names())

In [52]:
movie_genres = pd.merge(movies[['movieId']], movie_genres, how='left', left_index=True, right_index=True)
movie_genres.index = movie_genres.movieId
movie_genres.drop(columns=['movieId'], inplace=True)
movie_genres.fillna(0.0, inplace=True)
movie_genres.head()

Unnamed: 0_level_0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,filmnoir,horror,imax,musical,mystery,nogenreslisted,romance,scifi,thriller,war,western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.0,0.48299,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,0.0,0.593662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,0.466405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

Собираем датасет для моделирования.

In [53]:
mvs = ratings[['userId', 'movieId', 'rating']]
mvs = pd.merge(mvs, umr_users, how='left', left_on='userId', right_index=True)
mvs = pd.merge(mvs, umr_films, how='left', left_on='movieId', right_index=True)
mvs = pd.merge(mvs, movie_genres, how='left', left_on='movieId', right_index=True)
mvs = pd.merge(mvs, movie_tags, how='left', left_on='movieId', right_index=True)
mvs.sample(7) 

Unnamed: 0,userId,movieId,rating,user_median_rating,user_variance_rating,user_min_rating,user_max_rating,user_mean_rating,movie_median_rating,movie_variance_rating,...,wedding,zombies,twins,hitmen,visuallystunning,fantasy_y,dystopia,gambling,greatsoundtrack,gothic
5203,34,1717,3.5,4.0,1.84,0.5,5.0,3.42,3.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3394,21,7570,4.5,3.5,1.01,0.5,5.0,3.26,3.5,0.78,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73909,474,2866,3.5,3.5,0.69,0.5,5.0,3.4,3.5,0.17,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52855,346,7387,4.0,4.0,0.46,1.0,5.0,3.68,4.0,1.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
714,6,343,3.0,3.0,0.72,1.0,5.0,3.49,3.0,0.8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21268,140,1693,3.0,3.5,0.65,0.5,5.0,3.5,3.5,0.75,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25282,177,3969,5.0,3.5,0.92,0.5,5.0,3.38,3.75,1.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
mvs.shape

(100836, 170)

In [55]:
mvs.isna().sum().sort_values()

userId         0
creepy         0
stylized       0
family         0
martialarts    0
              ..
aliens         0
dreamlike      0
blackcomedy    0
highschool     0
gothic         0
Length: 170, dtype: int64

In [56]:
mvs.isnull().any().any()

False

---

#### Проведём масштабирование датасета с сохранением названий колонок

In [57]:
from sklearn.preprocessing import StandardScaler

In [58]:
ss = StandardScaler()
mvs_norm = pd.DataFrame(ss.fit_transform(mvs), columns=mvs.columns)
mvs_norm.head()

Unnamed: 0,userId,movieId,rating,user_median_rating,user_variance_rating,user_min_rating,user_max_rating,user_mean_rating,movie_median_rating,movie_variance_rating,...,wedding,zombies,twins,hitmen,visuallystunning,fantasy_y,dystopia,gambling,greatsoundtrack,gothic
0,-1.780374,-0.54697,0.478112,2.568746,-0.583237,0.145061,0.182856,1.880009,0.696738,-0.275264,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
1,-1.780374,-0.546914,0.478112,2.568746,-0.583237,0.145061,0.182856,1.880009,-0.93145,0.657741,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
2,-1.780374,-0.54683,0.478112,2.568746,-0.583237,0.145061,0.182856,1.880009,0.696738,-0.343533,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
3,-1.780374,-0.545676,1.437322,2.568746,-0.583237,0.145061,0.182856,1.880009,0.696738,0.066079,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
4,-1.780374,-0.545591,1.437322,2.568746,-0.583237,0.145061,0.182856,1.880009,1.510833,-0.411801,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549


In [80]:
mvs_norm.sample(5)

Unnamed: 0,userId,movieId,rating,user_median_rating,user_variance_rating,user_min_rating,user_max_rating,user_mean_rating,movie_median_rating,movie_variance_rating,...,wedding,zombies,twins,hitmen,visuallystunning,fantasy_y,dystopia,gambling,greatsoundtrack,gothic
89963,1.41209,-0.535178,0.478112,0.722867,0.831777,0.145061,0.182856,1.31686,-0.93145,0.134348,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
3263,-1.670855,-0.512381,-0.960704,-0.200073,0.319444,-0.662724,0.182856,-0.524205,0.696738,0.930816,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
10731,-1.413487,-0.483026,-0.960704,-0.661543,-0.192888,-0.662724,0.182856,-0.589183,-0.117356,-0.024946,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
48886,-0.049982,-0.353645,0.478112,0.722867,0.270651,1.76063,0.182856,0.493796,0.696738,-0.343533,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549
43608,-0.18688,-0.479423,-0.481099,-0.200073,-0.070904,-0.662724,0.182856,-0.437566,0.696738,-0.571095,...,-0.031664,-0.053145,-0.026544,-0.038469,-0.045245,-0.04676,-0.04264,-0.028528,-0.015104,-0.050549


---

### На подготовленном датасете проведём train_test_split.

In [59]:
mvs_norm.shape #С чем имеем дело, размер

(100836, 170)

In [60]:
from sklearn.model_selection import train_test_split

In [61]:
y = mvs_norm['rating']
x = mvs_norm.drop(columns=['rating'])

In [62]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

### Возьмём регрессию и обучим, посчитаем RMSE на тестовой выборке:

### [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

[sklearn.metrics.mean_squared_error:
](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html): squared=False, то возвращает RMSE, иначе (по дефолту) MSE.

In [64]:
lr = LinearRegression(n_jobs=-1)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
print ('RMSE: ',np.sqrt(mean_squared_error(y_test, y_pred, squared=False)))

RMSE:  0.8791157605612001


### [RandomForestRegressor - модель](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [65]:
from sklearn.ensemble import RandomForestRegressor

используем пять  фолдов в кросс-валидации cv=5

In [66]:
from sklearn.model_selection import RandomizedSearchCV

In [67]:
params = {   'criterion': ['mse'],
             'max_depth': list(range(5, 10, 1)),
             'max_features': ['auto', 'sqrt', 'log2'],
             'min_samples_leaf': list(range(1, 8, 1)),
             'min_samples_split': list(range(2, 8, 1)),
             'n_estimators': list(range(5, 10, 1)),
             'n_jobs': [-1],
             'random_state': [8]}

rf_regr = RandomizedSearchCV(RandomForestRegressor(), params, cv=5, n_jobs=-1, n_iter=7, random_state=8)
rf_regr =  rf_regr.fit(x_train, y_train).best_estimator_
rf_regr

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=8, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=5,
                      min_samples_split=6, min_weight_fraction_leaf=0.0,
                      n_estimators=8, n_jobs=-1, oob_score=False,
                      random_state=8, verbose=0, warm_start=False)

In [69]:
print('RMSE:', np.sqrt(mean_squared_error(y_test, rf_regr.predict(x_test), squared=False)))

RMSE: 0.8719943184092788


#### Поиграем параметрами

In [71]:
params = {   'criterion': ['mse'],
             'max_depth': list(range(5, 12, 1)),
             'max_features': ['auto', 'sqrt', 'log2'],
             'min_samples_leaf': list(range(1, 8, 1)),
             'min_samples_split': list(range(2, 8, 1)),
             'n_estimators': list(range(5, 20, 1)),
             'n_jobs': [-1],
             'random_state': [8]}

rf_regr = RandomizedSearchCV(RandomForestRegressor(), params, cv=5, n_jobs=-1, n_iter=9, random_state=8)
rf_regr =  rf_regr.fit(x_train, y_train).best_estimator_
rf_regr

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=11, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=7,
                      min_samples_split=7, min_weight_fraction_leaf=0.0,
                      n_estimators=17, n_jobs=-1, oob_score=False,
                      random_state=8, verbose=0, warm_start=False)

In [72]:
print('RMSE:', np.sqrt(mean_squared_error(y_test, rf_regr.predict(x_test), squared=False)))

RMSE: 0.8678670051309639


Чуть улучшили RMSE.

##### Предскажем рейтинг по тестовой выборке.

In [75]:
ratings_pred = rf_regr.predict(x_test)

In [76]:
ratings_pred

array([ 0.99099967,  0.09146114,  0.14475991, ..., -0.57190593,
       -0.83842859, -0.84266688])

### [Ridge_Regression - модель](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ridge_regression.html)

In [93]:
from sklearn.linear_model import Ridge

In [95]:
params = {'alpha':list(np.arange(0.1, 100.0, 0.1)),
          'solver':['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga'],
          'max_iter':[100, 1000, 2000, 5000, 10000],
          'tol':[1e-5, 1e-4, 1e-3, 1e-2] # по дефолту 1e-3
          }
linear_model = RandomizedSearchCV(Ridge(), params, cv=8, random_state=8, n_iter=10)
linear_model = linear_model.fit(x_train,y_train).best_estimator_
linear_model

Ridge(alpha=56.1, copy_X=True, fit_intercept=True, max_iter=5000,
      normalize=False, random_state=None, solver='svd', tol=0.01)

In [96]:
print('RMSE:', np.sqrt(mean_squared_error(y_test, linear_model.predict(x_test), squared=False)))

RMSE: 0.8791171115672273


---

###  RandomForestRegressor-модель показала лучший результат, для неё был получены предсказания рейтинга по тестовой выборке.