#### Задание

1. Используйте данные `MovieLens`.
2. Можно использовать любые модели из пакета.
3. Получите `RMSE` на тестовом сете 0,87 и ниже.

#### Загрузка необходимых данных и библиотек в рабочую среду

In [1]:
import pandas as pd
import numpy as np

In [2]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split

In [4]:
movies_with_ratings = movies.merge(ratings, on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [5]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [6]:
ratings.rating.max()

5.0

In [7]:
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(dataset, reader)

In [8]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [9]:
algo = KNNWithMeans(k=50, sim_options={
    'name': 'cosine',
    'user_based': True  # compute  similarities between users
})
algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x17f0a82f550>

In [10]:
test_pred = algo.test(testset)
accuracy.rmse(test_pred, verbose=True)

RMSE: 0.8959


0.8959492188641812

In [3]:
data = pd.read_csv('ratings.csv')

#### Перевод pandas датафрейма в формат SURPRISE

In [4]:
# Создаем объект Reader
reader = Reader(rating_scale=(0.5, 5))

In [5]:
# Загружаем данные в формате Surprise
data_surprise = Dataset.load_from_df(data[['userId', 'movieId', 'rating']], reader)

In [6]:
# Разбиваем данные на train и test
trainset, testset = train_test_split(data_surprise, test_size=0.2)

#### Подбор лучших параметров для выбранной модели, используя GridSearchCV

_Изучив тесты различных моделей из библиотеки SURPRISE, решил остановить свой выбор на SVD._

In [21]:
from surprise.model_selection import GridSearchCV

In [8]:
param_grid = {
    'n_factors': [10, 20, 100, 350, 500],
    'n_epochs': [10, 20, 35, 50, 70],
    'lr_all': [0.002, 0.007, 0.02, 0.05, 0.09],
    'reg_all': [0.008, 0.02, 0.09, 0.15, 0.3]
}

In [9]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(data_surprise)

In [14]:
# Лучшие параметры
print(gs.best_params['rmse'])

{'n_factors': 350, 'n_epochs': 70, 'lr_all': 0.007, 'reg_all': 0.09}


In [18]:
# Лучший алгоритм
best_algo = gs.best_estimator['rmse']

In [19]:
# Обучаем модель
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x17f0804aeb0>

In [20]:
# Получаем предсказания
predictions = algo.test(testset)

In [21]:
# Оцениваем качество модели
rmse = accuracy.rmse(predictions)
print("RMSE на тестовом сете:", rmse)

RMSE: 0.8763
RMSE на тестовом сете: 0.8763489231778587


#### Кросс-валидация

In [16]:
from surprise.model_selection import cross_validate

In [19]:
# Проверка на 5 фолдах и оценка RMSE
results = cross_validate(best_algo, data_surprise, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8497  0.8469  0.8523  0.8484  0.8471  0.8489  0.0020  
Fit time          5.41    5.48    5.60    5.67    5.52    5.54    0.09    
Test time         0.19    0.11    0.19    0.11    0.19    0.16    0.04    


In [20]:
# Вывод среднего RMSE по 5 фолдам
print('Mean RMSE:', results['test_rmse'].mean())

Mean RMSE: 0.8488580106541658
