**Домашнее задание**

Что делать?  
1.Датасет ml-latest  
2.Вспомнить подходы, которые мы разбирали  
3.Выбрать понравившийся подход к гибридным системам  
4.Написать свою  

In [1]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.neighbors import NearestNeighbors

import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
genome_scores = pd.read_csv('genome-scores.csv')
genome_tag = pd.read_csv('genome-tags.csv')

Отбросим ненужные данные.

In [3]:
ratings.drop(columns=["timestamp"], inplace=True)

Восстановим геном тегов.

In [4]:
genomes = genome_scores.pivot(index='movieId', columns='tagId', values='relevance').reset_index()
genomes.columns = ["movieId"] + genome_tag["tag"].tolist()
genomes.dropna(inplace=True)
genome_scores = None
genome_tag = None
genomes.head()

Unnamed: 0,movieId,007,007 (series),18th century,1920s,1930s,1950s,1960s,1970s,1980s,...,world politics,world war i,world war ii,writer's life,writers,writing,wuxia,wwii,zombie,zombies
0,1,0.029,0.02375,0.05425,0.06875,0.16,0.19525,0.076,0.252,0.2275,...,0.03775,0.0225,0.04075,0.03175,0.1295,0.0455,0.02,0.0385,0.09125,0.02225
1,2,0.03625,0.03625,0.08275,0.08175,0.102,0.069,0.05775,0.101,0.08225,...,0.04775,0.0205,0.0165,0.0245,0.1305,0.027,0.01825,0.01225,0.09925,0.0185
2,3,0.0415,0.0495,0.03,0.09525,0.04525,0.05925,0.04,0.1415,0.04075,...,0.058,0.02375,0.0355,0.02125,0.12775,0.0325,0.01625,0.02125,0.09525,0.0175
3,4,0.0335,0.03675,0.04275,0.02625,0.0525,0.03025,0.02425,0.07475,0.0375,...,0.049,0.03275,0.02125,0.03675,0.15925,0.05225,0.015,0.016,0.09175,0.015
4,5,0.0405,0.05175,0.036,0.04625,0.055,0.08,0.0215,0.07375,0.02825,...,0.05375,0.02625,0.0205,0.02125,0.17725,0.0205,0.015,0.0155,0.08875,0.01575


Оставим только те movieId, которые есть всех датафреймах.

In [5]:
m = set(movies["movieId"])
r = set(ratings["movieId"])
g = set(genomes["movieId"])
common_ids = m.intersection(r).intersection(g)
movies = movies[movies['movieId'].isin(common_ids)]
ratings = ratings[ratings['movieId'].isin(common_ids)]
genomes = genomes[genomes['movieId'].isin(common_ids)]

В результате данной операции остается примерно 1/4 от исходного объема movieId. что для реальной задачи недопустимо. Но цель данной работы опробовать гибридные рекомендательные системы, поэтому, считаю, что такое в данном случае допустимо.

Обучаем SVD на рейтингах.

In [6]:
# ВНИМАНИЕ: данная ячейка требует 15+ GB RAM

dataset = pd.DataFrame({
    'uid': ratings["userId"],
    'iid': ratings["movieId"],
    'rating': ratings["rating"]
})

reader = Reader(rating_scale=(ratings["rating"].min(), ratings["rating"].max()))
data = Dataset.load_from_df(dataset, reader)

trainset, testset = train_test_split(data, test_size=.15, random_state=42)

ratings_algo = SVD(n_factors=20, n_epochs=20)
ratings_algo.fit(trainset)

test_pred = ratings_algo.test(testset)
accuracy.rmse(test_pred, verbose=True)

RMSE: 0.7950


0.7950404843090519

Обучаем модель с использованием алгоритма ближайших соседей на жанрах.

In [7]:
def change_string(g):
    return g.replace(' ', '').replace('-', '').replace('|', ' ')

movie_genres = [change_string(g) for g in movies.genres.values]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(movie_genres)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf

genres_algo = NearestNeighbors(n_neighbors=50, n_jobs=-1, metric='euclidean') 
genres_algo.fit(X_train_tfidf);

Обучаем модель с использованием алгоритма ближайших соседей на геноме тегов.

In [8]:
genomes.set_index("movieId", inplace=True)
genomes_algo = NearestNeighbors(n_neighbors=50, n_jobs=-1, metric='euclidean') 
genomes_algo.fit(genomes);

Строим функцию предсказания.

In [9]:
def recommend_for_user(user_id):
    # получаем список фильмов для данного пользователя
    user_movies = ratings[ratings["userId"] == user_id]["movieId"].unique()
    
    # берем последний просмотренный пользователем фильм
    last_user_movie = user_movies[-1]
    
    # получаем жанры и готовим их к применению в TF-IDF
    movie_genres = change_string(movies.loc[movies["movieId"] == last_user_movie]["genres"].item())

    # на уже обученной модели получаем предсказание по жанру последнего фильма
    predict = count_vect.transform([movie_genres])
    X_tfidf2 = tfidf_transformer.transform(predict)
    res = genres_algo.kneighbors(X_tfidf2, return_distance=True)
    genres_ids = set(res[1][0])
    
    # на уже обученной модели получаем предсказание по геному тэгов последнего фильма
    res = genomes_algo.kneighbors([genomes.loc[last_user_movie]], return_distance=True)
    genomes_ids = set(res[1][0])
    
    # выполняем смешивание предсказания лучших фильмов от обеих моделей
    movies_to_score = genres_ids.union(genomes_ids)

    # для предсказанных фильмов, не вошедших в список уже просмотренных пользователем,
    # предсказываем оценку на скрытых факторах
    scores = []
    ids = []
    for movie in movies_to_score:
        if movie in user_movies:
            continue
        scores.append(ratings_algo.predict(uid=user_id, iid=movie).est)
        ids.append(movie)
        
    # сортируем и формируем датафрейм с рекомендациями
    best_indexes = np.argsort(scores)[-10:]
    mov = []
    sc = []
    for i in reversed(best_indexes):
        mov.append(movies.loc[movies["movieId"] == ids[i]]["title"].item())
        sc.append(scores[i])
        
    return pd.DataFrame({"Movie": mov, "Predict rating": sc})

Строим рекомендации для пользователя.

In [10]:
user_id = 42
print(f"Рекомендации для пользователя с ID = {user_id}.")
recommend_for_user(user_id)

Рекомендации для пользователя с ID = 42.


Unnamed: 0,Movie,Predict rating
0,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.57282
1,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",4.494807
2,In a Lonely Place (1950),4.279389
3,Inherit the Wind (1960),4.210076
4,"Killing, The (1956)",4.200182
5,Scratch (2001),4.125531
6,Diary of a Country Priest (Journal d'un curé d...,4.043176
7,Standing in the Shadows of Motown (2002),3.936566
8,Man of Marble (Czlowiek z Marmuru) (1977),3.901991
9,Strange Brew (1983),3.883549
