## Домашнее задание по теме «Рекомендации на основе содержания»

### Задание

#### Вопросы по заданию

1. Использовать датасет MovieLens (https://grouplens.org/datasets/movielens/latest/).
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:
- TF-IDF на тегах и жанрах;
- средние оценки (+ median, variance и т. д.) пользователя и фильма.

3. Оценить RMSE на тестовой выборке.

#### Решение

In [1]:
import os

import pandas as pd
import numpy as np
from datetime import datetime

from tqdm import notebook
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
os.listdir('../ml_latest_small')
prefix = '../ml_latest_small'

In [3]:
links = pd.read_csv(os.path.join(prefix, 'links.csv'))
movies = pd.read_csv(os.path.join(prefix, 'movies.csv'))
ratings = pd.read_csv(os.path.join(prefix, 'ratings.csv'))
tags = pd.read_csv(os.path.join(prefix, 'tags.csv'))

In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [7]:
tags.shape

(3683, 4)

##### Сначала подготовим данные (Tags)

> *Комментарий*:\
После того как вывел уникальные значения, определил, что некоторые значения начинаются на заглавную букву и имеют такие же в нижнем регистре\
таким образом лучше привести все к нижнему регистру.\
Так же обнаружил сокращения и ошибки, но так как это нет в задании, то это не является целью данного ДЗ, но в будущем можно и это откорретировать.


In [8]:
# # unique_tags = tags['tag'].unique() # Находим уникальные значения в колонке 'tag'
# unique_tags = sorted(tags['tag'].unique()) # Отсортированный список уникальных значений
# unique_tags

In [10]:
# Приводим все значения в колонке 'tag' к нижнему регистру
tags['tag'] = tags['tag'].str.lower() 

In [11]:
# объединяем теги по каждому фильму и делаем теги уникальными

tag_list = []
movie_list = []

for k, v in notebook.tqdm(tags.groupby('movieId').tag):
    tag_list.append(' '.join(v.values))
    movie_list.append(k)

  0%|          | 0/1572 [00:00<?, ?it/s]

In [12]:
# Создаем  датафрейм movieId - tag

movies_t = pd.DataFrame(
    {
        'movie_id': movie_list,
        'tag': tag_list
    }
)
movies_t

Unnamed: 0,movie_id,tag
0,1,pixar pixar fun
1,2,fantasy magic board game robin williams game
2,3,moldy old
3,5,pregnancy remake
4,7,remake
...,...,...
1567,183611,comedy funny rachel mcadams
1568,184471,adventure alicia vikander video game adaptation
1569,187593,josh brolin ryan reynolds sarcasm
1570,187595,emilia clarke star wars


##### Подготовим данные (genres)

> Далее убираем все лишние пробелы, дефисы и знак разделения у жанров

In [13]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [14]:
movies_g = movies.genres.apply(change_string).tolist()

In [15]:
# Создание объекта TfidfVectorizer
vectorizer_g = TfidfVectorizer()
vectorizer_t = TfidfVectorizer()

# Преобразование признаков genres и tags в пространство TF-IDF

movie_genres_g = vectorizer_g.fit_transform(movies_g)
movie_genres_t = vectorizer_t.fit_transform(movies_t['tag'].values)


In [16]:
# Преобразования разреженной матрицы в плотную матрицу (из лекции)
dense_matrix_g = movie_genres_g.todense()

# Преобразования плотной матрицы в датафрейм (из лекции)
df_g = pd.DataFrame(dense_matrix_g, index=movies.index)
df_t = pd.DataFrame(movie_genres_t.toarray())
                           
# название колонок из фичей
df_g.columns = vectorizer_g.get_feature_names_out()
df_t.columns = vectorizer_t.get_feature_names_out()

df_t.insert (loc= 0 , column='movieId', value = movies_t.movie_id)
df_t


Unnamed: 0,movieId,06,1900s,1920s,1950s,1960s,1970s,1980s,1990s,2001,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1567,183611,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1568,184471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1569,187593,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1570,187595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
df_g.insert (loc= 0 , column='movieId', value=movies.movieId)
df_g

Unnamed: 0,movieId,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,...,horror,imax,musical,mystery,nogenreslisted,romance,scifi,thriller,war,western
0,1,0.000000,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.000000,0.482990,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,2,0.000000,0.512361,0.000000,0.620525,0.000000,0.0,0.0,0.000000,0.593662,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,3,0.000000,0.000000,0.000000,0.000000,0.570915,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0
3,4,0.000000,0.000000,0.000000,0.000000,0.505015,0.0,0.0,0.466405,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.726241,0.0,0.0,0.0,0.0
4,5,0.000000,0.000000,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,0.436010,0.000000,0.614603,0.000000,0.318581,0.0,0.0,0.000000,0.575034,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9738,193583,0.000000,0.000000,0.682937,0.000000,0.354002,0.0,0.0,0.000000,0.638968,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9739,193585,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
9740,193587,0.578606,0.000000,0.815607,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


##### Подготовим данные средних ретингов фильмов и пользователей

In [18]:
# Расчет среднего рейтинга фильма
avg_movie_rating = ratings.groupby('movieId')['rating'].mean()

# Расчет среднего рейтинга пользователя
avg_user_rating = ratings.groupby('userId')['rating'].mean()

In [19]:
# Нормализуем рейтинги, что бы данные во всех фичах были в пределах от 0 - 1 

In [20]:
def min_max_scaling(series):
    return (series - series.min()) / (series.max() - series.min())

In [21]:
avg_movie_rating = min_max_scaling(avg_movie_rating)

avg_user_rating = min_max_scaling(avg_user_rating)

In [22]:
# конвертируем в DF

df_amr = avg_movie_rating.to_frame().reset_index(drop=False)

df_aur = avg_user_rating.to_frame().reset_index(drop=False)

In [23]:
# Переименовываем колонки рейтинг на соответствующие

df_amr.rename(columns = {'rating':'rating_am'}, inplace = True )

df_aur.rename(columns = {'rating':'rating_au'}, inplace = True )

In [24]:
df_amr.shape

(9724, 2)

In [25]:
df_aur.shape

(610, 2)

##### Объединяем DF

In [47]:
# Объединяем dataframe's с рейтингами

df_final = ratings.merge(df_aur, on='userId')
df_final = df_final.merge(df_amr, on='movieId')
df_final

Unnamed: 0,userId,movieId,rating,timestamp,rating_au,rating_am
0,1,1,4.0,964982703,0.829900,0.760207
1,5,1,4.0,847434962,0.633923,0.760207
2,7,1,4.5,1106635946,0.524903,0.760207
3,15,1,2.5,1510577970,0.583395,0.760207
4,17,1,4.5,1305696483,0.787792,0.760207
...,...,...,...,...,...,...
100831,610,160341,2.5,1479545749,0.647935,0.444444
100832,610,160527,4.5,1479544998,0.647935,0.888889
100833,610,160836,3.0,1493844794,0.647935,0.555556
100834,610,163937,3.5,1493848789,0.647935,0.666667


In [27]:
# Объединяем dataframe's с жанрами и тегами
df_final = df_final.merge(df_g, on='movieId')
df_final = df_final.merge(df_t, on='movieId')
df_final

Unnamed: 0,userId,movieId,rating,timestamp,rating_au,rating_am,action_x,adventure_x,animation_x,children_x,...,york,you,younger,your,zellweger,zither,zoe,zombie,zombies,zooey
0,1,1,4.0,964982703,0.829900,0.760207,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,1,4.0,847434962,0.633923,0.760207,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,1,4.5,1106635946,0.524903,0.760207,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15,1,2.5,1510577970,0.583395,0.760207,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17,1,4.5,1305696483,0.787792,0.760207,0.0,0.416846,0.516225,0.504845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48282,567,176419,3.0,1525287581,0.260525,0.611111,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48283,599,176419,3.5,1516604655,0.366993,0.611111,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48284,594,7023,4.5,1108972356,0.711294,0.888889,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48285,606,6107,4.0,1171324428,0.639570,0.777778,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
df_final.columns[4:]

Index(['rating_au', 'rating_am', 'action_x', 'adventure_x', 'animation_x',
       'children_x', 'comedy_x', 'crime_x', 'documentary_x', 'drama_x',
       ...
       'york', 'you', 'younger', 'your', 'zellweger', 'zither', 'zoe',
       'zombie', 'zombies', 'zooey'],
      dtype='object', length=1766)

In [29]:
X = df_final[df_final.columns[4:]]

In [30]:
X = X.fillna(0)

In [31]:
y = df_final['rating']

> Предсказываем оценку

In [32]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split 


In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_test.index

Index([32603, 46805, 42031, 42747, 43054, 45383, 31544, 24905, 22633, 35194,
       ...
       35071, 36229, 16739, 16883, 42955, 23579, 30400, 30018, 30999, 28425],
      dtype='int64', length=14487)

Regression based on k-nearest neighbors. \
The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set \
С ресурса: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [44]:
knr = KNeighborsRegressor(5, n_jobs=-1)
knr.fit(X_train, y_train)
predict_knr = knr.predict(X_test)

np.sqrt(mse(y_test, predict_knr))

0.9133774140766328

> Согласно полученной метрике предсказние очень даже не плохое, пробовал другие параметры, но текущие покази лучший результат