# Рекомендательная система фильмов

---
**Цель**: Построить рекомендательную систему, которая будет предлагать пользователю список из *n* самых интересных для него фильмов. 
___
**Задачи**: 

1. Обработать данные.
2. Реализовать систему извлечения пула фильмов из которых будут выбираться *n* самых интересных.
3. Построить алгоритм матричного разложения для предсказания значения оценки.
4. Протестировать модель.
---

## Про датасет

Датасет взят с kaggle (https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset#). Он был собран GroupLens Research Project с сайта **MovieLens**. Он содержит данные о 100,000 рейтингах от 943 пользователей на 1682 фильмах. Также представлена простая демографическая информация о пользователях (возраст, пол, должность, почтовый индекс).

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872

## Источники

1. Kaggle (https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset#).
2. Учебник по машинному обучению от ШАДа (https://education.yandex.ru/handbook/ml).
3. An Introduction to Matrix factorization and Factorization Machines in Recommendation System, and Beyond (https://arxiv.org/pdf/2203.11026)

## Метод решения.

&emsp;Для предсказаний рейтинга я собираюсь использовать метод Factorization machine. 

## EDA и подготовка данных

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor

In [2]:
PATH_TO_DATA = 'ml-100k/'

In [3]:
ratings = pd.read_csv(
    filepath_or_buffer=PATH_TO_DATA + 'u.data',
    sep='\t',
    header=None,
    )
ratings.columns = ["user_id", "movie_id", "rating", "timestamp"]

print(ratings.info())
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user_id    100000 non-null  int64
 1   movie_id   100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB
None


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [4]:
movies = pd.read_csv(
    filepath_or_buffer=PATH_TO_DATA + 'u.item',
    sep='|',
    header=None,
    encoding='latin-1'
    )
movies.columns = [
    "movie_id", "movie_title", "release_date", "video_release_date",
    "imdb_url", "unknown", "action", "adventure", "animation",
    "children's", "comedy", "crime",  "documentary", "drama", "fantasy",
    "film-noir", "horror", "musical", "mystery", "romance", "sci-fi",
    "thriller", "war", "western"
    ]

print(movies.info())
movies.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   movie_title         1682 non-null   object 
 2   release_date        1681 non-null   object 
 3   video_release_date  0 non-null      float64
 4   imdb_url            1679 non-null   object 
 5   unknown             1682 non-null   int64  
 6   action              1682 non-null   int64  
 7   adventure           1682 non-null   int64  
 8   animation           1682 non-null   int64  
 9   children's          1682 non-null   int64  
 10  comedy              1682 non-null   int64  
 11  crime               1682 non-null   int64  
 12  documentary         1682 non-null   int64  
 13  drama               1682 non-null   int64  
 14  fantasy             1682 non-null   int64  
 15  film-noir           1682 non-null   int64  
 16  horror

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,children's,...,fantasy,film-noir,horror,musical,mystery,romance,sci-fi,thriller,war,western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [5]:
users = pd.read_csv(
    filepath_or_buffer=PATH_TO_DATA + 'u.user',
    sep='|',
    header=None
)
users.columns = [
    "user_id", "age", "gender", "occupation", "zip_code"
]

print(users.info())
users.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB
None


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


В таблице movies столбец video_release_date не заполнен и присутствуют пропущенные значения. Столбец timestamp в таблице ratings в контексте выбранного метода считаю неинформативным ровно как и imdb_url в movies и zip_code в users. Ниже удалим строки с пропущенными значениями признаков в таблице movies и откинем ненужные столбцы.

In [6]:
ratings = ratings.drop(['timestamp'], axis=1)

movies = movies.drop(['video_release_date', 'imdb_url'], axis=1)
movies = movies.dropna()

users = users.drop(['zip_code'], axis=1)

Теперь нужно преобразовать полученные таблицы к виду, который можно будет подать на вход алгоритма matrix factorization machines.

Нужно привести все фичи к числовым значениям Для этого буду использовать one-hot encoding, а в release_date буду записывать только год выхода.

In [7]:
encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(users[['gender', 'occupation']])
encoded_columns = encoder.get_feature_names_out(['gender', 'occupation'])

users = pd.concat([users.drop(['gender', 'occupation'], axis=1), pd.DataFrame(encoded_features, columns=encoded_columns)], axis=1)

movies['release_date'] = movies['release_date'].astype('datetime64').dt.year

In [8]:
matrix = {'user_' + str(user_id) : [0]*ratings.shape[0] for user_id in users['user_id'].unique()} # user columns

matrix.update({'movie_' + str(movie_id) : [0]*ratings.shape[0] for movie_id in movies['movie_id'].unique()}) # movie columns

user_features_columns = users.drop('user_id', axis=1).columns
matrix.update({user_feature : [0]*ratings.shape[0] for user_feature in user_features_columns}) # user features

movie_features_columns = movies.drop(['movie_id', 'movie_title'], axis=1).columns
matrix.update({movie_feature : [0]*ratings.shape[0] for movie_feature in movie_features_columns}) # movie features

matrix.update({'rating' : [0]*ratings.shape[0]}) # target - rating

matrix = pd.DataFrame(matrix)
matrix

Unnamed: 0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9,user_10,...,film-noir,horror,musical,mystery,romance,sci-fi,thriller,war,western,rating
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
99998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
for i in range(ratings.shape[0]):
    row = ratings.iloc[i]
    try:
        matrix['user_' + str(row['user_id'])][i] = 1
        matrix['movie_' + str(row['movie_id'])][i] = 1
        matrix['rating'][i] = row['rating']
    except:
        print('exception occured')
    for user_feature in user_features_columns:
        matrix[user_feature][i] = users[user_feature].loc[users['user_id'] == row['user_id']]
    for movie_feature in movie_features_columns:
        matrix[movie_feature][i] = movies[movie_feature].loc[movies['movie_id'] == row['movie_id']]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matrix[user_feature][i] = users[user_feature].loc[users['user_id'] == row['user_id']]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matrix[movie_feature][i] = movies[movie_feature].loc[movies['movie_id'] == row['movie_id']]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matrix['user_' + str(row['user_id'])][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/use

exception occured
exception occured
exception occured
exception occured
exception occured
exception occured
exception occured
exception occured
exception occured


In [10]:
matrix = matrix.dropna()
matrix.to_csv('matrix.csv')

In [17]:
X = matrix.drop('rating', axis=1)[:5000].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = matrix['rating'][:5000].values

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

## Реализация модели

In [13]:
class FactorizationMachines:
    def __init__(self, n_features, n_factors, learning_rate=1e-2, n_epochs=50):
        self.w0 = 0  # global bias
        self.w = np.zeros(n_features)  # feature weights
        self.V = np.random.normal(scale=0.001, size=(n_features, n_factors))  # latent factors
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs

    def predict(self, X):
        return self.w0 + np.dot(X, self.w) + 0.5 * np.sum((X @ self.V)**2 - (X**2) @ (self.V**2), axis=1)

    def fit(self, X, y):
        for epoch in range(self.n_epochs):
            y_pred = self.predict(X)
            loss = mean_squared_error(y, y_pred)

            # Gradient for w0
            w0_grad = np.mean(y_pred - y)
            self.w0 -= self.learning_rate * w0_grad

            # Gradient for w
            w_grad = X.T @ (y_pred - y) / X.shape[0]
            self.w -= self.learning_rate * w_grad

            for f in range(self.V.shape[1]):
                for i in range(self.V.shape[0]):
                    v_if_grad = np.sum((y_pred - y) * (X[:, i] * (X @ self.V[:, f]) - X[:, i] * self.V[i, f]))
                    self.V[i, f] -= self.learning_rate * v_if_grad / X.shape[0]


            print(f"Epoch {epoch + 1}/{self.n_epochs}, Loss: {loss:.4f}")

In [19]:
n_samples = X_train.shape[0]
n_features = X_train.shape[1]
n_factors = 5

fm = FactorizationMachines(n_features=n_features, n_factors=n_factors)
fm.fit(X_train, y_train)

preds = fm.predict(X_test)
print("MSE:", mean_squared_error(y_test, preds))


Epoch 1/50, Loss: 14.0131
Epoch 2/50, Loss: 13.7107
Epoch 3/50, Loss: 13.4155
Epoch 4/50, Loss: 13.1275
Epoch 5/50, Loss: 12.8463
Epoch 6/50, Loss: 12.5719
Epoch 7/50, Loss: 12.3041
Epoch 8/50, Loss: 12.0425
Epoch 9/50, Loss: 11.7871
Epoch 10/50, Loss: 11.5376
Epoch 11/50, Loss: 11.2939
Epoch 12/50, Loss: 11.0558
Epoch 13/50, Loss: 10.8230
Epoch 14/50, Loss: 10.5955
Epoch 15/50, Loss: 10.3730
Epoch 16/50, Loss: 10.1554
Epoch 17/50, Loss: 9.9425
Epoch 18/50, Loss: 9.7340
Epoch 19/50, Loss: 9.5298
Epoch 20/50, Loss: 9.3298
Epoch 21/50, Loss: 9.1336
Epoch 22/50, Loss: 8.9412
Epoch 23/50, Loss: 8.7523
Epoch 24/50, Loss: 8.5667
Epoch 25/50, Loss: 8.3841
Epoch 26/50, Loss: 8.2045
Epoch 27/50, Loss: 8.0275
Epoch 28/50, Loss: 7.8528
Epoch 29/50, Loss: 7.6803
Epoch 30/50, Loss: 7.5097
Epoch 31/50, Loss: 7.3408
Epoch 32/50, Loss: 7.1731
Epoch 33/50, Loss: 7.0066
Epoch 34/50, Loss: 6.8408
Epoch 35/50, Loss: 6.6755
Epoch 36/50, Loss: 6.5103
Epoch 37/50, Loss: 6.3451
Epoch 38/50, Loss: 6.1796
Epoch