Использовать dataset MovieLens https://grouplens.org/datasets/movielens/latest/

Построить рекомендации (регрессия, предсказываем оценку) на фичах:
- TF-IDF на тегах и жанрах
- Средние оценки (+ median, variance, etc.) пользователя и фильма

Оценить RMSE на тестовой выборке.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [2]:
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

Взглянем на содержимое movies.

In [3]:
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Пропусков нет.

Приведем жанры к TFIDF представлению.

In [5]:
def change_string(s):
    return s.replace(' ', '').replace('-', '').replace('|',' ')

movies["genres"] = movies["genres"].apply(change_string)
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(movies["genres"])
X_train_counts

<9742x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22084 stored elements in Compressed Sparse Row format>

In [7]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf

<9742x20 sparse matrix of type '<class 'numpy.float64'>'
	with 22084 stored elements in Compressed Sparse Row format>

In [8]:
genres_tfidf_df = pd.DataFrame(X_train_tfidf.toarray())
genres_tfidf_df.columns = ["c_" + str(x) for x in range(genres_tfidf_df.shape[1])]
genres_tfidf_df.shape

(9742, 20)

Посмотрим на данные tags.

In [9]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [10]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


Сгруппируем тэги по movieId и приведем их к TFIDF представлению.

In [11]:
def change_string2(s):
    return s.lower().replace(' ', '').replace('-', '').replace('|',' ')

tags_by_movie = tags.groupby("movieId")["tag"].agg(lambda x: change_string2("|".join(x))).reset_index()
tags_by_movie.head(5)

Unnamed: 0,movieId,tag
0,1,pixar pixar fun
1,2,fantasy magicboardgame robinwilliams game
2,3,moldy old
3,5,pregnancy remake
4,7,remake


In [12]:
tags_vect = CountVectorizer()
tags_counts = tags_vect.fit_transform(tags_by_movie["tag"])
tags_counts

<1572x1472 sparse matrix of type '<class 'numpy.int64'>'
	with 3598 stored elements in Compressed Sparse Row format>

In [13]:
tags_tfidf_transform = TfidfTransformer()
tags_tfidf = tags_tfidf_transform.fit_transform(tags_counts)
tags_tfidf

<1572x1472 sparse matrix of type '<class 'numpy.float64'>'
	with 3598 stored elements in Compressed Sparse Row format>

In [14]:
tags_tfidf_df = pd.DataFrame(tags_tfidf.toarray())
tags_tfidf_df.shape

(1572, 1472)

Вернем обратно сопоставление с movieId, чтобы по нему можно было соединить с датасетом movies.

In [15]:
tags_to_join = tags_by_movie.join(tags_tfidf_df).drop(columns=["tag"])
tags_to_join.head(5)

Unnamed: 0,movieId,0,1,2,3,4,5,6,7,8,...,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Соединим movies и тэги по movieId. Пустые значения, если таковые будут, можно заполнить нулями, т.к. это не нарушит смысл TFIDF.

In [16]:
movies_tags = movies.join(genres_tfidf_df).merge(tags_to_join, on='movieId', how="left").fillna(0)
movies_tags.head(5)

Unnamed: 0,movieId,title,genres,c_0,c_1,c_2,c_3,c_4,c_5,c_6,...,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure Children Fantasy,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy Romance,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy Drama Romance,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Посмотрим на ratings.

In [17]:
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [18]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Пропусков нет.

Сгруппируем рейтинг по movieId и вычислим средние оценки.

In [19]:
rat_df = pd.DataFrame()
rat_df["rating"] = ratings.groupby("movieId")["rating"].mean()
rat_df["rating_median"] = ratings.groupby("movieId")["rating"].median()
rat_df["rating_var"] = (ratings.groupby("movieId")["rating"].var().fillna(ratings.groupby("movieId")["rating"].last()))
rat_df.head(5)

Unnamed: 0_level_0,rating,rating_median,rating_var
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3.92093,4.0,0.69699
2,3.431818,3.5,0.777419
3,3.259615,3.0,1.112651
4,2.357143,3.0,0.72619
5,3.071429,3.0,0.822917


Соединим с movies_tags.

In [20]:
movies_tags_rating = movies_tags.merge(rat_df, on='movieId', how="left")
movies_tags_rating.head(5)

Unnamed: 0,movieId,title,genres,c_0,c_1,c_2,c_3,c_4,c_5,c_6,...,1465,1466,1467,1468,1469,1470,1471,rating,rating_median,rating_var
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.92093,4.0,0.69699
1,2,Jumanji (1995),Adventure Children Fantasy,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.431818,3.5,0.777419
2,3,Grumpier Old Men (1995),Comedy Romance,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.259615,3.0,1.112651
3,4,Waiting to Exhale (1995),Comedy Drama Romance,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.357143,3.0,0.72619
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.071429,3.0,0.822917


Посмотрим сколько фильмов оказалось без рейтинга.

In [21]:
print(len(movies_tags_rating[movies_tags_rating["rating"].isna()]))

18


Фильмов без рейтинга оказалось не так уж много, поэтому эти значения можно отбросить.

In [22]:
movies_tags_rating.drop(movies_tags_rating[movies_tags_rating["rating"].isna()].index, inplace=True)
movies_tags_rating.head(5)

Unnamed: 0,movieId,title,genres,c_0,c_1,c_2,c_3,c_4,c_5,c_6,...,1465,1466,1467,1468,1469,1470,1471,rating,rating_median,rating_var
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.92093,4.0,0.69699
1,2,Jumanji (1995),Adventure Children Fantasy,0.0,0.512361,0.0,0.620525,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.431818,3.5,0.777419
2,3,Grumpier Old Men (1995),Comedy Romance,0.0,0.0,0.0,0.0,0.570915,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.259615,3.0,1.112651
3,4,Waiting to Exhale (1995),Comedy Drama Romance,0.0,0.0,0.0,0.0,0.505015,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.357143,3.0,0.72619
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.071429,3.0,0.822917


Сделаем финальную подготовку данных для модели и разделим выборку на тренировочную и тестовую.

In [23]:
movies_tags_rating.set_index("movieId", inplace=True)
Y = movies_tags_rating["rating"]
X = movies_tags_rating.drop(columns=["title", "genres", "rating"])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

Построим модель, предсказывающую средний рейтинг фильма на основе его жанров и тэгов, и оценим её RMSE на тестовой выборке.

In [24]:
model = RandomForestRegressor(random_state=42)
model.fit(X_train, Y_train)
print(f"RMSE на тестовой выборке: {mean_squared_error(Y_test, model.predict(X_test), squared=False)}.")

RMSE на тестовой выборке: 0.16687671037632537.


CountVectorizer для тегов дает 1472 признаков. С целью повысить влияние на модель признаков rating_median и rating_var можно уменьшить кол-во признаков на выходе CountVectorizer. При помощи гиперпараметров min_df=0.008, max_df=0.011 их кол-во сокращается до 16. Такой тест был проведен, но ощутимых результатов он не дал: RMSE получилась примерно такая же, как и до уменьшения кол-ва признаков. Данный тест в отчет не включен, чтобы не снижать читаемость из-за дублирования ячеек.