###  Домашне задание:  
### Рекомендательные системы № 1

#### Денис Иванов

Использовать dataset MovieLens

Построить рекомендации (регрессия, предсказываем оценку) на фичах:
- TF-IDF на тегах и жанрах
- Средние оценки (+ median, variance, etc.) пользователя и фильма

Оценить RMSE на тестовой выборке

In [102]:
import pandas as pd
import numpy as np
import re

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.neighbors import NearestNeighbors

%matplotlib inline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV, LassoCV
from sklearn.svm import SVR

from sklearn import model_selection
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, KFold, cross_val_score

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor

from warnings import filterwarnings 
filterwarnings('ignore')

In [2]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

### Movies

In [3]:
movies.head(1)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


Коды жанров Sci-Fi и no genres listed сбоят при проходе через векторайзер - поправим

In [4]:
movies.loc[(movies.genres.str.contains('Sci'))].head()
#movies.loc[movies.genres.str.contains('genr')].head()            

Unnamed: 0,movieId,title,genres
23,24,Powder (1995),Drama|Sci-Fi
28,29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
59,66,Lawnmower Man 2: Beyond Cyberspace (1996),Action|Sci-Fi|Thriller
68,76,Screamers (1995),Action|Sci-Fi|Thriller


In [16]:
movies['genres'] = movies.genres.replace('(no genres listed)', 'Unknown')
movies['genres'] = movies.genres.replace('Sci-Fi', 'SciFi', regex=True)

In [17]:
movies.loc[(movies.genres.str.contains('Sci'))].head()

Unnamed: 0,movieId,title,genres,year_of_film,epoch_of_film
23,24,Powder (1995),Drama|SciFi,1995,1.0
28,29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|SciFi,1995,1.0
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|SciFi|Thriller,1995,1.0
59,66,Lawnmower Man 2: Beyond Cyberspace (1996),Action|SciFi|Thriller,1996,1.0
68,76,Screamers (1995),Action|SciFi|Thriller,1995,1.0


Применим CountVectorizer и TF IDF на классификации жанров фильмов

In [18]:
movies_vect        = movies['genres'].str.lower().str.replace('|', ' ')
prefix             = 'genre'

vect1              = CountVectorizer()
tfidf_transformer1 = TfidfTransformer() 

genres_vect        = vect1.fit_transform(movies_vect)
genres_tfidf       = tfidf_transformer1.fit_transform(genres_vect)


Добавим признаки  - год выпуска фильма и эпоха / период выпуска (дата с точностью до 10 лет) для уточнения рекомендаций

In [19]:
def set_year(row):
    z = re.findall('(\d\d\d\d)', row.title)
    if len(z) == 0 or z == 0 or z == None:
        z = 9999
    else:
        z = int(z[-1])
    return z    

movies['year_of_film']= movies.apply(set_year, axis=1)

In [20]:
movies['epoch_of_film'] = pd.cut(movies.year_of_film, 
                                [1000, 1980, 2000, 2010, 2020, 10000], 
                                labels=False, 
                                ).astype('float64') 
movies['epoch_of_film'].value_counts()

1.0    3583
2.0    2814
3.0    1684
0.0    1649
4.0      12
Name: epoch_of_film, dtype: int64

Сливаем все признаки по фильмам в один датафрейм

In [21]:
movies_1 = pd.concat([movies.iloc[:,[0,3]], pd.DataFrame(genres_tfidf.toarray())], axis = 1)
movies_1.shape

(9742, 23)

In [22]:
movies_1.head(1)

Unnamed: 0,movieId,year_of_film,0,1,2,3,4,5,6,7,...,11,12,13,14,15,16,17,18,19,20
0,1,1995,0.0,0.416846,0.516225,0.504845,0.267586,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


###  tags

In [23]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


удаляем строки без тегов

In [24]:
tags = tags[~tags.tag.isna()]

In [25]:
tags.shape

(3683, 4)

Применим CountVectorizer и TF IDF на классификации тегов

In [27]:
vect2              = CountVectorizer()
tags_vect          = vect2.fit_transform(tags['tag'].str.lower())

tfidf_transformer2 = TfidfTransformer() 
tags_tfidf         = tfidf_transformer2.fit_transform(tags_vect)

Сливаем все признаки по фильмам в один датафрейм

In [28]:
tags_1 = pd.concat([tags.iloc[:,:2], pd.DataFrame(tags_tfidf.toarray())], axis = 1)
tags_1.shape

(3683, 1746)

In [29]:
tags_1.head(1)

Unnamed: 0,userId,movieId,0,1,2,3,4,5,6,7,...,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743
0,2,60756,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ratings

In [31]:
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


Вводим дополнительные признаки  - средний балл оценки по фильму и средний балл оценки по пользователю

In [32]:
rate_1 = pd.DataFrame(ratings.groupby(['movieId']).rating.median())
rate_2 = pd.DataFrame(ratings.groupby(['userId']).rating.median())

In [33]:
ratings = ratings.iloc[:,:3].merge(rate_1, how = 'left', left_on='movieId', right_on='movieId')
ratings = ratings.merge(rate_2, how = 'left', left_on='userId', right_on='userId')

In [41]:
ratings.head()

Unnamed: 0,userId,movieId,rating_x,rating_y,rating
0,1,1,4.0,4.0,5.0
1,1,3,4.0,3.0,5.0
2,1,6,4.0,4.0,5.0
3,1,47,5.0,4.0,5.0
4,1,50,5.0,4.5,5.0


Сливаем все признаки по рейтинагам в один датафрейм

In [43]:
movies_findf = ratings.merge(tags_2, how = 'left', left_on=['movieId'], right_on=['movieId'])
movies_findf = movies_findf.merge(movies_1, how = 'left', left_on=['movieId'], right_on=['movieId'])
movies_findf.head()

Unnamed: 0,userId,movieId,rating_x,rating_y,rating,0_x,1_x,2_x,3_x,4_x,...,11_y,12_y,13_y,14_y,15_y,16_y,17_y,18_y,19_y,20_y
0,1,1,4.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,3,4.0,3.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.821009,0.0,0.0,0.0,0.0,0.0
2,1,6,4.0,4.0,5.0,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.542042,0.0,0.0,0.0
3,1,47,5.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.823735,0.0,0.0,0.0,0.566975,0.0,0.0,0.0
4,1,50,5.0,4.5,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.685854,0.0,0.0,0.0,0.472071,0.0,0.0,0.0


In [44]:
movies_findf[movies_findf['0_x'].isna()] = movies_findf[movies_findf['0_x'].isna()].fillna(0)

### Построение модели

In [45]:
x  = np.asmatrix(movies_findf.iloc[:,4:].values)   
Y  = np.asmatrix(movies_findf.iloc[:,3].values).T  #  Средний рейтинг по фильму
                      
x_train, x_test, y_train, y_test = train_test_split(x, 
                                                    Y, 
                                                    test_size = 0.2, 
                                                    random_state = 100)

Применим стекинг из нескольких регрессионных моделей, чтобы быть поточнее

In [46]:
estimators = [
              ('lasso',  LassoCV(random_state=100)),              
              ('sgd',    SGDRegressor()),
              ('ridge',  RidgeCV()),
              ('rfr',    RandomForestRegressor()),
              ('abr',    AdaBoostRegressor())
            ]

In [47]:
reg = StackingRegressor(
                        estimators=estimators,
                        final_estimator=GradientBoostingRegressor(random_state=100))

In [48]:
reg.fit(x_train, y_train)

  return f(**kwargs)


StackingRegressor(estimators=[('lasso', LassoCV(random_state=100)),
                              ('sgd', SGDRegressor()),
                              ('ridge',
                               RidgeCV(alphas=array([ 0.1,  1. , 10. ]))),
                              ('rfr', RandomForestRegressor()),
                              ('abr', AdaBoostRegressor())],
                  final_estimator=GradientBoostingRegressor(random_state=100))

In [49]:
reg.score(x_test, y_test)

0.5620715727447314

Результат в целом оставляет желать лучшего, даже на стеке

In [50]:
from math import sqrt
sqrt(mean_squared_error(y_test, reg.predict(x_test))

0.40696132848806404

In [51]:
p1 = []
for i in list(reg.named_estimators_.keys()):
    p1.append(sqrt(mean_squared_error(y_test, reg.named_estimators_[i].predict(x_test))))
    
pd.DataFrame(p1,index=reg.named_estimators_.keys()).style.format({'0': '{:.3f}'})

Unnamed: 0,0
lasso,0.541953
sgd,618581000000000.0
ridge,0.469308
rfr,0.347546
abr,0.604993


### Простейшая рекомендательная система

В основу рекомендаций положено обеспечение следующих соответствий:

1.  Соответствие жанров - упрощаем жанровый профиль фильма до 6 базовых категорий, которыми фактически пользуемся
устанавливаем медианные средние показатели TFIDF для каждой укрупненной категории исходя из индивидуальных показателей по каждому фильму
2. Соответствие медианного  уровня оценок фильма
 - Для упрощения работы показатели значимости агрегированных жанров умножим на показатель средней оценки
3. Соответствие времени (эпохи создания) фильма

ну а дальше - ищем наибольший косинус между векторами

In [53]:
q_base = pd.concat([movies_findf.iloc[:,:3],
                        movies_findf.iloc[:,1750:]], 
                       axis = 1).groupby(['movieId']).median().iloc[:,1:].round(2)

Unnamed: 0_level_0,rating_x,0_y,1_y,2_y,3_y,4_y,5_y,6_y,7_y,8_y,...,11_y,12_y,13_y,14_y,15_y,16_y,17_y,18_y,19_y,20_y
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.42,0.52,0.5,0.27,0.0,0.0,0.0,0.48,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
q_base.columns = ['origin_rate_avg'] + vect1.get_feature_names()

In [55]:
q_base.head(1) 

Unnamed: 0_level_0,origin_rate_avg,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,...,imax,musical,mystery,noir,romance,scifi,thriller,unknown,war,western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.42,0.52,0.5,0.27,0.0,0.0,0.0,0.48,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
q_base['action_g']=(q_base['action']+q_base['adventure']+q_base['war']+q_base['western'])*q_base['origin_rate_avg']/4

q_base['child_g']  =(q_base['animation']+ q_base['children']   + q_base['fantasy']) * q_base['origin_rate_avg'] / 3
q_base['comedy_g'] =(q_base['comedy']   + q_base['romance']    + q_base['musical']) * q_base['origin_rate_avg'] / 3
q_base['serious_g']=(q_base['drama']    + q_base['documentary']                   ) * q_base['origin_rate_avg'] / 2
q_base['fantast_g']=(q_base['horror']   + q_base['mystery']    + q_base['scifi']  ) * q_base['origin_rate_avg'] / 3
q_base['crime_g']  =(q_base['thriller'] + q_base['noir']       + q_base['crime']  ) * q_base['origin_rate_avg'] / 3

In [57]:
q_base = pd.merge (q_base.iloc[:,[0,22,23,24,25,26,27]],
                   movies.iloc[:,[0,4,1]].set_index('movieId'),
                   how = 'left',
                   on = 'movieId')
q_base.head(1)                 

Unnamed: 0_level_0,origin_rate_avg,action_g,child_g,comedy_g,serious_g,fantast_g,crime_g,epoch_of_film,title
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,4.0,0.42,2.0,0.36,0.0,0.0,0.0,1.0,Toy Story (1995)


In [67]:
from IPython.display import display, HTML

In [80]:
def recomend_film (code):
    code                = int(code)
    q_base_1            = q_base[q_base.index != code]
    b                   = np.array(q_base.iloc[code][:8].astype('float64'))
    bLength             = np.linalg.norm( b )
    q_base_1['cos-s']     = np.nan
    
    for c in range(len(q_base_1)):
        a               = np.array(q_base_1.iloc[c][:8].astype('float64'))
        aLength         = np.linalg.norm( a )
        q_base_1.iloc[c,9]= np.dot ( a, b ) / ( aLength * bLength )
    
    rec_films = q_base_1.iloc[:,[8,0,9]
                ]. sort_values(['cos-s','origin_rate_avg'
                ], ascending = False).head(6)
    
    print('Представлен фильм для сравнения', q_base.iloc[code ,8])
    print('Рекомендованы к просмотру аналоги:')
    display(HTML(rec_films.to_html()))

In [113]:
recomend_film(8699)

Представлен фильм для сравнения A Flintstones Christmas Carol (1994)
Рекомендованы к просмотру аналоги:


Unnamed: 0_level_0,title,origin_rate_avg,cos-s
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
126088,A Flintstones Christmas Carol (1994),5.0,1.0
170777,There Once Was a Dog (1982),5.0,1.0
72692,Mickey's Once Upon a Christmas (1999),5.0,0.999996
745,Wallace & Gromit: A Close Shave (1995),4.0,0.999049
95858,For the Birds (2000),4.0,0.999049
170837,Life-Size (2000),4.0,0.999032


In [114]:
recomend_film(5)

Представлен фильм для сравнения Heat (1995)
Рекомендованы к просмотру аналоги:


Unnamed: 0_level_0,title,origin_rate_avg,cos-s
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Heat (1995),4.0,1.0
1036,Die Hard (1988),4.0,1.0
4946,"Eye for an Eye, An (1981)",4.0,1.0
27022,Thursday (1998),4.0,1.0
26736,Riki-Oh: The Story of Ricky (Lik Wong) (1991),4.5,0.9997
27480,Dead or Alive 2: Tôbôsha (2000),4.5,0.9997


In [115]:
recomend_film(7137)

Представлен фильм для сравнения Zombieland (2009)
Рекомендованы к просмотру аналоги:


Unnamed: 0_level_0,title,origin_rate_avg,cos-s
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
71535,Zombieland (2009),4.0,1.0
26593,Hell Comes to Frogtown (1988),2.0,1.0
2450,Howard the Duck (1986),2.0,0.99986
33004,"Hitchhiker's Guide to the Galaxy, The (2005)",3.5,0.998359
67168,Dance of the Dead (2008),3.5,0.998359
3264,Buffy the Vampire Slayer (1992),2.5,0.996753
