На этом занятии мы попробуем задачу регрессии. Данные в этой же папке, будем тренироваться на датасете фильмов с IMDB

Перед обучением обучением модели, нужно подготовить данные:

- найти\собрать данные
- почистить и предобработать
- преобразовать в матрицы 


In [None]:
# импорты необходимых библиотек
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline

# import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error


In [None]:
data = pd.read_csv('IMDB-Movie-Data.csv')
print(data.shape)

data.head(3)

(1000, 12)


Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


## Что делать с NaN?
Есть 3 варианта

In [None]:
# 1. Убрать строки с NaN
print(data.isna().any())
data.shape

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool


(1000, 12)

In [None]:
print(data.shape)
tmp = data.dropna()
tmp.shape

(1000, 12)


(838, 12)

In [None]:
# 2. Превратить NaN в 0
print(data.shape)
tmp = data.fillna(0)
print(tmp.shape)

(1000, 12)
(1000, 12)


In [None]:
# 3. Превратить NaN в средние значения по колонке

# вычисляем средние для колонок с пустыми значениями
meta_mean = data.Metascore.mean()
rev_mean = data['Revenue (Millions)'].mean()

#заменяем пустоты на средние значения
data.Metascore.fillna(meta_mean, inplace=True)
data['Revenue (Millions)'].fillna(rev_mean, inplace=True)

# проверяем присутствие NaN
data.isna().any()

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

## Подготовка данных

Попробуем предсказывать рейтинг фильма по данным его описания, года, длины в минутах и кассовых сборов

Колонка "Rating" станет **целевой переменной, или таргетом** (y)<br>
Остальных данные будут **обучающей выборкой** (X)

In [None]:
data.Actors

0      Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
1      Noomi Rapace, Logan Marshall-Green, Michael Fa...
2      James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
3      Matthew McConaughey,Reese Witherspoon, Seth Ma...
4      Will Smith, Jared Leto, Margot Robbie, Viola D...
                             ...                        
995    Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...
996    Lauren German, Heather Matarazzo, Bijou Philli...
997    Robert Hoffman, Briana Evigan, Cassie Ventura,...
998    Adam Pally, T.J. Miller, Thomas Middleditch,Sh...
999    Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...
Name: Actors, Length: 1000, dtype: object

In [None]:
data.Genre

0       Action,Adventure,Sci-Fi
1      Adventure,Mystery,Sci-Fi
2               Horror,Thriller
3       Animation,Comedy,Family
4      Action,Adventure,Fantasy
                 ...           
995         Crime,Drama,Mystery
996                      Horror
997         Drama,Music,Romance
998            Adventure,Comedy
999       Comedy,Family,Fantasy
Name: Genre, Length: 1000, dtype: object

In [None]:
# подготовим списки актёров
data["text"] = data.Actors.apply(lambda x: x.lower()) 

data["text"]

0      chris pratt, vin diesel, bradley cooper, zoe s...
1      noomi rapace, logan marshall-green, michael fa...
2      james mcavoy, anya taylor-joy, haley lu richar...
3      matthew mcconaughey,reese witherspoon, seth ma...
4      will smith, jared leto, margot robbie, viola d...
                             ...                        
995    chiwetel ejiofor, nicole kidman, julia roberts...
996    lauren german, heather matarazzo, bijou philli...
997    robert hoffman, briana evigan, cassie ventura,...
998    adam pally, t.j. miller, thomas middleditch,sh...
999    kevin spacey, jennifer garner, robbie amell,ch...
Name: text, Length: 1000, dtype: object

In [None]:
#а здесь жанры
data["text1"] = data.Genre.apply(lambda x: x.lower()) 

data["text1"]

0       action,adventure,sci-fi
1      adventure,mystery,sci-fi
2               horror,thriller
3       animation,comedy,family
4      action,adventure,fantasy
                 ...           
995         crime,drama,mystery
996                      horror
997         drama,music,romance
998            adventure,comedy
999       comedy,family,fantasy
Name: text1, Length: 1000, dtype: object

In [None]:
data.text.values

array(['chris pratt, vin diesel, bradley cooper, zoe saldana',
       'noomi rapace, logan marshall-green, michael fassbender, charlize theron',
       'james mcavoy, anya taylor-joy, haley lu richardson, jessica sula',
       'matthew mcconaughey,reese witherspoon, seth macfarlane, scarlett johansson',
       'will smith, jared leto, margot robbie, viola davis',
       'matt damon, tian jing, willem dafoe, andy lau',
       'ryan gosling, emma stone, rosemarie dewitt, j.k. simmons',
       'essie davis, andrea riseborough, julian barratt,kenneth branagh',
       'charlie hunnam, robert pattinson, sienna miller, tom holland',
       'jennifer lawrence, chris pratt, michael sheen,laurence fishburne',
       'eddie redmayne, katherine waterston, alison sudol,dan fogler',
       'taraji p. henson, octavia spencer, janelle monáe,kevin costner',
       'felicity jones, diego luna, alan tudyk, donnie yen',
       "auli'i cravalho, dwayne johnson, rachel house, temuera morrison",
       'anne

In [None]:
data.text1.values

array(['action,adventure,sci-fi', 'adventure,mystery,sci-fi',
       'horror,thriller', 'animation,comedy,family',
       'action,adventure,fantasy', 'action,adventure,fantasy',
       'comedy,drama,music', 'comedy', 'action,adventure,biography',
       'adventure,drama,romance', 'adventure,family,fantasy',
       'biography,drama,history', 'action,adventure,sci-fi',
       'animation,adventure,comedy', 'action,comedy,drama',
       'animation,adventure,comedy', 'biography,drama,history',
       'action,thriller', 'biography,drama', 'drama,mystery,sci-fi',
       'adventure,drama,thriller', 'drama', 'crime,drama,horror',
       'animation,adventure,comedy', 'action,adventure,sci-fi', 'comedy',
       'action,adventure,drama', 'horror,thriller', 'comedy',
       'action,adventure,drama', 'comedy', 'drama,thriller',
       'action,adventure,sci-fi', 'action,adventure,comedy',
       'action,horror,sci-fi', 'action,adventure,sci-fi',
       'adventure,drama,sci-fi', 'action,adventure,fant

In [None]:
input_text = list(data.text.values)

In [None]:
input_text1 = list(data.text1.values)

In [None]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text)]
documents[10:12]

[TaggedDocument(words='eddie redmayne, katherine waterston, alison sudol,dan fogler', tags=[10]),
 TaggedDocument(words='taraji p. henson, octavia spencer, janelle monáe,kevin costner', tags=[11])]

In [None]:
documents1 = [TaggedDocument(doc, [i]) for i, doc in enumerate(input_text1)]
documents1[10:12]

[TaggedDocument(words='adventure,family,fantasy', tags=[10]),
 TaggedDocument(words='biography,drama,history', tags=[11])]

обучаем модель на текстах описаний фильмов (можно поизменять параметры)

In [None]:
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)



In [None]:
model1 = Doc2Vec(documents1, vector_size=5, window=2, min_count=1, workers=4)



In [None]:
model.save("D2V.model") # сохранение модели

In [None]:
model1.save("D2V.model1") # сохранение модели

In [None]:
# так можно посмотреть на векторы текстов, на которых училась модель
# индекс [] около documents -- это индекс текста из датасета

model[documents[0].tags[0]]


array([0.008977  , 0.07020255, 0.03544287, 0.07257991, 0.03677884],
      dtype=float32)

In [None]:
model1[documents[0].tags[0]]

array([0.04987272, 0.09646527, 0.05061137, 0.06437829, 0.00872933],
      dtype=float32)

Теперь нужно добавить векторы в датасет с остальными параметрами

In [None]:
# создадим список с векторами для каждого текста
vectors = []
for x in documents:
    vec = list(model[x.tags][0])
    vectors.append(vec)

In [None]:
vectors1 = []
for x in documents1:
    vec1 = list(model1[x.tags][0])
    vectors1.append(vec1)

In [None]:
# так получим датафрейм, где все компоненты векторов в отдельных столбцах
split_df = pd.DataFrame(vectors,
                        columns=['v1', 'v2', 'v3','v4',"v5"])

split_df


Unnamed: 0,v1,v2,v3,v4,v5
0,0.008977,0.070203,0.035443,0.072580,0.036779
1,-0.059852,0.067650,-0.018647,-0.045910,-0.042930
2,-0.052643,0.039504,0.097905,-0.099311,-0.004637
3,0.023638,-0.042326,0.075104,0.068815,-0.065641
4,0.137344,0.000145,-0.054334,-0.065204,0.014699
...,...,...,...,...,...
995,0.056886,0.085300,-0.022210,0.017304,0.011205
996,0.109090,-0.025286,-0.031353,-0.003599,-0.067574
997,0.000899,0.138512,0.019535,-0.021260,0.030467
998,0.090195,0.091013,-0.056516,0.053828,-0.115001


In [None]:
split_df1 = pd.DataFrame(vectors1,
                        columns=['v6', 'v7', 'v8', 'v9', "v10"])

split_df1


Unnamed: 0,v6,v7,v8,v9,v10
0,0.049873,0.096465,0.050611,0.064378,0.008729
1,-0.019248,0.102725,-0.000538,-0.068627,-0.074150
2,-0.044217,0.036046,0.091019,-0.102154,0.006679
3,0.029136,-0.038013,0.077169,0.071650,-0.073289
4,0.109567,-0.032476,-0.079647,-0.054894,0.062066
...,...,...,...,...,...
995,0.084563,0.112261,-0.012664,-0.003644,-0.025953
996,0.083413,-0.057065,-0.055301,0.015388,-0.019769
997,-0.042089,0.089231,-0.008513,-0.002862,0.094790
998,0.061715,0.057574,-0.077713,0.067690,-0.065399


In [None]:
# теперь добавим его к основному датафрейму
result = data.join(split_df, how='left')

result1=result.join(split_df1, how='left')
result1.shape

(1000, 24)

In [None]:
result1

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,...,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,...,0.008977,0.070203,0.035443,0.072580,0.036779,0.049873,0.096465,0.050611,0.064378,0.008729
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,...,-0.059852,0.067650,-0.018647,-0.045910,-0.042930,-0.019248,0.102725,-0.000538,-0.068627,-0.074150
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,...,-0.052643,0.039504,0.097905,-0.099311,-0.004637,-0.044217,0.036046,0.091019,-0.102154,0.006679
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,...,0.023638,-0.042326,0.075104,0.068815,-0.065641,0.029136,-0.038013,0.077169,0.071650,-0.073289
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,...,0.137344,0.000145,-0.054334,-0.065204,0.014699,0.109567,-0.032476,-0.079647,-0.054894,0.062066
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,...,0.056886,0.085300,-0.022210,0.017304,0.011205,0.084563,0.112261,-0.012664,-0.003644,-0.025953
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,...,0.109090,-0.025286,-0.031353,-0.003599,-0.067574,0.083413,-0.057065,-0.055301,0.015388,-0.019769
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,...,0.000899,0.138512,0.019535,-0.021260,0.030467,-0.042089,0.089231,-0.008513,-0.002862,0.094790
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,...,0.090195,0.091013,-0.056516,0.053828,-0.115001,0.061715,0.057574,-0.077713,0.067690,-0.065399


In [None]:


data_sm = result1[['Runtime (Minutes)',"Year",
                'Rating', 'Votes',
                'Revenue (Millions)','Metascore',"v1","v2","v3","v4","v5", 'v6', 'v7', 'v8', 'v9', 'v10']
              ]


data_sm.head(3)

Unnamed: 0,Runtime (Minutes),Year,Rating,Votes,Revenue (Millions),Metascore,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10
0,121,2014,8.1,757074,333.13,76.0,0.008977,0.070203,0.035443,0.07258,0.036779,0.049873,0.096465,0.050611,0.064378,0.008729
1,124,2012,7.0,485820,126.46,65.0,-0.059852,0.06765,-0.018647,-0.04591,-0.04293,-0.019248,0.102725,-0.000538,-0.068627,-0.07415
2,117,2016,7.3,157606,138.12,62.0,-0.052643,0.039504,0.097905,-0.099311,-0.004637,-0.044217,0.036046,0.091019,-0.102154,0.006679


## Подготавливаем матрицы

In [None]:
# определяем X и y

X = data_sm.drop(["Rating"],axis=1).values 

display(X, X.shape)

array([[ 1.21000000e+02,  2.01400000e+03,  7.57074000e+05, ...,
         5.06113730e-02,  6.43782914e-02,  8.72932747e-03],
       [ 1.24000000e+02,  2.01200000e+03,  4.85820000e+05, ...,
        -5.38208638e-04, -6.86266422e-02, -7.41500184e-02],
       [ 1.17000000e+02,  2.01600000e+03,  1.57606000e+05, ...,
         9.10191163e-02, -1.02154262e-01,  6.67934865e-03],
       ...,
       [ 9.80000000e+01,  2.00800000e+03,  7.06990000e+04, ...,
        -8.51280894e-03, -2.86246091e-03,  9.47902873e-02],
       [ 9.30000000e+01,  2.01400000e+03,  4.88100000e+03, ...,
        -7.77129903e-02,  6.76898435e-02, -6.53994083e-02],
       [ 8.70000000e+01,  2.01600000e+03,  1.24350000e+04, ...,
         6.89533129e-02,  2.44382247e-02, -1.01444602e-01]])

(1000, 15)

In [None]:
data_sm.isna().any()

Runtime (Minutes)     False
Year                  False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
v1                    False
v2                    False
v3                    False
v4                    False
v5                    False
v6                    False
v7                    False
v8                    False
v9                    False
v10                   False
dtype: bool

In [None]:
y = data_sm['Rating'].values 
y.shape

(1000,)

Иногда бывает полезно [нормализовать](https://en.wikipedia.org/wiki/Normalization_(statistics)) данные: это позволяет исправить ситуацию, когда признаки представлены в разных единацах измерения. 
Для этого используется StandardScaler. 

До нормализации:

In [None]:
list(X[0])

[121.0,
 2014.0,
 757074.0,
 333.13,
 76.0,
 0.008977003395557404,
 0.0702025517821312,
 0.03544287011027336,
 0.07257990539073944,
 0.03677884489297867,
 0.04987271502614021,
 0.09646527469158173,
 0.05061137303709984,
 0.0643782913684845,
 0.008729327470064163]

In [None]:
# использзуем стандартизатор
sc = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(sc.fit_transform(X), y, random_state=75)

После:

In [None]:
list(sc.fit_transform(X)[0])

[0.4163497512303056,
 0.37979525138136244,
 3.1126899627963738,
 2.5961363010556906,
 1.0233613578368184,
 -0.3540382934158426,
 0.5679551687556464,
 0.2170235738014099,
 1.4687431444602803,
 1.1905037762079025,
 0.31386401396818725,
 1.1602466448264703,
 0.5912188894033786,
 1.343995547315544,
 0.5918845468111715]

теперь с данными удобнее работать и обучать

In [None]:
# задаем модель регрессора
# силу регуляризации можно варьировать параметром alpha
regressor = Ridge(alpha=0.01) 


# обучаем
regressor.fit(X_train, y_train)

Ridge(alpha=0.01)

In [None]:
# давайте предскажем результат для тестовой выборки

y_preds = regressor.predict(X_test)

### оценка результатов алгоритма

В качестве метрики будем использовать [среднюю абсолютную ошибку](https://www.youtube.com/watch?v=ZejnwbcU8nw). Она показывает отклонение от правильного ответа в тех же единах измерения

*(а вообще есть [разные способы](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b))*

In [None]:
mean_absolute_error(y_test, y_preds) 

0.46833119500208864

In [None]:
mean_squared_error(y_test, y_preds)

0.41160824550280156

In [None]:
from math import sqrt
sqrt(mean_squared_error(y_test, y_preds)) 

0.6415670233910106

Попробуйте разные значения для параметра регуляризации alpha при обучении модели. Как они влияют на величину ошибки?

In [None]:
#Лассо-регрессия
regressor1 = Lasso(alpha=0.01) 
regressor1.fit(X_train, y_train)
y_preds1 = regressor1.predict(X_test)


In [None]:
mean_absolute_error(y_test, y_preds1) 

0.4598751883201147

In [None]:
mean_squared_error(y_test, y_preds1)

0.4050405163303119

In [None]:
sqrt(mean_squared_error(y_test, y_preds1)) 

0.636427934907254

In [None]:
# Линейная регрессия
regressor2 = LinearRegression() 
regressor2.fit(X_train, y_train)
y_preds2 = regressor2.predict(X_test)

In [None]:
mean_absolute_error(y_test, y_preds2) 

0.46838130580125337

In [None]:
mean_squared_error(y_test, y_preds2)

0.4116381833188707

In [None]:
sqrt(mean_squared_error(y_test, y_preds2)) 

0.6415903547582917

In [None]:
#В тренировчные данные были добавлены векторные значения из столбцов "Жанр" и "Актёры", а ещё я удалил данные по "Описанию", поскольку, на мой взягляд,
#описание фильма не влияет радикально на рейтинг фильма, поэтому эти данные можно опустить при обучении модели. Оптимальным значением alpha оказалось
#значение 0.01, Random_state=75. Метрики MAE, MSE, RMSE показали слабую обучаемость модели, причём при смене как гиперпараметров, так и более качественная
#предобработка train-данных не сильно повлияли на значения метрик (значения колебались +-0.01). Вывод: модели линейной регрессии, лассо или ridge плохо 
#предсказывают данные для датасета, в котором наравне с количественными присутствуют качественные параметры, пускай и векторизированные.