Задание 2

Вам необходимо построить модель, которая на основании данных, поступающих каждую минуту, определяют качество продукции, производимое на обжиговой машине.

Обжиговая машина представляет собой агрегат, состоящий из 5 одинаковых по размеру камер, в каждой камере установлено по 3 датчика температур. Кроме этого, для данной задачи Вы собрали данные о высоте слоя сырья и его влажности. Высота слоя и влажность измеряются при входе сырья в машину. Сырье проходит через обжиговую машину за час.

Данные с показателями работы обжиговой машины содержатся в файле X_data.csv.

Качество продукции измеряется в лаборатории по пробам, которые забираются каждый час, данные по известным анализам содержатся в файле Y_train.csv. В файле указано время забора пробы, проба забирается на выходе из обжиговой машины.


Вы договорились с заказчиком, что оценкой модели будет являться показатель MAE, для оценки модели необходимо сгенерировать предсказания за период, указанный в файле Y_submit.csv (5808 предиктов).



In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from catboost import CatBoostRegressor, cv, Pool

%matplotlib inline

In [3]:
df = pd.read_csv('X_data.csv', sep=';', index_col=0, parse_dates=True)
df.head(3)

Unnamed: 0,T_data_1_1,T_data_1_2,T_data_1_3,T_data_2_1,T_data_2_2,T_data_2_3,T_data_3_1,T_data_3_2,T_data_3_3,T_data_4_1,T_data_4_2,T_data_4_3,T_data_5_1,T_data_5_2,T_data_5_3,H_data,AH_data
2015-01-01 00:00:00,212,210,211,347,353,347,474,473,481,346,348,355,241,241,243,167.85,9.22
2015-01-01 00:01:00,212,211,211,346,352,346,475,473,481,349,348,355,241,241,243,162.51,9.22
2015-01-01 00:02:00,212,211,211,345,352,346,476,473,481,352,349,355,242,241,242,164.99,9.22


In [4]:
# Признаки и метки для обучения

Y = pd.read_csv('Y_train.csv', sep=';', index_col=0, header=None, parse_dates=True)
X = df[df.index.isin(Y.index)]
X.shape, Y.shape

((29184, 17), (29184, 1))

In [5]:
Y.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
2015-01-04 00:05:00,392
2015-01-04 01:05:00,384
2015-01-04 02:05:00,393
2015-01-04 03:05:00,399
2015-01-04 04:05:00,400


### Исследование признаков

Все признаки являются вещественными. Пропусков нет. Таблица корреляции показывает линейную зависимость между показаниями датчиков, установленных в одной камере, что закономерно.

In [89]:
X.corr()

Unnamed: 0,T_data_1_1,T_data_1_2,T_data_1_3,T_data_2_1,T_data_2_2,T_data_2_3,T_data_3_1,T_data_3_2,T_data_3_3,T_data_4_1,T_data_4_2,T_data_4_3,T_data_5_1,T_data_5_2,T_data_5_3,H_data,AH_data
T_data_1_1,1.0,0.659668,0.649625,-0.005235,0.000708,0.000122,-0.010778,-0.007205,-0.009579,0.000628,-0.016972,-0.005507,-0.00556,-0.00618,-0.015449,-0.019331,-0.001418
T_data_1_2,0.659668,1.0,0.690438,0.00871,0.009286,0.011731,-0.002342,-0.013459,-0.011845,-0.001686,-0.020158,-0.007527,-0.014022,-0.018872,-0.023108,-0.017738,-0.000291
T_data_1_3,0.649625,0.690438,1.0,0.002835,0.007235,0.011749,-0.006373,-0.007384,-0.010108,0.005209,-0.003454,0.000794,-0.021769,-0.030411,-0.035314,-0.01202,-0.002702
T_data_2_1,-0.005235,0.00871,0.002835,1.0,0.35537,0.389946,-0.010062,-0.007176,-0.004971,0.021294,0.010942,0.025101,0.002258,-0.00957,-0.003175,-0.001611,-0.001315
T_data_2_2,0.000708,0.009286,0.007235,0.35537,1.0,0.40436,-0.023292,-0.000592,-0.017482,0.008122,-0.00591,0.014449,0.010317,0.008459,0.008991,0.013029,0.002594
T_data_2_3,0.000122,0.011731,0.011749,0.389946,0.40436,1.0,-0.000805,-0.000372,0.001864,0.010093,0.006046,0.013371,0.01039,-2.8e-05,0.011401,0.008324,0.002564
T_data_3_1,-0.010778,-0.002342,-0.006373,-0.010062,-0.023292,-0.000805,1.0,0.527216,0.558841,-0.025132,-0.025089,-0.015367,0.007238,0.008074,0.00319,0.018755,-0.005755
T_data_3_2,-0.007205,-0.013459,-0.007384,-0.007176,-0.000592,-0.000372,0.527216,1.0,0.540306,-0.018672,-0.031407,-0.01987,0.005811,0.001764,0.008637,0.013194,0.001493
T_data_3_3,-0.009579,-0.011845,-0.010108,-0.004971,-0.017482,0.001864,0.558841,0.540306,1.0,-0.010582,-0.012529,-0.011642,-0.009149,0.003999,-0.001516,0.010508,0.001088
T_data_4_1,0.000628,-0.001686,0.005209,0.021294,0.008122,0.010093,-0.025132,-0.018672,-0.010582,1.0,0.414581,0.421842,-0.005767,-0.006066,0.001401,-0.011916,0.000479


Обучим и оценим модели SGD-регрессии, k-ближайших соседей и градиентного бустинга. Признаки масштабируем. Оцениваем модель на 3-фолдах кроссвалидации с перемешиванием и валидационной выборке. Для оценки используем MAE.

Разобьем обучающую выборку на 2 части в пропорции 7/3 с перемешиванием.

In [6]:
x_train, x_valid, y_train, y_valid = train_test_split(X, Y, test_size=0.3)
x_train.shape, x_valid.shape

((20428, 17), (8756, 17))

### SGD-регрессор

In [59]:

cross_val_results = []
sgdmodel = SGDRegressor(random_state=32)
scaler = StandardScaler()
sgd_pipeline = Pipeline([('scaler', scaler), ('estimator', sgdmodel)])
kf = KFold(n_splits = 3, shuffle=True, random_state=32)
for train_indices, test_indices in kf.split(x_train):
    sgd_pipeline.fit(x_train.values[train_indices], np.ravel(y_train)[train_indices])
    cross_val_results.append(mean_absolute_error(sgd_pipeline.predict(x_train.values[test_indices]), 
                                                 np.ravel(y_train)[test_indices]))

print('Средняя абсолютная ошибка на кросс-валидации: %f' % np.mean(cross_val_results))

Средняя абсолютная ошибка на кросс-валидации: 15.599230


In [60]:
sgd_pipeline.fit(x_train.values, np.ravel(y_train))
print('Средняя абсолютная ошибка на отложенной выборке: %f' 
      % mean_absolute_error(pipeline.predict(x_valid), y_valid))

Средняя абсолютная ошибка на отложенной выборке: 16.480262


In [36]:
# Предсказанные метки

pipeline.predict(X.values[test_indices])

array([354.74456372, 372.22854195, 401.3326712 , ..., 477.25866122,
       483.71824358, 465.97613312])

### kNN

In [62]:
cross_val_results = []
knnmodel = KNeighborsRegressor()
scaler = StandardScaler()
knn_pipeline = Pipeline([('scaler', scaler), ('estimator', knnmodel)])
kf = KFold(n_splits = 3, shuffle=True, random_state=32)
for train_indices, test_indices in kf.split(x_train):
    knn_pipeline.fit(x_train.values[train_indices], np.ravel(y_train)[train_indices])
    cross_val_results.append(mean_absolute_error(knn_pipeline.predict(x_train.values[test_indices]), 
                                                 np.ravel(y_train)[test_indices]))

print('Средняя абсолютная ошибка на кросс-валидации: %f' % np.mean(cross_val_results))

Средняя абсолютная ошибка на кросс-валидации: 14.494560


In [63]:
knn_pipeline.fit(x_train.values, np.ravel(y_train))
print('Средняя абсолютная ошибка на отложенной выборке: %f' 
      % mean_absolute_error(knnpipeline.predict(x_valid), y_valid))

Средняя абсолютная ошибка на отложенной выборке: 12.091777


### Gradient boosting with Catboost

In [45]:
# Кросс-валидация 

params = {'loss_function':'RMSE',
         'verbose':200,
         'random_seed':32
         }
pool = Pool(x_train, y_train)
scores = cv(pool=pool,
            params=params,
           fold_count=3,
           seed=32,
            partition_random_seed=32,
           shuffle=True,
           )

0:	learn: 393.9631860	test: 393.9738584	best: 393.9738584 (0)	total: 304ms	remaining: 5m 3s
200:	learn: 15.1240805	test: 15.4844940	best: 15.4844940 (200)	total: 56.5s	remaining: 3m 44s
400:	learn: 14.2288866	test: 14.6772222	best: 14.6772222 (400)	total: 1m 46s	remaining: 2m 39s
600:	learn: 14.0953422	test: 14.5658353	best: 14.5658353 (600)	total: 2m 32s	remaining: 1m 40s
800:	learn: 14.0077515	test: 14.4964979	best: 14.4964382 (791)	total: 3m 21s	remaining: 50s
999:	learn: 13.9462791	test: 14.4425649	best: 14.4425138 (993)	total: 4m 6s	remaining: 0us


In [69]:
# Обучение и оценка на отложенной выборке (итоговый результат MAE: 10.217)

cross_val_results = []
tb = CatBoostRegressor(random_seed=32, eval_metric = 'MAE', verbose = 200, iterations=1200)
scaler = StandardScaler()
tb.fit(x_train, np.ravel(y_train), early_stopping_rounds = 500, 
               eval_set=(x_valid, y_valid),
              use_best_model=True);

0:	learn: 37.0597542	test: 36.9169338	best: 36.9169338 (0)	total: 79.7ms	remaining: 1m 35s
200:	learn: 10.9712755	test: 10.9232868	best: 10.9232868 (200)	total: 11.4s	remaining: 56.8s
400:	learn: 10.3254859	test: 10.3691404	best: 10.3691404 (400)	total: 23.6s	remaining: 47s
600:	learn: 10.1376623	test: 10.2358794	best: 10.2358794 (600)	total: 34.9s	remaining: 34.7s
800:	learn: 9.9972581	test: 10.1487308	best: 10.1487308 (800)	total: 46.4s	remaining: 23.1s
1000:	learn: 9.8719306	test: 10.0814464	best: 10.0814464 (1000)	total: 57.4s	remaining: 11.4s
1199:	learn: 9.7715085	test: 10.0366331	best: 10.0366331 (1199)	total: 1m 9s	remaining: 0us

bestTest = 10.03663308
bestIteration = 1199



<catboost.core.CatBoostRegressor at 0x1aba7efc860>

In [70]:
# Вклад признаков в предсказание

tb.get_feature_importance()

array([ 1.77453377,  2.07662789,  2.07340606,  0.87726172,  0.91681897,
        0.9779288 , 27.85800356, 27.69484092, 20.93391305,  0.13196822,
        0.15129945,  0.14993423,  3.45356574,  2.49358566,  3.34300304,
        4.98125546,  0.11205347])

Остановимся на модели бустинга в силу более высокого качества на кросс-валидации и отложенной выборке.

Проведем подбор параметров бустинга по сетке.

In [7]:
tb_grid = CatBoostRegressor(random_seed=32, eval_metric = 'MAE', verbose = 200, iterations=1200)
params = {'depth': [5, 7],
        'l2_leaf_reg': [1, 2, 5]
         }
grid_search_result = tb_grid.grid_search(params, X=x_train, y=y_train, train_size=0.7)

0:	loss: 10.4560293	best: 10.4560293 (0)	total: 37.1s	remaining: 3m 5s
1:	loss: 10.5052125	best: 10.4560293 (0)	total: 1m 16s	remaining: 2m 32s
2:	loss: 10.6261227	best: 10.4560293 (0)	total: 1m 56s	remaining: 1m 56s
3:	loss: 11.2247201	best: 10.4560293 (0)	total: 2m 47s	remaining: 1m 23s
4:	loss: 11.3022512	best: 10.4560293 (0)	total: 3m 46s	remaining: 45.3s
5:	loss: 11.1387208	best: 10.4560293 (0)	total: 5m 33s	remaining: 0us
Estimating final quality...


Остановимся на модели с параметрами по-умолчанию.

In [71]:
# Признаки и метки для предикта

y_submit = pd.read_csv('Y_submit.csv', sep=';', index_col=0, header=None, parse_dates=True)
x_submit = df[df.index.isin(y_submit.index)]
y_submit.shape, x_submit.shape

((5808, 1), (5808, 17))

In [114]:
predictions = tb.predict(x_submit)
y_submit.iloc[:, 0] = predictions.round().astype(int)
y_submit.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
2018-05-04 00:05:00,441
2018-05-04 01:05:00,441
2018-05-04 02:05:00,423
2018-05-04 03:05:00,406
2018-05-04 04:05:00,409


In [112]:
with open('predictions.csv', 'w') as f:
    f.write(y_submit.to_csv(sep=';', header=False))