## Условия проекта

Сервис по продаже автомобилей «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики , комплектации и цены автомобилей. Требуется построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

## Содержание проекта:
* [1. Подготовка данных.](#1-bullet)
* [2. Создание моделей.](#2-bullet)
* [3. Анализ моделей.](#3-bullet)


<a id='1-bullet'></a>

# 1. Подготовка данных

Первый шаг всегда один - портирование нужных на впоследствии библиотек!

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%%time
import pandas as pd
from catboost import CatBoostRegressor

import lightgbm as lgb
from sklearn import linear_model
import numpy as np
from sklearn.impute import SimpleImputer

import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder


from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import sklearn.metrics
from sklearn.model_selection import cross_val_score


Wall time: 5.07 s


Параллельно был введен показатель времени, затраченного на выполнение этого блока кода. Очень, очень долгие 1.3 секунды были потрачены не впустую...

Теперь взглянем на данные:

In [3]:
df_autos = pd.read_csv('C:/Users/mi/Downloads/autos.csv')
df_autos.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


С первого же взгляда бросается в глаза пара пропусков в различных столбцах. С ними разберемся по-умному - с помощью SimpleImputer, но об этом позже, сначала посмотрим сколько пропусков всего

In [4]:
df_autos.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Kilometer                0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

71 тысяча пропусков в столбце с ремонтом... Это очень плохо. Без заполнения пропусков никак. 

In [5]:
df_autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Еще нужно заметить, что некоторые даты приведены в типе "обджекта". Переведем их в нужный нам формат даты.

In [6]:
def to_datetime(series):
    series = pd.to_datetime(series)
    return series

date = ['DateCrawled','DateCreated','LastSeen']

for series in date:
    df_autos[series] = to_datetime(df_autos[series])

А теперь, ради разумной вставки пропусков разделим данные на категориальные и численые

In [7]:
df_autos['NumberOfPictures'].unique()

array([0], dtype=int64)

Еще один признак NumberOfPictures был удален

In [8]:
%%time
categorical_features = df_autos.select_dtypes(include="object").columns
integer_features = df_autos.select_dtypes(exclude="object").columns
integer_features = integer_features.drop(['DateCrawled','LastSeen','Price', 'DateCreated', 
                                          'RegistrationMonth', 'PostalCode','NumberOfPictures'])

Wall time: 95.8 ms


Некоторые признаки нам не понадобятся при анализе модели, и их участь предрешена: удалить.

Начнем же издеваться над категориальными данными! Нaчнем преображение с заполнения пропусков(как я и заспойлерил выше - с помощью SimpleImputer)

In [9]:
%%time
imputer = SimpleImputer(strategy = 'most_frequent')
imputer.fit(df_autos[categorical_features])
df_autos[categorical_features] = imputer.transform(df_autos[categorical_features])
df_autos[categorical_features]

Wall time: 8min 48s


Unnamed: 0,VehicleType,Gearbox,Model,FuelType,Brand,NotRepaired
0,sedan,manual,golf,petrol,volkswagen,no
1,coupe,manual,golf,gasoline,audi,yes
2,suv,auto,grand,gasoline,jeep,no
3,small,manual,golf,petrol,volkswagen,no
4,small,manual,fabia,gasoline,skoda,no
...,...,...,...,...,...,...
354364,sedan,manual,colt,petrol,mitsubishi,yes
354365,sedan,manual,golf,petrol,sonstige_autos,no
354366,convertible,auto,fortwo,petrol,smart,no
354367,bus,manual,transporter,gasoline,volkswagen,no


In [10]:
categorical_cols_without_encoding = df_autos[categorical_features]

Тут я решил провести эксперимент: как лучше работает с категориальными дыннами модели бустинга - с преобразоваными или с "сырыми".

Теперь это выглядит чуть-чуть красивее. Но душа требует чисел. И преобразования в OrdinalEncoder. 

In [11]:
%%time
encoder = OrdinalEncoder()
category_cols_for_encoding = pd.DataFrame(encoder.fit_transform(df_autos[categorical_features]),
                            columns=df_autos[categorical_features].columns)

Wall time: 753 ms


А теперь чуть поиздеваемся над численными данными. Заполним пропуски, а потом применим масштабирование.

In [12]:
df_autos[integer_features].isna().sum()

RegistrationYear    0
Power               0
Kilometer           0
dtype: int64

Пропусков в числовых признаках нет

In [13]:
category_cols_for_encoding.head()

Unnamed: 0,VehicleType,Gearbox,Model,FuelType,Brand,NotRepaired
0,4.0,1.0,116.0,6.0,38.0,0.0
1,2.0,1.0,116.0,2.0,1.0,1.0
2,6.0,0.0,117.0,2.0,14.0,0.0
3,5.0,1.0,116.0,6.0,38.0,0.0
4,5.0,1.0,101.0,2.0,31.0,0.0


In [14]:
categorical_cols_without_encoding.head()

Unnamed: 0,VehicleType,Gearbox,Model,FuelType,Brand,NotRepaired
0,sedan,manual,golf,petrol,volkswagen,no
1,coupe,manual,golf,gasoline,audi,yes
2,suv,auto,grand,gasoline,jeep,no
3,small,manual,golf,petrol,volkswagen,no
4,small,manual,fabia,gasoline,skoda,no


Готово! Теперь данные почти готовы для взаимодействия с моделями. Только сделать один маленький шаг

In [15]:
integer_cols = df_autos[integer_features]

In [16]:
df_for_boosting_with_encode = category_cols_for_encoding.merge(integer_cols, 
                                                               on = category_cols_for_encoding.index).drop('key_0', axis = 1)
df_for_boosting_with_encode =  df_for_boosting_with_encode.merge(df_autos['Price'],
                                                                 on = df_for_boosting_with_encode.index).drop('key_0', axis = 1)
df_for_boosting_with_encode.head()

Unnamed: 0,VehicleType,Gearbox,Model,FuelType,Brand,NotRepaired,RegistrationYear,Power,Kilometer,Price
0,4.0,1.0,116.0,6.0,38.0,0.0,1993,0,150000,480
1,2.0,1.0,116.0,2.0,1.0,1.0,2011,190,125000,18300
2,6.0,0.0,117.0,2.0,14.0,0.0,2004,163,125000,9800
3,5.0,1.0,116.0,6.0,38.0,0.0,2001,75,150000,1500
4,5.0,1.0,101.0,2.0,31.0,0.0,2008,69,90000,3600


In [17]:
df_for_boosting_without_encoding = categorical_cols_without_encoding.merge(integer_cols, 
                                                                           on = categorical_cols_without_encoding.index).drop('key_0', axis = 1)
df_for_boosting_without_encoding =  df_for_boosting_without_encoding.merge(df_autos['Price'], 
                                                                           on = df_for_boosting_without_encoding.index).drop('key_0', axis = 1)
df_for_boosting_without_encoding.head()

Unnamed: 0,VehicleType,Gearbox,Model,FuelType,Brand,NotRepaired,RegistrationYear,Power,Kilometer,Price
0,sedan,manual,golf,petrol,volkswagen,no,1993,0,150000,480
1,coupe,manual,golf,gasoline,audi,yes,2011,190,125000,18300
2,suv,auto,grand,gasoline,jeep,no,2004,163,125000,9800
3,small,manual,golf,petrol,volkswagen,no,2001,75,150000,1500
4,small,manual,fabia,gasoline,skoda,no,2008,69,90000,3600


Франкеншейны готов! Осталось только опять их расчленить, на этот раз на признаки факторы и признаки результаты. А потом пускаем в город!

In [18]:
features_auto = df_for_boosting_with_encode.drop('Price', axis = 1)
target_auto = df_for_boosting_with_encode['Price']

features_train, features_test, target_train, target_test = train_test_split(
    features_auto, target_auto , test_size=0.25, random_state=12345)

In [19]:
features_auto_1 = df_for_boosting_without_encoding.drop('Price', axis = 1)
target_auto_1 = df_for_boosting_without_encoding['Price']

features_train_1, features_test_1, target_train_1, target_test_1 = train_test_split(
    features_auto_1, target_auto_1 , test_size=0.25, random_state=12345)

In [20]:
%%time
scaler = StandardScaler()
scaler.fit(features_train[integer_features])

features_train[integer_features] = scaler.transform(features_train[integer_features])
features_test[integer_features] = scaler.transform(features_test[integer_features])

Wall time: 250 ms


In [21]:
%%time
scaler = StandardScaler()
scaler.fit(features_train_1[integer_features])

features_train_1[integer_features]= scaler.transform(features_train_1[integer_features])
features_test_1[integer_features]= scaler.transform(features_test_1[integer_features])

Wall time: 263 ms


<a id='2-bullet'></a>

# 2. Обучение моделей

Начнем с обучения модели CatBoost. Сначала спонтанная модель, а потом начнем играться с скоростью обучения)

In [22]:
%%time
model_cat_classic = CatBoostRegressor(iterations = 50)
model_cat_classic.fit(features_train, target_train, verbose=10)

Learning rate set to 0.5
0:	learn: 3271.0558946	total: 215ms	remaining: 10.5s
10:	learn: 2097.3272661	total: 643ms	remaining: 2.28s
20:	learn: 1994.2054436	total: 1.13s	remaining: 1.57s
30:	learn: 1945.6550064	total: 1.64s	remaining: 1s
40:	learn: 1911.0637789	total: 2.16s	remaining: 474ms
49:	learn: 1883.0373601	total: 2.66s	remaining: 0us
Wall time: 3.06 s


<catboost.core.CatBoostRegressor at 0x26181b52e08>

In [23]:
%%time
model_cat_classic_1 = CatBoostRegressor(iterations = 50)
model_cat_classic_1.fit(features_train_1, target_train_1, 
                        cat_features = categorical_cols_without_encoding.columns, verbose=10)

Learning rate set to 0.5
0:	learn: 3287.1983593	total: 132ms	remaining: 6.48s
10:	learn: 2043.1935484	total: 1.43s	remaining: 5.06s
20:	learn: 1958.7595897	total: 2.67s	remaining: 3.68s
30:	learn: 1919.5529672	total: 3.88s	remaining: 2.38s
40:	learn: 1888.1804600	total: 5.09s	remaining: 1.12s
49:	learn: 1870.0882217	total: 6.18s	remaining: 0us
Wall time: 7.38 s


<catboost.core.CatBoostRegressor at 0x26181b453c8>

In [24]:
cat_features = categorical_cols_without_encoding.columns
cat_features

Index(['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired'], dtype='object')

Надо признать, пока кэтбуст лучше справляется с сырыми категориальными данными. Немного, но точнее их предсказывает, хоть и почти двукратно медленнее

In [25]:
%%time
model_cat = CatBoostRegressor(iterations = 50, learning_rate = 0.2)
model_cat.fit(features_train, target_train, verbose=10)


0:	learn: 3964.6447766	total: 63.6ms	remaining: 3.12s
10:	learn: 2347.1484525	total: 641ms	remaining: 2.27s
20:	learn: 2139.9928747	total: 1.2s	remaining: 1.65s
30:	learn: 2050.4325904	total: 1.7s	remaining: 1.04s
40:	learn: 2005.6992659	total: 2.24s	remaining: 491ms
49:	learn: 1975.1347034	total: 2.72s	remaining: 0us
Wall time: 3 s


<catboost.core.CatBoostRegressor at 0x26181b63688>

In [26]:
%%time
model_cat_1 = CatBoostRegressor(iterations = 50, learning_rate = 0.2)
model_cat_1.fit(features_train_1, target_train_1,
                cat_features = categorical_features, verbose=10)

0:	learn: 3971.0071308	total: 167ms	remaining: 8.18s
10:	learn: 2287.7796526	total: 1.36s	remaining: 4.81s
20:	learn: 2075.5786181	total: 2.53s	remaining: 3.49s
30:	learn: 2010.4034569	total: 3.73s	remaining: 2.28s
40:	learn: 1970.2826258	total: 4.92s	remaining: 1.08s
49:	learn: 1948.3853710	total: 6s	remaining: 0us
Wall time: 7.22 s


<catboost.core.CatBoostRegressor at 0x26181b65dc8>

Вообще значение скорости обучения в 0.3 оптимальное с точки зрения точности (и метрики rmse)

А уж время-то какое! 9-18 секунд, в зависимости от входных данных, думаю, заказчик будет доволен!

А теперь проведем кросс валидацию

In [27]:
from sklearn.metrics import  make_scorer
def rmse_for_cv(y_true, y_pred):
    return  mean_squared_error(y_true, y_pred) **0.5

rMSE = make_scorer(rmse_for_cv, greater_is_better = False)

In [28]:
%%time
from catboost import Pool, cv
cv_dataset = Pool(data=features_train_1,
                  label=target_train_1,
                  cat_features=categorical_features)

params = {"iterations": 50,
          "verbose": 10,
          "learning_rate" : 0.3,
         "loss_function" : "RMSE"}

scores = cv(cv_dataset,
            params,
            fold_count=5)


0:	learn: 4852.1742726	test: 4853.0280584	best: 4853.0280584 (0)	total: 1.24s	remaining: 1m
10:	learn: 2144.0577603	test: 2146.0384871	best: 2146.0384871 (10)	total: 8.65s	remaining: 30.7s
20:	learn: 2017.9794464	test: 2024.0531725	best: 2024.0531725 (20)	total: 14.8s	remaining: 20.4s
30:	learn: 1967.1487957	test: 1976.1249549	best: 1976.1249549 (30)	total: 20.8s	remaining: 12.8s
40:	learn: 1933.5965262	test: 1945.6903111	best: 1945.6903111 (40)	total: 27.3s	remaining: 6s
49:	learn: 1909.4297651	test: 1924.0676067	best: 1924.0676067 (49)	total: 32.8s	remaining: 0us
Wall time: 34 s


In [29]:
mean_scores = scores.mean()
mean_scores 

iterations           24.500000
test-RMSE-mean     2173.805804
test-RMSE-std        17.190263
train-RMSE-mean    2166.582628
train-RMSE-std        8.512543
dtype: float64

In [30]:
%%time
cv_dataset_1 = Pool(data=features_train,
                  label=target_train,
                  )

params_1 = {"iterations": 50,
          "verbose": 10,
         "loss_function" : "RMSE"}

scores_1 = cv(cv_dataset_1,
            params_1,
            fold_count=5)

score_mean_1 = scores_1.mean()

0:	learn: 6156.8007920	test: 6156.7922636	best: 6156.7922636 (0)	total: 528ms	remaining: 25.9s
10:	learn: 4897.6425149	test: 4897.9323895	best: 4897.9323895 (10)	total: 4.11s	remaining: 14.6s
20:	learn: 4018.7737812	test: 4019.2891713	best: 4019.2891713 (20)	total: 7.46s	remaining: 10.3s
30:	learn: 3416.4297881	test: 3417.3616376	best: 3417.3616376 (30)	total: 10.8s	remaining: 6.6s
40:	learn: 3009.9156972	test: 3010.9829269	best: 3010.9829269 (40)	total: 14.4s	remaining: 3.16s
49:	learn: 2763.3856170	test: 2764.7833186	best: 2764.7833186 (49)	total: 17.4s	remaining: 0us
Wall time: 17.7 s


In [31]:
%%time
scores_2 = cv(cv_dataset,
            params_1,
            fold_count=5)
score_mean_2 = scores_2.mean()

0:	learn: 6159.6023736	test: 6159.6728246	best: 6159.6728246 (0)	total: 1.12s	remaining: 55.1s
10:	learn: 4899.6632508	test: 4899.6574800	best: 4899.6574800 (10)	total: 7.58s	remaining: 26.9s
20:	learn: 4019.6242773	test: 4019.9621061	best: 4019.9621061 (20)	total: 13.7s	remaining: 19s
30:	learn: 3415.7151036	test: 3416.5437601	best: 3416.5437601 (30)	total: 19.9s	remaining: 12.2s
40:	learn: 3006.2375555	test: 3006.6409153	best: 3006.6409153 (40)	total: 26.3s	remaining: 5.78s
49:	learn: 2750.9375094	test: 2751.0970660	best: 2751.0970660 (49)	total: 31.9s	remaining: 0us
Wall time: 32.2 s


In [32]:
%%time
scores_3 = cv(cv_dataset_1,
            params,
            fold_count=5)
score_mean_3 = scores_2.mean()

0:	learn: 4821.4628523	test: 4821.5223976	best: 4821.5223976 (0)	total: 492ms	remaining: 24.1s
10:	learn: 2216.3170190	test: 2220.0556564	best: 2220.0556564 (10)	total: 3.84s	remaining: 13.6s
20:	learn: 2059.5448971	test: 2064.5432260	best: 2064.5432260 (20)	total: 7.83s	remaining: 10.8s
30:	learn: 1995.4765468	test: 2004.2011311	best: 2004.2011311 (30)	total: 13.5s	remaining: 8.27s
40:	learn: 1957.8892921	test: 1969.1384738	best: 1969.1384738 (40)	total: 16.8s	remaining: 3.68s
49:	learn: 1931.7642723	test: 1946.0324585	best: 1946.0324585 (49)	total: 19.7s	remaining: 0us
Wall time: 20 s


In [33]:
score_mean_1

iterations           24.500000
test-RMSE-mean     3972.391836
test-RMSE-std        14.646398
train-RMSE-mean    3971.723598
train-RMSE-std        4.165810
dtype: float64

In [34]:
score_mean_2

iterations           24.500000
test-RMSE-mean     3971.605204
test-RMSE-std        15.607540
train-RMSE-mean    3971.264805
train-RMSE-std        4.157375
dtype: float64

Кросс-валидация дала примерно такие же результаты как и простое обучение(значения метрик похожи в схожих итерациях)

<a id='3-bullet'></a>

# 3. Анализ моделей

In [43]:
def rmse(model,features_test, test_target):
    predictions = model.predict(features_test)
    
    return (mean_squared_error(predictions, test_target) **0.5).round(2)

In [44]:
%%time
rmse_cat = rmse(model_cat, features_test, target_test)
rmse_cat

Wall time: 16 ms


1987.36

In [46]:
%%time
rmse_cat1 = rmse(model_cat_1, features_test_1, target_test_1)
rmse_cat1

Wall time: 172 ms


1954.56

In [47]:
%%time
rmse_cat_classic = rmse(model_cat_classic, features_test, target_test)
rmse_cat_classic

Wall time: 17 ms


1906.17

In [48]:
%%time
rmse_cat_calssic_1 = rmse(model_cat_classic_1, features_test_1, target_test_1)
rmse_cat_calssic_1

Wall time: 182 ms


1884.82

In [41]:
total_results = pd.DataFrame(columns = ['model_name','cv_rmse','test_rmse','train_time','test_time'])
total_results = total_results.append({'model_name' : "model_cat_1", 'cv_rmse': mean_scores['test-RMSE-mean'], 'test_rmse' : rmse_cat1,'train_time' : '18 s', 'test_time' : '268 ms'}, ignore_index = True)
total_results = total_results.append({'model_name' : "model_cat_classic", 'cv_rmse': score_mean_1['test-RMSE-mean'],'test_rmse' : rmse_cat_classic,'train_time' : '10.2 s', 'test_time': '32.5 ms'}, ignore_index = True)
total_results = total_results.append({'model_name' : "model_cat_classic_1", 'cv_rmse': score_mean_2['test-RMSE-mean'],'test_rmse' :rmse_cat_calssic_1,'train_time' : '19.6 s','test_time' :'256 ms'}, ignore_index = True)
total_results = total_results.append({'model_name' : "model_cat", 'cv_rmse': score_mean_3['test-RMSE-mean'],'test_rmse' : rmse_cat,'train_time' : '9.62 s', 'test_time' :'32 ms'}, ignore_index = True)

In [42]:
total_results

Unnamed: 0,model_name,cv_rmse,test_rmse,train_time,test_time
0,model_cat_1,2173.805804,1954.560725,18 s,268 ms
1,model_cat_classic,3972.391836,1906.166576,10.2 s,32.5 ms
2,model_cat_classic_1,3971.605204,1884.818097,19.6 s,256 ms
3,model_cat,3971.605204,1987.356353,9.62 s,32 ms


Не все модели смогли дать результат RMSE менее 2500, но тем не менее, есть те кто прошли к финишу с эдакой "победой". ЛУчше всех себя показала модель LGBM, как по скорости обучения почти лучший результат, а по точности лучший результат. Аве ЛГБМ!