<a href="https://colab.research.google.com/github/avgalkov/collab-notebooks/blob/main/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Скачиваем датасет и подгружаем в collab

https://drive.google.com/file/d/11HcDhlPAJ92FHSUW_kA8kN33nWkmcLV0/view?usp=sharing


In [1]:
! gdown --id 11HcDhlPAJ92FHSUW_kA8kN33nWkmcLV0

Downloading...
From: https://drive.google.com/uc?id=11HcDhlPAJ92FHSUW_kA8kN33nWkmcLV0
To: /content/UK_used_cars.zip
100% 1.15M/1.15M [00:00<00:00, 58.3MB/s]


In [2]:
! unzip /content/UK_used_cars.zip -d /content/uk-used-cars

Archive:  /content/UK_used_cars.zip
  inflating: /content/uk-used-cars/audi.csv  
  inflating: /content/uk-used-cars/bmw.csv  
  inflating: /content/uk-used-cars/cclass.csv  
  inflating: /content/uk-used-cars/focus.csv  
  inflating: /content/uk-used-cars/ford.csv  
  inflating: /content/uk-used-cars/hyundi.csv  
  inflating: /content/uk-used-cars/merc.csv  
  inflating: /content/uk-used-cars/skoda.csv  
  inflating: /content/uk-used-cars/toyota.csv  
  inflating: /content/uk-used-cars/unclean cclass.csv  
  inflating: /content/uk-used-cars/unclean focus.csv  
  inflating: /content/uk-used-cars/vauxhall.csv  
  inflating: /content/uk-used-cars/vw.csv  


In [3]:
import pandas as pd
import matplotlib.pyplot as plt


plt.style.use('dark_background')

In [4]:
df = pd.read_csv('/content/uk-used-cars/bmw.csv')

In [5]:
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,5 Series,2014,11200,Automatic,67068,Diesel,125,57.6,2.0
1,6 Series,2018,27000,Automatic,14827,Petrol,145,42.8,2.0
2,5 Series,2016,16000,Automatic,62794,Diesel,160,51.4,3.0
3,1 Series,2017,12750,Automatic,26676,Diesel,145,72.4,1.5
4,7 Series,2014,14500,Automatic,39554,Diesel,160,50.4,3.0


## Обучающая и тестовая выборки (train and test)

In [6]:
from sklearn.model_selection import train_test_split

In [8]:
train, test = train_test_split(df,train_size = 0.6, random_state = 1)

# random_state = 1  позволяет заморозить наши выборки

## Валидационная выборка

*Разобьем наши данные на train, val  и test  в пропорции 60% / 20% / 20%
Валидационная выборка нужна для проверки обучаемости нашей модели, тестовая выборка для модели - это абсолютно новые данные. Обучаемся на train, проверяем результаты на  val  и применяем уже на test*

In [9]:
len(train) / len(df)

0.5999443465355718

In [10]:
len(test) / len(df)

0.4000556534644282

In [11]:
val, test = train_test_split(test,train_size = 0.5, random_state = 1)

In [12]:
len(train) / len(df)

0.5999443465355718

In [13]:
len(test) / len(df)

0.20007420461923756

In [14]:
len(val) / len(df)

0.19998144884519062

## Список фичей

In [15]:
train.columns

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
       'mpg', 'engineSize'],
      dtype='object')

* Список фичей не может содержать целевой признак, поэтому разделяем наши признаки на 2 группы.

* Для CatBoost важно указать те фичи, в которых находится текст. Чтобы их использовать их сначала нужно преобразовать в числовые признаки. CatBoost может делать это под капотом.

* В нашем случае текстовые фичи - это 'model' , 'transmission', 'fuelType', Остальные фичи - числовые признаки


In [16]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

## Принцип преобразования категориальных признаков

One-Hot Encoding - как один из методов преобразования категориальных фичей в числовые

In [17]:
pd.get_dummies(train['transmission'])

Unnamed: 0,Automatic,Manual,Semi-Auto
7882,1,0,0
1381,0,1,0
8859,1,0,0
9354,1,0,0
4192,0,0,1
...,...,...,...
2895,0,1,0
7813,1,0,0
905,0,1,0
5192,0,0,1


## Первый запуск CatBoost

In [18]:
! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [20]:
from catboost import CatBoostRegressor
# для задач классификации импортируется CatBoostClassifier 

Запустим первый раз на тех же фичах, на которых мы делали человеческое обучение

In [64]:
X = ['year', 'transmission', 'engineSize']
cat_features = ['transmission']
y = ['price']

In [65]:
model = CatBoostRegressor(cat_features = cat_features,
                          eval_metric = 'MAPE', 
                          random_seed = 42,
                          verbose=100)

In [66]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

Learning rate set to 0.068263
0:	learn: 0.4621086	test: 0.4489079	best: 0.4489079 (0)	total: 2.57ms	remaining: 2.57s
100:	learn: 0.1537470	test: 0.1534632	best: 0.1534632 (100)	total: 178ms	remaining: 1.58s
200:	learn: 0.1515149	test: 0.1521516	best: 0.1521514 (197)	total: 336ms	remaining: 1.34s
300:	learn: 0.1503412	test: 0.1520507	best: 0.1519138 (261)	total: 508ms	remaining: 1.18s
400:	learn: 0.1495316	test: 0.1521949	best: 0.1519138 (261)	total: 698ms	remaining: 1.04s
500:	learn: 0.1488269	test: 0.1521153	best: 0.1519138 (261)	total: 883ms	remaining: 880ms
600:	learn: 0.1482608	test: 0.1521817	best: 0.1519138 (261)	total: 1.36s	remaining: 903ms
700:	learn: 0.1478179	test: 0.1523615	best: 0.1519138 (261)	total: 1.87s	remaining: 797ms
800:	learn: 0.1473592	test: 0.1526559	best: 0.1519138 (261)	total: 2.29s	remaining: 568ms
900:	learn: 0.1469682	test: 0.1527051	best: 0.1519138 (261)	total: 2.71s	remaining: 297ms
999:	learn: 0.1466731	test: 0.1529079	best: 0.1519138 (261)	total: 2.91s	

<catboost.core.CatBoostRegressor at 0x7fabf3f3f760>

Делаем предикт на teste

In [67]:
model.predict(test[X])

array([27791.79270642,  9414.28084977, 45523.73943666, ...,
       18197.14290638, 10477.94541465, 16233.48447626])

In [68]:
test['price_pred'] = model.predict(test[X])

In [69]:
test.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,price_pred,price_pred_all
342,5 Series,2019,28500,Semi-Auto,3816,Diesel,145,62.8,2.0,27791.792706,29094.170853
7604,3 Series,2013,9990,Manual,51000,Diesel,145,55.4,2.0,9414.28085,10886.267555
6579,5 Series,2019,32896,Automatic,6071,Diesel,145,53.3,3.0,45523.739437,34656.929246
9739,3 Series,2013,9195,Manual,82260,Diesel,30,62.8,2.0,9414.28085,8659.105197
1932,M4,2019,44500,Semi-Auto,5644,Petrol,145,34.0,3.0,44965.633234,45768.460724


In [70]:
from sklearn.metrics import mean_absolute_error,mean_absolute_percentage_error

In [71]:
# завернем в функцию команды по вычислению отклонения

def error(y_true,y_pred):
    print(mean_absolute_error(y_true,y_pred))
    print(mean_absolute_percentage_error(y_true,y_pred))


In [72]:
error(test['price'], test['price_pred'])

3631.492851599049
0.15887974860073903


## Обучаемся на всех фичах

In [73]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize','model', 'transmission', 'fuelType']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

In [79]:
# положим параметры в словарь для удобства

parameters = {'cat_features' : cat_features,
              'eval_metric' : 'MAPE',
              'verbose' : 100}

In [80]:
model = CatBoostRegressor(**parameters)

In [81]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

Learning rate set to 0.068263
0:	learn: 0.4620476	test: 0.4491287	best: 0.4491287 (0)	total: 4.89ms	remaining: 4.89s
100:	learn: 0.0907457	test: 0.0909975	best: 0.0909975 (100)	total: 402ms	remaining: 3.58s
200:	learn: 0.0782808	test: 0.0813269	best: 0.0813269 (200)	total: 790ms	remaining: 3.14s
300:	learn: 0.0728162	test: 0.0770531	best: 0.0770531 (300)	total: 1.17s	remaining: 2.72s
400:	learn: 0.0697744	test: 0.0753101	best: 0.0753101 (400)	total: 1.58s	remaining: 2.35s
500:	learn: 0.0674429	test: 0.0740215	best: 0.0740215 (500)	total: 1.97s	remaining: 1.97s
600:	learn: 0.0652693	test: 0.0730381	best: 0.0730315 (593)	total: 2.36s	remaining: 1.57s
700:	learn: 0.0636495	test: 0.0724704	best: 0.0724670 (698)	total: 2.93s	remaining: 1.25s
800:	learn: 0.0622577	test: 0.0719561	best: 0.0719342 (796)	total: 3.95s	remaining: 982ms
900:	learn: 0.0609696	test: 0.0715338	best: 0.0715338 (900)	total: 4.58s	remaining: 503ms
999:	learn: 0.0598680	test: 0.0712426	best: 0.0712395 (996)	total: 4.98s	

<catboost.core.CatBoostRegressor at 0x7fabf35635b0>

In [77]:
test['price_pred_all'] = model.predict(test[X])

In [78]:
error(test['price'], test['price_pred_all'])

1624.3527686112777
0.07376991781222213


## Количество итерраций и learning rate

In [82]:
X = ['year', 'transmission', 'engineSize']
cat_features = ['transmission']
y = ['price']

In [91]:
model = CatBoostRegressor(cat_features = cat_features,
                          learning_rate = 0.03, # меняем скорость обучения
                          eval_metric = 'MAPE', 
                          random_seed = 42,
                          verbose=100)

In [92]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

0:	learn: 0.4766117	test: 0.4630623	best: 0.4630623 (0)	total: 2.64ms	remaining: 2.64s
100:	learn: 0.1686204	test: 0.1665831	best: 0.1665831 (100)	total: 203ms	remaining: 1.81s
200:	learn: 0.1546768	test: 0.1538231	best: 0.1538231 (200)	total: 378ms	remaining: 1.5s
300:	learn: 0.1527610	test: 0.1524028	best: 0.1524003 (299)	total: 537ms	remaining: 1.25s
400:	learn: 0.1520242	test: 0.1519914	best: 0.1519914 (400)	total: 701ms	remaining: 1.05s
500:	learn: 0.1513796	test: 0.1517251	best: 0.1517218 (499)	total: 881ms	remaining: 878ms
600:	learn: 0.1509089	test: 0.1516070	best: 0.1516070 (600)	total: 1.05s	remaining: 701ms
700:	learn: 0.1504319	test: 0.1515781	best: 0.1515400 (678)	total: 1.24s	remaining: 527ms
800:	learn: 0.1500219	test: 0.1515251	best: 0.1515251 (800)	total: 1.41s	remaining: 350ms
900:	learn: 0.1496119	test: 0.1514039	best: 0.1514021 (895)	total: 1.6s	remaining: 176ms
999:	learn: 0.1492990	test: 0.1513532	best: 0.1513502 (996)	total: 1.79s	remaining: 0us

bestTest = 0.151

<catboost.core.CatBoostRegressor at 0x7fabf3f3f5e0>

In [93]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize','model', 'transmission', 'fuelType']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

In [94]:
# положим параметры в словарь для удобства

parameters = {'cat_features' : cat_features,
              'learning_rate' : 0.08, # меняем скорость обучения
              'eval_metric' : 'MAPE',
              'verbose' : 100}

In [95]:
model = CatBoostRegressor(**parameters)

In [96]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

0:	learn: 0.4576088	test: 0.4448361	best: 0.4448361 (0)	total: 5.54ms	remaining: 5.53s
100:	learn: 0.0859551	test: 0.0887269	best: 0.0887269 (100)	total: 566ms	remaining: 5.04s
200:	learn: 0.0756432	test: 0.0802535	best: 0.0802365 (199)	total: 980ms	remaining: 3.9s
300:	learn: 0.0708377	test: 0.0768255	best: 0.0768255 (300)	total: 1.37s	remaining: 3.18s
400:	learn: 0.0675903	test: 0.0746676	best: 0.0746662 (399)	total: 1.75s	remaining: 2.61s
500:	learn: 0.0656762	test: 0.0739730	best: 0.0739730 (500)	total: 2.14s	remaining: 2.13s
600:	learn: 0.0639073	test: 0.0733349	best: 0.0733349 (600)	total: 2.53s	remaining: 1.68s
700:	learn: 0.0621661	test: 0.0725205	best: 0.0725090 (687)	total: 2.94s	remaining: 1.25s
800:	learn: 0.0606551	test: 0.0719604	best: 0.0719604 (800)	total: 3.35s	remaining: 831ms
900:	learn: 0.0594107	test: 0.0716771	best: 0.0716771 (900)	total: 3.73s	remaining: 410ms
999:	learn: 0.0582401	test: 0.0712909	best: 0.0712452 (992)	total: 4.12s	remaining: 0us

bestTest = 0.07

<catboost.core.CatBoostRegressor at 0x7fabf3563b20>

In [97]:
X = ['year', 'transmission', 'engineSize']
cat_features = ['transmission']
y = ['price']

In [99]:
model = CatBoostRegressor(cat_features = cat_features,
                          early_stopping_rounds = 200, # если за 200 итерраций не происходит улучшений показателей, то процесс останавливается
                          eval_metric = 'MAPE', 
                          random_seed = 42,
                          verbose=100)

In [100]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

Learning rate set to 0.068263
0:	learn: 0.4621086	test: 0.4489079	best: 0.4489079 (0)	total: 2.48ms	remaining: 2.47s
100:	learn: 0.1537470	test: 0.1534632	best: 0.1534632 (100)	total: 180ms	remaining: 1.6s
200:	learn: 0.1515149	test: 0.1521516	best: 0.1521514 (197)	total: 412ms	remaining: 1.64s
300:	learn: 0.1503412	test: 0.1520507	best: 0.1519138 (261)	total: 657ms	remaining: 1.52s
400:	learn: 0.1495316	test: 0.1521949	best: 0.1519138 (261)	total: 855ms	remaining: 1.28s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.1519137901
bestIteration = 261

Shrink model to first 262 iterations.


<catboost.core.CatBoostRegressor at 0x7fabf350bb80>

## Обучаемся на всех данных

In [101]:
len(train)

6468

In [102]:
len(val)

2156

In [104]:
pd.concat([train,val])

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
7882,X1,2019,25000,Automatic,4534,Petrol,145,46.3,1.5
1381,3 Series,2016,15621,Manual,32927,Diesel,125,60.1,2.0
8859,3 Series,2014,14500,Automatic,45140,Diesel,125,57.6,3.0
9354,2 Series,2016,16359,Automatic,17748,Diesel,125,57.6,2.0
4192,3 Series,2018,23980,Semi-Auto,11717,Diesel,145,57.7,2.0
...,...,...,...,...,...,...,...,...,...
7010,X6,2019,59991,Automatic,4509,Diesel,145,34.9,3.0
6206,3 Series,2020,27492,Semi-Auto,3500,Petrol,145,42.2,2.0
4558,1 Series,2016,12995,Semi-Auto,33291,Diesel,20,67.3,2.0
618,6 Series,2015,17498,Semi-Auto,51412,Diesel,160,49.6,3.0


In [105]:
train_full = pd.concat([train,val])

In [106]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize','model', 'transmission', 'fuelType']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

In [107]:
# положим параметры в словарь для удобства

parameters = {'cat_features' : cat_features,
              'learning_rate' : 0.08, # меняем скорость обучения
              'eval_metric' : 'MAPE',
              'verbose' : 100}

In [108]:
model = CatBoostRegressor(**parameters)

In [109]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

0:	learn: 0.4576088	test: 0.4448361	best: 0.4448361 (0)	total: 5.33ms	remaining: 5.33s
100:	learn: 0.0859551	test: 0.0887269	best: 0.0887269 (100)	total: 668ms	remaining: 5.95s
200:	learn: 0.0756432	test: 0.0802535	best: 0.0802365 (199)	total: 1.1s	remaining: 4.37s
300:	learn: 0.0708377	test: 0.0768255	best: 0.0768255 (300)	total: 1.51s	remaining: 3.5s
400:	learn: 0.0675903	test: 0.0746676	best: 0.0746662 (399)	total: 1.91s	remaining: 2.86s
500:	learn: 0.0656762	test: 0.0739730	best: 0.0739730 (500)	total: 2.34s	remaining: 2.33s
600:	learn: 0.0639073	test: 0.0733349	best: 0.0733349 (600)	total: 2.73s	remaining: 1.81s
700:	learn: 0.0621661	test: 0.0725205	best: 0.0725090 (687)	total: 3.14s	remaining: 1.34s
800:	learn: 0.0606551	test: 0.0719604	best: 0.0719604 (800)	total: 3.54s	remaining: 880ms
900:	learn: 0.0594107	test: 0.0716771	best: 0.0716771 (900)	total: 3.95s	remaining: 434ms
999:	learn: 0.0582401	test: 0.0712909	best: 0.0712452 (992)	total: 4.38s	remaining: 0us

bestTest = 0.071

<catboost.core.CatBoostRegressor at 0x7fabf3efd9a0>

In [112]:
model.best_iteration_

992

In [114]:
# положим параметры в словарь для удобства

parameters = {'itarations': model.best_iteration_ + 1,
              'random_sent' : 42,
              'cat_features' : cat_features,
              'learning_rate' : 0.08, # меняем скорость обучения
              'eval_metric' : 'MAPE',
              'verbose' : 100}

In [115]:
parameters

{'itarations': 993,
 'random_sent': 42,
 'cat_features': ['model', 'transmission', 'fuelType'],
 'learning_rate': 0.08,
 'eval_metric': 'MAPE',
 'verbose': 100}

In [116]:
model.fit(train_full[X], train_full[y])

0:	learn: 0.4517876	total: 6.08ms	remaining: 6.07s
100:	learn: 0.0864716	total: 496ms	remaining: 4.42s
200:	learn: 0.0764252	total: 970ms	remaining: 3.85s
300:	learn: 0.0712969	total: 1.42s	remaining: 3.29s
400:	learn: 0.0685636	total: 1.9s	remaining: 2.83s
500:	learn: 0.0662134	total: 2.33s	remaining: 2.33s
600:	learn: 0.0641178	total: 2.81s	remaining: 1.87s
700:	learn: 0.0626798	total: 3.28s	remaining: 1.4s
800:	learn: 0.0613395	total: 3.73s	remaining: 927ms
900:	learn: 0.0602465	total: 4.2s	remaining: 461ms
999:	learn: 0.0594203	total: 4.65s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x7fabf3efd9a0>

In [117]:
model.predict(test[X])

array([29212.46392165, 11110.7768548 , 34861.35679728, ...,
       18203.20561521, 11238.04519552, 16372.6855144 ])

In [118]:
test['price_pred_all_features_and_date'] = model.predict(test[X])

In [119]:
error(test['price'], test['price_pred_all_features_and_date'])

1586.3120712760026
0.07329083203555277


In [120]:
# до этого было 
error(test['price'], test['price_pred_all'])

1624.3527686112777
0.07376991781222213


## Меняем функцию оптимизации

In [121]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize','model', 'transmission', 'fuelType']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

In [140]:
# положим параметры в словарь для удобства

parameters = {
              'random_seed' : 42,
              'cat_features' : cat_features,
              'learning_rate' : 0.12, # меняем скорость обучения
              'eval_metric' : 'MAPE',
              'verbose' : 100,
              'loss_function' : 'MAE' }

In [141]:
model = CatBoostRegressor(**parameters)

In [142]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

0:	learn: 0.3875962	test: 0.3735069	best: 0.3735069 (0)	total: 5.95ms	remaining: 5.94s
100:	learn: 0.0735575	test: 0.0763044	best: 0.0763044 (100)	total: 488ms	remaining: 4.34s
200:	learn: 0.0649629	test: 0.0714696	best: 0.0714696 (200)	total: 990ms	remaining: 3.94s
300:	learn: 0.0609392	test: 0.0700053	best: 0.0700053 (300)	total: 2.06s	remaining: 4.78s
400:	learn: 0.0589254	test: 0.0695055	best: 0.0695012 (399)	total: 3.03s	remaining: 4.53s
500:	learn: 0.0576315	test: 0.0693322	best: 0.0693254 (495)	total: 3.5s	remaining: 3.48s
600:	learn: 0.0565014	test: 0.0691157	best: 0.0690975 (584)	total: 3.95s	remaining: 2.62s
700:	learn: 0.0554363	test: 0.0688067	best: 0.0687862 (694)	total: 4.42s	remaining: 1.89s
800:	learn: 0.0544517	test: 0.0686674	best: 0.0686410 (797)	total: 4.87s	remaining: 1.21s
900:	learn: 0.0537573	test: 0.0685480	best: 0.0685476 (899)	total: 5.33s	remaining: 586ms
999:	learn: 0.0529895	test: 0.0683610	best: 0.0683588 (998)	total: 5.77s	remaining: 0us

bestTest = 0.06

<catboost.core.CatBoostRegressor at 0x7fabf3f1ee80>

In [143]:
test['price_pred_mae'] = model.predict(test[X])

In [144]:
error(test['price'], test['price_pred_mae'])

1611.426989828046
0.07150901109184499
