<a href="https://colab.research.google.com/github/avgalkov/collab-notebooks/blob/main/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Скачиваем датасет и подгружаем в collab

https://drive.google.com/file/d/11HcDhlPAJ92FHSUW_kA8kN33nWkmcLV0/view?usp=sharing


In [1]:
! gdown --id 11HcDhlPAJ92FHSUW_kA8kN33nWkmcLV0

Downloading...
From: https://drive.google.com/uc?id=11HcDhlPAJ92FHSUW_kA8kN33nWkmcLV0
To: /content/UK_used_cars.zip
100% 1.15M/1.15M [00:00<00:00, 58.3MB/s]


In [2]:
! unzip /content/UK_used_cars.zip -d /content/uk-used-cars

Archive:  /content/UK_used_cars.zip
  inflating: /content/uk-used-cars/audi.csv  
  inflating: /content/uk-used-cars/bmw.csv  
  inflating: /content/uk-used-cars/cclass.csv  
  inflating: /content/uk-used-cars/focus.csv  
  inflating: /content/uk-used-cars/ford.csv  
  inflating: /content/uk-used-cars/hyundi.csv  
  inflating: /content/uk-used-cars/merc.csv  
  inflating: /content/uk-used-cars/skoda.csv  
  inflating: /content/uk-used-cars/toyota.csv  
  inflating: /content/uk-used-cars/unclean cclass.csv  
  inflating: /content/uk-used-cars/unclean focus.csv  
  inflating: /content/uk-used-cars/vauxhall.csv  
  inflating: /content/uk-used-cars/vw.csv  


In [3]:
import pandas as pd
import matplotlib.pyplot as plt


plt.style.use('dark_background')

In [4]:
df = pd.read_csv('/content/uk-used-cars/bmw.csv')

In [5]:
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,5 Series,2014,11200,Automatic,67068,Diesel,125,57.6,2.0
1,6 Series,2018,27000,Automatic,14827,Petrol,145,42.8,2.0
2,5 Series,2016,16000,Automatic,62794,Diesel,160,51.4,3.0
3,1 Series,2017,12750,Automatic,26676,Diesel,145,72.4,1.5
4,7 Series,2014,14500,Automatic,39554,Diesel,160,50.4,3.0


## Обучающая и тестовая выборки (train and test)

In [6]:
from sklearn.model_selection import train_test_split

In [8]:
train, test = train_test_split(df,train_size = 0.6, random_state = 1)

# random_state = 1  позволяет заморозить наши выборки

## Валидационная выборка

*Разобьем наши данные на train, val  и test  в пропорции 60% / 20% / 20%
Валидационная выборка нужна для проверки обучаемости нашей модели, тестовая выборка для модели - это абсолютно новые данные. Обучаемся на train, проверяем результаты на  val  и применяем уже на test*

In [9]:
len(train) / len(df)

0.5999443465355718

In [10]:
len(test) / len(df)

0.4000556534644282

In [11]:
val, test = train_test_split(test,train_size = 0.5, random_state = 1)

In [12]:
len(train) / len(df)

0.5999443465355718

In [13]:
len(test) / len(df)

0.20007420461923756

In [14]:
len(val) / len(df)

0.19998144884519062

## Список фичей

In [15]:
train.columns

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
       'mpg', 'engineSize'],
      dtype='object')

* Список фичей не может содержать целевой признак, поэтому разделяем наши признаки на 2 группы.

* Для CatBoost важно указать те фичи, в которых находится текст. Чтобы их использовать их сначала нужно преобразовать в числовые признаки. CatBoost может делать это под капотом.

* В нашем случае текстовые фичи - это 'model' , 'transmission', 'fuelType', Остальные фичи - числовые признаки


In [16]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

## Принцип преобразования категориальных признаков

One-Hot Encoding - как один из методов преобразования категориальных фичей в числовые

In [17]:
pd.get_dummies(train['transmission'])

Unnamed: 0,Automatic,Manual,Semi-Auto
7882,1,0,0
1381,0,1,0
8859,1,0,0
9354,1,0,0
4192,0,0,1
...,...,...,...
2895,0,1,0
7813,1,0,0
905,0,1,0
5192,0,0,1


## Первый запуск CatBoost

In [18]:
! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [20]:
from catboost import CatBoostRegressor
# для задач классификации импортируется CatBoostClassifier 

Запустим первый раз на тех же фичах, на которых мы делали человеческое обучение

In [64]:
X = ['year', 'transmission', 'engineSize']
cat_features = ['transmission']
y = ['price']

In [65]:
model = CatBoostRegressor(cat_features = cat_features,
                          eval_metric = 'MAPE', 
                          random_seed = 42,
                          verbose=100)

In [66]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

Learning rate set to 0.068263
0:	learn: 0.4621086	test: 0.4489079	best: 0.4489079 (0)	total: 2.57ms	remaining: 2.57s
100:	learn: 0.1537470	test: 0.1534632	best: 0.1534632 (100)	total: 178ms	remaining: 1.58s
200:	learn: 0.1515149	test: 0.1521516	best: 0.1521514 (197)	total: 336ms	remaining: 1.34s
300:	learn: 0.1503412	test: 0.1520507	best: 0.1519138 (261)	total: 508ms	remaining: 1.18s
400:	learn: 0.1495316	test: 0.1521949	best: 0.1519138 (261)	total: 698ms	remaining: 1.04s
500:	learn: 0.1488269	test: 0.1521153	best: 0.1519138 (261)	total: 883ms	remaining: 880ms
600:	learn: 0.1482608	test: 0.1521817	best: 0.1519138 (261)	total: 1.36s	remaining: 903ms
700:	learn: 0.1478179	test: 0.1523615	best: 0.1519138 (261)	total: 1.87s	remaining: 797ms
800:	learn: 0.1473592	test: 0.1526559	best: 0.1519138 (261)	total: 2.29s	remaining: 568ms
900:	learn: 0.1469682	test: 0.1527051	best: 0.1519138 (261)	total: 2.71s	remaining: 297ms
999:	learn: 0.1466731	test: 0.1529079	best: 0.1519138 (261)	total: 2.91s	

<catboost.core.CatBoostRegressor at 0x7fabf3f3f760>

Делаем предикт на teste

In [67]:
model.predict(test[X])

array([27791.79270642,  9414.28084977, 45523.73943666, ...,
       18197.14290638, 10477.94541465, 16233.48447626])

In [68]:
test['price_pred'] = model.predict(test[X])

In [69]:
test.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,price_pred,price_pred_all
342,5 Series,2019,28500,Semi-Auto,3816,Diesel,145,62.8,2.0,27791.792706,29094.170853
7604,3 Series,2013,9990,Manual,51000,Diesel,145,55.4,2.0,9414.28085,10886.267555
6579,5 Series,2019,32896,Automatic,6071,Diesel,145,53.3,3.0,45523.739437,34656.929246
9739,3 Series,2013,9195,Manual,82260,Diesel,30,62.8,2.0,9414.28085,8659.105197
1932,M4,2019,44500,Semi-Auto,5644,Petrol,145,34.0,3.0,44965.633234,45768.460724


In [70]:
from sklearn.metrics import mean_absolute_error,mean_absolute_percentage_error

In [71]:
# завернем в функцию команды по вычислению отклонения

def error(y_true,y_pred):
    print(mean_absolute_error(y_true,y_pred))
    print(mean_absolute_percentage_error(y_true,y_pred))


In [72]:
error(test['price'], test['price_pred'])

3631.492851599049
0.15887974860073903


## Обучаемся на всех фичах

In [73]:
X = ['year',  'mileage', 'tax',
       'mpg', 'engineSize','model', 'transmission', 'fuelType']
cat_features = ['model', 'transmission', 'fuelType']
y = ['price']

In [74]:
parameters = {'cat_features' : cat_features,
              'eval_metric' : 'MAPE',
              'verbose' : 100}

In [75]:
model = CatBoostRegressor(**parameters)

In [76]:
model.fit(train[X], train[y],eval_set=(val[X], val[y]))

Learning rate set to 0.068263
0:	learn: 0.4620476	test: 0.4491287	best: 0.4491287 (0)	total: 5.17ms	remaining: 5.16s
100:	learn: 0.0907457	test: 0.0909975	best: 0.0909975 (100)	total: 386ms	remaining: 3.43s
200:	learn: 0.0782808	test: 0.0813269	best: 0.0813269 (200)	total: 778ms	remaining: 3.09s
300:	learn: 0.0728162	test: 0.0770531	best: 0.0770531 (300)	total: 1.16s	remaining: 2.69s
400:	learn: 0.0697744	test: 0.0753101	best: 0.0753101 (400)	total: 1.54s	remaining: 2.3s
500:	learn: 0.0674429	test: 0.0740215	best: 0.0740215 (500)	total: 1.93s	remaining: 1.92s
600:	learn: 0.0652693	test: 0.0730381	best: 0.0730315 (593)	total: 2.3s	remaining: 1.53s
700:	learn: 0.0636495	test: 0.0724704	best: 0.0724670 (698)	total: 2.7s	remaining: 1.15s
800:	learn: 0.0622577	test: 0.0719561	best: 0.0719342 (796)	total: 3.09s	remaining: 768ms
900:	learn: 0.0609696	test: 0.0715338	best: 0.0715338 (900)	total: 3.48s	remaining: 383ms
999:	learn: 0.0598680	test: 0.0712426	best: 0.0712395 (996)	total: 3.88s	rem

<catboost.core.CatBoostRegressor at 0x7fabf3f45760>

In [77]:
test['price_pred_all'] = model.predict(test[X])

In [78]:
error(test['price'], test['price_pred_all'])

1624.3527686112777
0.07376991781222213
