<a href="https://colab.research.google.com/github/dnevo/Practicum/blob/master/S12_Numerical_Methods_%E2%80%93_market_value_of_car.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Review

Hi, my name is Daria! I'm reviewing your project. 

You can find my comments under the heading «Review». 
I’m using __<font color='green'>green</font>__ color if everything is done perfectly. Recommendations and remarks are highlighted in __<font color='blue'>blue</font>__. 
If the topic requires some extra work, the color will be  __<font color='red'>red</font>__. 

You did an outstanding job on data processing and models training! Didn't make any mistakes in general data science workflow :) The only thing you need to work on is adding another type of regression model. Waiting for your update!


Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# 1. Data preparation

In [None]:
colab = True
if colab:
    data_path = 'https://raw.githubusercontent.com/dnevo/Practicum-NM/master/car_data.zip'
    !pip install catboost
else:
    data_path = '/datasets/car_data.csv'



In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgbm
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:11,.2f}'.format

In [None]:
data = pd.read_csv(data_path,parse_dates=['DateCrawled', 'DateCreated', 'LastSeen'])
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:00,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24,0,70435,2016-07-04 03:16:00
1,2016-03-24 10:58:00,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24,0,66954,2016-07-04 01:46:00
2,2016-03-14 12:52:00,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14,0,90480,2016-05-04 12:47:00
3,2016-03-17 16:54:00,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17,0,91074,2016-03-17 17:40:00
4,2016-03-31 17:25:00,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31,0,60437,2016-06-04 10:17:00


##<font color='green'>Review
    
Nice use of ``parse_dates=`` parameter! </font>

In [None]:
data.describe(include='all')

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
count,354369,354369.0,316879,354369.0,334536,354369.0,334664,354369.0,354369.0,321474,354369,283215,354369,354369.0,354369.0,354369
unique,15470,,8,,2,,250,,,7,40,2,109,,,18592
top,2016-05-03 14:25:00,,sedan,,manual,,golf,,,petrol,volkswagen,no,2016-03-04 00:00:00,,,2016-07-04 07:16:00
freq,66,,91457,,268251,,29232,,,216352,77013,247161,13719,,,654
first,2016-01-04 00:06:00,,,,,,,,,,,,2014-10-03 00:00:00,,,2016-01-04 00:15:00
last,2016-12-03 23:59:00,,,,,,,,,,,,2016-12-03 00:00:00,,,2016-12-03 23:54:00
mean,,4416.66,,2004.23,,110.09,,128211.17,5.71,,,,,0.0,50508.69,
std,,4514.16,,90.23,,189.85,,37905.34,3.73,,,,,0.0,25783.1,
min,,0.0,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,1050.0,,1999.0,,69.0,,125000.0,3.0,,,,,0.0,30165.0,


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   DateCrawled        354369 non-null  datetime64[ns]
 1   Price              354369 non-null  int64         
 2   VehicleType        316879 non-null  object        
 3   RegistrationYear   354369 non-null  int64         
 4   Gearbox            334536 non-null  object        
 5   Power              354369 non-null  int64         
 6   Model              334664 non-null  object        
 7   Mileage            354369 non-null  int64         
 8   RegistrationMonth  354369 non-null  int64         
 9   FuelType           321474 non-null  object        
 10  Brand              354369 non-null  object        
 11  NotRepaired        283215 non-null  object        
 12  DateCreated        354369 non-null  datetime64[ns]
 13  NumberOfPictures   354369 non-null  int64   

## As can be seen above:##
- total of 354,369 rows
- some of the columns / features have missing values
- several of the features are categorical nominal: `VehicleType`, `NumberOfPictures`, `PostalCode` , `Gearbox`, `Model`, `FuelType`, `Brand`, `NotRepaired`.
- Note that `PostalCode` and `Model` have high cardinality - which may be problematic for tree algorithms...
- 3 features (`DateCrawled`, `DateCreated`, `LastSeen`) are datetime which cannot be handle by regression models - therefore we will engineer another feature out of it.

##<font color='green'>Review
    
Good, everything is correct here</font>


Drop columns:
- `NumberOfPictures` - columns was no filled (always 0 or NaN)
- `PostalCode` - this is a Nominal Categorial Feature. There are more than 8000 
different codes which makes the features too fragmented and unusefull.
- `RegistrationMonth` - no impact on Price (only year)

##<font color='green'>Review
    
A reasonable decision :)</font>


In [None]:
data.drop(['NumberOfPictures', 'PostalCode', 'RegistrationMonth'], axis=1, inplace=True)

Delete rows with Model ='other' or Model=NaN - reason: price is highly correlated to model

In [None]:
data = data.loc[data['Model'] != 'other']
data = data.loc[~data['Model'].isna()]

Delete rows with low frequency (<20) `Model`

In [None]:
lt_20 =data['Model'].value_counts().gt(19)
data = data.loc[data['Model'].isin(lt_20[lt_20].index)]

delete rows with Price < 10

In [None]:
data = data.loc[data['Price'] >= 10]

Delete outliers in registration year < 1960 (error or antiques) and registration year > 2016 (the data is from 2016...)

##<font color='green'>Review
    
Great that you noticed these outliers!</font>


In [None]:
data = data.loc[data['RegistrationYear'] >= 1960]
data = data.loc[data['RegistrationYear'] <= 2016]

Gearbox, VehicleType and FuelType - NaN will be replace by most frequent value in model

In [None]:
data['Gearbox'].fillna(data.groupby('Model')['Gearbox'].transform(lambda x:x.value_counts().index[0]),inplace=True)
data['VehicleType'].fillna(data.groupby('Model')['VehicleType'].transform(lambda x:x.value_counts().index[0]),inplace=True)
data['FuelType'].fillna(data.groupby('Model')['FuelType'].transform(lambda x:x.value_counts().index[0]),inplace=True)

NotRepaired - replace NaNs with no - which is by far the most frequent value

In [None]:
data['NotRepaired'].fillna('no',inplace=True)

`Power` == 0 does not exist (was probably not fed) and should be replace by most frequent value per model. we will do it for power < 20

There are also abnormally large values as well (ex. power = 20000). We will replace the values in the 99.5% percentile (329) as well

##<font color='blue'>Review
    
There are also abnormally large values in this feature. Might be useful to replace them as well :)</font>


In [None]:
data['Power'].quantile(0.995)

329.0

In [None]:
data['Power'] = np.where(data['Power'] < 20 , data.groupby('Model')['Power'].transform(lambda x:x.value_counts().index[0]), data['Power'])
data['Power'] = np.where(data['Power'] > 329, data.groupby('Model')['Power'].transform(lambda x:x.value_counts().index[0]), data['Power'])

Create new feature out of the datetime features before dropping them

In [None]:
data['days_seen'] = (data.LastSeen - data.DateCreated).dt.days
data.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1, inplace=True)

Assign dtype = category for the categorical features - this is needed for the models to work

In [None]:
categorical_features = data.select_dtypes(exclude=['number']).columns.tolist()
for c in categorical_features:
    data[c] = data[c].astype('category')

In [None]:
for cat in categorical_features:
    print(f'--- feature: {cat}, nuniques: {data[cat].nunique()} ---')
    print(data[cat].value_counts())

--- feature: VehicleType, nuniques: 8 ---
sedan          87359
small          75834
wagon          61039
bus            24738
convertible    17742
coupe          12240
suv             9038
other           1964
Name: VehicleType, dtype: int64
--- feature: Gearbox, nuniques: 2 ---
manual    233292
auto       56662
Name: Gearbox, dtype: int64
--- feature: Model, nuniques: 234 ---
golf       26660
3er        18599
polo       11987
corsa      11549
astra       9991
           ...  
delta         31
b_max         26
charade       26
9000          24
musa          22
Name: Model, Length: 234, dtype: int64
--- feature: FuelType, nuniques: 7 ---
petrol      195185
gasoline     89981
lpg           4074
cng            476
hybrid         123
other           90
electric        25
Name: FuelType, dtype: int64
--- feature: Brand, nuniques: 39 ---
volkswagen       67158
opel             33996
bmw              32937
mercedes_benz    26472
audi             25581
ford             21254
renault          1

##<font color='green'>Review
    
You did a very thoughtful work on data preprocessing!</font>


# 2. Model training

In [None]:
def split_groups (df, target_col):
    df_train, df_temp = train_test_split(df, test_size=0.4, random_state=12345)
    df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=12345)

    features_train = df_train.drop([target_col], axis=1)
    target_train = df_train[target_col]
    features_valid = df_valid.drop([target_col], axis=1)
    target_valid = df_valid[target_col]
    features_test = df_test.drop([target_col], axis=1)
    target_test = df_test[target_col]
    return features_train, target_train, features_valid, target_valid, features_test, target_test

### Train with Random Forest

Sadly, there are 234 unique models. OHE will require 233 columns, which will make the model unpractical (very long runtime). Therefore we group th less popular models under 'other'.

In [None]:
data1 = data.copy()
top_50 = data1['Model'].value_counts()[:50].index
data1['Model'] = np.where(data1['Model'].isin(top_50),data1['Model'], 'other')
data_ohe = pd.get_dummies(data1, columns=categorical_features, drop_first=True)
features_train, target_train, features_valid, target_valid, features_test, target_test = split_groups(data_ohe, 'Price')

In [None]:
%%timeit -n1 -r1
for depth in range(9,11):
    model = RandomForestRegressor(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    score_train = mean_squared_error(target_train, model.predict(features_train)) ** 0.5
    score_valid = mean_squared_error(target_valid, model.predict(features_valid)) ** 0.5
    dif = 100*(score_valid-score_train) / score_train
    print(depth, score_train, score_valid, dif)

9 1828.3493648316462 1890.6402763436508 3.4069479668477016
10 1732.3187959387506 1824.222195575718 5.305224410912459
1 loop, best of 1: 3min 45s per loop


As above, depth=9 provide an acceptable result (depth=10 result in overfit)

Again we split into 3 groups - this is because dataframe is different, as we can work with categorical data in the following 2 algorithms.

In [None]:
features_train, target_train, features_valid, target_valid, features_test, target_test = split_groups(data, 'Price')

## Train with CatBoost

In [None]:
def print_rmse():
    rmse_train = mean_squared_error(target_train, prediction_train) ** 0.5
    rmse_valid = mean_squared_error(target_valid, prediction_valid) ** 0.5
    rmse_test = mean_squared_error(target_test, prediction_test) ** 0.5
    print(f'RMSE Train: {rmse_train:,.0f}, Valid: {rmse_valid:,.0f}, Test: {rmse_test:,.0f}, diff(Train,Valid):{100*(rmse_valid - rmse_train)/rmse_train:,.1f}%')

In [None]:
for depth_ in range(13,16):
    model = CatBoostRegressor(loss_function="RMSE", depth=depth_,n_estimators=150)
    print (f'Depth: {depth_}')
    model.fit(features_train, target_train, cat_features=categorical_features, verbose=50)
    prediction_train = model.predict(features_train)
    prediction_valid = model.predict(features_valid)
    prediction_test = model.predict(features_test)
    print_rmse()

Depth: 4
Learning rate set to 0.413406
0:	learn: 3516.6052688	total: 84.4ms	remaining: 12.6s
50:	learn: 1709.2577708	total: 2.41s	remaining: 4.69s
100:	learn: 1633.4110934	total: 4.7s	remaining: 2.28s
149:	learn: 1596.9050195	total: 6.9s	remaining: 0us
RMSE Train: 1,592, Valid: 1,628, Test: 1,634, diff(Train,Valid):2.3%
Depth: 5
Learning rate set to 0.413406
0:	learn: 3463.8670965	total: 68.4ms	remaining: 10.2s
50:	learn: 1655.2418615	total: 2.71s	remaining: 5.25s
100:	learn: 1585.3398139	total: 5.29s	remaining: 2.56s
149:	learn: 1547.5676461	total: 7.87s	remaining: 0us
RMSE Train: 1,543, Valid: 1,600, Test: 1,594, diff(Train,Valid):3.7%
Depth: 6
Learning rate set to 0.413406
0:	learn: 3399.1912738	total: 84.2ms	remaining: 12.5s
50:	learn: 1607.7470864	total: 3.1s	remaining: 6.02s
100:	learn: 1537.4609654	total: 6.05s	remaining: 2.94s
149:	learn: 1499.9093999	total: 8.99s	remaining: 0us
RMSE Train: 1,497, Valid: 1,570, Test: 1,571, diff(Train,Valid):4.8%
Depth: 7
Learning rate set to 0

## <font color='green'>Review
    
Great that you used `cat_features=` parameter :)</font>

As above, depth=6 provides the best result. Execution time = 11s

## Train with LightGBM

Using `num_leaves`=100, there is overfitting (RMSE train=1384, RMSE valid=1485 - diff 6.7%)

In [None]:
nleaves = 30
params = {
 'boosting_type': 'gbdt',
 'objective': 'regression',
 'metric': {'root_mean_squared_error'},
 'num_leaves': nleaves,
 'learning_rate': 0.05,
 'feature_fraction': 0.9,
 'bagging_fraction': 0.8,
 'bagging_freq': 5,
 'verbose': 0
}
lgb_train = lgbm.Dataset(features_train, target_train)
lgb_eval = lgbm.Dataset(features_valid, target_valid, reference=lgb_train)
gbm = lgbm.train(params,
                lgb_train,
                num_boost_round=800,
                valid_sets=lgb_eval,
                early_stopping_rounds=5,verbose_eval=100)

prediction_train = gbm.predict(features_train, num_iteration=gbm.best_iteration)
prediction_valid = gbm.predict(features_valid, num_iteration=gbm.best_iteration)
prediction_test = gbm.predict(features_test, num_iteration=gbm.best_iteration)
print_rmse()



Training until validation scores don't improve for 5 rounds.
[100]	valid_0's rmse: 1579.77
[200]	valid_0's rmse: 1526.38
[300]	valid_0's rmse: 1508.61
[400]	valid_0's rmse: 1499.29
[500]	valid_0's rmse: 1491.42
Early stopping, best iteration is:
[502]	valid_0's rmse: 1491.32
RMSE Train: 1,397, Valid: 1,491, Test: 1,493, diff(Train,Valid):6.8%


In [None]:
nleaves = 17
params = {
 'boosting_type': 'gbdt',
 'objective': 'regression',
 'metric': {'root_mean_squared_error'},
 'num_leaves': nleaves,
 'learning_rate': 0.05,
 'feature_fraction': 0.9,
 'bagging_fraction': 0.8,
 'bagging_freq': 5,
 'verbose': 0
}
lgb_train = lgbm.Dataset(features_train, target_train)
lgb_eval = lgbm.Dataset(features_valid, target_valid, reference=lgb_train)
gbm = lgbm.train(params,
                lgb_train,
                num_boost_round=800,
                valid_sets=lgb_eval,
                early_stopping_rounds=5,verbose_eval=100)

prediction_train = gbm.predict(features_train, num_iteration=gbm.best_iteration)
prediction_valid = gbm.predict(features_valid, num_iteration=gbm.best_iteration)
prediction_test = gbm.predict(features_test, num_iteration=gbm.best_iteration)
print_rmse()



Training until validation scores don't improve for 5 rounds.
[100]	valid_0's rmse: 1637.71
[200]	valid_0's rmse: 1565.95
[300]	valid_0's rmse: 1544.36
[400]	valid_0's rmse: 1533.16
[500]	valid_0's rmse: 1523.02
Early stopping, best iteration is:
[590]	valid_0's rmse: 1517.64
RMSE Train: 1,446, Valid: 1,518, Test: 1,522, diff(Train,Valid):5.0%


LightGBM: Now using `num_leaves`=31, no overfitting (RMSE train=1426, RMSE valid=1505 - diff 5%).Execution time: 48.7s

##<font color='green'>Review
    
All good :) You correctly train and test your models. Hyperparameters tuning is fine too, though it would be better to try cross-validation and to study more parameter sets.</font>

<font color='red'>But both CatBoost and LightGBM is gradient boosting models. We can't tell that we choose an optimal algorithm if we didn't try anything else :) So I ask you to apply some other type of regression and compare its performance with boosting.
    
</font>

# 3. Model analysis
- Random Forest (depth=9) RMSE=1890 and execution time = 1m
- CatBoost(depth=12), RMSE=1645 and execution time=1m 53s
- LightGBM (`num_leaves`=31), RMSE=1505 and execution time=48.7s

As we can see, LightGBM provide better results...

# <font color='green'>Review

Great results! Thank you for considering both error and execution time of models :)</font>

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed