Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [203]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from catboost import CatBoostRegressor
import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler 
import math
import warnings
from datetime import date
import time

RANDOM_STATE = 12345
warnings.filterwarnings('ignore')

### Download and look at the data

In [204]:
car_data = pd.read_csv('https://code.s3.yandex.net/datasets/car_data.csv')

In [205]:
car_data['LastSeen'].sort_values(ascending=False)

72809     31/03/2016 23:54
262683    31/03/2016 23:51
36473     31/03/2016 23:50
231191    31/03/2016 23:48
158969    31/03/2016 23:47
                ...       
105825    01/04/2016 00:15
328139    01/04/2016 00:15
237989    01/04/2016 00:15
286293    01/04/2016 00:15
323643    01/04/2016 00:15
Name: LastSeen, Length: 354369, dtype: object

In [206]:
car_data['RegistrationMonth'].value_counts()

0     37352
3     34373
6     31508
4     29270
5     29153
7     27213
10    26099
12    24289
11    24186
9     23813
1     23219
8     22627
2     21267
Name: RegistrationMonth, dtype: int64

In [207]:
RegistrationMonth = car_data[car_data['RegistrationMonth'] ==0]['RegistrationMonth'].count()
print(f"There are {RegistrationMonth} observations with RegistrationMonth=0 ")

There are 37352 observations with RegistrationMonth=0 


In [208]:
price = car_data[car_data['Price']<=0]['Price'].count()
print(f"There are {price} observations with illogical price values ")

There are 10772 observations with illogical price values 


In [209]:
# The first car was made in 1886 
RegistrationYear = car_data[(car_data['RegistrationYear'] > date.today().year) | \
                                (car_data['RegistrationYear'] < 1885)]['Price'].count()
print(f"There are {RegistrationYear} observations with illogical RegistrationYear values ")

There are 171 observations with illogical RegistrationYear values 


In [210]:
#  illogical power values  
power = car_data[car_data['Power'] >1500]['Price'].count()
print(f"There are {power} observations with illogical power values ")

There are 203 observations with illogical power values 


In [211]:
#delete illogical price values 
car_data = car_data[car_data['Price']>0]
#delete illogical RegistrationMonth values 
car_data = car_data[car_data['RegistrationMonth'] >0]

# delete illogical RegistrationYear values  
car_data = car_data[(car_data['RegistrationYear'] <= date.today().year) & \
                                (car_data['RegistrationYear'] > 1885)]

# delete illogical power values  
car_data = car_data[car_data['Power'] <1500]



In [212]:
car_data.shape

(310522, 16)

In [213]:
# observations with null values
car_data[car_data.isnull().any(axis=1)].sample(50)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
25460,11/03/2016 08:36,2500,,2017,auto,136,other,150000,9,petrol,suzuki,no,11/03/2016 00:00,0,39576,21/03/2016 15:44
267235,04/04/2016 19:39,50,sedan,2000,,0,,5000,1,petrol,smart,,04/04/2016 00:00,0,93107,06/04/2016 21:45
292076,08/03/2016 15:53,850,sedan,1997,manual,75,golf,150000,3,petrol,volkswagen,,08/03/2016 00:00,0,27211,06/04/2016 20:19
327289,16/03/2016 13:37,9199,sedan,2005,auto,249,,150000,11,petrol,sonstige_autos,no,16/03/2016 00:00,0,46535,06/04/2016 09:17
254597,01/04/2016 17:37,1590,,2016,manual,58,agila,100000,12,petrol,opel,no,01/04/2016 00:00,0,92363,03/04/2016 13:48
297537,31/03/2016 08:55,5500,sedan,2002,auto,333,,150000,4,petrol,bmw,no,31/03/2016 00:00,0,21522,31/03/2016 09:49
121527,24/03/2016 09:57,300,,2016,auto,60,corsa,150000,9,petrol,opel,,24/03/2016 00:00,0,27616,06/04/2016 23:45
48988,20/03/2016 22:48,2899,,2016,manual,75,golf,100000,5,petrol,volkswagen,,20/03/2016 00:00,0,61352,22/03/2016 13:10
197381,09/03/2016 13:54,10500,sedan,2008,manual,143,1er,125000,10,petrol,bmw,,09/03/2016 00:00,0,54570,09/03/2016 13:54
64325,11/03/2016 11:50,7999,suv,2000,auto,184,cherokee,150000,3,petrol,jeep,,11/03/2016 00:00,0,22527,05/04/2016 12:55


### fill missing values for  ' FuelType'

In [214]:
car_data[car_data['VehicleType'] == 'bus']['FuelType'].value_counts(dropna=False)

gasoline    16629
petrol       9008
NaN           666
lpg           484
cng           233
other           7
hybrid          5
electric        1
Name: FuelType, dtype: int64

For VehicleType=bus  
most cases FuelType=gasoline


In [215]:
# fill missing values for  VehicleType=bus -> FuelType=gasoline
car_data.loc[car_data['VehicleType'] == 'bus','FuelType'] = car_data.loc[car_data['VehicleType'] == 'bus','FuelType'].fillna('gasoline')

In [216]:
car_data[car_data['VehicleType'] == 'suv']['FuelType'].value_counts(dropna=False)

gasoline    6243
petrol      4207
lpg          520
NaN          277
hybrid        10
other          9
cng            3
Name: FuelType, dtype: int64

For VehicleType=suv  
most cases FuelType=gasoline


In [217]:
# fill missing values for  VehicleType=suv -> FuelType=gasoline
car_data.loc[car_data['VehicleType'] == 'suv','FuelType'] = car_data.loc[car_data['VehicleType'] == 'suv','FuelType'].fillna('gasoline')

In [218]:
car_data[car_data['VehicleType'] == 'wagon']['FuelType'].value_counts(dropna=False)

gasoline    31310
petrol      24835
NaN          1983
lpg          1066
cng           128
other          20
hybrid         20
electric        5
Name: FuelType, dtype: int64

For VehicleType=wagon  
most cases FuelType=gasoline


In [219]:
# fill missing values for  VehicleType=wagon -> FuelType=gasoline
car_data.loc[car_data['VehicleType'] == 'wagon','FuelType'] = car_data.loc[car_data['VehicleType'] == 'wagon','FuelType'].fillna('gasoline')

In [220]:
car_data[car_data['VehicleType'] == 'coupe']['FuelType'].value_counts(dropna=False)

petrol      11606
gasoline     1966
NaN           511
lpg           284
hybrid         16
electric        5
cng             2
other           1
Name: FuelType, dtype: int64

For VehicleType=coupe  
most cases FuelType=petrol


In [221]:
# fill missing values for  VehicleType=coupe -> FuelType=petrol
car_data.loc[car_data['VehicleType'] == 'coupe','FuelType'] = car_data.loc[car_data['VehicleType'] == 'coupe','FuelType'].fillna('petrol')

In [222]:
car_data[car_data['VehicleType'] == 'small']['FuelType'].value_counts(dropna=False)

petrol      61778
gasoline     6323
NaN          2959
lpg           460
cng            75
electric       45
hybrid         33
other          21
Name: FuelType, dtype: int64

For VehicleType=small  
most cases FuelType=petrol


In [223]:
# fill missing values for  VehicleType=small -> FuelType=petrol
car_data.loc[car_data['VehicleType'] == 'small','FuelType'] = car_data.loc[car_data['VehicleType'] == 'small','FuelType'].fillna('petrol')

In [224]:
car_data[car_data['VehicleType'] == 'sedan']['FuelType'].value_counts(dropna=False)

petrol      55868
gasoline    24072
NaN          2563
lpg          1529
hybrid        126
cng            36
other          35
electric        6
Name: FuelType, dtype: int64

For VehicleType=sedan  
most cases FuelType=petrol


In [225]:
# fill missing values for  VehicleType=sedan -> FuelType=petrol
car_data.loc[car_data['VehicleType'] == 'sedan','FuelType'] = car_data.loc[car_data['VehicleType'] == 'sedan','FuelType'].fillna('petrol')

In [226]:
car_data[car_data['VehicleType'] == 'convertible']['FuelType'].value_counts(dropna=False)

petrol      16529
gasoline     1461
NaN           512
lpg           217
electric        6
other           5
cng             3
Name: FuelType, dtype: int64

For VehicleType=convertible  
most cases FuelType=petrol

In [227]:
# fill missing values for  VehicleType=convertible -> FuelType=petrol
car_data.loc[car_data['VehicleType'] == 'convertible','FuelType'] = car_data.loc[car_data['VehicleType'] == 'convertible','FuelType'].fillna('petrol')

In [228]:
car_data[car_data['VehicleType'] == 'other']['FuelType'].value_counts(dropna=False)

gasoline    1232
petrol      1119
NaN          190
other         25
lpg           23
cng           12
electric      10
hybrid         2
Name: FuelType, dtype: int64

For VehicleType=convertible  
most cases FuelType=gasoline

In [229]:
# fill missing values for  VehicleType=convertible -> FuelType=gasoline
car_data.loc[car_data['VehicleType'] == 'other','FuelType'] = car_data.loc[car_data['VehicleType'] == 'other','FuelType'].fillna('gasoline')

In [230]:
car_data[car_data['FuelType'].isnull()]['VehicleType'].value_counts(dropna=False)

NaN    8419
Name: VehicleType, dtype: int64

In [231]:
car_data.pivot_table(
    index='VehicleType',
    columns='Gearbox',
    values='PostalCode',
    aggfunc='count',
    dropna=False
)


Gearbox,auto,manual
VehicleType,Unnamed: 1_level_1,Unnamed: 2_level_1
bus,4013,22430
convertible,4018,14218
coupe,4120,9953
other,280,2217
sedan,20331,62011
small,5566,63837
suv,4832,6185
wagon,15097,43081


In [232]:
#Check Gearbox distribution
car_data.pivot_table(
    index='VehicleType',
    columns='Gearbox',
    values='PostalCode',
    aggfunc='count',
    
)

Gearbox,auto,manual
VehicleType,Unnamed: 1_level_1,Unnamed: 2_level_1
bus,4013,22430
convertible,4018,14218
coupe,4120,9953
other,280,2217
sedan,20331,62011
small,5566,63837
suv,4832,6185
wagon,15097,43081


Most of the cars are Gearbox=manual
We can replace Nan values to manual 

In [233]:
car_data['Gearbox'] = car_data['Gearbox'].fillna('manual')

In [234]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 310522 entries, 1 to 354368
Data columns (total 16 columns):
DateCrawled          310522 non-null object
Price                310522 non-null int64
VehicleType          289335 non-null object
RegistrationYear     310522 non-null int64
Gearbox              310522 non-null object
Power                310522 non-null int64
Model                298275 non-null object
Mileage              310522 non-null int64
RegistrationMonth    310522 non-null int64
FuelType             302103 non-null object
Brand                310522 non-null object
NotRepaired          263854 non-null object
DateCreated          310522 non-null object
NumberOfPictures     310522 non-null int64
PostalCode           310522 non-null int64
LastSeen             310522 non-null object
dtypes: int64(7), object(9)
memory usage: 40.3+ MB


In [235]:
car_data.isnull().sum()

DateCrawled              0
Price                    0
VehicleType          21187
RegistrationYear         0
Gearbox                  0
Power                    0
Model                12247
Mileage                  0
RegistrationMonth        0
FuelType              8419
Brand                    0
NotRepaired          46668
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [236]:
#Number of observations with/without null 
car_data.isna().any(axis=1).value_counts()

False    245043
True      65479
dtype: int64

In [237]:
# Drop null's 
car_data.dropna(inplace=True)

We have 21% of observations with null's value. that is a lot of data to remove.
After removing null data we will still have 245043 observations

### Data type replacement

In [238]:
car_data['LastSeen'] = pd.to_datetime(car_data['LastSeen'])
car_data['DateCreated'] = pd.to_datetime(car_data['DateCreated'])
car_data['NotRepaired'] = car_data['NotRepaired'].map(dict(yes=1, no=0))
car_data['RegistrationMonth'] = np.int8(car_data['RegistrationMonth'])
car_data['RegistrationYear'] = pd.to_datetime(car_data['RegistrationYear'])
car_data['DateCrawled'] = pd.to_datetime(car_data['DateCrawled'])

In [239]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245043 entries, 3 to 354367
Data columns (total 16 columns):
DateCrawled          245043 non-null datetime64[ns]
Price                245043 non-null int64
VehicleType          245043 non-null object
RegistrationYear     245043 non-null datetime64[ns]
Gearbox              245043 non-null object
Power                245043 non-null int64
Model                245043 non-null object
Mileage              245043 non-null int64
RegistrationMonth    245043 non-null int8
FuelType             245043 non-null object
Brand                245043 non-null object
NotRepaired          245043 non-null int64
DateCreated          245043 non-null datetime64[ns]
NumberOfPictures     245043 non-null int64
PostalCode           245043 non-null int64
LastSeen             245043 non-null datetime64[ns]
dtypes: datetime64[ns](4), int64(6), int8(1), object(5)
memory usage: 30.1+ MB


### Duplicated data

In [240]:
# Check for duplicated data
car_data.duplicated().sum()

255

In [241]:
# Drop duplicates
car_data.drop_duplicates(inplace=True)

### Conclusions
- We have 21% of observations with null's value. that is a lot of data to remove. After removing null data we will still have     245043 observations. 
- Data type replacement have been made for features. 
- duplicated data was dropped. 

## Model training

In [242]:
# Remove unnecessary features
car_data.drop(['DateCrawled','RegistrationYear','DateCreated','LastSeen','PostalCode'], axis=1,inplace=True) 

In [243]:
car_data_ohe = pd.get_dummies(car_data, drop_first=True)

In [244]:
car_data_ohe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244788 entries, 3 to 354367
Columns: 306 entries, Price to Brand_volvo
dtypes: int64(5), int8(1), uint8(300)
memory usage: 81.5 MB


In [245]:
#categorical feature list
categorical_feature  = ['VehicleType','Gearbox','Model','FuelType','Brand']

#categorical feature index
categorical_index= [car_data.columns.get_loc(col) for col in categorical_feature]

In [246]:
car_data_ordinal = car_data.copy()
car_data_ordinal = car_data_ordinal.reset_index()
# encoing categorical features
encoder = OrdinalEncoder()

car_data_ordinal[categorical_feature] = pd.DataFrame(encoder.fit_transform(car_data[categorical_feature]), columns=categorical_feature)

car_data_ordinal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244788 entries, 0 to 244787
Data columns (total 12 columns):
index                244788 non-null int64
Price                244788 non-null int64
VehicleType          244788 non-null float64
Gearbox              244788 non-null float64
Power                244788 non-null int64
Model                244788 non-null float64
Mileage              244788 non-null int64
RegistrationMonth    244788 non-null int8
FuelType             244788 non-null float64
Brand                244788 non-null float64
NotRepaired          244788 non-null int64
NumberOfPictures     244788 non-null int64
dtypes: float64(5), int64(6), int8(1)
memory usage: 20.8 MB


In [247]:
target = car_data['Price']
features = car_data.drop(['Price'], axis=1)

target_ohe = car_data_ohe['Price']
features_ohe = car_data_ohe.drop(['Price'], axis=1)

target_ordinal = car_data_ordinal['Price']
features_ordinal = car_data_ordinal.drop(['Price'], axis=1)


In [248]:
# Split data into train and validation
#features_train, features_valid, target_train, target_valid = train_test_split(
#    features, target, test_size=0.2, random_state=RANDOM_STATE)

# Split data into train and test.
#features_train, features_test, target_train, target_test = train_test_split(
#    features_train, target_train, test_size=0.25, random_state=RANDOM_STATE)


# Split data into train and validation
features_train_ordinal, features_valid_ordinal, target_train_ordinal, target_valid_ordinal = train_test_split(
    features_ordinal, target_ordinal, test_size=0.2, random_state=RANDOM_STATE)

# Split data into train and test.
#features_train_ordinal, features_test_ordinal, target_train_ordinal, target_test_ordinal = train_test_split(
#    features_train_ordinal, target_train_ordinal, test_size=0.25, random_state=RANDOM_STATE)



In [249]:
# Feature Scaling
scaler = StandardScaler()
scaler.fit(features_train_ordinal)
features_train_ordinal = scaler.transform(features_train_ordinal)
features_valid_ordinal = scaler.transform(features_valid_ordinal)

In [250]:
def train_fit_score(model,features,target, params, name):
    model = GridSearchCV(model, param_grid=params, cv=5, verbose=0, refit=True)
       
    start = time.time()
    model.fit(features, target)
    stop = time.time()
    model_fit_time = stop-start
    
    # Predict the model   
    start = time.time()
    predict = model.predict(features)
    stop = time.time()
    model_predict_time = stop-start
    
    # RMSE Computation
    rmse = math.sqrt(mean_squared_error(target, predict))
    return (rmse,model_fit_time,model_predict_time)

### RandomForest

In [251]:

params = {'n_estimators' : range(10, 25, 25),
          'max_depth' : range(5, 10, 10),
          'min_impurity_decrease' : np.arange(.05, .1, .1),
          'min_samples_split' : [3,5,7]
         }

rf_result = train_fit_score(RandomForestRegressor(random_state=RANDOM_STATE),
                               features_train_ordinal,
                               target_train_ordinal,
                               params, 
                               'Random Forest')


print(f'RMSE: {rf_result[0]:.2f}')
print(f'fit time={rf_result[1]}  \npredict time={rf_result[2]}')

RMSE: 2983.93
fit time=29.389227390289307  
predict time=0.07725834846496582


###  CatBoostRegressor

In [252]:
#%%time
params = {'depth': range(10, 25, 25),
              'learning_rate' : np.arange(.05, .1, .1),
              'iterations'    : range(10, 25, 25)
                 }
model_CBR = CatBoostRegressor() 
grid = GridSearchCV(model_CBR, param_grid = params, cv=5, n_jobs=-1)
start = time.time()
grid.fit(features, target, cat_features=categorical_feature , verbose=10)
stop = time.time()
CBR_fit_time = stop-start
# Predict the model
start = time.time()
predict = grid.predict(features)
stop = time.time()
CBR_predict_time = stop-start  
# RMSE Computation
catBoost_rmse = math.sqrt(mean_squared_error(target, predict))
print(f'RMSE: {catBoost_rmse:.2f}')
print(f'fit time={CBR_fit_time}  \npredict time={CBR_predict_time}')

0:	learn: 4562.8800114	total: 267ms	remaining: 2.4s
9:	learn: 3666.0677058	total: 2.57s	remaining: 0us
0:	learn: 4562.1444992	total: 147ms	remaining: 1.32s
9:	learn: 3664.2289416	total: 2.62s	remaining: 0us
0:	learn: 4562.2887011	total: 215ms	remaining: 1.93s
9:	learn: 3666.9450088	total: 2.61s	remaining: 0us
0:	learn: 4563.8743418	total: 294ms	remaining: 2.64s
9:	learn: 3662.8466000	total: 2.57s	remaining: 0us
0:	learn: 4573.4998302	total: 198ms	remaining: 1.78s
9:	learn: 3677.0055692	total: 2.67s	remaining: 0us
0:	learn: 4565.8233177	total: 234ms	remaining: 2.1s
9:	learn: 3666.3922177	total: 3.02s	remaining: 0us
RMSE: 3666.88
fit time=44.69465112686157  
predict time=0.4393754005432129


### LightGBM 

In [254]:
#Change categorical features to data type 'category'
for feature in categorical_feature:
    features[feature] = pd.Series(features[feature], dtype="category")
    features[feature] = pd.Series(features[feature], dtype="category")

model_lgb = lgb.LGBMRegressor()
params = {
    'n_estimators': range(10, 25, 25),
    'colsample_bytree': np.arange(0.1, 0.9,0.5),
    'max_depth': range(10, 25, 25),
    'num_leaves': range(50, 200, 150),
    'reg_alpha': np.arange(1.1, 1.5, 0.4),
    'reg_lambda': np.arange(1.1, 1.3, 0.2),
    'min_split_gain': np.arange(0.3, 0.4,0.1),
    'subsample': np.arange(0.7, 0.9, 0.2),
    'subsample_freq': [20]
}
grid = GridSearchCV(model_lgb, param_grid = params, cv=5)

start = time.time()
grid.fit(features, target, categorical_feature=categorical_index)
stop = time.time()
lgb_fit_time = stop-start

# Predict the model
start = time.time()
predict = grid.predict(features)
stop = time.time()
lgb_predict_time = stop-start  
# RMSE Computation
lightGBM_rmse = math.sqrt(mean_squared_error(target, predict))
print(f'RMSE: {lightGBM_rmse:.2f}')
print(f'fit time={lgb_fit_time}\npredict time={lgb_predict_time}')

RMSE: 3390.70
fit time=21.59748387336731
predict time=0.2270030975341797


### XGBoost 

In [258]:
model_xgb = xgb.XGBRegressor(objective ='reg:squarederror',n_estimators = 10, seed = 123)

params = {
    'n_estimators': range(10, 25, 25),
    'colsample_bytree': np.arange(0.1, 0.9,0.5),
    'max_depth': range(10, 25, 25),
    'num_leaves': range(50, 200, 150),
    'reg_alpha': np.arange(1.1, 1.5, 0.4),
    'reg_lambda': np.arange(1.1, 1.3, 0.2),
    'min_split_gain': np.arange(0.3, 0.4,0.1),
    'subsample': np.arange(0.7, 0.9, 0.2),
    'subsample_freq': [20]
}
grid = GridSearchCV(model_xgb, param_grid = params, cv=5)

start = time.time()
grid.fit(features_train_ordinal, target_train_ordinal)
stop = time.time()
xgb_fit_time = stop-start  

# Predict the model
start = time.time()
predict = grid.predict(features_train_ordinal)
stop = time.time()
xgb_predict_time = stop-start  
  
# RMSE Computation

XGBoost_rmse = math.sqrt(mean_squared_error(target_train_ordinal, predict))
print(f'RMSE: {XGBoost_rmse:.2f}')
print(f'fit time={xgb_fit_time}\npredict time={xgb_predict_time}')

RMSE: 3542.49
fit time=62.85644268989563
predict time=0.21706509590148926


####  Sanity check

In [261]:
reg = LinearRegression()
#cross_val_score(reg, features_train_ordinal, target_train_ordinal, cv=5)
grid = GridSearchCV(reg, param_grid = {}, cv=5)

start = time.time()
grid.fit(features_train_ordinal, target_train_ordinal)
stop = time.time()
lr_fit_time = stop-start  
 
start = time.time()    
predict = grid.predict(features_valid_ordinal)
stop = time.time()
lr_predict_time = stop-start  
 
lr_rmse = math.sqrt(mean_squared_error(target_valid_ordinal, predict))
print(f'RMSE: {lr_rmse:.2f}')
print(f'fit time={lr_fit_time}\npredict time={lr_predict_time}')

RMSE: 3279.75
fit time=1.4908909797668457
predict time=0.009427547454833984


## Model analysis

In [263]:
data = [[lr_rmse,XGBoost_rmse,lightGBM_rmse,catBoost_rmse,rf_result[0]],
        [lr_fit_time,xgb_fit_time,lgb_fit_time,CBR_fit_time,rf_result[1]],
        [lr_predict_time,xgb_predict_time,lgb_predict_time,CBR_predict_time,rf_result[2]]
       ]
columns=['LinearRegression','XGBoost','LightGBM','CatBoost','RandomForest']
index = ('rmse','fit time','predict time')
scores = pd.DataFrame(data,columns=columns,index=index)
scores

Unnamed: 0,LinearRegression,XGBoost,LightGBM,CatBoost,RandomForest
rmse,3279.74924,3542.489717,3390.703057,3666.880721,2983.926871
fit time,1.490891,62.856443,21.597484,44.694651,29.389227
predict time,0.009428,0.217065,0.227003,0.439375,0.077258


- RandomForest has the smallest rmse
- RandomForest is the second  fastest model in tearms  fit time(not including LinearRegression)
- RandomForest is the fastest model in tearms  predict time(not including LinearRegression)
- All models except RandomForest preformed worse than the LinearRegression model(sanity model)
- The best model is RandomForest. It has the smallest rmse and he is fast