## Predicting the Sale Price of Bulldozers using Machine Learning

In this notebook, we're going through a machine learning project with the goal of predicting the sale price of bulldozers.

### 1. Problem definition
How well can we predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for?

### 2. Data
Data is downloaded from the Kaggle bluebook for bulldozer competition : https://www.kaggle.com/competitions/bluebook-for-bulldozers/data

The data for this competition is split into three parts:

Train.csv is the training set, which contains data through the end of 2011. Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012. You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard. Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition. The key fields are in train.csv are:

SalesID: the uniue identifier of the sale MachineID: the unique identifier of a machine. A machine can be sold multiple times saleprice: what the machine sold for at auction (only provided in train.csv) saledate: the date of the sale

### 3. Evaluation
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check: https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimise RMSLE.

### 4. Features
Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary by visiting this link: https://www.kaggle.com/competitions/bluebook-for-bulldozers/data

In [1]:
import numpy as np
import pandas as pd
import sklearn
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("data/TrainAndValid.csv", 
                 low_memory=False,
                 parse_dates=["saledate"])
df.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000.0,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000.0,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000.0,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500.0,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000.0,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalesID                   412698 non-null  int64         
 1   SalePrice                 412698 non-null  float64       
 2   MachineID                 412698 non-null  int64         
 3   ModelID                   412698 non-null  int64         
 4   datasource                412698 non-null  int64         
 5   auctioneerID              392562 non-null  float64       
 6   YearMade                  412698 non-null  int64         
 7   MachineHoursCurrentMeter  147504 non-null  float64       
 8   UsageBand                 73670 non-null   object        
 9   saledate                  412698 non-null  datetime64[ns]
 10  fiModelDesc               412698 non-null  object        
 11  fiBaseModel               412698 non-null  object        
 12  fi

In [4]:
df.isna().sum()

SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension     

In [5]:
df.saledate[:5]

0   2006-11-16
1   2004-03-26
2   2004-02-26
3   2011-05-19
4   2009-07-23
Name: saledate, dtype: datetime64[ns]

In [6]:
df_temp  = df.copy()

#### Add datetime parameters for `saledate` column

In [7]:
df_temp["saleyear"] = df_temp.saledate.dt.year
df_temp["salemonth"] = df_temp.saledate.dt.month
df_temp["saleday"] = df_temp.saledate.dt.day

In [8]:
df_temp.head().T

Unnamed: 0,0,1,2,3,4
SalesID,1139246,1139248,1139249,1139251,1139253
SalePrice,66000.0,57000.0,10000.0,38500.0,11000.0
MachineID,999089,117657,434808,1026470,1057373
ModelID,3157,77,7009,332,17311
datasource,121,121,121,121,121
auctioneerID,3.0,3.0,3.0,3.0,3.0
YearMade,2004,1996,2001,2001,2007
MachineHoursCurrentMeter,68.0,4640.0,2838.0,3486.0,722.0
UsageBand,Low,Low,High,High,Medium
saledate,2006-11-16 00:00:00,2004-03-26 00:00:00,2004-02-26 00:00:00,2011-05-19 00:00:00,2009-07-23 00:00:00


In [9]:
df_temp.drop("saledate", axis=1, inplace= True)

#### Convert string to categories

In [10]:
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

UsageBand
fiModelDesc
fiBaseModel
fiSecondaryDesc
fiModelSeries
fiModelDescriptor
ProductSize
fiProductClassDesc
state
ProductGroup
ProductGroupDesc
Drive_System
Enclosure
Forks
Pad_Type
Ride_Control
Stick
Transmission
Turbocharged
Blade_Extension
Blade_Width
Enclosure_Type
Engine_Horsepower
Hydraulics
Pushblock
Ripper
Scarifier
Tip_Control
Tire_Size
Coupler
Coupler_System
Grouser_Tracks
Hydraulics_Flow
Track_Type
Undercarriage_Pad_Width
Stick_Length
Thumb
Pattern_Changer
Grouser_Type
Backhoe_Mounting
Blade_Type
Travel_Controls
Differential_Type
Steering_Controls


In [11]:
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        df_temp[label] = content.astype("category").cat.as_ordered()

In [12]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 55 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   SalesID                   412698 non-null  int64   
 1   SalePrice                 412698 non-null  float64 
 2   MachineID                 412698 non-null  int64   
 3   ModelID                   412698 non-null  int64   
 4   datasource                412698 non-null  int64   
 5   auctioneerID              392562 non-null  float64 
 6   YearMade                  412698 non-null  int64   
 7   MachineHoursCurrentMeter  147504 non-null  float64 
 8   UsageBand                 73670 non-null   category
 9   fiModelDesc               412698 non-null  category
 10  fiBaseModel               412698 non-null  category
 11  fiSecondaryDesc           271971 non-null  category
 12  fiModelSeries             58667 non-null   category
 13  fiModelDescriptor         748

In [13]:
# Check missing data
df_temp.isna().sum() / len(df_temp)

SalesID                     0.000000
SalePrice                   0.000000
MachineID                   0.000000
ModelID                     0.000000
datasource                  0.000000
auctioneerID                0.048791
YearMade                    0.000000
MachineHoursCurrentMeter    0.642586
UsageBand                   0.821492
fiModelDesc                 0.000000
fiBaseModel                 0.000000
fiSecondaryDesc             0.340993
fiModelSeries               0.857845
fiModelDescriptor           0.818715
ProductSize                 0.524851
fiProductClassDesc          0.000000
state                       0.000000
ProductGroup                0.000000
ProductGroupDesc            0.000000
Drive_System                0.740520
Enclosure                   0.000809
Forks                       0.520921
Pad_Type                    0.803498
Ride_Control                0.629928
Stick                       0.803498
Transmission                0.544444
Turbocharged                0.803498
B

#### Fill missing values

In [14]:
# Fill numeric rows with median
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            df_temp[label] = content.fillna(content.median())

In [15]:
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        df_temp[label] = content.astype("category").cat.as_ordered()

In [16]:
# Fill and turn categorical values into numbers
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
            df_temp[label] = pd.Categorical(content).codes+1

In [17]:
df_temp.isnull().sum()

SalesID                     0
SalePrice                   0
MachineID                   0
ModelID                     0
datasource                  0
auctioneerID                0
YearMade                    0
MachineHoursCurrentMeter    0
UsageBand                   0
fiModelDesc                 0
fiBaseModel                 0
fiSecondaryDesc             0
fiModelSeries               0
fiModelDescriptor           0
ProductSize                 0
fiProductClassDesc          0
state                       0
ProductGroup                0
ProductGroupDesc            0
Drive_System                0
Enclosure                   0
Forks                       0
Pad_Type                    0
Ride_Control                0
Stick                       0
Transmission                0
Turbocharged                0
Blade_Extension             0
Blade_Width                 0
Enclosure_Type              0
Engine_Horsepower           0
Hydraulics                  0
Pushblock                   0
Ripper    

In [18]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 55 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              412698 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  412698 non-null  float64
 8   UsageBand                 412698 non-null  int8   
 9   fiModelDesc               412698 non-null  int16  
 10  fiBaseModel               412698 non-null  int16  
 11  fiSecondaryDesc           412698 non-null  int16  
 12  fiModelSeries             412698 non-null  int8   
 13  fiModelDescriptor         412698 non-null  i

In [19]:
%%time
from catboost import CatBoostRegressor

# # Instantiate model
model = CatBoostRegressor()
# # Fit model
model.fit(df_temp.drop("SalePrice", axis = 1),
                         df_temp["SalePrice"])

Learning rate set to 0.106035
0:	learn: 21946.9445013	total: 141ms	remaining: 2m 21s
1:	learn: 20856.3289472	total: 224ms	remaining: 1m 51s
2:	learn: 20004.7671552	total: 307ms	remaining: 1m 41s
3:	learn: 19237.3101062	total: 391ms	remaining: 1m 37s
4:	learn: 18591.1565892	total: 473ms	remaining: 1m 34s
5:	learn: 18002.8360752	total: 554ms	remaining: 1m 31s
6:	learn: 17413.3995110	total: 628ms	remaining: 1m 29s
7:	learn: 16969.0149462	total: 700ms	remaining: 1m 26s
8:	learn: 16554.1588727	total: 767ms	remaining: 1m 24s
9:	learn: 16095.8726827	total: 832ms	remaining: 1m 22s
10:	learn: 15759.9617000	total: 902ms	remaining: 1m 21s
11:	learn: 15469.5520581	total: 960ms	remaining: 1m 19s
12:	learn: 15215.6857889	total: 1.02s	remaining: 1m 17s
13:	learn: 14966.8471933	total: 1.09s	remaining: 1m 16s
14:	learn: 14676.5882374	total: 1.15s	remaining: 1m 15s
15:	learn: 14482.5452913	total: 1.21s	remaining: 1m 14s
16:	learn: 14284.9433506	total: 1.27s	remaining: 1m 13s
17:	learn: 14100.7608064	tot

<catboost.core.CatBoostRegressor at 0x7fb14c7a3a30>

In [20]:
model.score(df_temp.drop("SalePrice", axis =1),df_temp["SalePrice"])

0.8990335730834754

In [21]:
df_temp.saleyear.value_counts()

2009    43849
2008    39767
2011    35197
2010    33390
2007    32208
2006    21685
2005    20463
2004    19879
2001    17594
2000    17415
2002    17246
2003    15254
1998    13046
1999    12793
2012    11573
1997     9785
1996     8829
1995     8530
1994     7929
1993     6303
1992     5519
1991     5109
1989     4806
1990     4529
Name: saleyear, dtype: int64

#### Split data into train and valid set

In [22]:
# Split data into training and validation set according to saleyear(2012)
df_val = df_temp[df_temp["saleyear"]==2012]
df_train = df_temp[df_temp["saleyear"]!=2012]

len(df_val), len(df_train)

(11573, 401125)

In [23]:
# spltting into X and y sets
X_train, y_train = df_train.drop("SalePrice", axis = 1), df_train["SalePrice"]
X_val, y_val = df_val.drop("SalePrice", axis = 1), df_val["SalePrice"] 

#### Evaluation metrics

In [24]:
from sklearn.metrics import r2_score, mean_squared_log_error

def rmsle(y_true, y_pred):
    """
    Calculate the Root Mean Squared Log Error"""
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

def showscores(model):
    train_preds = np.abs(model.predict(X_train))
    valid_preds = np.abs(model.predict(X_val))
    scores = {"Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_val, valid_preds),
              "Training r2_score": r2_score(y_train, train_preds),
              "Validr 2_score": r2_score(y_val, valid_preds)}
    
    return scores

In [25]:
%%time
# Instantiate model
model = CatBoostRegressor()

# Fit the model
model.fit(X_train, y_train)

Learning rate set to 0.10556
0:	learn: 21832.5536188	total: 91.8ms	remaining: 1m 31s
1:	learn: 20763.3852283	total: 175ms	remaining: 1m 27s
2:	learn: 19897.2005121	total: 262ms	remaining: 1m 27s
3:	learn: 19083.3432370	total: 342ms	remaining: 1m 25s
4:	learn: 18444.1617210	total: 436ms	remaining: 1m 26s
5:	learn: 17870.1436627	total: 525ms	remaining: 1m 26s
6:	learn: 17309.2436353	total: 607ms	remaining: 1m 26s
7:	learn: 16863.8964345	total: 692ms	remaining: 1m 25s
8:	learn: 16435.2672050	total: 787ms	remaining: 1m 26s
9:	learn: 16077.7164201	total: 870ms	remaining: 1m 26s
10:	learn: 15736.3804948	total: 946ms	remaining: 1m 25s
11:	learn: 15445.2744325	total: 1.03s	remaining: 1m 25s
12:	learn: 15183.4925086	total: 1.1s	remaining: 1m 23s
13:	learn: 14952.0888560	total: 1.19s	remaining: 1m 23s
14:	learn: 14729.5230658	total: 1.25s	remaining: 1m 22s
15:	learn: 14511.9147581	total: 1.34s	remaining: 1m 22s
16:	learn: 14294.3938734	total: 1.42s	remaining: 1m 21s
17:	learn: 14104.6108538	tota

<catboost.core.CatBoostRegressor at 0x7fb111f432e0>

In [26]:
showscores(model)

{'Training RMSLE': 0.2411435284413031,
 'Valid RMSLE': 0.26078551374578773,
 'Training r2_score': 0.89960093842323,
 'Validr 2_score': 0.8776601776961745}

#### Hyperparameter tuning using RandomizedSearchCV

In [27]:
from sklearn.model_selection import RandomizedSearchCV

cat_grid = {"depth":[6,8,10, 12, 14],
            "learning_rate":[0.01, 0.05, 0.1, 0.15, 0.2],
            "iterations": [30,50,100, 200, 300]}

cat_model = RandomizedSearchCV(CatBoostRegressor(),
                               param_distributions=cat_grid,
                               n_iter=100,
                               cv= 5,
                               verbose =True)

cat_model.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
0:	learn: 22296.7129180	total: 292ms	remaining: 58.1s
1:	learn: 21605.4637664	total: 578ms	remaining: 57.2s
2:	learn: 20948.9391505	total: 850ms	remaining: 55.8s
3:	learn: 20325.7284632	total: 1.11s	remaining: 54.3s
4:	learn: 19741.5960811	total: 1.36s	remaining: 53s
5:	learn: 19200.3226639	total: 1.63s	remaining: 52.7s
6:	learn: 18651.5441879	total: 1.88s	remaining: 52s
7:	learn: 18179.4077690	total: 2.13s	remaining: 51.3s
8:	learn: 17720.4186986	total: 2.38s	remaining: 50.6s
9:	learn: 17296.6242667	total: 2.63s	remaining: 50s
10:	learn: 16885.3776727	total: 2.88s	remaining: 49.4s
11:	learn: 16514.7665320	total: 3.13s	remaining: 49s
12:	learn: 16156.1270774	total: 3.38s	remaining: 48.7s
13:	learn: 15824.2544159	total: 3.65s	remaining: 48.4s
14:	learn: 15518.1256627	total: 3.89s	remaining: 48s
15:	learn: 15207.2929079	total: 4.14s	remaining: 47.6s
16:	learn: 14921.2347324	total: 4.38s	remaining: 47.1s
17:	learn: 14661.02336

RandomizedSearchCV(cv=5,
                   estimator=<catboost.core.CatBoostRegressor object at 0x7fb111eacb80>,
                   n_iter=100,
                   param_distributions={'depth': [6, 8, 10, 12, 14],
                                        'iterations': [30, 50, 100, 200, 300],
                                        'learning_rate': [0.01, 0.05, 0.1, 0.15,
                                                          0.2]},
                   verbose=True)

In [28]:
cat_model.best_params_

{'learning_rate': 0.2, 'iterations': 300, 'depth': 12}

In [29]:
showscores(cat_model)

{'Training RMSLE': 0.19663254328820254,
 'Valid RMSLE': 0.25246939764222354,
 'Training r2_score': 0.9342902245542312,
 'Validr 2_score': 0.8809995050922712}

### Make Predictions on test data

In [30]:
df_test = pd.read_csv("data/Test.csv",
                      parse_dates=["saledate"])
df_test.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1227829,1006309,3168,121,3,1999,3688.0,Low,2012-05-03,580G,...,,,,,,,,,,
1,1227844,1022817,7271,121,3,1000,28555.0,High,2012-05-10,936,...,,,,,,,,,Standard,Conventional
2,1227847,1031560,22805,121,3,2004,6038.0,Medium,2012-05-10,EC210BLC,...,None or Unspecified,"9' 6""",Manual,None or Unspecified,Double,,,,,
3,1227848,56204,1269,121,3,2006,8940.0,High,2012-05-10,330CL,...,None or Unspecified,None or Unspecified,Manual,Yes,Triple,,,,,
4,1227863,1053887,22312,121,3,2005,2286.0,Low,2012-05-10,650K,...,,,,,,None or Unspecified,PAT,None or Unspecified,,


In [31]:
df_temp.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saleyear,salemonth,saleday
0,1139246,66000.0,999089,3157,121,3.0,2004,68.0,2,963,...,0,0,0,0,0,4,2,2006,11,16
1,1139248,57000.0,117657,77,121,3.0,1996,4640.0,2,1745,...,0,0,0,0,0,4,2,2004,3,26
2,1139249,10000.0,434808,7009,121,3.0,2001,2838.0,1,336,...,0,0,0,0,0,0,0,2004,2,26
3,1139251,38500.0,1026470,332,121,3.0,2001,3486.0,1,3716,...,0,0,0,0,0,0,0,2011,5,19
4,1139253,11000.0,1057373,17311,121,3.0,2007,722.0,3,4261,...,0,0,0,0,0,0,0,2009,7,23


#### Preprocess test data

In [32]:
def preprocess(df):
    df["saleyear"] = df.saledate.dt.year
    df["salemonth"] = df.saledate.dt.month
    df["saleday"] = df.saledate.dt.day

    df.drop("saledate", axis =1, inplace=True)

    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[label] = content.fillna(content.median())
        if not pd.api.types.is_numeric_dtype(content):
            df[label]=pd.Categorical(content).codes+1
    
    return df

In [33]:
for label, content in df_test.items():
    if pd.api.types.is_string_dtype(content):
        df_test[label] = content.astype("category").cat.as_ordered()

In [34]:
preprocess(df_test)

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,saleyear,salemonth,saleday
0,1227829,1006309,3168,121,3,1999,3688.0,2,499,180,...,0,0,0,0,0,0,0,2012,5,3
1,1227844,1022817,7271,121,3,1000,28555.0,1,831,292,...,0,0,0,0,0,3,2,2012,5,10
2,1227847,1031560,22805,121,3,2004,6038.0,3,1177,404,...,1,1,0,0,0,0,0,2012,5,10
3,1227848,56204,1269,121,3,2006,8940.0,1,287,113,...,2,2,0,0,0,0,0,2012,5,10
4,1227863,1053887,22312,121,3,2005,2286.0,2,566,196,...,0,0,1,4,5,0,0,2012,5,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12452,6643171,2558317,21450,149,2,2008,3525.0,0,713,235,...,1,1,0,0,0,0,0,2012,10,24
12453,6643173,2558332,21434,149,2,2005,3525.0,0,186,80,...,1,1,0,0,0,0,0,2012,10,24
12454,6643184,2558342,21437,149,2,1000,3525.0,0,325,123,...,1,1,0,0,0,0,0,2012,10,24
12455,6643186,2558343,21437,149,2,2006,3525.0,0,325,123,...,1,1,0,0,0,0,0,2012,10,24


In [35]:
ideal_model = CatBoostRegressor(learning_rate=0.2, iterations=300, depth = 12)

ideal_model.fit(X_train, y_train)

0:	learn: 20025.9346019	total: 475ms	remaining: 2m 21s
1:	learn: 17745.9245836	total: 1.07s	remaining: 2m 38s
2:	learn: 15928.0901716	total: 1.51s	remaining: 2m 29s
3:	learn: 14691.0948718	total: 1.86s	remaining: 2m 17s
4:	learn: 13582.2983639	total: 2.22s	remaining: 2m 11s
5:	learn: 12691.4316858	total: 2.57s	remaining: 2m 5s
6:	learn: 12096.8735044	total: 2.92s	remaining: 2m 2s
7:	learn: 11623.4757582	total: 3.29s	remaining: 2m
8:	learn: 11183.8055250	total: 3.67s	remaining: 1m 58s
9:	learn: 10860.9301251	total: 4.03s	remaining: 1m 56s
10:	learn: 10579.6695146	total: 4.4s	remaining: 1m 55s
11:	learn: 10391.8847359	total: 4.8s	remaining: 1m 55s
12:	learn: 10140.0890523	total: 5.16s	remaining: 1m 53s
13:	learn: 9961.8188842	total: 5.5s	remaining: 1m 52s
14:	learn: 9797.6376892	total: 5.85s	remaining: 1m 51s
15:	learn: 9662.0861261	total: 6.19s	remaining: 1m 49s
16:	learn: 9521.0080943	total: 6.56s	remaining: 1m 49s
17:	learn: 9401.7339696	total: 6.91s	remaining: 1m 48s
18:	learn: 9322.

<catboost.core.CatBoostRegressor at 0x7fb111be4b80>

In [36]:
test_preds = ideal_model.predict(df_test)

In [37]:
showscores(ideal_model)

{'Training RMSLE': 0.19663254328820254,
 'Valid RMSLE': 0.25246939764222354,
 'Training r2_score': 0.9342902245542312,
 'Validr 2_score': 0.8809995050922712}

In [38]:
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds.head()

Unnamed: 0,SalesID,SalesPrice
0,1227829,22318.34091
1,1227844,17916.558229
2,1227847,48233.653416
3,1227848,63445.848931
4,1227863,51137.538229


# END OF PROJECT