Predicting the sale price of bulldozer using ML
In this notebook , I am going to go through with the goal of predicting the sale price of Bulldozers.

# 1. Problem definition :
Predict the sale price of a particular piece of heavy equipment at auction based on it's usage, equipment type, and configuration.

# 2. Data
The data is downloaded from the Kaggle "Blue Book for bulldozer" competition. https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public Leaderboard.
Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
# 3. Evaluation
RMSLE (root mean squared log error) between the actual and predicted auction prices.

# Importing essential tools

In [None]:
# Regular EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# preprocessor
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Models from Scikit-Learn
from sklearn.ensemble import RandomForestRegressor

# Model Evaluations
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import mean_squared_error,mean_squared_log_error,mean_absolute_error,make_scorer
#Pipeline
from sklearn.pipeline import Pipeline
plt.style.use('seaborn-whitegrid')
from datetime import datetime

# Load data
Parsing saledate as a Datatime column

In [None]:
# combined dataset of training and validation set
df = pd.read_csv("../input/blue-book-for-bulldozer/Train/Train.csv",parse_dates=['saledate'],low_memory=False) 
# test set
test_df = pd.read_csv("../input/blue-book-for-bulldozer/Test.csv",parse_dates=['saledate'],low_memory=False)
# sorting df according to the saledate
df.sort_values(by='saledate',inplace=True)

In [None]:
df.head().T

In [None]:
df.info() # most of the features are having object DataType

In [None]:
test_df.info()

In [None]:
# shape of the dataframe
df.shape

In [None]:
test_df.shape

# Preprocessing

In [None]:
df.isna().sum()

In [None]:
test_df.isna().sum()

# Visualize missing data

In [None]:
# visualizing missing entries
df_missing_percentage = ((df.isna().sum()/df.shape[0])*100)
test_df_missing_percentage = ((df.isna().sum()/df.shape[0])*100)

In [None]:
pd.DataFrame(df_missing_percentage,columns=['missing%']).sort_values(by='missing%').plot(kind='barh',figsize=(7,15));
plt.xticks(fontsize = 15);
plt.yticks(fontsize = 10);

In [None]:
pd.DataFrame(test_df_missing_percentage,columns=['missing%']).sort_values(by='missing%').plot(kind='barh',figsize=(7,15));
plt.xticks(fontsize = 15);
plt.yticks(fontsize = 10);

### Adding Missing Indicators for Numerical and Categorical columns

In [None]:
# First of all, I have concatenated all data points so that we can add missing indicators easily
# test_df has no SalePrice column , so its data points will have NaN in its SalePrice column when cancatenated with df 
Concat = pd.concat((df,test_df),axis = 0).reset_index(drop=True)

# Converting all columns with object dtype to category dtype
for label,content in Concat.items() :
    if pd.api.types.is_object_dtype(content):
        Concat[label] = content.astype('category')
        
# Enriching features
Concat['year'] = Concat.saledate.dt.year
Concat['month']= Concat.saledate.dt.month
Concat['day']= Concat.saledate.dt.day

In [None]:
cat=[] # list for storing all columns with 'cstegory' dtype
cat_missing = [] # list for storing columns with 'category' dtype and having missing values
num_missing = [] # list for storing columns with 'numerical' dtype and having missing values

In [None]:
for label,content in Concat.items():
    
    if pd.api.types.is_numeric_dtype(content): # checking for numerical features
        if content.isna().sum() > 0: # checking if the feature has any missing values
            Concat[f'{label}_ismissing'] = content.isna()
            num_missing.append(label)
            
    if pd.api.types.is_categorical_dtype(content): # checking for categorical features
        cat.append(label) 
        if content.isna().sum() > 0: # checking if the feature has any missing values
            Concat[f'{label}_ismissing'] = content.isna()
            cat_missing.append(label)
            
cat_not_missing = list(set(cat) - set(cat_missing))

### Filling categorical values
One more reason to make a single dataset of all data points is to cover all possible value category while assigning codes to categorical data.

In [None]:
# For missing values in categorical datatype, by default `-1` is assigned for its code, so adding 1 before creating new column
Concat[cat_missing] = Concat[cat_missing].apply(lambda i : i.cat.codes+1)

# For features with no missing values, simply assigning code
Concat[cat_not_missing] = Concat[cat_not_missing].apply(lambda i : i.cat.codes)

In [None]:
(Concat.isna().sum() !=0 ).sum() # out which one is SalePrice , which will not be considered

### Filling numerical values
Filling the missing values with median
To avoid data leakage , we separate training set, validation set and test set

In [None]:
train_df = Concat.loc[Concat.saledate.dt.year < 2012, :].drop('saledate', axis=1)

valid_df = Concat.loc[Concat.saledate <= pd.Timestamp(
    year=2012, month=4, day=30)].loc[Concat.saledate >= pd.Timestamp(year=2012, month=1, day=1)].drop('saledate', axis=1)

test_df = Concat.loc[Concat.saledate >=
                     pd.Timestamp(year=2012, month=4, day=30), :].drop(['SalePrice','saledate'], axis=1)

In [None]:
train_df.shape

In [None]:
test_df.shape

In [None]:
valid_df.shape

In [None]:
train_df[num_missing].isna().sum()

In [None]:
valid_df[num_missing].isna().sum()

In [None]:
num_imputer = SimpleImputer(strategy='median')
transformer = ColumnTransformer(transformers=[('num_missing',num_imputer,train_df.columns)],remainder='passthrough',)

train_df_filled = transformer.fit_transform(train_df) # fitting on training data 
valid_df_filled = transformer.transform(valid_df) # transforming test based on training data to avoid data leakage

train_df_filled = pd.DataFrame(train_df_filled,columns=train_df.columns)
valid_df_filled = pd.DataFrame(valid_df_filled,columns=valid_df.columns)

In [None]:
train_df_filled

In [None]:
train_df_filled[num_missing].isna().sum()

In [None]:
valid_df_filled[num_missing].isna().sum()

### Modelling

In [None]:
# separating features and labels
X_train_filled,y_train_filled = train_df_filled.drop(['SalePrice'],axis=1),train_df_filled.SalePrice 
X_valid_filled,y_valid_filled = valid_df_filled.drop(['SalePrice'],axis=1),valid_df_filled.SalePrice

X_train,y_train = train_df.drop(['SalePrice'],axis=1),train_df_filled.SalePrice
X_valid,y_valid = valid_df.drop(['SalePrice'],axis=1),valid_df_filled.SalePrice