# Blue Book for Bulldozers

## Business Problem

### Description

The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuaration.  
The data is sourced from auction result postings and includes information on usage and equipment configurations.

### Problem definition

Predict the auction sale price for a piece of heavy equipment to create a "blue book" for bulldozers..

### Data

The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

    - Train.csv is the training set, which contains data through the end of 2011.

    - Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.

    - Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

### Evaluation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimises RMSLE.

### Features

Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing

- Import the important libraries for loading data.

In [2]:
## Importing the data and preparing it for modelling

# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import sklearn

### Now we've got our tools for data analysis ready, we can import the data and start to explore it.

In [3]:
# Import the training and validation set.
# No parse_dates... check dtype of "saledate".

### Parsing dates

- When working with time series data, it's a good idea to make sure any date data is the format of a datetime object (a Python data type which encodes specific information about dates).

In [5]:
# With parse_dates... check dtype of "saledate"

### Sort DataFrame by saledate

- As we're working on a time series problem and trying to predict future examples given past examples, it makes sense to sort our data by date.

In [6]:
# Sort DataFrame in date order

### Make a copy of the original DataFrame

In [8]:
# Since we're going to be manipulating the data, we'll make a copy of the original DataFrame and perform our changes there.
# This will keep the original DataFrame.
# Make a copy of the original DataFrame to perform edits on

### Add datetime parameters for saledate column

In [10]:
# So we can enrich our dataset with as much information as possible.
# Because we imported the data using read_csv() and we asked pandas to parse the dates using parase_dates=["saledate"],
# we can now access the different datetime attributes of the saledate column.
# Add datetime parameters for saledate.
# Drop original saledate.
# Check the different values of different columns.

### Modelling

- Now start to do some model-driven EDA.

In [14]:
# This won't work since we've got missing numbers and categories,
# Check for missing categories and different datatypes.
# Check for missing values.

### Convert string to categories

In [17]:
# One way we can turn all of our data into numbers is by converting them into pandas catgories.
# We can check the different datatypes compatible with pandas here: https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#data-types-related-functionality

### Save Processed Data

In [18]:
# Save preprocessed data.
# Import preprocessed data.

### Fill missing values

In [55]:
# Filling numerical values first.
# Check for which numeric columns have null values.
# Fill numeric rows with the median.
# Add a binary column which tells if the data was missing our not
# Fill missing numeric values with median since it's more robust than the mean.
# Check if there's any null values.
# Check to see how many examples were missing.

### Filling and turning categorical variables to numbers

- Now we've filled the numeric values, we'll do the same with the categorical values at the same time as turning them into numbers.

In [29]:
# Check columns which *aren't* numeric.
# Turn categorical variables into numbers.
# Check columns which *aren't* numeric.
# Add binary column to inidicate whether sample had missing value.
# We add the +1 because pandas encodes missing categories as -1.
# Fit the model.
# Score the model.

### Splitting data into train/validation sets

In [37]:
# Split data into training and validation.
# Split data into X & y.

### Building an evaluation function

In [40]:
# Create evaluation function (the competition uses Root Mean Square Log Error)
# Create function to evaluate our model

### Testing our model on a subset

In [41]:
# Retraing an entire model would take far too long to continuing experimenting as fast as we want to.
# So what we'll do is take a sample of the training set and tune the hyperparameters on that before training a larger model.

In [43]:
# Retrain a model on training data
# Cutting down the max number of samples each tree can see improves training time.

### Hyerparameter tuning with RandomizedSearchCV

In [45]:
# You can increase n_iter to try more combinations of hyperparameters but in our case, we'll try 20 and see where it gets us.
# Different RandomForestClassifier hyperparameters.

### Make predictions on test data

- Now we've got a trained model, it's time to make predictions on the test data.

In [47]:
# See how the model goes predicting on the test data

### Preprocessing the data

- To get the test dataset in the same format as training dataset

In [50]:
# Add datetime parameters for saledate.
# Drop original saledate.
# Fill numeric rows with the median.
# Turn categorical variables into numbers
# We add the +1 because pandas encodes missing categories as -1.
# Now we got a function for preprocessing data, Now preprocess the test dataset into the same format as our training dataset.

In [51]:
# We can find how the columns differ using sets.
# Match test dataset columns to training dataset.

### Feature Importance

In [53]:
# Since we've built a model which is able to make predictions.
# Feature importance seeks to figure out which different attributes of the data were most importance when it comes to predicting the target variable (SalePrice).

In [56]:
# Use RandomForestRegressor for final Implementation.
# After executing all the algorithms the RandomForestRegressor will give the better result.