## Predicting the Sale Price of Bulldozers using Machine Learning
In this notebook, we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.
___

### 1. Problem definition
>How well can we predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for?

### 2. Data 
The data is downloaded from the [Kaggle Bluebook for Bulldozers competition](https://www.kaggle.com/c/bluebook-for-bulldozers/data)

There are 3 main datasets:
* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012. You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2021. Your score on the test set determines your final rank for the competition.

### 3. Evaluation
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

**Note:** The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimizes RMSLE.

### 4. Features

Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [None]:
# Import training and validation sets
df = pd.read_csv('TrainAndValid.csv', low_memory=False)
df.head()

>Large data sets may trigger a warning. Since data types live in RAM, pandas tries to minimize the amount of space they take up. To this avoid warning, we can set `low_memory` equal to `False`.

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots()
ax.scatter(df['saledate'][:1000], df['SalePrice'][:1000]);

In [None]:
df.saledate[:1000]

In [None]:
df.saledate.dtype

In [None]:
df.SalePrice.plot.hist()

#### Parsing dates
When we work with time series data, we want to enrich the time & date component as much as possible.

We can do that by telling pandas which of our columns has dates in it usint the `parse_date` parameter.

In [None]:
# Import data again but this time parse dates
df = pd.read_csv('TrainAndValid.csv',
                  low_memory=False,
                  parse_dates=['saledate'])

In [None]:
df.saledate.dtype

In [None]:
df.saledate[:1000]

In [None]:
fig, ax = plt.subplots()
ax.scatter(df['saledate'][:1000], df['SalePrice'][:1000])

In [None]:
df.head()

In [None]:
df.head().T

In [None]:
df.saledate.head(20)

#### Sort DataFrame by saledate
When working with time series data, it's a good idea to sort it by date.

In [None]:
# Sort DataFrame in date order
df.sort_values(by=['saledate'], inplace=True, ascending=True)
df.saledate.head(20)

#### Make a copy of the original DataFrame
We make a copy of the original dataframe so when we manipulate the copy, we've still got our original date.

In [None]:
# Make a copy of the origional DataFrame to perform edits on
df_tmp = df.copy()

### Feature Engineering
Feature engineering involves creating or manipulating features.

We can use some of our pandas datetime as well as some of pandas datetime attributes to enrich our dataset.

For any time series project you may want to add datetime parameters using [pandas date time attributes](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DatetimeIndex.html)

Since we've turned our saledate column into a datetime index, we can acces some of the attributes of that date...

In [None]:
df_tmp[:1].saledate.dt.year

In [None]:
df_tmp[:1].saledate.dt.day

In [None]:
df_tmp[:1].saledate

#### Add datetime parameters for `saledate` column

In [None]:
df_tmp['saleYear'] = df_tmp.saledate.dt.year
df_tmp['saleMonth'] = df_tmp.saledate.dt.month
df_tmp['saleDay'] = df_tmp.saledate.dt.day
df_tmp['saleDayOfWeek'] = df_tmp.saledate.dt.dayofweek
df_tmp['saleDayOfYear'] = df_tmp.saledate.dt.dayofyear

In [None]:
df_tmp.head().T

In [None]:
# Now we've enriched our DataFrame with date time features, we can remove 'saledate'
df_tmp.drop('saledate', axis=1, inplace=True)

In [None]:
# Check the values of different columns
df_tmp.state.value_counts()

## 5. Modeling
We've done enough EDA (we could always do more) but let's start to do some model-driven EDA.

In [None]:
# Let's build a machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

model.fit(df_tmp.drop('SalePrice', axis=1), df_tmp['SalePrice'])

In [None]:
df_tmp.info()

In [None]:
df_tmp['UsageBand'].dtype

In [None]:
df_tmp.isna().sum()

### Converting a string to categories
One way we can turn all of our data into numbers is by converting them into pandas categories.

We can check the different datatypes compatible with pandas here: https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#data-types-related-functionality

In [None]:
df_tmp.head().T

In [None]:
pd.api.types.is_string_dtype(df_tmp['UsageBand'])

In [None]:
# Find the columns which contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# If you're wondering what df.items() does, here's an example
random_dict = {'key1': 'hello',
               'key2': 'world!'}

for key, value in random_dict.items():
    print(f'this is a key: {key}',
          f'this is a value: {value}')

In [None]:
# This will turn all of the string values into category values
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype('category').cat.as_ordered()

In [None]:
df_tmp.info()

In [None]:
df_tmp.state.cat.categories

In [None]:
df_tmp.state.cat.codes

Thanks to pandas Categories we now have a way to access all of our data in the form of numbers.

But we still have a bunch of missing data...

In [None]:
# Check missing data
# Divide by length of dataframe to view as ratios
df_tmp.isnull().sum()/len(df_tmp)

#### Save preprocessed data
Before we move on to filling any missing data, it may be a good idea to save our dataframe.

In [None]:
# Export current tmp dataframe