# Introduction: Machine Learning Project Part 3

In the first two parts of this project, we implemented the first 6 steps of the machine learning pipeline:

1. Data cleaning and structuring
2. Exploratory Data Analysis
3. Feature Engineering/Selection
4. Evaluate/compare several machine learning models on a performance metric
5. Perform hyperparameter tuning on the best model
6. Evaluate the best model on the testing set
7. Interpret the model results
8. Draw conclusions and write a well-documented report

In this notebook, we will concentrate on the last two steps, which is where the most value in the project comes from. We have our final model and the results, but what can we take away from the results? To answer this question, we can employ a variety of techniques to try and understand our model.

### Imports

We will use a similar stack of data science and machine learning imports as in the previous parts. These are all fairly standard tools of the trade, so being familiar with them will be very useful in your data science career!

In [1]:
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# No warnings about setting value on copy of slice
pd.options.mode.chained_assignment = None

# Matplotlib and seaborn for visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Set default font size
plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

import seaborn as sns

sns.set(font_scale = 2)

pd.set_option('display.max_columns', 60)

# Imputing missing values
from sklearn.preprocessing import Imputer, MinMaxScaler

# Machine Learning Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

import itertools

### Read in Data

In [8]:
# Read in data into dataframes from GitHub url
X = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/training_features.csv', header = 0)
X_test = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/testing_features.csv', header = 0)
y = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/training_labels.csv', header = 0)
y_test = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/testing_labels.csv', header = 0)

# Display sizes of data
print('Training Feature Size: ', X.shape)
print('Testing Feature Size:  ', X_test.shape)
print('Training Labels Size:  ', y.shape)
print('Testing Labels Size:   ', y_test.shape)

Training Feature Size:  (6749, 77)
Testing Feature Size:   (2893, 77)
Training Labels Size:   (6749, 1)
Testing Labels Size:    (2893, 1)


## Recreate Final Model

In [9]:
# Make sure that all problem values are recorded as np.nan
X = X.replace({np.inf: np.nan, -np.inf: np.nan})
X_test = X_test.replace({np.inf: np.nan, -np.inf: np.nan})

# Create an imputer object with a median filling strategy
imputer = Imputer(strategy='median')

# Train on the training features
imputer.fit(X)

# Transform both training data and testing data
X = imputer.transform(X)
X_test = imputer.transform(X_test)

# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data
scaler.fit(X)

# Transform both the training and testing data
X = scaler.transform(X)
X_test = scaler.transform(X_test)

# Sklearn wants the labels as one-dimensional vectors
y = np.array(y).reshape((-1,))
y_test = np.array(y_test).reshape((-1,))

In [10]:
# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

In [None]:
model = RandomForestRegressor( n_estimators=300, max_depth=60, max_features='auto', criterion = 'mae',
                              min_samples_leaf=1, n_jobs=-1, random_state=42, verbose=1)

model.fit(X, y)

In [None]:
#  Make predictions on the test set
model_pred = model.predict(X_test)

print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))

In [12]:
#  Make predictions on the test set
model_pred = model.predict(X_test)

print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))

Final Model Performance on the test set: MAE = 9.4806


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    0.0s finished
