# Introduction Machine Learning Project Part 2

In this series of notebooks, we are working on a supervised, regression machine learning problem. Using real-world New York City building energy data, we want to predict the Energy Star Score and determine the factors that influence the score.

We are working through the outline of a machine learning project:

1. Data cleaning and structuring
2. Exploratory Data Analysis
3. Feature Engineering/Selection
4. Evaluate/compare several machine learning models on a performance metric
5. Perform hyperparameter tuning on the best model
6. Evaluate the best model on the testing set
7. Interpret the model results
8. Draw conclusions and write a well-documented report

The first notebook covered steps 1-3, and in this notebook, we will cover 4-6. I skip over all of the details of the machine learning models used here to focus on the implementation, but I would suggest reading this excellent book to get an idea of how they work and how to use them effectively in Python.


### Imports 

We will use most of the same imports as for the first part with the addition of some machine learning models. 

In [5]:
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# No warnings about setting value on copy of slice
pd.options.mode.chained_assignment = None

# Matplotlib and seaborn for visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Set default font size
plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

import seaborn as sns

sns.set(font_scale = 2)

pd.set_option('display.max_columns', 60)

# Imputing missing values
from sklearn.preprocessing import Imputer

# Machine Learning Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Evaluating Models
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, mean_absolute_error

import itertools

### Read in Data

Here we will read in the formatted data that we cleaned in the previous notebook. 

In [55]:
# Read in data into dataframes from GitHub url
X = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/training_features.csv')
X_test = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/testing_features.csv')
y = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/training_labels.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/WillKoehrsen/machine-learning-project/master/data/testing_labels.csv')

# Display sizes of data
print('Training Feature Size: ', X.shape)
print('Testing Feature Size:  ', X_test.shape)
print('Training Labels Size:  ', y.shape)
print('Testing Labels Size:   ', y_test.shape)

Training Feature Size:  (6749, 77)
Testing Feature Size:   (2893, 77)
Training Labels Size:   (6748, 1)
Testing Labels Size:    (2892, 1)


In [56]:
import pandas as pd

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

As a reminder, here is what the formatted data looks like. In the first notebook, we engineered a number features by taking the natural log of the variables and we selected features by removing highly collinear features. Mostly we are focusing on numerical features, but we also included two categorical features that we saw are related to the Energy Star Score. These categorical variables have been one-hot encoded.

In [22]:
X.head(12)

Unnamed: 0,Order,Property Id,DOF Gross Floor Area,Largest Property Use Type - Gross Floor Area (ft²),Year Built,Number of Buildings - Self-reported,Occupancy,Weather Normalized Site EUI (kBtu/ft²),Weather Normalized Site Electricity Intensity (kWh/ft²),Natural Gas Use (kBtu),Weather Normalized Site Natural Gas Use (therms),Indirect GHG Emissions (Metric Tons CO2e),Water Use (All Water Sources) (kgal),Water Intensity (All Water Sources) (gal/ft²),Latitude,Longitude,Community Board,Census Tract,log_Site EUI (kBtu/ft²),log_Weather Normalized Site EUI (kBtu/ft²),log_Weather Normalized Site Electricity Intensity (kWh/ft²),log_Direct GHG Emissions (Metric Tons CO2e),log_Water Use (All Water Sources) (kgal),log_Water Intensity (All Water Sources) (gal/ft²),Borough_Staten Island,Largest Property Use Type_Adult Education,Largest Property Use Type_Ambulatory Surgical Center,Largest Property Use Type_Automobile Dealership,Largest Property Use Type_Bank Branch,Largest Property Use Type_College/University,...,Largest Property Use Type_Museum,Largest Property Use Type_Non-Refrigerated Warehouse,Largest Property Use Type_Other,Largest Property Use Type_Other - Education,Largest Property Use Type_Other - Entertainment/Public Assembly,Largest Property Use Type_Other - Lodging/Residential,Largest Property Use Type_Other - Mall,Largest Property Use Type_Other - Public Services,Largest Property Use Type_Other - Recreation,Largest Property Use Type_Other - Services,Largest Property Use Type_Other - Specialty Hospital,Largest Property Use Type_Other - Technology/Science,Largest Property Use Type_Outpatient Rehabilitation/Physical Therapy,Largest Property Use Type_Parking,Largest Property Use Type_Performing Arts,Largest Property Use Type_Pre-school/Daycare,Largest Property Use Type_Refrigerated Warehouse,"Largest Property Use Type_Repair Services (Vehicle, Shoe, Locksmith, etc.)",Largest Property Use Type_Residence Hall/Dormitory,Largest Property Use Type_Residential Care Facility,Largest Property Use Type_Restaurant,Largest Property Use Type_Retail Store,Largest Property Use Type_Self-Storage Facility,Largest Property Use Type_Senior Care Community,Largest Property Use Type_Social/Meeting Hall,Largest Property Use Type_Strip Mall,Largest Property Use Type_Supermarket/Grocery Store,Largest Property Use Type_Urgent Care/Clinic/Other Outpatient,Largest Property Use Type_Wholesale Club/Supercenter,Largest Property Use Type_Worship Facility
0,4212,4932827,111567.0,98016.0,1913,1,100,63.8,12.1,2217126.8,23407.5,402.3,1924.7,18.37,40.769624,-73.96823,8.0,122.0,4.135167,4.155753,2.493205,4.768988,7.562525,2.910719,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,6167,4406956,54030.0,63250.0,1930,1,100,,,4716513.9,,11.0,5184.6,81.97,40.847999,-73.940296,12.0,265.0,4.335983,,,5.523459,8.553448,4.406353,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,10770,4370684,149450.0,149450.0,1934,1,100,56.4,2.8,5988399.9,63472.7,136.9,,,40.645638,-73.981046,12.0,498.0,3.974058,4.032469,1.029619,5.869014,,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,6960,4401878,148827.0,159146.0,1982,1,100,74.2,4.8,8491331.1,92259.8,245.0,,,40.821001,-73.89558,2.0,159.0,4.247066,4.306764,1.568616,6.111467,,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4928,2682511,377823.0,344857.0,1976,1,100,78.7,6.4,18623690.7,205135.3,747.4,,,40.773428,-73.953129,8.0,138.0,4.306764,4.365643,1.856298,6.896897,,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,8272,3114850,72500.0,80475.0,1924,1,100,,3.4,5252521.5,56063.4,90.9,,,40.835794,-73.852112,9.0,222.0,4.348987,,1.223775,5.631212,,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,11731,3114772,68076.0,75564.0,1940,1,100,,2.2,4101299.9,43882.4,54.9,6365.2,84.24,40.629099,-73.957021,14.0,772.0,4.188138,,0.788457,5.477718,8.758601,4.43367,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,3735,2817479,62237.0,73187.0,1924,1,100,97.2,15.0,,,1814.2,2024.2,26.51,40.756962,-73.976256,5.0,94.0,4.525044,4.576771,2.70805,-inf,7.61293,3.277522,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,13808,4404055,53914.0,54000.0,1929,1,100,140.6,3.2,1495400.0,14954.0,57.6,4405.1,81.58,,,,,4.859812,4.945919,1.163151,6.0974,8.390518,4.401584,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,12447,3782892,100000.0,100000.0,1959,1,100,59.2,7.6,2934999.7,33086.2,242.2,1310.2,13.1,,,,,4.01458,4.080922,2.028148,5.049215,7.177935,2.572612,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Evaluating and Comparing Machine Learning Models

## Imputing Missing Values

We can see there are missing values in a number of columns. Although we dropped features with more than 50% missing values, there are still quite a few left that must be addressed before we do machine learning. There are a number of methods for filling in missing values (known as imputation) but here we will use the relatively simple method of replacing missing values with the median of the column. In the code below, we create a scikit-learn `Imputer` object and then fill in the missing values.

Notice that we train the imputer (using the `.fit` method) on the training data but not the testing data. We then transform both the training data and testing data. This means that the missing values in the testing set are filled in with the median value of the corresponding columns in the training set. We have to do it this way rather than training on all the data because at production time, we will have to impute the missing values based on the previous training data and not on any new observations we get. 

In [45]:
# Create an imputer object with a median filling strategy
imputer = Imputer(strategy='median')

# Train on the training features
imputer.fit(X)

# Transform both training data and testing data
X = imputer.transform(X)
X_test = imputer.transform(X_test)

In [46]:
print('Missing values in training features: ', np.sum(np.isnan(X)))
print('Missing values in testing features:  ', np.sum(np.isnan(X_test)))

Missing values in training features:  0
Missing values in testing features:   0


Just to remind ourselves, here is the naive baseline performance measured in mean absolute error.

In [47]:
# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

baseline_guess = np.median(y)
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess))

Baseline Performance on the test set: MAE = 25.3485


### Models to Evaluate

We will compare five different models:

1. Linear Regression
2. Support Vector Machine Regression
3. Random Forest Regression
4. Gradient Boosting Regression
5. K-Nearest Neighbors Regression

To evaluate the models, we are going to be using the sklearn defaults for the model hyperparameters. Generally these will perform decent, but should be optimized before actually using a model. Now we just want to determine the baseline performance of each model, and then we can select the best-performer for further optimization. I don't want to get bogged down in the model theory or hyperparameters, so I'll leave it up to you to do some research. Just know that the default hyperparameters will get a model up and running, but nearly always should be adjusted using some sort of search to find the best settings for your problem.

One of the best parts about scikit-learn is that all models are implemented in basically the same manner: once you know how to build one, you can implement an extremely diverse array of models. Here we will implement the entire training and testing procedures for a number of models in just a few lines of code.

In [48]:
np.array(y_test.values)

array([[ 82.],
       [ 60.],
       [100.],
       ...,
       [ 49.],
       [ 35.],
       [ 46.]])

In [50]:
X = X.astype(np.float)
y = y.astype(np.float)

In [51]:
X

array([[4.212000e+03, 4.932827e+06, 1.115670e+05, ..., 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       [6.167000e+03, 4.406956e+06, 5.403000e+04, ..., 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       [1.077000e+04, 4.370684e+06, 1.494500e+05, ..., 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       ...,
       [8.173000e+03, 3.269943e+06, 7.050000e+04, ..., 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       [1.518000e+03, 3.527450e+06, 1.259140e+05, ..., 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       [1.124400e+04, 3.129145e+06, 5.280000e+04, ..., 0.000000e+00,
        0.000000e+00, 0.000000e+00]])

In [52]:
# Create the model
lr = LinearRegression()

# Train the model
lr.fit(X, y)

# Make predictions and evaluate
lr_pred = lr.predict(X_test)
lr_mae = mae(y_test, lr_pred)
print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [54]:
random_forest = RandomForestRegressor()

random_forest.fit(X, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [18]:
np.any(pd.isna(X_test))

False