<a href="https://colab.research.google.com/github/YuxingW/machine_learning/blob/main/hw5_hpo_autogluon/hpo_autogluon_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is the story

This note book is to predict the instant power consumption based on smart home sensor data and latent variables <br />
Dataset<br />
&nbsp;  &nbsp; &nbsp; Home C sensor data set in 2015<br />

Objective fuction<br />
Predict instant power consumption based on sensor data solar energy and latent variables -- weather info<br />
We will optimise hyper parameters for RandomForestRegressor and MLPRegressor, also will use autogluon to complete the prediction and compare the results.



In [1]:
# Uninstall mkl for faster neural-network training time
!pip uninstall -qy mkl
# Upgrade pip to ensure the latest package versions are available
!pip install -qU pip
# Upgrade setuptools to be compatible with namespace packages
!pip install -qU setuptools
!pip install -qU "mxnet<2.0.0"
# Install pre-release, frozen to a particual pre-release for stability
!pip install -q --pre "autogluon==0.0.16b20201214"
!pip install -qU ipykernel



In [2]:
from autogluon.tabular import TabularPrediction as task
import requests
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from io import BytesIO


Load the dataset which has aggregated the sensor power consumption data combined with weather data of Home C.

In [3]:
train_url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTAH13xa_KY8QeIy48jCgbAY7OkFQCQYfHh-7Nt-TgXKTBeH8buWJZ6Izm91LxIXCo9W9oebwmbgbeF/pub?output=csv#'
r = requests.get(train_url)
data = r.content
stats = pd.read_csv(BytesIO(data))
stats.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,time,use [kW],gen [kW],House overall [kW],Dishwasher [kW],Furnace 1 [kW],Furnace 2 [kW],Home office [kW],Fridge [kW],Wine cellar [kW],Garage door [kW],Kitchen 12 [kW],Kitchen 14 [kW],Kitchen 38 [kW],Barn [kW],Well [kW],Microwave [kW],Living room [kW],Solar [kW],temperature,icon,humidity,visibility,summary,apparentTemperature,pressure,windSpeed,cloudCover,windBearing,precipIntensity,dewPoint,precipProbability
0,1451624400,0.932833,0.003483,0.932833,3.3e-05,0.0207,0.061917,0.442633,0.12415,0.006983,0.013083,0.000417,0.00015,0.0,0.03135,0.001017,0.004067,0.001517,0.003483,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,cloudCover,282,0.0,24.4,0.0
1,1451624401,0.934333,0.003467,0.934333,0.0,0.020717,0.063817,0.444067,0.124,0.006983,0.013117,0.000417,0.00015,0.0,0.0315,0.001017,0.004067,0.00165,0.003467,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,cloudCover,282,0.0,24.4,0.0
2,1451624402,0.931817,0.003467,0.931817,1.7e-05,0.0207,0.062317,0.446067,0.123533,0.006983,0.013083,0.000433,0.000167,1.7e-05,0.031517,0.001,0.004067,0.00165,0.003467,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,cloudCover,282,0.0,24.4,0.0
3,1451624403,1.02205,0.003483,1.02205,1.7e-05,0.1069,0.068517,0.446583,0.123133,0.006983,0.013,0.000433,0.000217,0.0,0.0315,0.001017,0.004067,0.001617,0.003483,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,cloudCover,282,0.0,24.4,0.0
4,1451624404,1.1394,0.003467,1.1394,0.000133,0.236933,0.063983,0.446533,0.12285,0.00685,0.012783,0.00045,0.000333,0.0,0.0315,0.001017,0.004067,0.001583,0.003467,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,cloudCover,282,0.0,24.4,0.0


### Data preprosessing
Remove unit(kW) from dataset column names.

In [4]:
stats.columns = [i.replace(' [kW]', '') for i in stats.columns]

'cloudCover' value was found in the column, let's replace these invalid values with the next valid value.

In [5]:
stats['cloudCover'].replace(['cloudCover'], method='bfill', inplace=True)
stats['cloudCover'] = stats['cloudCover'].astype('float')
stats.head()

Unnamed: 0,time,use,gen,House overall,Dishwasher,Furnace 1,Furnace 2,Home office,Fridge,Wine cellar,Garage door,Kitchen 12,Kitchen 14,Kitchen 38,Barn,Well,Microwave,Living room,Solar,temperature,icon,humidity,visibility,summary,apparentTemperature,pressure,windSpeed,cloudCover,windBearing,precipIntensity,dewPoint,precipProbability
0,1451624400,0.932833,0.003483,0.932833,3.3e-05,0.0207,0.061917,0.442633,0.12415,0.006983,0.013083,0.000417,0.00015,0.0,0.03135,0.001017,0.004067,0.001517,0.003483,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,0.75,282,0.0,24.4,0.0
1,1451624401,0.934333,0.003467,0.934333,0.0,0.020717,0.063817,0.444067,0.124,0.006983,0.013117,0.000417,0.00015,0.0,0.0315,0.001017,0.004067,0.00165,0.003467,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,0.75,282,0.0,24.4,0.0
2,1451624402,0.931817,0.003467,0.931817,1.7e-05,0.0207,0.062317,0.446067,0.123533,0.006983,0.013083,0.000433,0.000167,1.7e-05,0.031517,0.001,0.004067,0.00165,0.003467,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,0.75,282,0.0,24.4,0.0
3,1451624403,1.02205,0.003483,1.02205,1.7e-05,0.1069,0.068517,0.446583,0.123133,0.006983,0.013,0.000433,0.000217,0.0,0.0315,0.001017,0.004067,0.001617,0.003483,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,0.75,282,0.0,24.4,0.0
4,1451624404,1.1394,0.003467,1.1394,0.000133,0.236933,0.063983,0.446533,0.12285,0.00685,0.012783,0.00045,0.000333,0.0,0.0315,0.001017,0.004067,0.001583,0.003467,36.14,clear-night,0.62,10.0,Clear,29.26,1016.91,9.18,0.75,282,0.0,24.4,0.0


Combine 'Furnace 1' and 'Furnace 2' to 'Furnace', combine 'Kitchen 12' 'Kitchen 14' and 'Kitchen 38' to 'Kitchen'.

In [6]:
stats['Furnace'] = stats[['Furnace 1','Furnace 2']].sum(axis=1)
stats['Kitchen'] = stats[['Kitchen 12','Kitchen 14','Kitchen 38']].sum(axis=1)

See feature correlations, Solar data is negatively correlated to power consumption, that is expected.

In [7]:
  feature = 'use'
  df_corr = stats.corr()[feature]
  golden_features_list = df_corr[abs(df_corr) > 0.1].sort_values(ascending=True)
  print("There is {} correlated values with feature {}:\n{}"
        .format(len(golden_features_list), feature, golden_features_list))

There is 21 correlated values with feature use:
gen                   -0.309440
Solar                 -0.309440
time                  -0.159069
apparentTemperature   -0.154378
temperature           -0.152890
dewPoint              -0.120681
Kitchen 38             0.107878
Kitchen 14             0.154937
Home office            0.155206
Microwave              0.158953
Kitchen                0.170290
Fridge                 0.188517
Barn                   0.239522
Well                   0.257693
Dishwasher             0.302407
Furnace 1              0.329671
Living room            0.342365
Furnace 2              0.483417
Furnace                0.543849
House overall          1.000000
use                    1.000000
Name: use, dtype: float64


Observe duplicate feature 'use' and 'House overall', 'gen' and 'Solar', remove one of them. <br />
Keep only features which absolute correlation with 'use' > 0.1, not involving duplicate features.

In [8]:
feature_list = ['gen' , 'apparentTemperature', 'temperature', 'dewPoint', 'Kitchen', 'Furnace', \
                'Home office', 'Microwave', 'Fridge', 'Barn', 'Well', 'Dishwasher', 'Living room', 'use']
df = stats[feature_list]
df.head()

Unnamed: 0,gen,apparentTemperature,temperature,dewPoint,Kitchen,Furnace,Home office,Microwave,Fridge,Barn,Well,Dishwasher,Living room,use
0,0.003483,29.26,36.14,24.4,0.000567,0.082617,0.442633,0.004067,0.12415,0.03135,0.001017,3.3e-05,0.001517,0.932833
1,0.003467,29.26,36.14,24.4,0.000567,0.084533,0.444067,0.004067,0.124,0.0315,0.001017,0.0,0.00165,0.934333
2,0.003467,29.26,36.14,24.4,0.000617,0.083017,0.446067,0.004067,0.123533,0.031517,0.001,1.7e-05,0.00165,0.931817
3,0.003483,29.26,36.14,24.4,0.00065,0.175417,0.446583,0.004067,0.123133,0.0315,0.001017,1.7e-05,0.001617,1.02205
4,0.003467,29.26,36.14,24.4,0.000783,0.300917,0.446533,0.004067,0.12285,0.0315,0.001017,0.000133,0.001583,1.1394


In [9]:
df.shape

(113176, 14)

### Import libraries and split the data for train test

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from time import time

In [11]:
X, y = df.drop(['use'], axis=1), df['use']

X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=.2, shuffle=True)

### Hyperparameters tuning loop on RandomForestRegressor and MLPRegressor

In [20]:
def tuning_hyper_parameters():
  random_grid = {#'Linear Regression', 
           'Random Forest': {'bootstrap': [True, False],
                             'max_depth': [10, 50, 100, None],
                             'max_features': ['auto', 'sqrt'],
                             'min_samples_leaf': [1, 2, 4],
                             'min_samples_split': [2, 5, 10],
                             'n_estimators': [200, 1000, 2000]},
           #'Decision Tree',
          'MLP Regressor': {'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,1)],
                            'activation': ['relu','tanh','logistic'],
                            'alpha': [0.0001, 0.05],
                            'learning_rate': ['constant','adaptive'],
                            'solver': ['lbfgs','sgd','adam']},
           #'XBoost Regressor',
           #'KNN Regressor'
          }
  regression = [
            #LinearRegression(),
            RandomForestRegressor(),
            #DecisionTreeRegressor(),
            MLPRegressor(),
            #xgb.XGBRegressor(),
            #KNeighborsRegressor(),
          ]
  for name, clf in zip(random_grid.keys(), regression):
    rf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid[name], n_iter = 3, cv = 2, verbose=2, random_state=42, n_jobs = -1)
    rf_random.fit(X_train, y_train)
    print('model %s best_params %s' % (name, rf_random.best_params_))

In [21]:
tuning_hyper_parameters()

Fitting 2 folds for each of 3 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  9.7min finished


model Random Forest best_params {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 50, 'bootstrap': True}
Fitting 2 folds for each of 3 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  3.8min finished


model MLP Regressor best_params {'solver': 'adam', 'learning_rate': 'adaptive', 'hidden_layer_sizes': (50, 50, 50), 'alpha': 0.0001, 'activation': 'logistic'}


### Apply the best parameters in muller loop and compare the results

In [14]:
def muller_loop(X_train, y_train, X_test, y_test):
  names = ['Random Forest',
           'Random Forest Tuning',
           'MLP Regressor',
           'MLP Regressor Tuning',
          ]
  regression = [
            RandomForestRegressor(max_depth=5, n_estimators=10),
            RandomForestRegressor(max_depth=50, n_estimators=200, max_features='sqrt', bootstrap=True, min_samples_leaf=4, min_samples_split=5),
            MLPRegressor(random_state=1, alpha=1, max_iter=1000),
            MLPRegressor(solver='adam', learning_rate= 'adaptive', hidden_layer_sizes= (50, 50, 50), alpha= 0.0001, activation= 'tanh'),
          ]
  result = pd.DataFrame(columns=['model','r2 score', 'mean squared error', 'mean absolute error', 'time'])

  # iterate over classifiers
  for name, clf in zip(names, regression):
      start_time = time()
      clf.fit(X_train, y_train)
      r2 = 100.0 * clf.score(X_test, y_test)
      y_pred = clf.predict(X_test)
      v_score = 100.0 * explained_variance_score(y_test, y_pred, multioutput='raw_values')[0]
      mse =  mean_squared_error(y_test, y_pred)
      mae =  mean_absolute_error(y_test, y_pred)
      result = result.append({'model': name, 'r2 score': r2, 'mean squared error': mse, 'mean absolute error': mae ,'time': (time() - start_time) }, ignore_index=True)

  display(result)


In [15]:
muller_loop(X_train, y_train, X_test, y_test)



Unnamed: 0,model,r2 score,mean squared error,mean absolute error,time
0,Random Forest,62.573965,0.174889,0.243374,2.988457
1,Random Forest Tuning,89.775175,0.04778,0.08004,60.248286
2,MLP Regressor,76.764494,0.108578,0.144515,10.040634
3,MLP Regressor Tuning,86.544679,0.062876,0.120644,304.453364


### Use AutoML AutoGluon to complete the prediction

In [27]:
label_column = 'use'
train_data = df[:20000]
predictor = task.fit(train_data=train_data, label=label_column, eval_metric='r2')
test_data = df[20001:]
y_test = test_data[label_column]

test_data = test_data.drop(labels=[label_column], axis=1)
y_pred = predictor.predict(test_data)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred)
results = predictor.fit_summary()

No output_directory specified. Models will be saved in: AutogluonModels/ag-20210307_054451/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20210307_054451/
AutoGluon Version:  0.0.16b20201214
Train Data Rows:    20000
Train Data Columns: 13
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (7.378916667, 0.000466667, 1.0705, 0.69513)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    12290.24 MB
	Train Data (Original)  Memory Usage: 2.08 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_m

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L1   0.970730       0.591439   48.822532                0.000808           0.541473            1       True         11
1              CatBoost   0.969270       0.007375   17.304661                0.007375          17.304661            0       True          7
2              LightGBM   0.963676       0.185555    9.250434                0.185555           9.250434            0       True          5
3        LightGBMCustom   0.957901       0.093998    5.668685                0.093998           5.668685            0       True         10
4               XGBoost   0.956450       0.243168   33.663420                0.243168          33.663420            0       True          8
5         ExtraTreesMSE   0.947469       0.303703   16.057278                0.303703          16.

In [28]:
test_data = df[20001:]
leaderboard = predictor.leaderboard(test_data)

                  model  score_test  score_val  pred_time_test  pred_time_val    fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0         ExtraTreesMSE    0.735984   0.947469        5.425150       0.303703   16.057278                 5.425150                0.303703          16.057278            0       True          2
1   WeightedEnsemble_L1    0.734841   0.970730       13.841974       0.591439   48.822532                 0.047729                0.000808           0.541473            1       True         11
2        LightGBMCustom    0.723812   0.957901        4.115636       0.093998    5.668685                 4.115636                0.093998           5.668685            0       True         10
3              LightGBM    0.720576   0.963676        4.010450       0.185555    9.250434                 4.010450                0.185555           9.250434            0       True          5
4            LightGBMXT    0.714009

### Conclusion 
1. The models perform much better after tuning the hyper parameters comparing to not tuning. 
<br />
2. The Random Forest increased 27% while the MLP increased 10% in r2 score. MSE and MAE score also improved, especially for Random Forest. <br />
3. Autogluon provided good performed models and hyper parameters, CatBoost achieve 96% r2 score while XGBoost got 93% r2 score on train dataset. However, the average r2 score on test data is 73%, lower than manual Random Forest Regressor Tuning 90%.