# Occupancy detection

### A multivariate time series example

In this tutorial, you will learn how to apply getML to multivariate time series. It also demonstrates how to use getML's [high-level interface for hyperparameter tuning](https://docs.getml.com/latest/user_guide/hyperopt/hyperopt.html#tuning-routines).

Summary:

- Prediction type: __Binary classification__
- Domain: __Energy__
- Prediction target: __Room occupancy__
- Source data: __1 table, 32k rows__
- Population size: __32k__

_Author: Dr. Johannes King_

# Background

Our use case is a public domain data set for predicting room occupancy from sensor data. The results achieved using getML outperform all published results on this data set. Note that this is not only a neat use case for machine learning algorithms, but a real-world application with tangible consequences: If room occupancy is known with sufficient certainty, it can be applied to the control systems of a building. Such as system can reduce the energy consumption by [up to 50 %](https://ieeexplore.ieee.org/document/7566062). Fot further details about the data set refer to [link to the occupancy notebook](#).

### Comparison of propisitionalization approaches

Approaches to propositionalization:
- getML (FastProp)
- featurettools
- tsfresh

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set-up your session:

In [1]:
import datetime
import os
from urllib import request
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.style.use("seaborn")
%matplotlib inline

import getml

from utils import FTTimeSeriesBuilder, TSFreshBuilder

print(f"getML API version: {getml.__version__}\n")

getml.engine.set_project("occupancy")

getML API version: 0.16.0


Loading pipelines...

Connected to project 'occupancy'


## 1. Loading data


The data set can be downloaded directly from GitHub. It is conveniently separated into a train, a validation and a testing set. This allows us to directly benchmark our results against the results of the original paper later.

In [2]:
data_test, data_train, data_validate = getml.datasets.load_occupancy(
    roles=True
).values()

The train set looks like this:

In [3]:
data_train

Name,date,Occupancy,Temperature,Humidity,Light,CO2,HumidityRatio
Role,time_stamp,target,numerical,numerical,numerical,numerical,numerical
Units,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0.0,2015-02-04 17:51:00,1,23.18,27.272,426,721.25,0.004793
1.0,2015-02-04 17:51:59,1,23.15,27.2675,429.5,714,0.004783
2.0,2015-02-04 17:53:00,1,23.15,27.245,426,713.5,0.004779
3.0,2015-02-04 17:54:00,1,23.15,27.2,426,708.25,0.004772
4.0,2015-02-04 17:55:00,1,23.1,27.2,426,704.5,0.004757
,...,...,...,...,...,...,...
8138.0,2015-02-10 09:29:00,1,21.05,36.0975,433,787.25,0.005579
8139.0,2015-02-10 09:29:59,1,21.05,35.995,433,789.5,0.005563
8140.0,2015-02-10 09:30:59,1,21.1,36.095,433,798.5,0.005596
8141.0,2015-02-10 09:32:00,1,21.1,36.26,433,820.3333,0.005621


## 2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

### 2.1 Propositionalization with getML's FastProp

We use all possible aggregations. Because tsfresh and featuretools are single-threaded, we limit our FastProp algorithm to one thread as well, to ensure a fair comparison.

In [4]:
population = getml.data.Placeholder("population")
peripheral = getml.data.Placeholder("peripheral")

population.join(
    peripheral,
    time_stamp="date",
    # We want our time series features to only use
    # data from the last 15 minutes
    memory=getml.data.time.minutes(15),
    # Our forecast horizon is 0.
    # We do not predict the future, instead we infer
    # the present state from current and past sensor data.
    horizon=0.0,
    # We do not allow the time series features
    # to use target values from the past.
    allow_lagged_targets=False,
)

population

feature_learner = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=getml.feature_learning.FastPropModel.agg_sets.All,
    num_threads=1
)

Next, we create the pipeline. In contrast to our usual approach, we create _two pipelines_ in
this notebook. One for feature learning (suffix `_fl`) and one for predicition (suffix `_pr`).
This allows for a fair comparison of runtimes.

In [5]:
pipe_fp_fl = getml.pipeline.Pipeline(
    feature_learners=[feature_learner],
    peripheral=[peripheral],
    population=population,
    tags=["feature learning", "fastprop"],
)

In [6]:
pipe_fp_fl.check(data_train, [data_train])

Checking data model...
OK.


The wrappers around featuretools and tsfresh fit on the training set and then return the training features. We therefore measure the time it takes getML's FastProp algorithm to fit on the training set and create the training features.

In [7]:
begin = time.time()

pipe_fp_fl.fit(data_train, [data_train])

fastprop_train = pipe_fp_fl.transform(
    data_train, 
    [data_train], 
    df_name="fastprop_train"
)

end = time.time()

fastprop_runtime = datetime.timedelta(seconds=end - begin)

Checking data model...
OK.

FastProp: Trying 194 features...

Trained pipeline.
Time taken: 0h:0m:0.115086


FastProp: Building features...



In [8]:
fastprop_test = pipe_fp_fl.transform(data_test, [data_test], df_name="fastprop_test")


FastProp: Building features...



Now we create a dedicated prediction pipeline and provide the fast prop features
(contrained in `fastprop_train` and `fastprop_test`.)

In [9]:
predictor = getml.predictors.XGBoostClassifier()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)

In [10]:
pipe_fp_pr.check(fastprop_train)

pipe_fp_pr.fit(fastprop_train)



Checking data model...




Checking data model...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:4.231413



In [11]:
pipe_fp_pr.score(fastprop_test)




Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-05-14 16:39:03,fastprop_train,Occupancy,0.9998,1.0,0.004514
1,2021-05-14 16:39:03,fastprop_test,Occupancy,0.9936,0.9986,0.025329


### 2.2 Propositionalization with featuretools

In [12]:
dfs_pandas = {}

for df in getml.project.data_frames:
    dfs_pandas[df.name] = df.to_pandas()
    dfs_pandas[df.name]["id"] = 1

In [13]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(minutes=0),
    memory=pd.Timedelta(minutes=15),
    column_id="id",
    time_stamp="date",
    target="Occupancy",
)

The `FTTimeSeriesBuilder` provides a `fit` method that is designed to be equivilant to
to the `fit` method of the predictorless getML pipeline above.

In [14]:
featuretools_train = ft_builder.fit(dfs_pandas["train"])
featuretools_test = ft_builder.transform(dfs_pandas["test"])

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=data_train.roles
)
df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=data_train.roles
)

df_featuretools_train.set_role(
    df_featuretools_train.unused_names, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.unused_names, getml.data.roles.numerical
)

featuretools: Trying features...


  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


Selecting the best out of 122 features...
Time taken: 0h:3m:25.446083



  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


In [15]:
predictor = getml.predictors.XGBoostClassifier()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

In [16]:
pipe_ft_pr.check(df_featuretools_train)



Checking data model...


In [17]:
pipe_ft_pr.fit(df_featuretools_train)



Checking data model...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:2.783537



In [18]:
pipe_ft_pr.score(df_featuretools_test)




Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-05-14 16:47:20,featuretools_train,Occupancy,0.999,1.0,0.006113
1,2021-05-14 16:47:20,featuretools_test,Occupancy,0.9937,0.998,0.026789


### 2.3 Propositionalization with tsfresh

In [19]:
tsfresh_builder = TSFreshBuilder(
    num_features=200, memory=15, column_id="id", time_stamp="date", target="Occupancy"
)

tsfresh_train = tsfresh_builder.fit(dfs_pandas["train"])
tsfresh_test = tsfresh_builder.transform(dfs_pandas["test"])

df_tsfresh_train = getml.data.DataFrame.from_pandas(
    tsfresh_train, name="tsfresh_train", roles=data_train.roles
)
df_tsfresh_test = getml.data.DataFrame.from_pandas(
    tsfresh_test, name="tsfresh_test", roles=data_train.roles
)

df_tsfresh_train.set_role(df_tsfresh_train.unused_names, getml.data.roles.numerical)

df_tsfresh_test.set_role(df_tsfresh_test.unused_names, getml.data.roles.numerical)

Rolling: 100%|██████████| 20/20 [00:12<00:00,  1.61it/s]
Feature Extraction: 100%|██████████| 20/20 [00:11<00:00,  1.73it/s]
Feature Extraction: 100%|██████████| 20/20 [00:21<00:00,  1.07s/it]


Selecting the best out of 55 features...
Time taken: 0h:0m:49.4377



Rolling: 100%|██████████| 20/20 [00:16<00:00,  1.24it/s]
Feature Extraction: 100%|██████████| 20/20 [00:14<00:00,  1.36it/s]
Feature Extraction: 100%|██████████| 20/20 [00:26<00:00,  1.34s/it]


In [20]:
pipe_tsf_pr = getml.pipeline.Pipeline(
    tags=["predicition", "tsfresh"], predictors=[predictor]
)

pipe_tsf_pr

In [21]:
pipe_tsf_pr.check(df_tsfresh_train)



Checking data model...


In [22]:
pipe_tsf_pr.fit(df_tsfresh_train)



Checking data model...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:1.837107



In [23]:
pipe_tsf_pr.score(df_tsfresh_test)




Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-05-14 16:49:17,tsfresh_train,Occupancy,0.9991,1.0,0.00581
1,2021-05-14 16:49:17,tsfresh_test,Occupancy,0.9931,0.9983,0.033657


### 3. Comparison

In [26]:
num_features = dict(
    fastprop=194,
    featuretools=122,
    tsfresh=55,
)

runtime_per_feature = [
    fastprop_runtime / num_features['fastprop'],
    ft_builder.runtime / num_features['featuretools'],
    tsfresh_builder.runtime / num_features['tsfresh'],
]

speedup_per_feature = [r/runtime_per_feature[0] for r in runtime_per_feature]

comparison = pd.DataFrame(
    dict(
        runtime=[fastprop_runtime, ft_builder.runtime, tsfresh_builder.runtime],
        num_features=num_features.values(),
        runtime_per_feature=runtime_per_feature,
        speedup=[1, ft_builder.runtime/fastprop_runtime, tsfresh_builder.runtime/fastprop_runtime],
        speedup_per_feature=speedup_per_feature,
        auc=[pipe_fp_pr.auc, pipe_ft_pr.auc, pipe_tsf_pr.auc],
    )
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

In [27]:
comparison

Unnamed: 0,runtime,num_features,runtime_per_feature,speedup,speedup_per_feature,auc
getML: FastProp,0 days 00:00:01.987577,194,0 days 00:00:00.010245,1.0,1.0,0.998594
featuretools,0 days 00:03:25.446083,122,0 days 00:00:01.683984,103.365094,164.371303,0.997984
tsfresh,0 days 00:00:49.437700,55,0 days 00:00:00.898867,24.873351,87.73714,0.998251


## 4. Conclusion

This tutorial demonstrates that relational learning is a powerful tool for time series. We able to outperform the benchmarks for a scientific paper on a simple public domain time series data set using relatively little effort.

If you want to learn more about getML, check out the [official documentation](https://getml.com/product).

# Next Steps

This tutorial went through the basics of applying getML to multirvariate time series and hyperparameter tuning.

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples.

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.