This tutorial will go through the steps to forecast anomalies using [Open Anomaly Detection](https://github.com/algorithmia-algorithms/OpenAnomalyDetection), [Open Forecast](https://github.com/algorithmia-algorithms/OpenForecast), and the [m4 competition](https://www.mcompetitions.unic.ac.cy/)'s dataset.

First, lets download the m4 dataset, and unzip it.

In [None]:
import requests
import csv
raw_data = requests.get('https://github.com/M4Competition/M4-methods/raw/master/Dataset/Train/Hourly-train.csv')
data = []
reader = csv.reader(raw_data.text.splitlines())

for row in reader:
    data.append(row)
print(len(data))

The data that we have is not formatted as you would expect for a CSV. rows are columns, and columns are rows - so we're going to need to flip this somehow. But first, lets get this into a proper numpy array.

In [None]:
import numpy as np
import pandas as pd
def trim_to_first_nan(variable):
    r"""
    This function uses `pandas` to find non-numeric characters (missing values, or invalid entries) for each variable.
    When a non-numeric character is found, the algorithm then trims the variable sequence  from 0 -> last numeric value.
    """

    variable = pd.to_numeric(variable, errors='coerce')
    nans = np.isnan(variable)
    has_nans = nans.any()
    if has_nans:
        first_nan_index = np.where(nans == True)[0][0]
        output = variable[0:first_nan_index]
    else:
        output = variable
    return output

In [None]:
r"""
* We limit the maximum number of variables to "max_vars", so even if most of our variables are
longer than "sequence_length", we truncate the rest to keep the formatted dataset trim.
* And finally, for our demo we are only selecting the first variable as a 'key_variable', you can change this
as desired.
"""

max_vars = 5
length = 500

in_tensor = np.asarray(data)[1:, 1:]
out_tensor = []
for i in range(max_vars):
    variable = in_tensor[i, :]
    var_data = trim_to_first_nan(variable)
    if var_data.shape[0] >= length:
        var_data = var_data[0:length]
        out_tensor.append(var_data)
if len(out_tensor) == 0:
    raise Exception('the requested sequence length is too long for your data, please select a smaller number.')
else:
    out_tensor = np.stack(out_tensor, axis=1)
serializable_tensor = out_tensor.tolist()
ingestable_input = {'tensor': serializable_tensor}
print(out_tensor.shape)

Great, our tensor is formatted and ready to go. Lets serialize this to a file, so we can use algorithmia to train a model with it.

In [None]:
algorithmia_api_key = input("what's your algorithmia api key?")

In [None]:
import Algorithmia
client = Algorithmia.client(algorithmia_api_key)

client.file('data://.my/example_collection/m4-hourly-data.json').putJson(ingestable_input)

With that done, lets now go and create a forecasting model using the Open Forecast algorithm on Algorithmia.

In [None]:
forecast_algorithm = client.algo('algo://timeseries/openforecast/1.1.0')
forecast_input = {
    'mode': 'train',
    'data_path': 'data://.my/example_collection/m4-hourly-data.json',
    'model_output_path': 'data://.my/example_collection/m4-hourly-model_0.1.0.zip',
    'training_time': 100,
    'model_complexity': 0.65,
    'forecast_length': 5
}

#result = forecast_algorithm.pipe(input).result
#print(result)

Since the above will take ~100+ seconds to complete, lets skip that step and use the model that we've already trained.

In [None]:
model_path = 'data://.my/example_collection/m4-hourly-model_0.1.0.zip'
data_path = 'data://.my/example_collection/m4-hourly-data.json'
output_graph_path = 'data://.my/example_collection/graph_path.png'

anomaly_algorithm = client.algo('algo://timeseries/openanomalydetection/1.0.0')
anom_input = {
    'data_path': data_path,
    'model_input_path': model_path,
    'graph_save_path': output_graph_path,
    'sigma_threshold': 3,
    'variable_index': 0,
    'calibration_percentage': 0.1
}

anom_result = anomaly_algorithm.pipe(anom_input)


print(anom_result)


Sweet, finally lets load up our graph and take a look at it using matplotlib.

In [None]:
from IPython.display import Image
Image("/tmp/graph_path.png")