# Machine Learning Models

Continuation of notes about SciPy2019 talk by Aileen Nielsen.

## Time series feature generation

There are no (none at all?) machine learning algorithms that have been specifically developed for handling time series models. Indeed, many ml algorithms do not support the notion of temporal data. Therefore, it is necessary to make your time series data fit into a representation that the ml algorithm in question does support. For example, Decision Trees are not built to handle temporal data.

You can handle this mismatch by generating suitable features from your time series data. For example, record the minimum and maximum values (and time stamps?), count the number of peaks and valleys in time windows, compute the mean and median over such windows.

If you have many time series that span long periods of time and compute such features once every minute (or whatever the time step in your data is), then this can become computationally expensive.

There are canonical feature sets that have been developed for time series. According to Aileen, if you have a specific use case / domain, you should be able to do better than a canonical feature set using domain knowledge and analyzing your data. Time series features are discipline-specific. 

This may be true, but still wondering what to do when faced with very large multivariate time series datas sets. 

A few of the models being discussed:
* Random Forests using xgboost. You learn your first decision tree and then learn the second one based on the errors of the first one. In practice, xgboost is said to perform well with time series classification.
* Clustering. Difficult both conceptually and due to computational costs. Need to be careful to pick a good distance metric (e.g., not Euclidean distance). Need to be careful that you really cluster time series with similar features. Put differently, you need to find features that distinguish well between different time series (domain knowledge?).
  * Clustering can be based on features that you identify in time series (e.g., either by looking for the features from a canonical set or with custom-code to look for relevant features in your domain). This can become computationally expensive. Think thousands of time series each with tens of thousands of data points you need to analyze.
  * Clustering can also be done using a suitable distance metric. The one recommended for time series data is [Dynamic Time Warping](https://en.wikipedia.org/wiki/Dynamic_time_warping).


## Classification using trees

In [8]:
import matplotlib.pyplot as plt
import cesium
import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time

from cesium import datasets
from cesium import featurize as ft

import scipy
from scipy.stats import pearsonr, spearmanr
from scipy.stats import skew

import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

plt.rcParams["figure.figsize"] = [10, 10]

ImportError: cannot import name 'reraise' from 'dask.compatibility' (/Users/brunow/anaconda3/envs/py37scipy/lib/python3.7/site-packages/dask/compatibility.py)

There are packages that can extract features from time series. `cesium` is one such package. More [details](https://github.com/cesium-ml/cesium), including the list of features, such as number of peaks, index of $i^{th}$ largest peak, total number of observed values, difference between maximum and minimum time values, mean of time values, amplitude, skewness and many more.