# Sequence classification with Neural Networks: a primer
## Part 2: Basic Tree model

We're going to try Decision Trees as the most common traditional classification model.

As those models don't have memory and cannot accept sequences as input, we flatten our sequential samples and give features at each timestamp as individual training samples.

We then start to inroduce outliers with increasing probability in our train samples and measure what happens with the accuracy.

In [2]:
import altair as alt
import pandas as pd

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from tmdprimer.datagen import Dataset

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
data_trees = []
for outlier_prob in (0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0):
    X, y = Dataset.generate(train_outlier_prob=outlier_prob).get_flat_X_y()
    clf = RandomForestClassifier(n_estimators=10, class_weight="balanced")
    clf.fit(X, y)
    X_test, y_test = Dataset.generate(train_outlier_prob=outlier_prob, n_samples=20).get_flat_X_y()
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    data_trees.append({'outlier_prob': outlier_prob, 'accuracy': acc})
df_trees = pd.DataFrame(data_trees)

In [6]:
alt.Chart(df_trees).mark_line().encode(x='outlier_prob', y='accuracy')

As expected, since outlier train speeds are not distinguishable from walk speeds, we see a linear dicrease in accuracy with an increasing outlier probability.

Honestly, for such a simple univariate data we don't even need the ensemble classifier, a single decision tree would do as good. But we keep it here for the reference as this is the model you are probably going to use in production.

Let's plot the same graph using a single decision tree model.

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
data_tree = []
for outlier_prob in (0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0):
    X, y = Dataset.generate(train_outlier_prob=outlier_prob).get_flat_X_y()
    clf = DecisionTreeClassifier(max_depth=2)
    clf.fit(X, y)
    X_test, y_test = Dataset.generate(train_outlier_prob=outlier_prob, n_samples=20).get_flat_X_y()
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    data_trees.append({'outlier_prob': outlier_prob, 'accuracy': acc})
df_tree = pd.DataFrame(data_trees)

In [4]:
alt.Chart(df_tree).mark_line().encode(x='outlier_prob', y='accuracy')

In the next notebook, we'll see how to use NN models on sequence data to improve classification performance with outliers.