# HMM (Hidden Markov Model) Filter Demo

This notebook is a demo on how to use HMM filter package with scikit-learn on a toy dataset generated in `dataset.ipynb`:

* Step 1: Train scikit-learn `RandomForestClassifier` on training dataset
* Step 2: Predict labels for unlabeled dataset using trained random forest
* Step 3: Estimate HMM state transition matrix from predicted labels of unlabeled dataset
* Step 4: Estimate class probability distributions for test dataset using trained random forest
* Step 5: Predict most likely sequence of states for each session in test dataset using HMM filter

[HMMs](https://en.wikipedia.org/wiki/Hidden_Markov_model) are defined by hidden states, state transition probabilities, possible observations and their emission probabilities. In our problem, the HMM parameters are the following:

* Hidden states are drawn from the categorical distribution of the classification class labels. E.g., `"0:0"`
* State transition probabilities are represented by the state transition matrix. E.g., `P("0:0 -> "0:1") = 0.2`
* Possible observations are drawn from the categorical distribution of the classification class labels. E.g., `"0:0"`
* Emission probabilities are estimated by the prediction probability estimates. E.g., `{"0:0": 0.4, "0:1": 0.6}`

The HMM filter revises the predictions accordingly to their uncertainty and the state transition matrix estimated from unlabeled data using the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm). E.g, it might suggest to revise the sequence of predictions `["0:0", "1:1", "0:0"]` to `["0:0", "0:0", "0:0"]` since it is more likely to remain in the same cell (accordingly to the transition matrix) and the classifier was uncertain about the correct label in the 2nd position (E.g., `{"0:0": 0.8, "1:1": 0.2}`).


Accuracy is used as the evaluation metric.

In [1]:
# Imports

# Disable future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from os import cpu_count
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from hmm_filter.hmm_filter import HMMFilter


In [2]:
# Configuration

# Training dataset, used to train a classifier to predict column true_class based on the x,y features
train_pathname = "data/train.csv"

# Test dataset, used to evaluate the predictions
test_pathname = "data/test.csv"

# Dataset of measurements without known true_class label, used to evaluate the HMM transition matrix
unlabeled_pathname = "data/unlabeled.csv"


## Load datasets


In [3]:
# Load input datasets
train_dataset = pd.read_csv(train_pathname)
unlabeled_dataset = pd.read_csv(unlabeled_pathname)
test_dataset = pd.read_csv(test_pathname)

# Sort rows in ascending timestamp order (already sorted, but included here to stress that rows need to be ordered)
unlabeled_dataset.sort_values("timestamp", ascending=True, inplace=True)
test_dataset.sort_values("timestamp", ascending=True, inplace=True)


In [4]:
train_dataset.head()

Unnamed: 0,session_id,timestamp,x_sample,y_sample,true_class
0,0,0,2.07,1.81,2:2
1,0,1,1.84,1.9,2:2
2,0,2,1.98,1.98,2:2
3,0,3,1.99,1.99,2:2
4,0,4,2.14,2.08,2:2


In [5]:
test_dataset.head()

Unnamed: 0,session_id,timestamp,x_sample,y_sample,true_class
0,0,20000,3.93,0.37,4:0
1,0,20001,3.77,0.38,4:0
2,0,20002,3.94,0.45,4:0
3,0,20003,3.71,0.42,4:0
4,0,20004,3.85,0.19,4:0


In [6]:
unlabeled_dataset.head()

Unnamed: 0,session_id,timestamp,x_sample,y_sample
0,0,40000,3.98,3.22
1,0,40001,4.01,3.2
2,0,40002,4.02,3.1
3,0,40003,3.87,3.32
4,0,40004,3.85,3.24


In [7]:
len(train_dataset), len(test_dataset), len(unlabeled_dataset)

(20000, 20000, 1000000)

## Extract features and labels

In [8]:
# Prepare features and labels

# training dataset
X_train = train_dataset[["x_sample", "y_sample"]].values
y_train = train_dataset["true_class"].values

# test dataset
X_test = test_dataset[["x_sample", "y_sample"]].values
y_test = test_dataset["true_class"].values

# unlabeled dataset
X_unlabeled = unlabeled_dataset[["x_sample", "y_sample"]].values


## Step 1: Train random forest classifier

In [9]:
# cross-validate random forest model on training dataset
clf = RandomForestClassifier(n_jobs=cpu_count())

# print average accuracy across all folds
avg_accuracy = np.mean(cross_validate(clf, X_train, y_train, cv=2, scoring="accuracy")["test_score"])
avg_accuracy

0.7945002684801719

In [10]:
# Instantiate random forest classifier and fit to training data
clf = RandomForestClassifier(n_jobs=cpu_count())
clf.fit(X_train, y_train)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [11]:
# Evaluation of trained random forest on test dataset
test_dataset["prediction_rf"] = clf.predict(X_test)

# Evaluate accuracy of predictions
rf_accuracy = len(test_dataset[test_dataset.true_class == test_dataset.prediction_rf]) / len(test_dataset)
rf_accuracy

0.80715

## Step 2: Predict labels for unlabeled dataset

In [12]:
# predict classes for unabeled dataset
unlabeled_dataset["prediction_rf"] = clf.predict(X_unlabeled)


## Step 3: Estimate HMM state transition matrix

In [13]:
# train HMM filter by estimating the state transition matrix
hmmfilter = HMMFilter()
hmmfilter.fit(unlabeled_dataset, session_column="session_id", prediction_column="prediction_rf")


In [14]:
# what's the probability to remain in one cell?
hmmfilter.A[('2:2', '2:2')]

0.7159761363261808

In [15]:
# what's the probability to jump on the right by one cell (one jump)?
hmmfilter.A[('2:2', '2:3')]

0.06343320478591911

In [16]:
# what's the probability to jump on the right and to the top by one cell (two jumps)?
hmmfilter.A[('2:2', '3:3')]

0.007960051418965688

The estimated probabilities are intuitively what one would expect: it is likely to remain in the same cell, less likely to jump to a neighboring cell in one direction, and even less likely to jump along both axes to a neighboring cell.

## Step 4: Estimate class probability distributions

In [17]:
# estimate random forest class probabilities
d = pd.DataFrame.from_records(clf.predict_proba(X_test), columns=clf.classes_).to_dict(orient="records")
test_dataset["probabs"] = [{ k:v for k,v in r.items() if v > 0} for r in d ]
test_dataset.head()

Unnamed: 0,session_id,timestamp,x_sample,y_sample,true_class,prediction_rf,probabs
0,0,20000,3.93,0.37,4:0,4:0,"{'4:0': 0.8, '4:1': 0.2}"
1,0,20001,3.77,0.38,4:0,4:0,"{'4:0': 0.725, '4:1': 0.275}"
2,0,20002,3.94,0.45,4:0,4:0,"{'4:0': 0.9, '4:1': 0.1}"
3,0,20003,3.71,0.42,4:0,4:0,{'4:0': 1.0}
4,0,20004,3.85,0.19,4:0,4:0,{'4:0': 1.0}


Column `probabs` reflects the prediction uncertainty of the random forest classifier. In the first row, the predicted class label is `"4:0"` with probability `0.9` and `"4:1"` with probability `0.1`.

## Step 5: Predict most likely sequence of states using HMM filter

In [18]:
# We can now combine the transition matrix (probability of jumping between
# any pair of class labels) with the probabilistic class predictions.
# Since the dataset is splitted by session ID, and each session is processed
# in parallel, a new dataframe is returned.
df = hmmfilter.predict(test_dataset, session_column='session_id', probabs_column="probabs", prediction_column='prediction')


In [19]:
# Evaluate accuracy of predictions
hmm_accuracy = len(df[df.true_class == df.prediction]) / len(df)
hmm_accuracy

0.8751

## Comparison of accuracy results

In [20]:
print("Accuracy of the Random Forest classifier (RF) is {:.2f}%".format(rf_accuracy * 100))
print("Accuracy of the HMM filter applied on the Random Forest classifier (MM-RF) is {:.2f}%".format(hmm_accuracy * 100))
print("HMM filter contribution to accuracy is {:.2f}%".format((hmm_accuracy - rf_accuracy) * 100))

Accuracy of the Random Forest classifier (RF) is 80.72%
Accuracy of the HMM filter applied on the Random Forest classifier (MM-RF) is 87.51%
HMM filter contribution to accuracy is 6.79%


## Conclusions

The HMM filter provided an increase in accuracy of 6.28% (absoulte percentage). Varying the properties of the synthetic dataset, we ontaomed an accuracy increase that ranges between zero and 10%, depending on how much data you use for training the classifier and the transition matrix, and the noise standard deviation. In general, the HMM filter is rather robust: either it provides a better accuracy or accuracy remains the same.