# Identifying Users Using Topological Data Analysis

## 1. Introduction

In this notebook we will be calculating persistence diagrams for all the
subjects in the study, and then try to identify the users using the bottleneck
distance of their respective persistence diagrams.

## 2. Approach

We will train a machine learning model to identify users based on the
bottleneck distances of the persistance diagrams of their typing data and check
its accuracy. Then we will introduce a new sample of typing data and see check
how accurately it matches the new sample to a user.

### 2.1. Setup

First we load the typing data:

In [1]:
import gudhi                 as gd
import pandas                as pd
import matplotlib.pyplot     as plt
import numpy                 as np
import gudhi.representations as gdrep
from sklearn.preprocessing   import MinMaxScaler
from sklearn.pipeline        import Pipeline
from sklearn.svm             import SVC
from sklearn.neighbors       import KNeighborsClassifier
from sklearn.ensemble        import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

plt.rcParams['font.size'] = 16
plt.rcParams['font.family'] = 'serif'

strong_password_data_frame = pd.read_csv('data/DSL-StrongPasswordData.csv',
                                   # declare type of 'subject' column
                                   dtype = {'subject' : 'string'},
                                   index_col = ['subject', 'sessionIndex', 'rep'])

Like we did in the previous notebook, we seperate each subject's typing data
and put it in an array of DataFrames where each index contains all the typing
data for a specific user:

In [2]:
def subjects_in_range(start, stop):
    """Returns a list of labels for subjects in the subject column.

    :param start: integer between 2 and 57, inclusive
    :param stop: integer between 2 and 57, inclusive. Should be greater than or
                 equal to start.
    :returns: list of zero-padded subject labels beginning with s{start} to s{stop}
    """
    return [f's{i:03}' for i in range(start, 1 + stop) if i not in [6, 9, 14, 23, 45]]

people = [strong_password_data_frame.loc[subject] for subject in subjects_in_range(2,57)]

### 2.2 Calculating Persistence Diagrams for Training

Now, we user the first six sessions to calculate three persistence diagrams for
each user. We will only be using the first homology since it appears to be the
best choice for identifying a user.

In [3]:
dimension = 1
last_person = 57
max_edge_length = 3.0
max_dimension = 2

In [4]:
train_diagrams_for_person = []

train_labels = []
label_idx = 0

for person, name in zip(people, subjects_in_range(2,last_person)):
    diagrams = []
    for n in range(3):
        train_labels.append(label_idx)
        points = person.loc[2*n + 1 : 2*(n+1)]
        simplicial_complex = gd.RipsComplex(points = points.to_numpy(),
                                            max_edge_length=max_edge_length)
        simplex_tree = simplicial_complex.create_simplex_tree(max_dimension = max_dimension)
        diagram = simplex_tree.persistence()
        diagrams.append(simplex_tree.persistence_intervals_in_dimension(dimension))
        
    train_diagrams_for_person.append(diagrams)
    label_idx = label_idx + 1
    # print(f'Training diagrams for {name} complete.')

### 2.3 Calculating Persistence Diagrams for Testing

Similarly, we use the last two sessions to calculate a persistence diagram for each user.

In [5]:
test_diagrams_for_person = []
test_labels = []
label_idx = 0

for person, name in zip(people, subjects_in_range(2,last_person)):
    diagrams = []
    test_labels.append(label_idx)
    points = person.loc[7:8] # get ith session
    simplicial_complex = gd.RipsComplex(points = points.to_numpy(),
                                        max_edge_length=max_edge_length)
    simplex_tree = simplicial_complex.create_simplex_tree(max_dimension = max_dimension)
    diagram = simplex_tree.persistence()
    diagrams.append(simplex_tree.persistence_intervals_in_dimension(dimension))
    
    test_diagrams_for_person.append(diagrams)
    label_idx = label_idx + 1
    # print(f'Test diagrams for {name} complete.')

In [6]:
training_data = np.array(train_diagrams_for_person).flatten()
test_data = np.array(test_diagrams_for_person).flatten()

## 3. Results and Analysis

Using the Bottleneck distance, our machine learning model was able to correctly
match typing data it had already seen to its user with about 50% accuracy.
However, when we introduced new typing data, its accuracy dropped down to
about 2%.

In [7]:
pipe = Pipeline([
                 ("TDA",       gd.representations.BottleneckDistance(0.001)),
                 ("Estimator", KNeighborsClassifier(n_neighbors=2, metric="precomputed"))
])
model = pipe.fit(training_data, train_labels)

In [8]:
model.score(training_data, train_labels)

0.49019607843137253

In [9]:
model.score(test_data, test_labels)

0.0196078431372549

## 4. Discussion and Future Work

Our machine learning experience is very limited, so we may have been able to
achieve better results by changing some paramaters. We were also limited by
the amount of session data, our hardware, our time constraint, and other
outside factors.

There are other tools in Topological Data Analysis besides the Bottleneck
distance that could led to better results. For example, the Persistence
Landscape gave us 100% accuracy with typing data it had already seen and
about 8% accuracy when shown new typing data.

In [10]:
pipe = Pipeline([("Scaler",    gd.representations.DiagramScaler(scalers=[([0,1], MinMaxScaler())])),
                 ("TDA",       gd.representations.BottleneckDistance(0.001)),
                 ("Estimator", KNeighborsClassifier(n_neighbors=3, metric="precomputed"))])

In [11]:
param =    [
            {"Scaler__use":         [True],
             "TDA":                 [gd.representations.Landscape()], 
             "TDA__resolution":     [150],
             "TDA__num_landscapes": [3],
             "Estimator":           [RandomForestClassifier()]},
           ]

In [12]:
model = GridSearchCV(pipe, param, cv=3)

In [13]:
model = model.fit(training_data, train_labels)

In [14]:
model.best_params_

{'Estimator': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=None,
                        verbose=0, warm_start=False),
 'Scaler__use': True,
 'TDA': Landscape(num_landscapes=3, resolution=150, sample_range=[nan, nan]),
 'TDA__num_landscapes': 3,
 'TDA__resolution': 150}

In [15]:
model.score(training_data, train_labels)

1.0

In [16]:
model.score(test_data, test_labels)

0.058823529411764705