# TODO - ROC curves, minimal equal error rate plot, evaulation of ROC over time (overlay with average?)
# Implement gridsearch to optimise the model? (Use validation set of data)
Working on this problem: https://www.cs.cmu.edu/~keystroke/.
Supporting paper: http://www.cs.cmu.edu/~keystroke/KillourhyMaxion09.pdf

Data comes from 51 subjects typing ".tie5Roanl" 400 times across multiple sessions.

Our goal is to develop a model which has a minimal equal error rate. 

(Diagram of minimal equal error rate https://api.intechopen.com/media/chapter/66135/media/F2.png).

Questions that immediately need answering:
- What type of problem is this (classification or regression)?
- Has anyone attempted this problem before?
    - If so, how did they approach it? 
        - Which detectors / feature sets / models did they use?
        - What was successful about their approach? 
        - What were their limitations?
- What do the features in the dataset represent?
- Which do we prioritise - false poitives or false negatives (aka in this context: false-alarm rates and miss rates).
    - From the literature (and common sense to be honest), we should prioritise lowering miss rates (it's better to lock out a user, than have a threat access the system).

These were largely answered through reading the aforementioned paper, and doing some background reading and research.

The aforementioned paper also detailed a method by which different detectors could be compared on the same dataset. So to evaluate how our 'new' model performs against its competitors, it makes sense to first implement a pre-existing model, then our new model, and compare performance under the same conditions.

Note: The paper implemented the techniques using R (which I've not used before). Implementation in Python _should_ be the same, but there may be some underlying differences in R/Python's mathematics libraries

# Imports and file processing
Let's import some relevant modules and see what the file's contents are.

In [None]:
# First, imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# for Manhattan detector, need cityblock distance
from scipy.spatial.distance import cityblock

In [None]:
# Read in csv file and check what's inside
df = pd.read_csv('DSL-StrongPasswordData.csv')
df.head()

# Exploratory Analysis

In [None]:
df.info()
# 20,400 rows, 34 columns

In [None]:
df.describe()
# Mean DD looks to be ~ double H and UD

In [None]:
subjects = df["subject"].unique()
print(subjects) 
# Confirmation there are 51 unique subjects

In [None]:
sessions = df["sessionIndex"].unique()
print(sessions)
# 400 times across 8 sessions means each subject typed the string ~ 50 times per session

In [None]:
sns.pairplot(df.filter(['H.period','H.t','H.i','H.e','H.five','H.Shift.r','H.o','H.a','H.n','H.l','H.Return'],axis=1))

In [None]:
sns.pairplot(df.filter(['DD.period.t','DD.t.i','DD.i.e','DD.e.five','DD.five.Shift.r','DD.Shift.r.o','DD.o.a','DD.a.n','DD.n.l','DD.l.Return'],axis=1))

In [None]:
sns.pairplot(df.filter(['UD.period.t','UD.t.i','UD.i.e','UD.e.five','UD.five.Shift.r','UD.Shift.r.o','UD.o.a','UD.a.n','UD.n.l','UD.l.Return'],axis=1))

In [None]:
# Check for duplicate rows - there aren't any, this is good
df[df.duplicated()]

In [None]:
# check for null values
print(df.isnull().sum())

In [None]:
sns.boxplot(x=df["H.period"])

In [None]:
sns.boxplot(x=df["H.t"])

In [None]:
sns.boxplot(x=df["DD.period.t"])

In [None]:
sns.boxplot(x=df["DD.t.i"])

In [None]:
sns.boxplot(x=df["UD.period.t"])

In [None]:
sns.boxplot(x=df["UD.t.i"])

So it looks like there are some serious outliers with UD and DD, but H has a more even spread.

# Model development

It is evident this is a classification problem, rather than a regression problem. 

Firstly, let's approach this using standard anomaly detection practices - we will train a model to recognise a certain user's typing pattern, and then test it against the remaining user's samples, from which we can obtain an anomaly score.

In [None]:
for subject in subjects:
    #print('Training new model for subject {}'.format(subject))
    real_user = df.loc[df.subject == subject]
    fake_user = df.loc[df.subject != subject]

    # We train our model using a genuine user's data
    training_data = real_user[:200].loc[:, 'H.period':'H.Return']
    
    # To test our model, we need both more data from the original user, and imposter user data
    genuine_user_data = real_user[200:].loc[:, 'H.period':'H.Return']
    imposter_user_data = fake_user[:].loc[:, 'H.period':'H.Return']
    
    # Let's check dimensions of our training and testing tuples are the same...just in case
    if training_data.shape != genuine_user_data.shape:
        sys.exit("training_data and genuine_user_data shapes don't match: {} | {}".format(training_data.shape, genuine_user_data.shape))
    elif imposter_user_data.shape[0] != genuine_user_data.shape[0]*100:
        sys.exit("imposter_user_data and genuine_user_data rows aren't 20000 and 200: {} | {}".format(imposter_user_data.shape[0], genuine_user_data.shape[0]))
    else:
        continue
        
    mean_vector = training_data.mean().values # store mean vector in a numpy array

For simplicity, let's implement the Manhattan detector first, and then later we can compare our model's performance to this.

# Conclusion and Future Improvements
Things that could be looked at in future:
- Try multi-class classification vs anomaly detection.
    - Rather than training with respect to one user's data, then testing against the rest, with MCC you could use multiple user's samples to form decision boundaries wherein users could be distinguished.
- We have not accounted for correlations between dataset features (whereas in reality, DD values will be comprised of both H and UD components, implying correlation). We could retrain having normalised out these effects.
- There appear to be a lot of outliers present in the UD and DD data (considering the box plots). We could use some sort of filter (see Manhattan filter detector) to remove these components, and see if performance improves.
- Could implement hypothesis testing to verify our model is better than the others presented (rather than due to just random chance).
- Could look at other keystroke data available online (as mentioned in the paper) - although if we were to integrate it, we'd have to ensure it was recorded under similar conditions.