# Unsupervised anomaly detection
After our initial EDA, we have decided to pursue some unsupervised anomaly detection with a feature for the number of usernames with a failed login attempt in a given minute.

## Setup

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sqlite3

with sqlite3.connect('logs/logs.db') as conn:
    logs_2018 = pd.read_sql(
        'SELECT * FROM logs WHERE datetime BETWEEN "2018-01-01" AND "2019-01-01";', 
        conn, parse_dates=['datetime'], index_col='datetime'
    )
logs_2018.head()

## Prepping our data
We need a function to transform our log data into our X for the model:

In [None]:
def get_X(log, day):
    """
    Get data we can use for the X
    
    Parameters:
        - log: The logs dataframe
        - day: A day or single value we can use as a datetime index slice
    
    Returns: 
        A `pandas.DataFrame` object
    """
    return pd.get_dummies(log.loc[day].assign(
        failures=lambda x: 1 - x.success
    ).query('failures > 0').resample('1min').agg(
        {'username': 'nunique', 'failures': 'sum'}
    ).dropna().rename(
        columns={'username': 'usernames_with_failures'}
    ).assign(
        day_of_week=lambda x: x.index.dayofweek, 
        hour=lambda x: x.index.hour
    ).drop(columns=['failures']), columns=['day_of_week', 'hour'])

We will work with January 2018:

In [None]:
X = get_X(logs_2018, '2018-01')
X.columns

## Isolation Forest
with estimated 5% contamination:

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

iso_forest_pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('iforest', IsolationForest(
        random_state=0, contamination=0.05
    ))
]).fit(X)

Let's see how many outliers versus inliers we got. Outliers will be marked as -1:

In [None]:
isolation_forest_preds = iso_forest_pipeline.predict(X)
pd.Series(np.where(
    isolation_forest_preds == -1, 'outlier', 'inlier'
)).value_counts()

### Local Outlier Factor
Since we have no labeled data, we can't use grid search to tune our hyperparameters (we can't calculate performance metrics). Therefore, we will accept the default parameters for LOF, which will use 20 neighbors:

In [None]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

lof_pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('lof', LocalOutlierFactor())
]).fit(X)

This model comes up with a negative outlier factor, which doesn't tell us outlier/inlier on its own:

In [None]:
lof_preds = lof_pipeline.named_steps['lof'].negative_outlier_factor_ 
lof_preds

For that, we need to compare it to the offset. Values less than the offset are outliers:

In [None]:
pd.Series(np.where(
    lof_preds < lof_pipeline.named_steps['lof'].offset_, 'outlier', 'inlier'
)).value_counts()

### Check agreement between unsupervised methods
While we can't compare their performance without labeled data, we can see if they are generally in agreement (there is a low level of agreement):

In [None]:
from sklearn.metrics import cohen_kappa_score

is_lof_outlier = np.where(
    lof_preds < lof_pipeline.named_steps['lof'].offset_, 
    'outlier', 'inlier'
)
is_iso_outlier = np.where(
    isolation_forest_preds == -1, 'outlier', 'inlier'
)

cohen_kappa_score(is_lof_outlier, is_iso_outlier)

### Evaluating the models
We have been given the labeled data. Now we can truly compare these models.

In [None]:
with sqlite3.connect('logs/logs.db') as conn:
    hackers_jan_2018 = pd.read_sql(
        """
        SELECT * 
        FROM attacks 
        WHERE start BETWEEN "2018-01-01" AND "2018-02-01";
        """, conn, parse_dates=['start', 'end']
    ).assign(
        duration=lambda x: x.end - x.start,
        start_floor=lambda x: x.start.dt.floor('min'),
        end_ceil=lambda x: x.end.dt.ceil('min')
    )
hackers_jan_2018.shape

Note this only has an IP address for one of the IP addresses involved in each attack, so it's a good thing we aren't relying on that anymore. Also note that, while the attacks are quick in duration, our minutely data means we will trigger many alerts per attack:

In [None]:
hackers_jan_2018

We want to mark each minute that had an attack, so we can use the `start_floor` and `end_ceil` columns to create a range of datetimes. Then, we can check if the data we marked as outliers falls within that range:

In [None]:
def get_y(datetimes, hackers, resolution='1min'):
    """
    Get data we can use for the y (whether or not a hacker attempted a log in during that time).
    
    Parameters:
        - datetimes: The datetimes to check for hackers
        - hackers: The dataframe indicating when the attacks started and stopped
        - resolution: The granularity of the datetime. Default is 1 minute.
        
    Returns:
        `pandas.Series` of Booleans.
    """
    date_ranges = hackers.apply(
        lambda x: pd.date_range(x.start_floor, x.end_ceil, freq=resolution), 
        axis=1
    )
    dates = pd.Series(dtype='object')
    for date_range in date_ranges:
        dates = pd.concat([dates, date_range.to_series()])
    return datetimes.isin(dates)

Let's grab our labeled `y` data:

In [None]:
is_hacker = get_y(X.reset_index().datetime, hackers_jan_2018)

We will create partials for the performance metrics functions for less typing:

In [None]:
from functools import partial
from sklearn.metrics import classification_report
from ml_utils.classification import confusion_matrix_visual

report = partial(classification_report, is_hacker)
conf_matrix = partial(
    confusion_matrix_visual, is_hacker, class_labels=[False, True]
)

#### Isolation forest

In [None]:
iso_forest_predicts_hacker = isolation_forest_preds == - 1

print(report(iso_forest_predicts_hacker))

#### Local Outlier Factor

In [None]:
lof_predicts_hacker = lof_preds < lof_pipeline.named_steps['lof'].offset_

print(report(lof_predicts_hacker))

#### Comparing confusion matrices

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
conf_matrix(iso_forest_predicts_hacker, ax=axes[0], title='Isolation Forest')
conf_matrix(lof_predicts_hacker, ax=axes[1], title='Local Outlier Factor')

<hr>
<div style="overflow: hidden; margin-bottom: 10px;">
    <div style="float: left;">
        <a href="./1-EDA_unlabeled_data.ipynb">
            <button>&#8592; Previous Notebook</button>
        </a>
    </div>
    <div style="float: right;">
        <a href="./3-EDA_labeled_data.ipynb">
            <button>Next Notebook &#8594;</button>
        </a>
    </div>
</div>
<hr>