# Exploratory data analysis on unlabeled data
We don't have labeled data yet, but we can still examine the data to see if there is something that stands out. 

## Setup

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

For this chapter, we will be reading data from a SQLite database:

In [None]:
import sqlite3

with sqlite3.connect('logs/logs.db') as conn:
    logs_2018 = pd.read_sql(
        """
        SELECT * 
        FROM logs 
        WHERE datetime BETWEEN "2018-01-01" AND "2019-01-01";
        """, 
        conn, parse_dates=['datetime'], index_col='datetime'
    )
logs_2018.head()

## EDA
The `success` column is now an integer because SQLite doesn't have Booleans:

In [None]:
logs_2018.dtypes

We are working with data for all of 2018, so it's important to keep an eye on memory usage:

In [None]:
logs_2018.info()

The most common failure reason is providing the wrong password. We can also see that the unique usernames tried is well over the amount of users in our user base (133), indicating some suspicious activity:

In [None]:
logs_2018.describe(include='all')

### Distinct users per IP address
Most IP addresses are associated with a single user, but at least one is associated with many:

In [None]:
logs_2018.groupby('source_ip').agg(dict(username='nunique'))\
    .username.describe()

### Calculate metrics per IP address
The top 5 rows of the pivot seem to be valid users, since they have relatively high success rates:

In [None]:
pivot = logs_2018.pivot_table(
    values='success', index='source_ip', 
    columns=logs_2018.failure_reason.fillna('success'), 
    aggfunc='count', fill_value=0
)
pivot.insert(0, 'attempts', pivot.sum(axis=1))
pivot = pivot.sort_values('attempts', ascending=False).assign(
    success_rate=lambda x: x.success / x.attempts,
    error_rate=lambda x: 1 - x.success_rate
)
pivot.head()

### Visual Anomaly Detection
Let's see if something jumps out at us when plotting successes versus attempts by IP address:

In [None]:
pivot.plot(
    kind='scatter', x='attempts', y='success', 
    title='successes vs. attempts by IP address', alpha=0.25
)

We can imagine there being a separation boundary between the groups by considering the fact that valid users probably have a close to 1:1 relationship between attempts and successes:

In [None]:
ax = pivot.plot(
    kind='scatter', x='attempts', y='success', 
    title='successes vs. attempts by IP address', alpha=0.25
)
ax.plot([30, 350], [0, 340], 'r--', label='sample boundary')
plt.legend()

More IP addresses are attacker IP addresses because they get a new IP address for each attack, while the valid users stick with the 1-3 they have. This makes the outliers on successes the valid users instead of the attackers.

In [None]:
pivot[['attempts', 'success']].plot(
    kind='box', subplots=True, figsize=(10, 3),
    title='stats per IP address'
)

Does looking at this on a minute resolution make anything stand out?

In [None]:
from matplotlib.ticker import MultipleLocator

ax = logs_2018.loc['2018-01'].assign(
    failures=lambda x: 1 - x.success
).groupby('source_ip').resample('1min').agg(
    {'username': 'nunique', 'success': 'sum', 'failures': 'sum'}
).assign(
    attempts=lambda x: x.success + x.failures
).dropna().query('attempts > 0').reset_index().plot(
    y=['attempts', 'username', 'failures'], kind='hist',
    subplots=True, layout=(1, 3), figsize=(20, 3),
    title='January 2018 distributions of minutely stats by IP address'
)
for axes in ax.flatten():
    axes.xaxis.set_major_locator(MultipleLocator(1))

There seems to be something with the number of usernames with failures per minute that we can use.

In [None]:
logs_2018.loc['2018'].assign(
    failures=lambda x: 1 - x.success
).query('failures > 0').resample('1min').agg(
    {'username': 'nunique', 'failures': 'sum'}
).dropna().rename(
    columns={'username': 'usernames_with_failures'}
).usernames_with_failures.plot(
    title='usernames with failures per minute in 2018',
    figsize=(15, 3)
).set_ylabel('usernames with failures')

<hr>
<div style="overflow: hidden; margin-bottom: 10px;">
    <div style="float: left;">
        <a href="../../ch_10/red_wine.ipynb">
            <button>&#8592; Chapter 10</button>
        </a>
        <a href="./0-simulating_the_data.ipynb">
            <button>Simulation</button>
        </a>
    </div>
    <div style="float: right;">
        <a href="./2-unsupervised_anomaly_detection.ipynb">
            <button>Next Notebook &#8594;</button>
        </a>
    </div>
</div>
<hr>