# Solutions
Simulation results will be different unless you use the seed. Check that your strategy for completing the exercises is similar to the sample solutions here, in that case.

## Exercise 1
Simulate December 2018 using the seed of 27.

In [1]:
!python3 simulate.py \
    -s 27 \
    -u user_data/user_base.txt \
    -i user_data/user_ips.json \
    -l dec_2018_log.csv \
    -hl dec_2018_attacks.csv \
    31 "2018-12-01"

[INFO] [ simulate.py ] Simulating 31.0 days...
[INFO] [ simulate.py ] Saving logs
[INFO] [ simulate.py ] All done!


## Imports for Remaining Exercises

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

## Exercise 2
Find the number of unique usernames, attempts, successes, failures, and success/failure rates per IP address.

## Exercise 3
Create two subplots with failures versus attempts on the left and failure rate versus distinct usernames on the right. Draw a decision boundary for what you see. Be sure to color by whether or not it is a hacker IP address.

## Exercise 4
Build a rule-based criteria using percent difference from the median that flags an IP address if failures and attempts are 5 times the median OR if distinct usernames is 5 times the median.

Function from chapter for getting baselines:

In [6]:
def get_baselines(hourly_ip_logs, func, *args, **kwargs):
    """
    Calculate hourly bootstrapped statistic per column.
    
    Parameters:
        - hourly_ip_logs: Data to sample from.
        - func: Statistic to calculate.
        - args: Additional positional arguments for `func`
        - kwargs: Additional keyword arguments for `func`
    
    Returns:
        `pandas.DataFrame` of hourly bootstrapped statistics
    """
    
    )

Get baseline:

In [7]:
medians = get_baselines(hourly_ip_logs, 'median')

Flag if both failures and attempts are 5 times higher than the median or if usernames tried is 5 times higher than the median:

## Exercise 5
Calculate metrics to evaluate how well the ensemble method performed. We can use the `evaluate()` function from the chapter:

In [9]:
def evaluate(alerted_ips, attack_ips, log_ips):
    """
    Calculate true positives (TP), false positives (FP), 
    true negatives (TN), and false negatives (FN) for 
    IP addresses flagged as suspicious.
    
    Parameters:
        - alerted_ips: `pandas.Series` of flagged IP addresses
        - attack_ips: `pandas.Series` of attacker IP addresses
        - log_ips: `pandas.Series` of all IP addresses seen
    
    Returns:
        Tuple of form (TP, FP, TN, FN)
    """
    
    return tp, fp, tn, fn

Next, we make a partial to store the attacker IP addreses and the unique IP addresses in the logs:

In [10]:
# make this easier to call
from functools import partial
scores = partial(
    evaluate, 
    attack_ips=pd.read_csv('dec_2018_attacks.csv').source_ip, 
    log_ips=dec_log.source_ip.drop_duplicates()
)

We can evaluate the performance with the `classification_stats()` function from the chapter:

In [11]:
def classification_stats(tp, fp, tn, fn):
    """Calculate metrics"""
    return {
        'FPR': fp / (fp + tn),
        'FDR': fp / (fp + tp),
        'FNR': fn / (fn + tp),
        'FOR': fn / (fn + tn)
    }

Performance is decent:

In [12]:
classification_stats(*scores(flagged_ips))

{'FPR': 0.07003891050583658,
 'FDR': 0.18947368421052632,
 'FNR': 0.01282051282051282,
 'FOR': 0.004166666666666667}

<hr>
<div>
    <a href="../../ch_08/anomaly_detection.ipynb">
        <button>&#8592; Chapter 8</button>
    </a>
    <a href="../../ch_09/red_wine.ipynb">
        <button style="float: right;">Chapter 9 &#8594;</button>
    </a>
</div>
<hr>