# Anomaly Detection

In this notebook we finally perform our anomaly detection. We perform this in 6 steps:
1. Create a data frame for each individual username showing authentication types for 8/24 hours of each day
2. Use EDA, Isolation Forest's, Local Outlier Factor and other models to find 'normal' days or 'normal' usernames to allow us to train th CP_APR model
3. Train the CP_APR model with the data we've identified in step 2
4. Run the trained CP_APR model on the other data to identify anomalies in the 'test' data
5. Use a function to return the anomalous entry from the original data frame based on the output of the CP_APR function
6. Create a new data frame of anomalies

Finally, we may verify this process through other means such as HTM studio for a subset or other anomaly detection techniques. We may also use the original red team authentication data to determine whether the events given there were picked up by the CP_APR method.

First we import our libraries that we need.

In [1]:
from pyCP_APR import CP_APR

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import os.path
import gzip
import shutil
import datetime
import networkx as nx
import pickle
from scipy import stats
from scipy import sparse
import bz2
import random
random.seed(1134)

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.neighbors import LocalOutlierFactor

from IPython.display import clear_output

### Original Data

Now we import the original data.

In [2]:
try:
    print('Attempting to read entire data set.')
    authentication_data = pd.read_csv('../Data/Authentication data.gz', compression='gzip', index_col = 0)
    process_data = pd.read_csv('../Data/Process data.gz', compression='gzip', index_col = 0)
except:
    clear_output()
    print('Unable to read entire data set, reading from original files.')
    rootdir = 'C:/Users/corri/OneDrive/Documents/Uni/Postgraduate/Final Project/LANL/ATI Data/Summaries/wls'
    unzippeddir = 'C:/Users/corri/OneDrive/Documents/Uni/Postgraduate/Final Project/LANL/ATI Data/Summaries/wls/Unzipped'
    frames = []

    count = 0

    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            if file[-3:] == '.gz':
                filedir = rootdir + '/' + file
                with gzip.open(filedir) as f:
                    df = pd.read_csv(filedir, header=None)
                    frames.append(df)
                if 'authentications' in str(file):
                    count = count + len(df)

    df = pd.concat(frames)

    authentication_data = df[:count]
    authentication_data.columns = ['UserName', 'SrcDevice','DstDevice', 'Authent Type', 'Failure', 'DailyCount']

    process_data = df[count:]
    process_data = process_data[[0,1,2,3,4]]
    process_data.columns = ['UserName', 'Device', 'ProcessName', 'ParentProcessName', 'DailyCount']

    authentication_data.to_csv('../Data/Authentication data.gz', header=True, compression='gzip')
    process_data.to_csv('../Data/Process data.gz', header=True, compression='gzip')

Attempting to read entire data set.


  mask |= (ar1 == a)


### Other required data

#### Possible Username Lists

We need a list of usernames we'll consider for training/testing. Currently at the beginning of all this we will consider all usernames for both training and testing and reduce this as we go.

In [3]:
train_users = list(authentication_data['UserName'].unique())
test_users = list(authentication_data['UserName'].unique())

#### Authentication Types

We'll need a dictionary of authentication types for later use.

In [4]:
a_t = list(authentication_data['Authent Type'].unique())
AT_dict = { i : a_t[i] for i in range(0, len(a_t) ) }

#### Authentication Day Starts

The below code defines the indices where each day begins in the authentiation data.

In [5]:
auth_index_list = authentication_data.index.tolist()
auth_start_days = [i for i, e in enumerate(auth_index_list) if e == 0]
auth_start_days.append(len(authentication_data))

### Step 1: DataFrame Creation

This first function is used to split a data frame into equal chunks. Since we need to split each day into 8/24 hours we use this function to split into equal time periods - this may not be perfectly representitive of the actual hour split but should be a good estimate since we don't have the original time stamps.

In [6]:
def split_dataframe(df,n): 
    chunks = list()
    chunk_size = int(np.round(df.shape[0]/n))
    num_chunks = n
    for i in range(num_chunks):
        if i != num_chunks-1:
            chunks.append(df[i*chunk_size:(i+1)*chunk_size])
        else:
            chunks.append(df[i*chunk_size:])
    return chunks

This function creates the required data frames. It takes as input a username and a split by number (8/24) and returns a data frame of the user's authentiation events split by type over 90 days, split by 8/24 hours.

In [7]:
def auth_type_un_df(user,n):
    auth_type_df = pd.DataFrame(index = list(authentication_data['Authent Type'].unique()))
    n = n
    auth_type_dict = {}
    
    for i in range(len(auth_start_days)-1):
        chunks = split_dataframe(authentication_data[auth_start_days[i]:auth_start_days[i+1]],n)
        for j in range(n):
                data = chunks[j]
                auth_type_data = data[data['UserName'] == user].groupby('Authent Type').size()
                auth_type_dict[i*n + j] = auth_type_df.index.to_series().map(auth_type_data.to_dict())
    
    auth_type_df = pd.DataFrame(data=auth_type_dict,index = list(authentication_data['Authent Type'].unique()))
    auth_type_df = auth_type_df.transpose()
    auth_type_df = auth_type_df.fillna(0)
    
    return auth_type_df

This function creates the inputs for our CP_APR model. We pass a list of usernames to the function and it returns the set of co-ordinate tuples (i,j,e) where we have non-zero entries in our data matrices, along with the corresponding values for that matrix. i is the row of the matrix i.e. time, j is the column i.e. authentication type and e is the username number. We can instead pass a single username which would return this for just one user but this is optimised to run for all users when required.

In [8]:
def sparse_df(usernamelist,n):
    
    coords = []
    vals_list = []
    
    for e,user in enumerate(usernamelist):
        df = auth_type_un_df(user,n)
    
        s = sparse.coo_matrix(df)
        co = [[s.row[i],s.col[i],e] for i in range(len(s.row))]
        vals = s.data
        
        coords.append(co)
        vals_list.append(vals)
    
    coords = np.array([item for sublist in coords for item in sublist])
    vals_list = np.array([item for sublist in vals_list for item in sublist])
    
    return vals_list, coords

### Step 2: Determining Training Data

Here we determine the training data we'll use in our final model.

### Step 3: Train the CP_APR model

Here we define our CP_APR model. We then train it on the data we have determined to be 'normal' above to teach the model what is likely to be normal activity in the authentication sense.

In [9]:
cp_apr = CP_APR(n_iters=10, random_state=42, verbose=200, method='numpy', return_type='numpy')

In [10]:
#factors = cp_apr.fit(coords=train_coords, values=train_vals)
#factors

### Step 4: Apply the CP_APR model to the actual data

Here we apply the model to the data we want to find anomalies in. This data will then be used to find the final set of anomalies to pass into the final stage of our project.

In [11]:
#p_values = cp_apr.predict_scores(coords=test_coords, values=test_vals)
#p_values

### Step 5: Obtain the data frame of anomalies

Here we use the p-values found above to retrieve the final set of anomalies from the original data frame.

This function returns a single anomaly based on the test coordinates array we obtain i.e. the actual data we look for anomalies in, the entry value i.e. the position of the anomaly in the array output by our CP_APR model and n, the number of hours we split the data frame by.

In [12]:
def orig_finder(test_coords, entry_val, n):
    
    # gets the co-ordinates of the entry where we have the erro
    orig_co = test_coords[entry_val]
    
    # gets the authentication type
    authent = AT_dict[orig_co[1]]
    
    # gets the username of the individual who the anomaly occured with
    username = test_users[orig_co[2]]
    
    # gets the day the anomaly occured (n is the number of hours we split the data frame into)
    day = int(orig_co[0]/n)
    
    # gets the hour the anomaly occured in
    hour = orig_co[0] - n * day
    
    # gets the n hour chunks for that day
    chunks = split_dataframe(authentication_data[auth_start_days[day]:auth_start_days[day+1]],n)
    
    # gets the hour
    data = chunks[hour]
    
    # finds the anomaly
    anom = data[(data['UserName'] == username) & (data['Authent Type'] == authent)]
    
    return anom

The p-values array defined below will be the output of the CP_APR function. We then set a threshold for anomaly scores to determine what we will class as an anomaly. Using the np.where function we will find all instances where we are below the threshold and return a data frame of the anomalies that we have found.

In [13]:
#frames = []
#threshold = 0.05

#for i in range(len(np.where(p_values < threshold)[0])):
#    entry = np.where(p_values < threshold)[0][i]
#    anom = orig_finder(test_coords, entry_val, 24)
#    frames.append(anom)
    
#anomalies = pd.concat(frames)