# Reducing the process and authentication data for easier usage

In this notebook I am employing techniques I have researched, and also experimented with in the other notebooks found in this folder to reduce the dimensions of the Process and Authentication data for use with other methods.

We start by importing various modules; check [here](https://github.com/Galeforse/Advanced-Cyber-Analytics-for-Attack-Detection/blob/main/Modules/01%20-%20Using%20Modules.ipynb) for information about using custom modules.

In [1]:
import sys
sys.path.insert(0,'G:/Users/Gabriel/Documents/Education/UoB/GitHubDesktop/Advanced-Cyber-Analytics-for-Attack-Detection/Modules/')
import pandas as pd
import numpy as np
from dt import *
from startup_g import *
import networkx as nx
from pyvis.network import Network

I've imported a code block that does the import step for me to save line space.

In [2]:
df_p = process_import()
df_a = auth_import()

Looking for local copy of Process data...


  mask |= (ar1 == a)


Process data fetched locally in 0:01:12.035862
Looking for local copy of Auth data...
Auth data fetched locally in 0:00:23.062016


In [3]:
index_list = df_p.index.tolist()
proc_start_days = [i for i, e in enumerate(index_list) if e == 0]
proc_start_days.append(len(df_p))
df_p_d1 = df_p[proc_start_days[0]:proc_start_days[1]]
df_p_d1.head()

Unnamed: 0,UserName,Device,ProcessName,ParentProcessName,DailyCount
0,Comp748297$,Comp748297,Proc391839.exe,Proc387473,1
1,Comp563664$,Comp563664,rundll32.exe,services,1
2,User607396,Comp609111,Proc417435.exe,Proc417435,1
3,Comp641702$,Comp641702,Proc249569.exe,services,1
4,Comp157389$,Comp157389,Proc402696.exe,services,1


In the above I indexed the data as done by Matt so that I could test the techniques below on a smaller subset of data for computational purposes. Though I have since deleted this testing, it's useful to leave in for potential future experimentation purposes.

The function below was originally developed by me in another of my notebooks. It takes a dataset, counts the number of each different entry in a defined column of the dataset, then removes any with a count higher than `upper_value` or lower than `lower_value`.

In [4]:
def data_slicer(dataset,variable,upper_value,lower_value):
    """Cuts down data based on counts of entries
    
    Keyword arguments:
    dataset -- pandas dataframe to cut down
    variable -- name of column in string format to reduce by
    upper_value -- any entry of the column with a count higher than this will be removed
    lower_value -- any entry of the column with a count lower than this will be removed. set to 0 to not cut any from the lower end of the data.
    """
    if dataset == "p":
        z = df_p.groupby(variable).size().sort_values(ascending=False)
        z = pd.DataFrame(z)
        z.reset_index(level=0, inplace=True)
        z.columns = [variable,'Count']
        lims = z[z.Count < upper_value]
        lims = lims[lims.Count > lower_value]
        upper_lim = lims.head(1).index.values.astype(int)[0]
        lower_lim = lims.tail(1).index.values.astype(int)[0]
        y = z[(z["Count"] <= z["Count"][upper_lim]) & (z["Count"] >= z["Count"][lower_lim])]
        x = z[z["Count"] > z["Count"][upper_lim]]
        v = z[z["Count"] < z["Count"][lower_lim]]
        counts_p_reduced = y
        temp = []
        temp.append(x)
        temp.append(v)
        counts_p_discard = pd.concat(temp,axis=0)
        return counts_p_reduced, counts_p_discard
    if dataset == "a":
        z = df_a.groupby(variable).size().sort_values(ascending=False)
        z = pd.DataFrame(z)
        z.reset_index(level=0, inplace=True)
        z.columns = [variable,'Count']
        lims = z[z.Count < upper_value]
        lims = lims[lims.Count > lower_value]
        upper_lim = lims.head(1).index.values.astype(int)[0]
        lower_lim = lims.tail(1).index.values.astype(int)[0]
        y = z[(z["Count"] <= z["Count"][upper_lim]) & (z["Count"] >= z["Count"][lower_lim])]
        x = z[z["Count"] > z["Count"][upper_lim]]
        v = z[z["Count"] < z["Count"][lower_lim]]
        counts_a_reduced = y
        temp = []
        temp.append(x)
        temp.append(v)
        counts_a_discard = pd.concat(temp,axis=0)
        return counts_a_reduced, counts_a_discard

The following block of code applies the `data_slicer` defined above to our our datasets. We have to run it twice, once for process and once for authentication. It runs on the variables defined in the script, with values being chosen by me based on the boxplots that I plotted in another notebook. The boxplots can be found [here for authentication](https://github.com/Galeforse/Advanced-Cyber-Analytics-for-Attack-Detection/tree/main/Gabriel/Plots/Boxplot/Auth) and [here for process](https://github.com/Galeforse/Advanced-Cyber-Analytics-for-Attack-Detection/tree/main/Gabriel/Plots/Boxplot/Process).

It outputs the reduced data and when calling the function we assign this to a new cutdown variable.

In [5]:
def total_data_reducer(dataset):
    if dataset == "p":
        counts_p_reduced_un, counts_p_discard_un = data_slicer("p","UserName",7600,0)
        counts_p_reduced_d, counts_p_discard_d = data_slicer("p","Device",9500,0)
        counts_p_discard_un = counts_p_discard_un["UserName"].tolist()
        counts_p_discard_d = counts_p_discard_d["Device"].tolist()
        print("Starting length of dataframe: "+str(len(df_p)))
        df_p_cut = df_p[~df_p["UserName"].isin(counts_p_discard_un)]
        print("1st drop length of dataframe: "+str(len(df_p_cut)))
        df_p_cut = df_p_cut[~df_p_cut["Device"].isin(counts_p_discard_d)]
        print("Final length of dataframe: "+str(len(df_p_cut)))
        return df_p_cut
    elif dataset == "a":
        counts_a_reduced_un, counts_a_discard_un = data_slicer("a","UserName",1600,0)
        counts_a_reduced_sd, counts_a_discard_sd = data_slicer("a","SrcDevice",3300,0)
        #counts_a_reduced_dd, counts_a_discard_dd = data_slicer("a","DstDevice",350,0)
        counts_a_discard_un = counts_a_discard_un["UserName"].tolist()
        counts_a_discard_sd = counts_a_discard_sd["SrcDevice"].tolist()
        #counts_a_discard_dd = counts_a_discard_dd["DstDevice"].tolist()
        print("Starting length of dataframe: "+str(len(df_a)))
        df_a_cut = df_a[~df_a["UserName"].isin(counts_a_discard_un)]
        print("1st drop length of dataframe: "+str(len(df_a_cut)))
        df_a_cut = df_a_cut[~df_a_cut["SrcDevice"].isin(counts_a_discard_sd)]
        #print("2nd drop length of dataframe: "+str(len(df_a_cut)))
        #df_a_cut = df_a_cut[~df_a_cut["DstDevice"].isin(counts_a_discard_dd)]
        print("Final length of dataframe: "+str(len(df_a_cut)))
        return df_a_cut

In [6]:
dtn()
df_p_cut1 = total_data_reducer("p")
gen_end()
print("")
dtn()
df_a_cut1 = total_data_reducer("a")
gen_end()

Starting length of dataframe: 55981618
1st drop length of dataframe: 55334015
Final length of dataframe: 53171763
Completed in :0:00:33.430418

Starting length of dataframe: 15953681
1st drop length of dataframe: 12940224
Final length of dataframe: 11793285
Completed in :0:00:10.142453


The `data_reducer` function below does a similar thing to above however this time we are using network analysis to drop the small data. The reason we do not want to drop all data with a single or few connections is that it is quite possible that they could be attached to a larger cluster and could be relevant, therefore by dropping small compoments in the following block, we avoid losing these, only dropping small clusters of connections (which of course includes those with only appear once). After dropping these nodes, the source and destination nodes are listed in order to then reduce the data further to have only the unremoved points included.

In [7]:
def data_reducer(df,sourc,targ,comp):
    G = nx.from_pandas_edgelist(df,source=sourc,target=targ)
    for component in list(nx.connected_components(G)):
        if len(component)<comp:
            for node in component:
                G.remove_node(node)
    source_edge = []
    targ_edge = []
    for e in G.edges():
        source, target = e
        source_edge.append(source)
        targ_edge.append(target)
    source_edge = list(dict.fromkeys(source_edge))
    targ_edge = list(dict.fromkeys(targ_edge))
    df_cut = df[df[sourc].isin(source_edge)]
    df_cut = df_cut[df_cut[targ].isin(targ_edge)]
    return df_cut

In [8]:
x = len(df_p_cut1)
dtn()
print("length of process data before calling data_reducer: "+str(x))
df_p_cut = data_reducer(df_p_cut1,"UserName","Device",5)
y = len(df_p_cut)
print("length of process data after calling data_reducer: "+str(y))
print("Data reduced in size by "+str(100 - round((y/x)*100,2))+" %.")
print("")
gen_end()

length of process data before calling data_reducer: 53171763
length of process data after calling data_reducer: 34298709
Data reduced in size by 35.489999999999995 %.

Completed in :0:02:45.020675


In [9]:
x = len(df_a_cut1)
dtn()
print("length of auth data before calling data_reducer: "+str(x))
df_a_cut = data_reducer(df_a_cut1,"UserName","SrcDevice",5)
y = len(df_a_cut)
print("length of auth data after calling data_reducer: "+str(y))
print("Data reduced in size by "+str(100 - round((y/x)*100,2))+" %.")
print("")
gen_end()

length of auth data before calling data_reducer: 11793285
length of auth data after calling data_reducer: 6927651
Data reduced in size by 41.26 %.

Completed in :0:00:40.158020


Total process reduction percentage after applying both methods:

In [10]:
100-round((len(df_p_cut)/len(df_p))*100,2)

38.73

Total authentication reduction percentage after applying both methods:

In [11]:
100-round((len(df_a_cut)/len(df_a))*100,2)

56.58

We then save these reduced dataframes to `.csv.gz` files.

In [12]:
df_p_cut.to_csv("G:/Users/Gabriel/Documents/Education/UoB/GitHubDesktop/Advanced-Cyber-Analytics-for-Attack-Detection/Data/Reduced/proc_data_reduced.csv.gz",compression="gzip")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.


KeyboardInterrupt



In [None]:
df_a_cut.to_csv("G:/Users/Gabriel/Documents/Education/UoB/GitHubDesktop/Advanced-Cyber-Analytics-for-Attack-Detection/Data/Reduced/auth_data_reduced.csv.gz",compression="gzip")

## Further Experimentation

In [147]:
def experiment(numb):
    x = len(df_p_cut1)
    print("length of process data before calling data_reducer: "+str(x))
    df_p_cut2 = data_reducer(df_p_cut1,"UserName","Device",numb)
    y = len(df_p_cut2)
    print("length of process data after calling data_reducer: "+str(y))
    print("Data reduced in size by "+str(100 - round((y/x)*100,2))+" %.")
    print("")
    x = len(df_a_cut1)
    print("length of auth data before calling data_reducer: "+str(x))
    df_a_cut2 = data_reducer(df_a_cut1,"UserName","SrcDevice",numb)
    y = len(df_a_cut2)
    print("length of auth data after calling data_reducer: "+str(y))
    print("Data reduced in size by "+str(100 - round((y/x)*100,2))+" %.")
    print("")
    print("Total process reduction percentage after applying both methods:")
    print(100-round((len(df_p_cut2)/len(df_p))*100,2))
    print("")
    print("Total authentication reduction percentage after applying both methods:")
    print(100-round((y/len(df_a))*100,2))

In [148]:
experiment(4)

length of process data before calling data_reducer: 53171763
length of process data after calling data_reducer: 34800345
Data reduced in size by 34.55 %.

length of auth data before calling data_reducer: 1219138
length of auth data after calling data_reducer: 667614
Data reduced in size by 45.24 %.

Total process reduction percentage after applying both methods:
37.84

Total authentication reduction percentage after applying both methods:
95.82


In [149]:
experiment(3)

length of process data before calling data_reducer: 53171763
length of process data after calling data_reducer: 40123718
Data reduced in size by 24.540000000000006 %.

length of auth data before calling data_reducer: 1219138
length of auth data after calling data_reducer: 772570
Data reduced in size by 36.63 %.

Total process reduction percentage after applying both methods:
28.33

Total authentication reduction percentage after applying both methods:
95.16


In [150]:
experiment(6)

length of process data before calling data_reducer: 53171763
length of process data after calling data_reducer: 33504094
Data reduced in size by 36.99 %.

length of auth data before calling data_reducer: 1219138
length of auth data after calling data_reducer: 638470
Data reduced in size by 47.63 %.

Total process reduction percentage after applying both methods:
40.15

Total authentication reduction percentage after applying both methods:
96.0
