# Intro

We are wrangling up some data from simulated APT activity that was captured on a mock production network with the efforts of creating a "realistic, semi-synthetic" dataset. I will document some of the process that I undertook, as a decent amount of it was exploratory, as well as covering the changes I had to make with the original dataset.

To reduce scope, yet still covering both the axes of host and network logs, I will just be wrangling the Netflow data, Linux auth+audit logs, and Windows Security Events.

The source for this dataset can be found here: https://doi.org/10.1016/j.comnet.2023.109688

The original data/ can be found here: https://www.kaggle.com/datasets/ernie55ernie/unraveled-advanced-persistent-threats-dataset/data

Due to the size of the data (835 MB ZIP, decompressed to 4.43 GB of plaintext and binary data) the download will take some time, but I used the following command:

```zsh
  curl -L -o ~/Downloads/unraveled-advanced-persistent-threats-dataset.zip\
    https://www.kaggle.com/api/v1/datasets/download/ernie55ernie/unraveled-advanced-persistent-threats-dataset
```

While cleaning up the data, we should keep in mind our hypothesis and trim away anything that probably won't contribute to proving the null or alternative.

Hypothesis: Choose one
- APTs adjust TTPs in response to defensive measures and signs of detection.
- We can better detect APTs based on their TTPs versus specific artifacts
  - Testing the highest scoring/most imporant features of a model

# Setup

Below I threw together a few helper functions to solve a couple problems I ran into when trying to load CSVs.

In [1]:
import pandas as pd
import numpy as np
import os
import re
import chardet

%matplotlib inline

def get_encoding(path):
    with open(path, 'rb') as f:
        raw = f.read(4096)  # read first 4 KB

        # Use chardet lib to detect the encoding
        result = chardet.detect(raw)

        return result['encoding']

def get_files_recurse(path):
    result = []
    
    # For each file, append its full path to a list
    for root, dirs, files in os.walk(path):
        for file in files:
            fullpath = os.path.join(root, file)
            result.append(fullpath)
            
    return result

def load_all_csv(path, sep=',', recurse=False, verbose=False, encoding='auto'):
    files = [path + x for x in os.listdir(path)] if not recurse else get_files_recurse(path)
    d = dict()
    
    # For each file, check its encoding scheme, then store as DF in dict with fullpath as key
    for f in files:
        if not os.path.isfile(f):
            continue
        
        if verbose:
            print(f)
        
        enc = get_encoding(f) if encoding == 'auto' else encoding
        d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
        
    # Concatenate all DFs in the dictionary, ignoring the indexes so they don't collide
    df = pd.concat(d.values(), ignore_index=True)
        
    return df

# Network Logs

## Netflow

In [202]:
path = '../data/unraveled-apt/network-flows/'

df = load_all_csv(path, recurse=True)

  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encoding=enc)
  d[f] = pd.read_csv(f, delimiter=sep, encodin

In [203]:
dropme = [
    # Identifiers
    'fgid', 'id', ' id',
    
    # Redundant and unneeded Layer 2/3 info
    'src_oui', 'dst_oui', 'tunnel_id', 'ip_version',
    'vlan_id',
    
    # Sparse application metadata
    'requested_server_name', 'client_fingerprint', 
    'content_type', 'application_is_guessed',
    
    # Redundant bidirectional stats
    'bidirectional_min_ps', 'bidirectional_mean_ps', 
    'bidirectional_stddev_ps', 'bidirectional_max_ps',
    'bidirectional_min_piat_ms', 'bidirectional_mean_piat_ms',
    'bidirectional_stddev_piat_ms', 'bidirectional_max_piat_ms',
    
    # Redundant bidirectional TCP flags (keep directional)
    'bidirectional_syn_packets', 'bidirectional_cwr_packets',
    'bidirectional_ece_packets', 'bidirectional_urg_packets',
    'bidirectional_ack_packets', 'bidirectional_psh_packets',
    'bidirectional_rst_packets', 'bidirectional_fin_packets',
    
    # Potentially redundant timing
    'src2dst_last_seen_ms', 'dst2src_last_seen_ms'
]

reduced_df = df.drop(columns=dropme)

In [204]:
# Replace null values
reduced_df['Signature'] = reduced_df['Signature'].fillna('Normal')

In [205]:
reduced_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6877157 entries, 0 to 6877156
Data columns (total 62 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   Unnamed: 0                   object 
 1   expiration_id                float64
 2   src_ip                       object 
 3   src_mac                      object 
 4   src_port                     float64
 5   dst_ip                       object 
 6   dst_mac                      object 
 7   dst_port                     float64
 8   protocol                     float64
 9   bidirectional_first_seen_ms  float64
 10  bidirectional_last_seen_ms   float64
 11  bidirectional_duration_ms    float64
 12  bidirectional_packets        float64
 13  bidirectional_bytes          float64
 14  src2dst_first_seen_ms        float64
 15  src2dst_duration_ms          float64
 16  src2dst_packets              float64
 17  src2dst_bytes                float64
 18  dst2src_first_seen_ms        float64
 19  

Even with dropping a decent amount of columns, we still have a DataFrame that takes up over 6GB of memory. All of the datatypes appear to just be the default `int64`/`object`, so we can make some changes to that and save a fair amount.

I applied the methodology commented into the code below:

In [None]:
# We're only working with whole numbers here.
floats = reduced_df.dtypes[reduced_df.dtypes == 'float64'].index
reduced_df[floats] = reduced_df[floats].astype('int64')

# Storing some frequently referenced sets of values
int_cols = reduced_df.dtypes[reduced_df.dtypes == 'int64'].index
maxes = reduced_df[int_cols].max()


# I used unsigned ints because none of the values are negative.
# uint16 = 0-65535
uint16 = maxes < 65536
uint16 = uint16[uint16].index

# uint32 = 0-4294967295
uint32 = maxes < 4294967296  # includes uint16 cols, but we will do that type change after this one
uint32 = uint32[uint32].index

# For these, I manually checked how many nunique() they had, and it was on the lower end.
categories = [
    'src_ip', 'dst_ip', 'src_mac', 
    'dst_mac', 'expiration_id', 'application_name', 
    'user_agent', 'server_fingerprint', 'Activity', 
    'DefenderResponse', 'Signature', 'Stage',
    'application_category_name'
]

In [207]:
reduced_df[categories].nunique().sort_values()

expiration_id                    2
DefenderResponse                 3
Signature                        3
Stage                            8
Activity                        15
application_category_name       25
user_agent                      57
src_mac                        123
dst_mac                        132
server_fingerprint             180
application_name               201
src_ip                        5937
dst_ip                       18035
dtype: int64

In [208]:
reduced_df[categories] = reduced_df[categories].astype('category')
reduced_df[float32] = reduced_df[float32].astype('float32')
reduced_df[uint32] = reduced_df[uint32].astype('uint32')
reduced_df[uint16] = reduced_df[uint16].astype('uint16')

reduced_df['flow_start'] = pd.to_datetime(reduced_df['bidirectional_first_seen_ms'], unit='ms')
reduced_df['flow_end'] = pd.to_datetime(reduced_df['bidirectional_last_seen_ms'], unit='ms')

In [209]:
import gc

# lets free up some memory
del df
gc.collect()

5910

In [210]:
# Move start and end timestamps if they are not already there
reduced_df = reduced_df[['flow_start', 'flow_end'] + 
                        [c for c in reduced_df.columns if c not in ['flow_start', 'flow_end']]]

reduced_df.head()

Unnamed: 0.1,flow_start,flow_end,Unnamed: 0,expiration_id,src_ip,src_mac,src_port,dst_ip,dst_mac,dst_port,...,dst2src_rst_packets,dst2src_fin_packets,application_name,application_category_name,server_fingerprint,user_agent,Activity,Stage,DefenderResponse,Signature
0,2021-05-26 18:02:55.872,2021-05-26 18:07:18.016,,0.0,10.1.2.17,fa:16:3e:a2:d6:e6,123.0,192.81.135.252,fa:16:3e:10:2d:11,123.0,...,0.0,0.0,NTP,System,,,Normal,Benign,Benign,Normal
1,2021-05-26 18:02:55.872,2021-05-26 18:07:18.016,,0.0,10.1.2.17,fa:16:3e:a2:d6:e6,123.0,74.6.168.72,fa:16:3e:10:2d:11,123.0,...,0.0,0.0,NTP,System,,,Normal,Benign,Benign,Normal
2,2021-05-26 18:05:06.944,2021-05-26 18:05:06.944,,0.0,10.1.2.17,fa:16:3e:a2:d6:e6,123.0,23.131.160.7,fa:16:3e:10:2d:11,123.0,...,0.0,0.0,NTP,System,,,Normal,Benign,Benign,Normal
3,2021-05-26 18:02:55.872,2021-05-26 18:02:55.872,,0.0,10.1.2.17,fa:16:3e:a2:d6:e6,123.0,209.115.181.108,fa:16:3e:10:2d:11,123.0,...,0.0,0.0,NTP,System,,,Normal,Benign,Benign,Normal
4,2021-05-26 18:05:06.944,2021-05-26 18:05:06.944,,0.0,10.1.2.17,fa:16:3e:a2:d6:e6,123.0,108.61.73.243,fa:16:3e:10:2d:11,123.0,...,0.0,0.0,NTP,System,,,Normal,Benign,Benign,Normal


In [212]:
reduced_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6877157 entries, 0 to 6877156
Data columns (total 64 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   flow_start                   datetime64[ns]
 1   flow_end                     datetime64[ns]
 2   Unnamed: 0                   object        
 3   expiration_id                float32       
 4   src_ip                       category      
 5   src_mac                      category      
 6   src_port                     float32       
 7   dst_ip                       category      
 8   dst_mac                      category      
 9   dst_port                     float32       
 10  protocol                     float32       
 11  bidirectional_first_seen_ms  float32       
 12  bidirectional_last_seen_ms   float32       
 13  bidirectional_duration_ms    float32       
 14  bidirectional_packets        float32       
 15  bidirectional_bytes          float32       
 16  

In [213]:
# Export as pickle to save all that hard work we did converting datatypes
reduced_df.to_pickle('../data/cleaned/netflow.pkl')

These Netflow logs should be ready for us to play around with further.

In [193]:
reduced_df = pd.read_pickle('../data/cleaned/netflow.pkl')

# Linux Host Logs

## `audit`

In [167]:
path = os.path.split(os.getcwd())[0] + '/data/unraveled-apt/host-logs/audit/'
audit_df = load_all_csv(path, sep=';')

In [168]:
audit_df.shape

(264320, 5)

On the last row, there appears to be some preceeding whitespace in the LogEvent column. Lets handle that:

In [169]:
for col in audit_df.columns:
    try:
        audit_df[col] = audit_df[col].str.strip()
    except:
        continue

With that out of the way, we will need to address the log message that is sometimes nested under the `msg` field.

In [170]:
audit_df.LogEvent.iloc[111]

'type=USER_START ts=1621862701.432 tsid=600 pid=15765 uid=0 auid=0 ses=3575 msg=\'op=PAM:session_open acct="root" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success\''

```sh
type=USER_START 
ts=1621862701.432 
...
# I still want to keep this mostly intact
msg=\'op=PAM:session_open acct="root" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success\'
```

I plan to keep this by extracting `msg` out of the string, processing it separately from the rest of the log, then throwing `msg` into the rest of the log as a column.

In [171]:
msg_df = audit_df.LogEvent.str.extract(r"msg=('.*')")
no_msg = audit_df.LogEvent.str.replace(r"msg=('.*')", repl='', regex=True)

This is what we extracted:

In [172]:
msg_df.head(10)

Unnamed: 0,0
0,
1,
2,
3,
4,"'unit=auditd comm=""systemd"" exe=""/lib/systemd/..."
5,"'op=PAM:session_close acct=""root"" exe=""/usr/bi..."
6,"'op=PAM:setcred acct=""root"" exe=""/usr/bin/sudo..."
7,"'op=PAM:authentication acct=""ubuntu"" exe=""/usr..."
8,"'op=PAM:accounting acct=""ubuntu"" exe=""/usr/lib..."
9,"'op=PAM:authentication acct=""ubuntu"" exe=""/usr..."


Here is what the `no_msg` series looks like now. We can proceeed with converting this into a DataFrame, then concatenating `msg_df` to it.

In [173]:
print("Message removed:", no_msg.iloc[111])
print("Original log:   ", audit_df.LogEvent.iloc[111])

Message removed: type=USER_START ts=1621862701.432 tsid=600 pid=15765 uid=0 auid=0 ses=3575 
Original log:    type=USER_START ts=1621862701.432 tsid=600 pid=15765 uid=0 auid=0 ses=3575 msg='op=PAM:session_open acct="root" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'


We'll store our final result in a var called `logs`. Very descriptive.

In [174]:
logs = no_msg.str.split()

In [175]:
logs.head()

0    [type=DAEMON_START, ts=1621837767.969, tsid=93...
1    [type=CONFIG_CHANGE, ts=1621837767.983, tsid=4...
2    [type=CONFIG_CHANGE, ts=1621837767.983, tsid=4...
3    [type=CONFIG_CHANGE, ts=1621837767.983, tsid=4...
4    [type=SERVICE_START, ts=1621837767.987, tsid=4...
Name: LogEvent, dtype: object

In [176]:
logs.iloc[logs.shape[0]-1]  # We are doing it this way because I like how it formats the text better. No judging!

['type=USER_START',
 'ts=1625992741.358',
 'tsid=78499',
 'pid=789481',
 'uid=0',
 'auid=1000',
 'ses=5754',
 'subj==unconfined',
 'UID="root"',
 'AUID="ubuntu"']

In [177]:
expand_logs = logs.apply(lambda x: {split_field[0]: split_field[1] for split_field in [log_field.split('=') for log_field in x]}).to_dict()
list(expand_logs.items())[:2] # logs are expanded to a dictionary of dictionaries

[(0,
  {'type': 'DAEMON_START',
   'ts': '1621837767.969',
   'tsid': '9329',
   'op': 'start',
   'ver': '2.8.2',
   'format': 'raw',
   'kernel': '5.3.0-40-generic',
   'auid': '4294967295',
   'pid': '13687',
   'uid': '0',
   'ses': '4294967295',
   'subj': 'unconfined',
   'res': 'success'}),
 (1,
  {'type': 'CONFIG_CHANGE',
   'ts': '1621837767.983',
   'tsid': '489',
   'op': 'set',
   'audit_backlog_limit': '8192',
   'old': '64',
   'auid': '4294967295',
   'ses': '4294967295',
   'res': '1'})]

In [178]:
log_df = pd.DataFrame(expand_logs).T
print(log_df.head(6))

del expand_logs

            type              ts  tsid     op    ver format            kernel  \
0   DAEMON_START  1621837767.969  9329  start  2.8.2    raw  5.3.0-40-generic   
1  CONFIG_CHANGE  1621837767.983   489    set    NaN    NaN               NaN   
2  CONFIG_CHANGE  1621837767.983   490    set    NaN    NaN               NaN   
3  CONFIG_CHANGE  1621837767.983   491    set    NaN    NaN               NaN   
4  SERVICE_START  1621837767.987   492    NaN    NaN    NaN               NaN   
5       USER_END  1621837780.539   493    NaN    NaN    NaN               NaN   

         auid    pid  uid  ...  sig  dev prom old_prom AUID  UID OLD-AUID  \
0  4294967295  13687    0  ...  NaN  NaN  NaN      NaN  NaN  NaN      NaN   
1  4294967295    NaN  NaN  ...  NaN  NaN  NaN      NaN  NaN  NaN      NaN   
2  4294967295    NaN  NaN  ...  NaN  NaN  NaN      NaN  NaN  NaN      NaN   
3  4294967295    NaN  NaN  ...  NaN  NaN  NaN      NaN  NaN  NaN      NaN   
4  4294967295      1    0  ...  NaN  NaN  NaN  

In [179]:
log_df.columns  # no msg column

Index(['type', 'ts', 'tsid', 'op', 'ver', 'format', 'kernel', 'auid', 'pid',
       'uid', 'ses', 'subj', 'res', 'audit_backlog_limit', 'old',
       'audit_failure', 'audit_backlog_wait_time', 'old-auid', 'tty',
       'old-ses', 'apparmor', 'operation', 'profile', 'name', 'comm',
       'requested_mask', 'denied_mask', 'fsuid', 'ouid', 'gid', 'exe', 'sig',
       'dev', 'prom', 'old_prom', 'AUID', 'UID', 'OLD-AUID', 'ID', 'GID',
       'info'],
      dtype='object')

In [180]:
log_df['ts'] = pd.to_datetime(log_df['ts'].str.replace('.', ''), unit='ms')
log_df['ts'].head()

  log_df['ts'] = pd.to_datetime(log_df['ts'].str.replace('.', ''), unit='ms')


0   2021-05-24 06:29:27.969
1   2021-05-24 06:29:27.983
2   2021-05-24 06:29:27.983
3   2021-05-24 06:29:27.983
4   2021-05-24 06:29:27.987
Name: ts, dtype: datetime64[ns]

In [181]:
# Create DataFrame of labeled audit log data
labeled_audit_df = pd.concat([
        msg_df,  # contains the retained msg field
        log_df,  # contains the rest of the log, parsed
        audit_df[audit_df.columns[1:]]  # slice off first column, since we just expanded that.
        # This will give us LogEvent expanded out into more columns as well as the labels.
    ], 
    axis=1)

labeled_audit_df.rename({0: 'msg'}, inplace=True, axis=1)

In [182]:
# reordering the columns to put the msg field in position 11
labeled_audit_df = labeled_audit_df[labeled_audit_df.columns[1:].insert(11, 'msg')]

In [184]:
labeled_audit_df['Signature'] = labeled_audit_df['Signature'].fillna('Normal')

In [185]:
# We can save a lot of memory just by changing some columns to type 'category'
categories = labeled_audit_df.columns[labeled_audit_df.nunique() < 100]
labeled_audit_df[categories] = labeled_audit_df[categories].astype('category')

In [186]:
# Over half the values in each of these columns == null
dropme = labeled_audit_df.columns[labeled_audit_df.notna().sum() / labeled_audit_df.shape[0] < 0.50]
dropme

Index(['op', 'ver', 'format', 'kernel', 'res', 'audit_backlog_limit', 'old',
       'audit_failure', 'audit_backlog_wait_time', 'old-auid', 'tty',
       'old-ses', 'apparmor', 'operation', 'profile', 'name', 'comm',
       'requested_mask', 'denied_mask', 'fsuid', 'ouid', 'gid', 'exe', 'sig',
       'dev', 'prom', 'old_prom', 'OLD-AUID', 'ID', 'GID', 'info'],
      dtype='object')

In [187]:
labeled_audit_df.drop(columns=dropme, inplace=True)

In [188]:
labeled_audit_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264320 entries, 0 to 264319
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   type              264320 non-null  category      
 1   ts                264320 non-null  datetime64[ns]
 2   tsid              264320 non-null  object        
 3   auid              258041 non-null  category      
 4   pid               264298 non-null  object        
 5   uid               258020 non-null  category      
 6   ses               258041 non-null  object        
 7   msg               223500 non-null  object        
 8   subj              226225 non-null  category      
 9   AUID              226222 non-null  category      
 10  UID               226210 non-null  category      
 11  Activity          264320 non-null  category      
 12  Stage             264320 non-null  category      
 13  DefenderResponse  264320 non-null  category      
 14  Sign

In [189]:
labeled_audit_df[['pid', 'ses']] = labeled_audit_df[['pid', 'ses']].fillna(0)

labeled_audit_df.pid = labeled_audit_df.pid.astype('uint32')
labeled_audit_df.tsid = labeled_audit_df.tsid.astype('uint32')
labeled_audit_df.ses = labeled_audit_df.ses.astype('uint32')

In [191]:
# Saving our work to save me some time.
labeled_audit_df.to_pickle('../data/cleaned/audit.pkl')

## `auth`

In [32]:
path = os.path.split(os.getcwd())[0] + '/data/unraveled-apt/host-logs/auth/'
auth_df = load_all_csv(path, sep='|')

In [33]:
for col in auth_df.columns[1:]:
    print(auth_df[col].value_counts(), end=f'\n{'-'*20}\n')

Activity
Normal                       89135
Network Service Discovery       38
Maintain Access                 36
Name: count, dtype: int64
--------------------
Stage
Benign              89135
Lateral Movement       74
Name: count, dtype: int64
--------------------
DefenderResponse
Benign    89209
Name: count, dtype: int64
--------------------
Signature
APT    74
Name: count, dtype: int64
--------------------


In [64]:
auth_df.LogEvent.iloc[[5, 10, 15, 20, 25, 100, 200, 300, 1000, 2000]].values

array(['Jun 13 00:15:01 kali CRON[328966]: pam_unix(cron:session): session closed for user root',
       'Jun 13 00:35:01 kali CRON[329034]: pam_unix(cron:session): session opened for user root by (uid=0)',
       'Jun 13 00:45:01 kali CRON[329086]: pam_unix(cron:session): session closed for user root',
       'Jun 13 01:09:01 kali CRON[329103]: pam_unix(cron:session): session opened for user root by (uid=0)',
       'Jun 13 01:17:01 kali CRON[329153]: pam_unix(cron:session): session closed for user root',
       'Jun 13 05:17:01 kali CRON[329689]: pam_unix(cron:session): session opened for user root by (uid=0)',
       'Jun 13 10:39:01 kali CRON[330494]: pam_unix(cron:session): session opened for user root by (uid=0)',
       'Jun 13 16:15:01 kali CRON[331321]: pam_unix(cron:session): session opened for user root by (uid=0)',
       "Jun 14 11:30:10 kali sshd[336289]: lastlog_openseek: Couldn't stat /var/log/lastlog: No such file or directory",
       'Jun 15 23:09:01 kali CRON[390994

In [35]:
logs = auth_df.LogEvent.apply(lambda x: x.split(' ', maxsplit=5))
logs.head().values

array([list(['Jun', '13', '00:05:01', 'kali', 'CRON[328914]:', 'pam_unix(cron:session): session opened for user root by (uid=0)']),
       list(['Jun', '13', '00:05:01', 'kali', 'CRON[328914]:', 'pam_unix(cron:session): session closed for user root']),
       list(['Jun', '13', '00:09:01', 'kali', 'CRON[328918]:', 'pam_unix(cron:session): session opened for user root by (uid=0)']),
       list(['Jun', '13', '00:09:01', 'kali', 'CRON[328918]:', 'pam_unix(cron:session): session closed for user root']),
       list(['Jun', '13', '00:15:01', 'kali', 'CRON[328966]:', 'pam_unix(cron:session): session opened for user root by (uid=0)'])],
      dtype=object)

In [36]:
df = pd.DataFrame(data=logs.tolist(), columns=['month', 'day', 'time', 'hostname', 'app', 'msg'])


In [37]:
df['ts'] = "2021-"+df['month']+"-"+df['day']+" "+df['time']
df['ts'] = pd.to_datetime(df['ts'])

In [38]:
# Drop redundant date cols and make ts col 0
df.drop(['month', 'day', 'time'], axis=1, inplace=True, errors='ignore')
df = df[df.columns[:-1].insert(0, 'ts')]

In [39]:
df.head()

Unnamed: 0,ts,hostname,app,msg
0,2021-06-13 00:05:01,kali,CRON[328914]:,pam_unix(cron:session): session opened for use...
1,2021-06-13 00:05:01,kali,CRON[328914]:,pam_unix(cron:session): session closed for use...
2,2021-06-13 00:09:01,kali,CRON[328918]:,pam_unix(cron:session): session opened for use...
3,2021-06-13 00:09:01,kali,CRON[328918]:,pam_unix(cron:session): session closed for use...
4,2021-06-13 00:15:01,kali,CRON[328966]:,pam_unix(cron:session): session opened for use...


In [40]:
tmp = df['app'].str.split('[')

In [41]:
tmp = tmp.apply(lambda x: [e.strip(']:') for e in x])
tmp = tmp.apply(lambda x: x+[0] if len(x) == 1 else x)

In [42]:
tmp

0        [CRON, 328914]
1        [CRON, 328914]
2        [CRON, 328918]
3        [CRON, 328918]
4        [CRON, 328966]
              ...      
89204         [sudo, 0]
89205         [sudo, 0]
89206         [sudo, 0]
89207         [sudo, 0]
89208         [sudo, 0]
Name: app, Length: 89209, dtype: object

In [43]:
tmp = pd.DataFrame(tmp.tolist(), columns=['app','pid'])

In [44]:
tmp.head()

Unnamed: 0,app,pid
0,CRON,328914
1,CRON,328914
2,CRON,328918
3,CRON,328918
4,CRON,328966


In [45]:
df['app'] = tmp['app']
df['pid'] = tmp['pid']


In [46]:
df = df[['ts','hostname','app','pid','msg']]
df.head()

Unnamed: 0,ts,hostname,app,pid,msg
0,2021-06-13 00:05:01,kali,CRON,328914,pam_unix(cron:session): session opened for use...
1,2021-06-13 00:05:01,kali,CRON,328914,pam_unix(cron:session): session closed for use...
2,2021-06-13 00:09:01,kali,CRON,328918,pam_unix(cron:session): session opened for use...
3,2021-06-13 00:09:01,kali,CRON,328918,pam_unix(cron:session): session closed for use...
4,2021-06-13 00:15:01,kali,CRON,328966,pam_unix(cron:session): session opened for use...


In [47]:
del tmp

In [48]:
df.msg = df.msg.apply(lambda x: x.strip())

In [59]:
categories = df.columns[df.nunique() < 100]
categories

Index(['hostname', 'app'], dtype='object')

In [60]:
df[categories] = df[categories].astype('category')

In [53]:
df.pid = df.pid.astype('uint32')

In [63]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89209 entries, 0 to 89208
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   ts        89209 non-null  datetime64[ns]
 1   hostname  89209 non-null  category      
 2   app       89209 non-null  category      
 3   pid       89209 non-null  uint32        
 4   msg       89209 non-null  object        
dtypes: category(2), datetime64[ns](1), object(1), uint32(1)
memory usage: 10.9 MB


Now to finally add the labels!

In [65]:
labels = auth_df.columns[1:]
df[labels] = auth_df[labels].astype('category')

In [66]:
df.head()

Unnamed: 0,ts,hostname,app,pid,msg,Activity,Stage,DefenderResponse,Signature
0,2021-06-13 00:05:01,kali,CRON,328914,pam_unix(cron:session): session opened for use...,Normal,Benign,Benign,
1,2021-06-13 00:05:01,kali,CRON,328914,pam_unix(cron:session): session closed for use...,Normal,Benign,Benign,
2,2021-06-13 00:09:01,kali,CRON,328918,pam_unix(cron:session): session opened for use...,Normal,Benign,Benign,
3,2021-06-13 00:09:01,kali,CRON,328918,pam_unix(cron:session): session closed for use...,Normal,Benign,Benign,
4,2021-06-13 00:15:01,kali,CRON,328966,pam_unix(cron:session): session opened for use...,Normal,Benign,Benign,


In [67]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89209 entries, 0 to 89208
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   ts                89209 non-null  datetime64[ns]
 1   hostname          89209 non-null  category      
 2   app               89209 non-null  category      
 3   pid               89209 non-null  uint32        
 4   msg               89209 non-null  object        
 5   Activity          89209 non-null  category      
 6   Stage             89209 non-null  category      
 7   DefenderResponse  89209 non-null  category      
 8   Signature         74 non-null     category      
dtypes: category(6), datetime64[ns](1), object(1), uint32(1)
memory usage: 11.2 MB


In [68]:
df.to_pickle('../data/cleaned/auth.pkl')

import gc

del df, auth_df
gc.collect()

5958

## Combined

In [47]:
combined_linux_host_df = pd.concat([audit_df, auth_df], ignore_index=True)

In [57]:
combined_linux_host_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353529 entries, 0 to 353528
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   LogEvent          353529 non-null  object
 1   Activity          353529 non-null  object
 2   Stage             353529 non-null  object
 3   DefenderResponse  353529 non-null  object
 4   Signature         74 non-null      object
dtypes: object(5)
memory usage: 13.5+ MB


-----

# Windows Host Logs

## Security.evtx

In [214]:
df = load_all_csv(path='../data/unraveled-apt/host-logs/windows/', encoding='utf-8')

In [215]:
df.head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
0,Audit Success,7/17/2021 10:28:37 PM,Microsoft-Windows-Security-Auditing,4672,Special Logon,Special privileges assigned to new logon.,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
1,Audit Success,7/17/2021 10:28:37 PM,Microsoft-Windows-Security-Auditing,4624,Logon,An account was successfully logged on.,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
2,Audit Success,7/17/2021 10:28:36 PM,Microsoft-Windows-Security-Auditing,4798,User Account Management,A users local group membership was enumerated.,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\us...
3,Audit Success,7/17/2021 10:28:36 PM,Microsoft-Windows-Security-Auditing,4799,Security Group Management,A security-enabled local group membership was ...,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
4,Audit Success,7/17/2021 10:28:36 PM,Microsoft-Windows-Security-Auditing,4799,Security Group Management,A security-enabled local group membership was ...,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...


In [216]:
df['Signature'] = df['Signature'].fillna('Normal')

Perusing through the data seems to show that the cleaning messed up event ID 4625, as it was probably formatted slightly differently. We'll fix this just by using `fillna` and moving some columns around

In [217]:
tmp = df[df.EventID == 4625].copy()
tmp.head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
56,Audit Failure,7/17/2021 9:59:17 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...,,,Normal,
16470,Audit Failure,7/2/2021 5:53:14 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...,,,Normal,
16472,Audit Failure,7/2/2021 5:51:53 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...,,,Normal,
16474,Audit Failure,7/2/2021 5:50:56 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...,,,Normal,
16555,Audit Failure,7/2/2021 5:05:41 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...,,,Normal,


In [218]:
tmp.LogMessage = tmp.Activity
tmp.Activity = tmp.Activity.apply(lambda x:"Normal")
tmp[['Stage', 'DefenderResponse']] = tmp[['Stage', 'DefenderResponse']].fillna('Benign')

In [219]:
tmp.head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
56,Audit Failure,7/17/2021 9:59:17 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
16470,Audit Failure,7/2/2021 5:53:14 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...
16472,Audit Failure,7/2/2021 5:51:53 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...
16474,Audit Failure,7/2/2021 5:50:56 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...
16555,Audit Failure,7/2/2021 5:05:41 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...


There we go, that looks better.

In [220]:
df[df.EventID == 4625] = tmp

In [221]:
df[df.EventID == 4625].head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
56,Audit Failure,7/17/2021 9:59:17 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
16470,Audit Failure,7/2/2021 5:53:14 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...
16472,Audit Failure,7/2/2021 5:51:53 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...
16474,Audit Failure,7/2/2021 5:50:56 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...
16555,Audit Failure,7/2/2021 5:05:41 PM,Microsoft-Windows-Security-Auditing,4625,Logon,An account failed to log on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\an...


In [226]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143473 entries, 0 to 143472
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Type              143473 non-null  object
 1   DateTime          143473 non-null  object
 2   Source            143473 non-null  object
 3   EventID           143473 non-null  int64 
 4   TaskCategory      143473 non-null  object
 5   Description       143473 non-null  object
 6   Activity          143473 non-null  object
 7   Stage             143473 non-null  object
 8   DefenderResponse  143473 non-null  object
 9   Signature         143473 non-null  object
 10  LogMessage        143412 non-null  object
dtypes: int64(1), object(10)
memory usage: 227.8 MB


In [240]:
df.nunique().sort_values()

DefenderResponse        1
Type                    2
Source                  2
Activity                2
Stage                   2
Signature               2
TaskCategory           14
Description            29
EventID                30
LogMessage          25279
DateTime            32796
dtype: int64

In [242]:
df.DateTime = pd.to_datetime(df.DateTime)

  df.DateTime = pd.to_datetime(df.DateTime)


In [231]:
categories = df.columns[df.nunique() < 1000]
categories

Index(['Type', 'Source', 'EventID', 'TaskCategory', 'Description', 'Activity',
       'Stage', 'DefenderResponse', 'Signature'],
      dtype='object')

In [232]:
df[categories] = df[categories].astype('category')

In [245]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143473 entries, 0 to 143472
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Type              143473 non-null  category      
 1   DateTime          143473 non-null  datetime64[ns]
 2   Source            143473 non-null  category      
 3   EventID           143473 non-null  category      
 4   TaskCategory      143473 non-null  category      
 5   Description       143473 non-null  category      
 6   Activity          143473 non-null  category      
 7   Stage             143473 non-null  category      
 8   DefenderResponse  143473 non-null  category      
 9   Signature         143473 non-null  category      
 10  LogMessage        143412 non-null  object        
dtypes: category(9), datetime64[ns](1), object(1)
memory usage: 148.6 MB


In [246]:
df[df.LogMessage.isna()].head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
3672,Audit Success,2021-07-14 17:54:45,Microsoft-Windows-Security-Auditing,5024,Other System Events,The Windows Firewall service started successfu...,Normal,Benign,Benign,Normal,
3694,Audit Success,2021-07-14 17:54:41,Microsoft-Windows-Security-Auditing,5033,Other System Events,The Windows Firewall Driver started successfully.,Normal,Benign,Benign,Normal,
3748,Audit Success,2021-07-14 17:51:40,Microsoft-Windows-Eventlog,1100,Service shutdown,The event logging service has shut down.,Normal,Benign,Benign,Normal,
12537,Audit Success,2021-07-07 18:06:33,Microsoft-Windows-Security-Auditing,5024,Other System Events,The Windows Firewall service started successfu...,Normal,Benign,Benign,Normal,
12560,Audit Success,2021-07-07 18:06:07,Microsoft-Windows-Security-Auditing,5033,Other System Events,The Windows Firewall Driver started successfully.,Normal,Benign,Benign,Normal,


Let's check both of the NaN fields to make sure they only contain nulls.

In [247]:
df[df.LogMessage.isna()][['Signature', 'LogMessage']].isna().all()

Signature     False
LogMessage     True
dtype: bool

In [248]:
df.to_pickle('../data/cleaned/win-security.pkl')

In [249]:
df = pd.read_pickle('../data/cleaned/win-security.pkl')

In [250]:
df.head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
0,Audit Success,2021-07-17 22:28:37,Microsoft-Windows-Security-Auditing,4672,Special Logon,Special privileges assigned to new logon.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
1,Audit Success,2021-07-17 22:28:37,Microsoft-Windows-Security-Auditing,4624,Logon,An account was successfully logged on.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
2,Audit Success,2021-07-17 22:28:36,Microsoft-Windows-Security-Auditing,4798,User Account Management,A users local group membership was enumerated.,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\us...
3,Audit Success,2021-07-17 22:28:36,Microsoft-Windows-Security-Auditing,4799,Security Group Management,A security-enabled local group membership was ...,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
4,Audit Success,2021-07-17 22:28:36,Microsoft-Windows-Security-Auditing,4799,Security Group Management,A security-enabled local group membership was ...,Normal,Benign,Benign,Normal,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
