# Exploratory Analysis

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [63]:
audit_df = pd.read_pickle('../data/cleaned/audit.pkl')
auth_df = pd.read_pickle('../data/cleaned/auth.pkl')
net_df = pd.read_pickle('../data/cleaned/netflow.pkl')
win_df = pd.read_pickle('../data/cleaned/win-security.pkl')

## Host Logs

### `audit` Logs

With these audit logs, I want to review the following:
- Most common and least common value of each categorical field.
- 

I also have the following concerns:
- Lots of columns are mostly null values.
- Lots of rows have a null value
- Apply some sort of feature reduction, otherwise a classifier will take ages to get through it.

After we play around, I'll talk about what's next for this DataFrame.

In [3]:
audit_df.head()

Unnamed: 0,type,ts,tsid,op,ver,format,kernel,auid,pid,uid,...,AUID,UID,OLD-AUID,ID,GID,info,Activity,Stage,DefenderResponse,Signature
0,DAEMON_START,2021-05-24 06:29:27.969,9329,start,2.8.2,raw,5.3.0-40-generic,4294967000.0,13687.0,0.0,...,,,,,,,Normal,Benign,Benign,
1,CONFIG_CHANGE,2021-05-24 06:29:27.983,489,set,,,,4294967000.0,,,...,,,,,,,Normal,Benign,Benign,
2,CONFIG_CHANGE,2021-05-24 06:29:27.983,490,set,,,,4294967000.0,,,...,,,,,,,Normal,Benign,Benign,
3,CONFIG_CHANGE,2021-05-24 06:29:27.983,491,set,,,,4294967000.0,,,...,,,,,,,Normal,Benign,Benign,
4,SERVICE_START,2021-05-24 06:29:27.987,492,,,,,4294967000.0,1.0,0.0,...,,,,,,,Normal,Benign,Benign,


In [52]:
audit_df.to_pickle('../data/cleaned/audit.pkl')

In [62]:
audit_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264320 entries, 0 to 264319
Data columns (total 46 columns):
 #   Column                   Non-Null Count   Dtype   
---  ------                   --------------   -----   
 0   type                     264320 non-null  category
 1   ts                       264320 non-null  object  
 2   tsid                     264320 non-null  int64   
 3   op                       28 non-null      category
 4   ver                      7 non-null       category
 5   format                   7 non-null       category
 6   kernel                   7 non-null       category
 7   auid                     258041 non-null  category
 8   pid                      264298 non-null  float64 
 9   uid                      258020 non-null  category
 10  ses                      258041 non-null  float64 
 11  msg                      223500 non-null  object  
 12  subj                     7 non-null       category
 13  res                      34540 non-null   ca

Alright, so the plan is:
- Select the best performing classification model.
- Fit a classification model built on a majority of the features, then apply feature reduction techniques.
- Pit the old model against the new to validate performance.

### `auth` Logs

I'm interested in the following:
- What are some common features among benign logs?
  - Successes and failures
- What about for the logs of malicious activity?


In [4]:
auth_df.head()

Unnamed: 0,ts,hostname,app,pid,msg,Activity,Stage,DefenderResponse,Signature
0,2021-06-13 00:05:01,kali,CRON,328914,pam_unix(cron:session): session opened for use...,Normal,Benign,Benign,
1,2021-06-13 00:05:01,kali,CRON,328914,pam_unix(cron:session): session closed for use...,Normal,Benign,Benign,
2,2021-06-13 00:09:01,kali,CRON,328918,pam_unix(cron:session): session opened for use...,Normal,Benign,Benign,
3,2021-06-13 00:09:01,kali,CRON,328918,pam_unix(cron:session): session closed for use...,Normal,Benign,Benign,
4,2021-06-13 00:15:01,kali,CRON,328966,pam_unix(cron:session): session opened for use...,Normal,Benign,Benign,


In [61]:
auth_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89209 entries, 0 to 89208
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   ts                89209 non-null  datetime64[ns]
 1   hostname          89209 non-null  category      
 2   app               89209 non-null  category      
 3   pid               89209 non-null  uint32        
 4   msg               89209 non-null  object        
 5   Activity          89209 non-null  category      
 6   Stage             89209 non-null  category      
 7   DefenderResponse  89209 non-null  category      
 8   Signature         74 non-null     category      
dtypes: category(6), datetime64[ns](1), object(1), uint32(1)
memory usage: 11.2 MB


### Windows `Security.evtx` Logs

Here are some things I am curious about:
- f

In [5]:
win_df.head()

Unnamed: 0,Type,DateTime,Source,EventID,TaskCategory,Description,Activity,Stage,DefenderResponse,Signature,LogMessage
0,Audit Success,7/17/2021 10:28:37 PM,Microsoft-Windows-Security-Auditing,4672,Special Logon,Special privileges assigned to new logon.,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
1,Audit Success,7/17/2021 10:28:37 PM,Microsoft-Windows-Security-Auditing,4624,Logon,An account was successfully logged on.,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
2,Audit Success,7/17/2021 10:28:36 PM,Microsoft-Windows-Security-Auditing,4798,User Account Management,A users local group membership was enumerated.,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tDESKTOP-56DUI1B\us...
3,Audit Success,7/17/2021 10:28:36 PM,Microsoft-Windows-Security-Auditing,4799,Security Group Management,A security-enabled local group membership was ...,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...
4,Audit Success,7/17/2021 10:28:36 PM,Microsoft-Windows-Security-Auditing,4799,Security Group Management,A security-enabled local group membership was ...,Normal,Benign,Benign,,Subject:\n\tSecurity ID:\t\tSYSTEM\n\tAccount ...


In [57]:
win_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143473 entries, 0 to 143472
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Type              143473 non-null  category
 1   DateTime          143473 non-null  object  
 2   Source            143473 non-null  category
 3   EventID           143473 non-null  category
 4   TaskCategory      143473 non-null  category
 5   Description       143473 non-null  category
 6   Activity          143473 non-null  category
 7   Stage             143473 non-null  category
 8   DefenderResponse  143473 non-null  category
 9   Signature         360 non-null     category
 10  LogMessage        143412 non-null  object  
dtypes: category(9), object(2)
memory usage: 155.5 MB


In [56]:
categories = win_df.columns[win_df.nunique() < 100]
win_df[categories] = win_df[categories].astype('category')

## Network Logs

### Netflow

In [10]:
net_df.head()

Unnamed: 0,flow_start,flow_end,expiration_id,src_ip,src_mac,src_port,dst_ip,dst_mac,dst_port,protocol,...,dst2src_rst_packets,dst2src_fin_packets,application_name,application_category_name,server_fingerprint,user_agent,Activity,Stage,DefenderResponse,Signature
0,2021-05-26 18:03:17.917,2021-05-26 18:06:39.941,0,10.1.2.17,fa:16:3e:a2:d6:e6,123,192.81.135.252,fa:16:3e:10:2d:11,123,17,...,0,0,NTP,System,,,Normal,Benign,Benign,
1,2021-05-26 18:03:18.917,2021-05-26 18:06:42.960,0,10.1.2.17,fa:16:3e:a2:d6:e6,123,74.6.168.72,fa:16:3e:10:2d:11,123,17,...,0,0,NTP,System,,,Normal,Benign,Benign,
2,2021-05-26 18:04:34.917,2021-05-26 18:04:34.939,0,10.1.2.17,fa:16:3e:a2:d6:e6,123,23.131.160.7,fa:16:3e:10:2d:11,123,17,...,0,0,NTP,System,,,Normal,Benign,Benign,
3,2021-05-26 18:03:20.917,2021-05-26 18:03:20.998,0,10.1.2.17,fa:16:3e:a2:d6:e6,123,209.115.181.108,fa:16:3e:10:2d:11,123,17,...,0,0,NTP,System,,,Normal,Benign,Benign,
4,2021-05-26 18:04:37.917,2021-05-26 18:04:37.988,0,10.1.2.17,fa:16:3e:a2:d6:e6,123,108.61.73.243,fa:16:3e:10:2d:11,123,17,...,0,0,NTP,System,,,Normal,Benign,Benign,
