# Firewall Action Predictions
This notebook and dataset is to see if it's possible to predict if the given firewall will allow, deny, drop, or reset the connection.

In [1]:
import pandas as pd

In [3]:
data_path = '../datasets/firewall_log.csv'
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Action,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
0,57222,53,54587,53,allow,177,94,83,2,30,1,1
1,56258,3389,56258,3389,allow,4768,1600,3168,19,17,10,9
2,6881,50321,43265,50321,allow,238,118,120,2,1199,1,1
3,50553,3389,50553,3389,allow,3327,1438,1889,15,17,8,7
4,50002,443,45848,443,allow,25358,6778,18580,31,16,13,18


# Data Cleaning

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65532 entries, 0 to 65531
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Source Port           65532 non-null  int64 
 1   Destination Port      65532 non-null  int64 
 2   NAT Source Port       65532 non-null  int64 
 3   NAT Destination Port  65532 non-null  int64 
 4   Action                65532 non-null  object
 5   Bytes                 65532 non-null  int64 
 6   Bytes Sent            65532 non-null  int64 
 7   Bytes Received        65532 non-null  int64 
 8   Packets               65532 non-null  int64 
 9   Elapsed Time (sec)    65532 non-null  int64 
 10  pkts_sent             65532 non-null  int64 
 11  pkts_received         65532 non-null  int64 
dtypes: int64(11), object(1)
memory usage: 6.0+ MB


In [5]:
df.describe()

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
count,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0
mean,49391.969343,10577.385812,19282.972761,2671.04993,97123.95,22385.8,74738.15,102.866,65.833577,41.39953,61.466505
std,15255.712537,18466.027039,21970.689669,9739.162278,5618439.0,3828139.0,2463208.0,5133.002,302.461762,3218.871288,2223.332271
min,0.0,0.0,0.0,0.0,60.0,60.0,0.0,1.0,0.0,1.0,0.0
25%,49183.0,80.0,0.0,0.0,66.0,66.0,0.0,1.0,0.0,1.0,0.0
50%,53776.5,445.0,8820.5,53.0,168.0,90.0,79.0,2.0,15.0,1.0,1.0
75%,58638.0,15000.0,38366.25,443.0,752.25,210.0,449.0,6.0,30.0,3.0,2.0
max,65534.0,65535.0,65535.0,65535.0,1269359000.0,948477200.0,320881800.0,1036116.0,10824.0,747520.0,327208.0


## Description Notes
* It's interesting there are packets with no ports associated.
* Zero elapsed time is interesting
* Packets sent, 1 might be dropped or denied

In [23]:
df['Action'].value_counts()

allow         37640
deny          14987
drop          12851
reset-both       54
Name: Action, dtype: int64

## Classification notes
* Allow and deny are self-explanatory
* Drop is when the packet is ignored
* reset-both was found to be a feature in Palo Alto firewalls
  * Injects a TCP RST packet into the session and sends it to the server and client
  * Terminates connections detected to be malicious
  * Generally not recommended as it can be used in DoS attacks against your firewall

In [30]:
# Showing any missing values for each column
for column in df.columns:
    print(df[column].isna().sum(), "missing values in column", column)

0 missing values in column Source Port
0 missing values in column Destination Port
0 missing values in column NAT Source Port
0 missing values in column NAT Destination Port
0 missing values in column Action
0 missing values in column Bytes
0 missing values in column Bytes Sent
0 missing values in column Bytes Received
0 missing values in column Packets
0 missing values in column Elapsed Time (sec)
0 missing values in column pkts_sent
0 missing values in column pkts_received


# Exploratory Data Analysis