# Exploratory Data Analysis (UNSW-NB15)

Analysis of missing values and label distribution for UNSW-NB15 network traffic dataset.

#### Feature Descriptions

##### Addressing and protocol

**1. `srcip` (string)**  
Source IP address of the network flow.

**2. `sport` (port number)**  
Source transport-layer port.

**3. `dstip` (string)**  
Destination IP address of the network flow.

**4. `dsport` (port number)**  
Destination transport-layer port.

**5. `proto` (categorical)**  
Transport-layer protocol (e.g., tcp, udp, icmp).

**6. `state` (categorical)**  
Connection state derived by Argus (e.g., CON, INT, FIN, REQ).

---

##### Flow timing and volume

**7. `dur` (seconds)**  
Total duration of the flow.

**8. `sbytes` (bytes)**  
Total number of bytes sent from source to destination.

**9. `dbytes` (bytes)**  
Total number of bytes sent from destination to source.

**10. `sttl` (hop count)**  
Time-To-Live value of packets sent from source.

**11. `dttl` (hop count)**  
Time-To-Live value of packets sent from destination.

**12. `sloss` (packets)**  
Number of packets lost from source to destination.

**13. `dloss` (packets)**  
Number of packets lost from destination to source.

---

##### Service and load characteristics

**14. `service` (categorical)**  
Application-layer service (e.g., http, ftp, dns).

**15. `sload` (bits/second)**  
Source-to-destination data transfer rate.

**16. `dload` (bits/second)**  
Destination-to-source data transfer rate.

**17. `spkts` (packets)**  
Number of packets sent from source.

**18. `dpkts` (packets)**  
Number of packets sent from destination.

---

##### TCP window and packet size

**19. `swin` (bytes)**  
TCP window size advertised by the source.

**20. `dwin` (bytes)**  
TCP window size advertised by the destination.

**21. `stcpb` (sequence number)**  
TCP base sequence number from the source.

**22. `dtcpb` (sequence number)**  
TCP base sequence number from the destination.

**23. `smeansz` (bytes)**  
Mean packet size sent from source.

**24. `dmeansz` (bytes)**  
Mean packet size sent from destination.

---

##### Application behavior

**25. `trans_depth` (count)**  
Depth of HTTP transaction pipelining.

**26. `res_bdy_len` (bytes)**  
Length of HTTP response body.

---

##### Jitter and timing features

**27. `sjit` (milliseconds)**  
Source packet inter-arrival jitter.

**28. `djit` (milliseconds)**  
Destination packet inter-arrival jitter.

**29. `stime` (epoch seconds)**  
Start time of the flow.

**30. `ltime` (epoch seconds)**  
End time of the flow.

**31. `sintpkt` (milliseconds)**  
Mean inter-arrival time between source packets.

**32. `dintpkt` (milliseconds)**  
Mean inter-arrival time between destination packets.

---

##### TCP handshake timing

**33. `tcprtt` (milliseconds)**  
TCP round-trip time.

**34. `synack` (milliseconds)**  
Time between SYN and SYN-ACK packets.

**35. `ackdat` (milliseconds)**  
Time between SYN-ACK and ACK packets.

---

##### Flow structure indicators

**36. `is_sm_ips_ports` (binary)**  
Indicates whether source and destination IPs and ports are identical.

**37. `ct_state_ttl` (count)**  
Number of connections with the same state and TTL values.

**38. `ct_flw_http_mthd` (count)**  
Number of flows with the same HTTP method.

**39. `is_ftp_login` (binary)**  
Indicates whether an FTP login was detected.

**40. `ct_ftp_cmd` (count)**  
Number of FTP command flows.

---

##### Connection aggregation features

**41. `ct_srv_src` (count)**  
Number of connections to the same service from the same source.

**42. `ct_srv_dst` (count)**  
Number of connections to the same service from the same destination.

**43. `ct_dst_ltm` (count)**  
Number of connections to the same destination in the last time window.

**44. `ct_src_ltm` (count)**  
Number of connections from the same source in the last time window.

**45. `ct_src_dport_ltm` (count)**  
Number of connections from the same source to the same destination port.

**46. `ct_dst_sport_ltm` (count)**  
Number of connections to the same destination from the same source port.

**47. `ct_dst_src_ltm` (count)**  
Number of connections between the same source and destination pair.

---

##### Labels

**48. `attack_cat` (categorical)**  
Attack category label (e.g., Exploits, Fuzzers, DoS, Reconnaissance).

**49. `label` (binary)**  
Binary class label where:
- `0` indicates normal (benign) traffic  
- `1` indicates attack traffic

---

##### Source  
UNSW-NB15 Dataset  
https://research.unsw.edu.au/projects/unsw-nb15-dataset


In [1]:
import boto3
import sagemaker
import pandas as pd
from sqlalchemy import create_engine, text

# helps the notebook print all columns
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## SageMaker / Athena setup

In [2]:
sess = sagemaker.Session()
region = boto3.Session().region_name

results_bucket = sess.default_bucket()
athena_results_path = f"s3://{results_bucket}/athena/staging/"

database_name = "aai540_eda"

engine = create_engine(
    f"awsathena+rest://@athena.{region}.amazonaws.com:443/{database_name}",
    connect_args={
        "s3_staging_dir": athena_results_path,
        "region_name": region,
    },
)

def read_sql(query: str) -> pd.DataFrame:
    return pd.read_sql(query, engine)

print("Region:", region)
print("Athena results:", athena_results_path)

Region: us-east-1
Athena results: s3://sagemaker-us-east-1-128131109986/athena/staging/


## Athena Tables

In [3]:
read_sql(f"SHOW TABLES IN {database_name}")

Unnamed: 0,tab_name
0,cic_ids2017_raw
1,ton_iot_raw
2,unsw_nb15_raw


## UNSW-NB15 EDA

### Count rows

In [4]:
read_sql(f"""
SELECT COUNT(*) AS total_rows
FROM {database_name}.unsw_nb15_raw
""")


Unnamed: 0,total_rows
0,2540047


### Columns

In [5]:
read_sql(f"SHOW COLUMNS FROM {database_name}.unsw_nb15_raw")

Unnamed: 0,field
0,srcip
1,sport
2,dstip
3,dsport
4,proto
5,state
6,dur
7,sbytes
8,dbytes
9,sttl


### Binary lable distribution

In [6]:
read_sql(f"""
SELECT
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END AS class,
  COUNT(*) AS cnt,
  COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS pct
FROM {database_name}.unsw_nb15_raw
GROUP BY
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END
ORDER BY cnt DESC
""")


Unnamed: 0,class,cnt,pct
0,BENIGN,2218764,0.873513
1,MALICIOUS,321283,0.126487


### Attack type distribution

In [7]:
read_sql(f"""
SELECT
  attack_cat,
  COUNT(*) AS cnt,
  COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS pct
FROM {database_name}.unsw_nb15_raw
WHERE label = 1
GROUP BY attack_cat
ORDER BY cnt DESC
""")

Unnamed: 0,attack_cat,cnt,pct
0,Generic,215481,0.670689
1,Exploits,44525,0.138585
2,Fuzzers,19195,0.059745
3,DoS,16353,0.050899
4,Reconnaissance,12228,0.03806
5,Fuzzers,5051,0.015721
6,Analysis,2677,0.008332
7,Backdoor,1795,0.005587
8,Reconnaissance,1759,0.005475
9,Shellcode,1288,0.004009


### Missing values analysis

In [9]:
cols_df = read_sql(f"SHOW COLUMNS FROM {database_name}.unsw_nb15_raw")
col_field = cols_df.columns[0]

columns = (
    cols_df[col_field]
    .dropna()
    .astype(str)
    .tolist()
)
columns = [c for c in columns if c.strip() and not c.lower().startswith("#")]

missing_sql = f"""
SELECT
  {", ".join([f"SUM(CASE WHEN {c} IS NULL THEN 1 ELSE 0 END) AS {c}" for c in columns])}
FROM {database_name}.unsw_nb15_raw
"""
missing = read_sql(missing_sql).iloc[0]
missing

srcip                     0
sport                     8
dstip                     0
dsport                  304
proto                     0
state                     0
dur                       0
sbytes                    0
dbytes                    0
sttl                      0
dttl                      0
sloss                     0
dloss                     0
service                   0
sload                     0
dload                     0
spkts                     0
dpkts                     0
swin                      0
dwin                      0
stcpb                     0
dtcpb                     0
smeansz                   0
dmeansz                   0
trans_depth               0
res_bdy_len               0
sjit                      0
djit                      0
stime                     0
ltime                     0
sintpkt                   0
dintpkt                   0
tcprtt                    0
synack                    0
ackdat                    0
is_sm_ips_ports     

### Duration distribution

In [10]:
read_sql(f"""
SELECT
  MIN(dur) AS min_duration,
  APPROX_PERCENTILE(dur, 0.50) AS p50,
  APPROX_PERCENTILE(dur, 0.95) AS p95,
  APPROX_PERCENTILE(dur, 0.99) AS p99,
  MAX(dur) AS max_duration
FROM {database_name}.unsw_nb15_raw
""")


Unnamed: 0,min_duration,p50,p95,p99,max_duration
0,0.0,0.015387,1.673901,12.528759,8786.637695


### Duration by class

In [11]:
read_sql(f"""
SELECT
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END AS class,
  APPROX_PERCENTILE(dur, 0.50) AS p50,
  APPROX_PERCENTILE(dur, 0.95) AS p95
FROM {database_name}.unsw_nb15_raw
GROUP BY
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END
""")


Unnamed: 0,class,p50,p95
0,BENIGN,0.020982,1.653411
1,MALICIOUS,9e-06,1.705538


### Protocol distribution

In [12]:
read_sql(f"""
SELECT proto, COUNT(*) AS cnt
FROM {database_name}.unsw_nb15_raw
GROUP BY proto
ORDER BY cnt DESC
""")


Unnamed: 0,proto,cnt
0,tcp,1495074
1,udp,990435
2,unas,16202
3,arp,10064
4,ospf,7798
5,sctp,1525
6,icmp,524
7,any,411
8,gre,324
9,rsvp,274


In [14]:
read_sql("""
SELECT
  APPROX_PERCENTILE(sbytes, 0.50) AS sbytes_p50,
  APPROX_PERCENTILE(sbytes, 0.95) AS sbytes_p95,
  APPROX_PERCENTILE(dbytes, 0.50) AS dbytes_p50,
  APPROX_PERCENTILE(dbytes, 0.95) AS dbytes_p95
FROM aai540_eda.unsw_nb15_raw
""")


Unnamed: 0,sbytes_p50,sbytes_p95,dbytes_p50,dbytes_p95
0,1285,19099,1848,73797


In [15]:
read_sql("""
SELECT
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END AS class,
  APPROX_PERCENTILE(sbytes, 0.95) AS sbytes_p95,
  APPROX_PERCENTILE(dbytes, 0.95) AS dbytes_p95
FROM aai540_eda.unsw_nb15_raw
GROUP BY
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END
""")


Unnamed: 0,class,sbytes_p95,dbytes_p95
0,BENIGN,22222,85271
1,MALICIOUS,2503,1837
