# Exploratory Data Analysis (CIC-IDS2017)

Analysis of missing values and label distribution for CIC-IDS2017 network traffic dataset.

#### CIC-IDS2017 Feature Descriptions
**1. `flow_duration` (microseconds)**  
Total duration of the network flow from the first packet to the last packet.

**2. `total_fwd_packets` (packets)**  
Total number of packets sent in the forward (source → destination) direction.

**3. `total_backward_packets` (packets)**  
Total number of packets sent in the backward (destination → source) direction.

**4. `total_length_of_fwd_packets` (bytes)**  
Total number of bytes transmitted in forward packets.

**5. `total_length_of_bwd_packets` (bytes)**  
Total number of bytes transmitted in backward packets.

---

##### Forward packet length statistics

**6. `fwd_packet_length_max` (bytes)**  
Maximum packet size observed in the forward direction.

**7. `fwd_packet_length_min` (bytes)**  
Minimum packet size observed in the forward direction.

**8. `fwd_packet_length_mean` (bytes)**  
Mean packet size in the forward direction.

**9. `fwd_packet_length_std` (bytes)**  
Standard deviation of forward packet sizes.

---

##### Backward packet length statistics

**10. `bwd_packet_length_max` (bytes)**  
Maximum packet size observed in the backward direction.

**11. `bwd_packet_length_min` (bytes)**  
Minimum packet size observed in the backward direction.

**12. `bwd_packet_length_mean` (bytes)**  
Mean packet size in the backward direction.

**13. `bwd_packet_length_std` (bytes)**  
Standard deviation of backward packet sizes.

---

##### Flow rate features

**14. `flow_bytes_s` (bytes/second)**  
Average number of bytes transmitted per second over the flow duration.

**15. `flow_packets_s` (packets/second)**  
Average number of packets transmitted per second over the flow duration.

---

##### Flow inter-arrival time (IAT) statistics

**16. `flow_iat_mean` (microseconds)**  
Mean inter-arrival time between packets in the flow.

**17. `flow_iat_std` (microseconds)**  
Standard deviation of packet inter-arrival times.

**18. `flow_iat_max` (microseconds)**  
Maximum inter-arrival time between packets.

**19. `flow_iat_min` (microseconds)**  
Minimum inter-arrival time between packets.

---

##### Forward inter-arrival time (IAT) statistics

**20. `fwd_iat_total` (microseconds)**  
Total inter-arrival time between forward packets.

**21. `fwd_iat_mean` (microseconds)**  
Mean inter-arrival time of forward packets.

**22. `fwd_iat_std` (microseconds)**  
Standard deviation of forward packet inter-arrival times.

**23. `fwd_iat_max` (microseconds)**  
Maximum forward packet inter-arrival time.

**24. `fwd_iat_min` (microseconds)**  
Minimum forward packet inter-arrival time.

---

##### Backward inter-arrival time (IAT) statistics

**25. `bwd_iat_total` (microseconds)**  
Total inter-arrival time between backward packets.

**26. `bwd_iat_mean` (microseconds)**  
Mean inter-arrival time of backward packets.

**27. `bwd_iat_std` (microseconds)**  
Standard deviation of backward packet inter-arrival times.

**28. `bwd_iat_max` (microseconds)**  
Maximum backward packet inter-arrival time.

**29. `bwd_iat_min` (microseconds)**  
Minimum backward packet inter-arrival time.

---

##### TCP flag indicators (directional)

**30. `fwd_psh_flags` (count)**  
Number of PSH (Push) flags set in forward packets.

**31. `bwd_psh_flags` (count)**  
Number of PSH flags set in backward packets.

**32. `fwd_urg_flags` (count)**  
Number of URG (Urgent) flags set in forward packets.

**33. `bwd_urg_flags` (count)**  
Number of URG flags set in backward packets.

---

##### Header and packet structure features

**34. `fwd_header_length` (bytes)**  
Total length of headers in forward packets.

**35. `bwd_header_length` (bytes)**  
Total length of headers in backward packets.

**36. `fwd_packets_s` (packets/second)**  
Rate of forward packets per second.

**37. `bwd_packets_s` (packets/second)**  
Rate of backward packets per second.

---

##### Packet length statistics (combined)

**38. `min_packet_length` (bytes)**  
Minimum packet size observed in the flow.

**39. `max_packet_length` (bytes)**  
Maximum packet size observed in the flow.

**40. `packet_length_mean` (bytes)**  
Mean packet size across the entire flow.

**41. `packet_length_std` (bytes)**  
Standard deviation of packet sizes.

**42. `packet_length_variance` (bytes²)**  
Variance of packet sizes.

---

##### TCP flag counts (aggregate)

**43. `fin_flag_count` (count)**  
Number of FIN flags observed in the flow.

**44. `syn_flag_count` (count)**  
Number of SYN flags observed.

**45. `rst_flag_count` (count)**  
Number of RST flags observed.

**46. `psh_flag_count` (count)**  
Number of PSH flags observed.

**47. `ack_flag_count` (count)**  
Number of ACK flags observed.

**48. `urg_flag_count` (count)**  
Number of URG flags observed.

**49. `cwe_flag_count` (count)**  
Number of CWE flags observed.

**50. `ece_flag_count` (count)**  
Number of ECE flags observed.

---

##### Traffic directionality and aggregation

**51. `down_up_ratio` (ratio)**  
Ratio of backward packets to forward packets.

**52. `average_packet_size` (bytes)**  
Average packet size across the flow.

**53. `avg_fwd_segment_size` (bytes)**  
Average TCP segment size in the forward direction.

**54. `avg_bwd_segment_size` (bytes)**  
Average TCP segment size in the backward direction.

---

##### Bulk transfer features

**55. `fwd_header_length_1` (bytes)**  
Alternative or duplicated calculation of forward header length.

**56. `fwd_avg_bytes_bulk` (bytes)**  
Average number of bytes transferred per forward bulk segment.

**57. `fwd_avg_packets_bulk` (packets)**  
Average number of packets per forward bulk segment.

**58. `fwd_avg_bulk_rate` (bytes/second)**  
Average forward bulk transfer rate.

**59. `bwd_avg_bytes_bulk` (bytes)**  
Average number of bytes transferred per backward bulk segment.

**60. `bwd_avg_packets_bulk` (packets)**  
Average number of packets per backward bulk segment.

**61. `bwd_avg_bulk_rate` (bytes/second)**  
Average backward bulk transfer rate.

---

##### Subflow features

**62. `subflow_fwd_packets` (packets)**  
Number of forward packets in subflows.

**63. `subflow_fwd_bytes` (bytes)**  
Number of forward bytes in subflows.

**64. `subflow_bwd_packets` (packets)**  
Number of backward packets in subflows.

**65. `subflow_bwd_bytes` (bytes)**  
Number of backward bytes in subflows.

---

##### TCP window and segment features

**66. `init_win_bytes_forward` (bytes)**  
Initial TCP window size in the forward direction.

**67. `init_win_bytes_backward` (bytes)**  
Initial TCP window size in the backward direction.

**68. `act_data_pkt_fwd` (packets)**  
Number of forward packets carrying application-layer payload data.

**69. `min_seg_size_forward` (bytes)**  
Minimum TCP segment size observed in the forward direction.

---

##### Flow activity and idle time statistics

**70. `active_mean` (microseconds)**  
Mean duration of active periods within the flow.

**71. `active_std` (microseconds)**  
Standard deviation of active durations.

**72. `active_max` (microseconds)**  
Maximum active duration.

**73. `active_min` (microseconds)**  
Minimum active duration.

**74. `idle_mean` (microseconds)**  
Mean duration of idle periods.

**75. `idle_std` (microseconds)**  
Standard deviation of idle durations.

**76. `idle_max` (microseconds)**  
Maximum idle duration.

**77. `idle_min` (microseconds)**  
Minimum idle duration.

---

##### Label

**78. `label` (categorical)**  
Ground-truth class label indicating whether the flow is benign or malicious.

---

##### Source  
CIC-IDS2017 Dataset  
https://www.unb.ca/cic/datasets/ids-2017.html


In [21]:
import boto3
import sagemaker
import pandas as pd
from sqlalchemy import create_engine, text

# helps the notebook print all columns
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)

## SageMaker / Athena setup

In [2]:
sess = sagemaker.Session()
region = boto3.Session().region_name

results_bucket = sess.default_bucket()
athena_results_path = f"s3://{results_bucket}/athena/staging/"

database_name = "aai540_eda"

engine = create_engine(
    f"awsathena+rest://@athena.{region}.amazonaws.com:443/{database_name}",
    connect_args={
        "s3_staging_dir": athena_results_path,
        "region_name": region,
    },
)

def read_sql(query: str) -> pd.DataFrame:
    return pd.read_sql(query, engine)

print("Region:", region)
print("Athena results:", athena_results_path)

Region: us-east-1
Athena results: s3://sagemaker-us-east-1-128131109986/athena/staging/


## Athena Tables

In [3]:
read_sql(f"SHOW TABLES IN {database_name}")

Unnamed: 0,tab_name
0,cic_ids2017_raw
1,ton_iot_raw
2,unsw_nb15_raw


## CIC-IDS2017 EDA

### Row counts

In [25]:
read_sql(f"SELECT COUNT(*) AS total_rows FROM {database_name}.cic_ids2017_raw")

Unnamed: 0,total_rows
0,2830743


### Columns

In [26]:
read_sql(f"SHOW COLUMNS FROM {database_name}.cic_ids2017_raw")

Unnamed: 0,field
0,destination_port
1,flow_duration
2,total_fwd_packets
3,total_backward_packets
4,total_length_of_fwd_packets
5,total_length_of_bwd_packets
6,fwd_packet_length_max
7,fwd_packet_length_min
8,fwd_packet_length_mean
9,fwd_packet_length_std


### Binary label distribution

In [30]:
read_sql(f"""
SELECT
  CASE
    WHEN label = 'BENIGN' THEN 'BENIGN'
    ELSE 'MALICIOUS'
  END AS binary_label,
  COUNT(*) AS cnt
FROM {database_name}.cic_ids2017_raw
GROUP BY
  CASE
    WHEN label = 'BENIGN' THEN 'BENIGN'
    ELSE 'MALICIOUS'
  END
ORDER BY cnt DESC
""")

Unnamed: 0,binary_label,cnt
0,BENIGN,2273097
1,MALICIOUS,557646


### Attack type distribution

In [57]:
read_sql(f"""
SELECT label, COUNT(*) AS cnt
FROM {database_name}.cic_ids2017_raw
GROUP BY label
ORDER BY cnt DESC
""")

Unnamed: 0,label,cnt
0,BENIGN,2273097
1,DoS Hulk,231073
2,PortScan,158930
3,DDoS,128027
4,DoS GoldenEye,10293
5,FTP-Patator,7938
6,SSH-Patator,5897
7,DoS slowloris,5796
8,DoS Slowhttptest,5499
9,Bot,1966


### Missing values analysis

In [60]:
cols_df = read_sql(f"SHOW COLUMNS FROM {database_name}.cic_ids2017_raw")
col_field = cols_df.columns[0]
columns = (
    cols_df[col_field]
    .dropna()
    .astype(str)
    .tolist()
)
columns = [c for c in columns if c.strip() and not c.lower().startswith("#")]

missing_sql = f"""
SELECT
  {", ".join([f"SUM(CASE WHEN {c} IS NULL THEN 1 ELSE 0 END) AS {c}" for c in columns])}
FROM {database_name}.cic_ids2017_raw
"""
missing = read_sql(missing_sql).iloc[0]
missing

destination_port               0
flow_duration                  0
total_fwd_packets              0
total_backward_packets         0
total_length_of_fwd_packets    0
total_length_of_bwd_packets    0
fwd_packet_length_max          0
fwd_packet_length_min          0
fwd_packet_length_mean         0
fwd_packet_length_std          0
bwd_packet_length_max          0
bwd_packet_length_min          0
bwd_packet_length_mean         0
bwd_packet_length_std          0
flow_bytes_s                   0
flow_packets_s                 0
flow_iat_mean                  0
flow_iat_std                   0
flow_iat_max                   0
flow_iat_min                   0
fwd_iat_total                  0
fwd_iat_mean                   0
fwd_iat_std                    0
fwd_iat_max                    0
fwd_iat_min                    0
bwd_iat_total                  0
bwd_iat_mean                   0
bwd_iat_std                    0
bwd_iat_max                    0
bwd_iat_min                    0
fwd_psh_fl

### Basic feature sanity (ports, duration)

In [16]:
read_sql(f"""
SELECT
  MIN(destination_port) AS min_port,
  MAX(destination_port) AS max_port,
  MIN(flow_duration) AS min_duration,
  APPROX_PERCENTILE(flow_duration, 0.50) AS p50_duration,
  APPROX_PERCENTILE(flow_duration, 0.95) AS p95_duration,
  APPROX_PERCENTILE(flow_duration, 0.99) AS p99_duration,
  MAX(flow_duration) AS max_duration
FROM {database_name}.cic_ids2017_raw
""")

Unnamed: 0,min_port,max_port,min_duration,p50_duration,p95_duration,p99_duration,max_duration
0,0,65535,-13.0,37136.134073,101872200.0,117743300.0,119999998.0


### Packet and byte rate distributions

In [44]:
read_sql(f"""
SELECT
  MIN(flow_bytes_s) AS bytes_min,
  APPROX_PERCENTILE(flow_bytes_s, 0.50) AS bytes_p50,
  APPROX_PERCENTILE(flow_bytes_s, 0.95) AS bytes_p95,
  MAX(flow_bytes_s) AS bytes_max
FROM {database_name}.cic_ids2017_raw
WHERE flow_bytes_s IS NOT NULL
  AND flow_bytes_s > 0
""")


Unnamed: 0,bytes_min,bytes_p50,bytes_p95,bytes_max
0,0.050344,8920.704327,2817338.0,inf


In [43]:
read_sql(f"""
SELECT
  MIN(flow_packets_s) AS packets_min,
  APPROX_PERCENTILE(flow_packets_s, 0.50) AS packets_p50,
  APPROX_PERCENTILE(flow_packets_s, 0.95) AS packets_p95,
  MAX(flow_packets_s) AS packets_max
FROM {database_name}.cic_ids2017_raw
""")

Unnamed: 0,packets_min,packets_p50,packets_p95,packets_max
0,-2000000.0,242.991748,592760.063827,inf
