# Exploratory Data Analysis (TON_IOT)

Analysis of missing values and label distribution for TON_IOT network traffic dataset.

#### Feature Descriptions

##### Service Profile: Connection Activity

**1. `ts` (epoch seconds)**  
Timestamp indicating when the connection between flow identifiers was observed.

**2. `src_ip` (string)**  
Source IP address representing the originating endpoint.

**3. `src_port` (port number)**  
Source TCP/UDP port number used by the originating endpoint.

**4. `dst_ip` (string)**  
Destination IP address representing the responding endpoint.

**5. `dst_port` (port number)**  
Destination TCP/UDP port number used by the responding endpoint.

**6. `proto` (categorical)**  
Transport-layer protocol of the flow (e.g., TCP, UDP, ICMP).

**7. `service` (categorical)**  
Dynamically detected application-layer service (e.g., DNS, HTTP, SSL).

**8. `duration` (seconds)**  
Duration of the connection, calculated as the difference between the time of the last packet seen and the first packet seen.

**9. `src_bytes` (bytes)**  
Number of payload bytes sent from the source endpoint.

**10. `dst_bytes` (bytes)**  
Number of payload bytes sent from the destination endpoint.

**11. `conn_state` (categorical)**  
Connection state as defined by Zeek, such as:
- `S0`: connection attempt with no reply  
- `S1`: connection established  
- `REJ`: connection attempt rejected

**12. `missed_bytes` (bytes)**  
Number of bytes missing due to packet loss or capture gaps.

---

##### Service Profile: Statistical Activity

**13. `src_pkts` (packets)**  
Number of packets sent from the source endpoint.

**14. `src_ip_bytes` (bytes)**  
Total number of IP-layer bytes sent from the source endpoint, including headers.

**15. `dst_pkts` (packets)**  
Number of packets sent from the destination endpoint.

**16. `dst_ip_bytes` (bytes)**  
Total number of IP-layer bytes sent from the destination endpoint, including headers.

---

## Service Profile: DNS Activity

**17. `dns_query` (string)**  
Domain name queried in DNS requests.

**18. `dns_qclass` (integer code)**  
DNS query class value (e.g., IN = 1).

**19. `dns_qtype` (integer code)**  
DNS query type value (e.g., A = 1, AAAA = 28).

**20. `dns_rcode` (integer code)**  
DNS response code indicating the query result.

**21. `dns_AA` (boolean)**  
Authoritative Answer flag; true if the responding server is authoritative.

**22. `dns_RD` (boolean)**  
Recursion Desired flag; true if recursive lookup was requested.

**23. `dns_RA` (boolean)**  
Recursion Available flag; true if the server supports recursive queries.

**24. `dns_rejected` (boolean)**  
Indicates whether the DNS query was rejected by the server.

---

##### Service Profile: SSL Activity

**25. `ssl_version` (string)**  
SSL/TLS protocol version offered by the server.

**26. `ssl_cipher` (string)**  
Cipher suite selected by the server during SSL/TLS negotiation.

**27. `ssl_resumed` (boolean)**  
Indicates whether the SSL session was resumed.

**28. `ssl_established` (boolean)**  
Indicates whether the SSL/TLS connection was successfully established.

**29. `ssl_subject` (string)**  
Subject field of the X.509 certificate offered by the server.

**30. `ssl_issuer` (string)**  
Issuer of the SSL/TLS certificate (certificate authority).

---

##### Service Profile: HTTP Activity

**31. `http_trans_depth` (integer)**  
Pipelining depth of the HTTP connection.

**32. `http_method` (string)**  
HTTP request method (e.g., GET, POST, HEAD).

**33. `http_uri` (string)**  
URI requested by the HTTP client.

**35. `http_version` (string)**  
HTTP protocol version used (e.g., HTTP/1.1).

**36. `http_request_body_len` (bytes)**  
Uncompressed size of the HTTP request body sent by the client.

**37. `http_response_body_len` (bytes)**  
Uncompressed size of the HTTP response body sent by the server.

**38. `http_status_code` (integer)**  
HTTP response status code returned by the server.

**39. `http_user_agent` (numeric or encoded)**  
Encoded value representing the HTTP User-Agent header.

**40. `http_orig_mime_types` (string)**  
Ordered list of MIME types observed in HTTP requests.

**41. `http_resp_mime_types` (string)**  
Ordered list of MIME types observed in HTTP responses.

---

##### Service Profile: Violation Activity

**42. `weird_name` (string)**  
Name of the protocol anomaly or violation detected by Zeek.

**43. `weird_addl` (string)**  
Additional contextual information related to the anomaly or violation.

**44. `weird_notice` (boolean)**  
Indicates whether the anomaly or violation triggered a Zeek notice.

---

##### Service Profile: Data Labelling

**45. `label` (binary)**  
Binary class label where:
- `0` indicates normal (benign) traffic  
- `1` indicates attack traffic

**46. `type` (categorical)**  
Attack category label (e.g., normal, DoS, DDoS, backdoor).

---

##### Source  
TON-IoT Dataset â€“ Network Feature Description  
https://research.unsw.edu.au/projects/toniot-datasets  
:contentReference[oaicite:0]{index=0}


In [21]:
import boto3
import sagemaker
import pandas as pd
from sqlalchemy import create_engine, text

# helps the notebook print all columns
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)

## SageMaker / Athena setup

In [2]:
sess = sagemaker.Session()
region = boto3.Session().region_name

results_bucket = sess.default_bucket()
athena_results_path = f"s3://{results_bucket}/athena/staging/"

database_name = "aai540_eda"

engine = create_engine(
    f"awsathena+rest://@athena.{region}.amazonaws.com:443/{database_name}",
    connect_args={
        "s3_staging_dir": athena_results_path,
        "region_name": region,
    },
)

def read_sql(query: str) -> pd.DataFrame:
    return pd.read_sql(query, engine)

print("Region:", region)
print("Athena results:", athena_results_path)

Region: us-east-1
Athena results: s3://sagemaker-us-east-1-128131109986/athena/staging/


## Athena Tables

In [3]:
read_sql(f"SHOW TABLES IN {database_name}")

Unnamed: 0,tab_name
0,cic_ids2017_raw
1,ton_iot_raw
2,unsw_nb15_raw


## TON_IOT EDA

### Count rows

In [46]:
read_sql(f"""
SELECT COUNT(*) AS total_rows
FROM {database_name}.ton_iot_raw
""")

Unnamed: 0,total_rows
0,22339021


### Columns

In [49]:
read_sql(f"SHOW COLUMNS FROM {database_name}.ton_iot_raw")

Unnamed: 0,field
0,ts
1,src_ip
2,src_port
3,dst_ip
4,dst_port
5,proto
6,service
7,duration
8,src_bytes
9,dst_bytes


### Binary lable distribution

In [61]:
read_sql(f"""
SELECT
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END AS class,
  COUNT(*) AS cnt,
  COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS pct
FROM {database_name}.ton_iot_raw
GROUP BY
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END
ORDER BY cnt DESC
""")

Unnamed: 0,class,cnt,pct
0,MALICIOUS,21556114,0.964953
1,BENIGN,782907,0.035047


### Attack type distribution

In [54]:
read_sql(f"""
SELECT
  type AS attack_type,
  COUNT(*) AS cnt
FROM {database_name}.ton_iot_raw
WHERE label = 1
GROUP BY type
ORDER BY cnt DESC
""")

Unnamed: 0,attack_type,cnt
0,ddos,6165008
1,scanning,6153634
2,dos,3375328
3,xss,2108944
4,password,1718568
5,backdoor,508116
6,injection,452659
7,ransomware,72805
8,mitm,1052


### Missing values analysis

In [62]:
cols_df = read_sql(f"SHOW COLUMNS FROM {database_name}.ton_iot_raw")
col_field = cols_df.columns[0]
columns = (
    cols_df[col_field]
    .dropna()
    .astype(str)
    .tolist()
)
columns = [c for c in columns if c.strip() and not c.lower().startswith("#")]

missing_sql = f"""
SELECT
  {", ".join([f"SUM(CASE WHEN {c} IS NULL THEN 1 ELSE 0 END) AS {c}" for c in columns])}
FROM {database_name}.ton_iot_raw
"""
missing = read_sql(missing_sql).iloc[0]
missing

ts                              0
src_ip                          0
src_port                  1000000
dst_ip                          0
dst_port                  1000000
proto                           0
service                         0
duration                  1000000
src_bytes                  391921
dst_bytes                       0
conn_state                      0
missed_bytes              1000000
src_pkts                        0
src_ip_bytes                    0
dst_pkts                        0
dst_ip_bytes                    0
dns_query                       0
dns_qclass                1000000
dns_qtype                       0
dns_rcode                       0
dns_aa                          0
dns_rd                          0
dns_ra                          0
dns_rejected                    0
ssl_version                     0
ssl_cipher                      0
ssl_resumed                     0
ssl_established                 0
ssl_subject                     0
ssl_issuer    

### Duration distribution

In [63]:
read_sql(f"""
SELECT
  MIN(duration) AS min_duration,
  APPROX_PERCENTILE(duration, 0.50) AS p50_duration,
  APPROX_PERCENTILE(duration, 0.95) AS p95_duration,
  APPROX_PERCENTILE(duration, 0.99) AS p99_duration,
  MAX(duration) AS max_duration
FROM {database_name}.ton_iot_raw
""")


Unnamed: 0,min_duration,p50_duration,p95_duration,p99_duration,max_duration
0,0.0,0.000413,60.893127,65.434704,93516.92917


### Duration by class

In [64]:
read_sql(f"""
SELECT
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END AS class,
  APPROX_PERCENTILE(duration, 0.50) AS p50,
  APPROX_PERCENTILE(duration, 0.95) AS p95,
  APPROX_PERCENTILE(duration, 0.99) AS p99
FROM {database_name}.ton_iot_raw
GROUP BY
  CASE WHEN label = 0 THEN 'BENIGN' ELSE 'MALICIOUS' END
""")


Unnamed: 0,class,p50,p95,p99
0,BENIGN,3e-05,0.93827,31.345933
1,MALICIOUS,0.000562,60.916354,65.40621


### Protocol distribution

In [69]:
read_sql(f"""
SELECT
  proto,
  COUNT(*) AS cnt
FROM {database_name}.ton_iot_raw
GROUP BY proto
ORDER BY cnt DESC
LIMIT 50
""")


Unnamed: 0,proto,cnt
0,tcp,19650665
1,udp,1669806
2,icmp,18550
3,53,13555
4,80,7155
5,443,6541
6,143,3170
7,22,3119
8,993,3067
9,587,2922
