# Normalize Attack Categories Across Datasets

This notebook explores the attack type categories from the three merged datasets (CIC-IDS2017, TON_IoT, UNSW-NB15) and creates a normalized taxonomy for attack types.

## Goals
1. Explore attack type distributions in each dataset
2. Identify overlapping/similar attack categories
3. Create a unified attack taxonomy
4. Map original attack types to normalized categories

In [1]:
!pip -q install "PyAthena[SQLAlchemy]" sqlalchemy s3fs

In [2]:
import boto3
import sagemaker
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text

# Display settings
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Connect to Athena

In [3]:
sess = sagemaker.Session()
region = boto3.Session().region_name

results_bucket = sess.default_bucket()
athena_results_path = f"s3://{results_bucket}/athena/staging/"

database_name = "aai540_eda"

engine = create_engine(
    f"awsathena+rest://@athena.{region}.amazonaws.com:443/{database_name}",
    connect_args={"s3_staging_dir": athena_results_path, "region_name": region},
)
print("Region:", region)
print("Athena results:", athena_results_path)

Region: us-east-1
Athena results: s3://sagemaker-us-east-1-128131109986/athena/staging/


In [4]:
# Helper functions for queries
def exec_ddl(sql: str):
    with engine.begin() as conn:
        conn.execute(text(sql))

def read_sql(sql: str) -> pd.DataFrame:
    return pd.read_sql(sql, engine)

## Explore Attack Types by Dataset

In [5]:
# Get attack type distribution for all datasets
attack_dist = read_sql(f"""
SELECT 
    source_dataset,
    attack_type,
    COUNT(*) AS count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY source_dataset), 2) AS percent
FROM {database_name}.merged_canonical
GROUP BY source_dataset, attack_type
ORDER BY source_dataset, count DESC
""")

print(f"Total unique attack types across all datasets: {attack_dist['attack_type'].nunique()}")
attack_dist

Total unique attack types across all datasets: 39


Unnamed: 0,source_dataset,attack_type,count,percent
0,CIC-IDS2017,BENIGN,2273097,80.3
1,CIC-IDS2017,DoS Hulk,231073,8.16
2,CIC-IDS2017,PortScan,158930,5.61
3,CIC-IDS2017,DDoS,128027,4.52
4,CIC-IDS2017,DoS GoldenEye,10293,0.36
5,CIC-IDS2017,FTP-Patator,7938,0.28
6,CIC-IDS2017,SSH-Patator,5897,0.21
7,CIC-IDS2017,DoS slowloris,5796,0.2
8,CIC-IDS2017,DoS Slowhttptest,5499,0.19
9,CIC-IDS2017,Bot,1966,0.07


### CIC-IDS2017 Attack Types

In [6]:
# CIC-IDS2017 attack types
cic_attacks = attack_dist[attack_dist['source_dataset'] == 'CIC-IDS2017'].copy()
print(f"CIC-IDS2017 unique attack types: {len(cic_attacks)}")
cic_attacks

CIC-IDS2017 unique attack types: 15


Unnamed: 0,source_dataset,attack_type,count,percent
0,CIC-IDS2017,BENIGN,2273097,80.3
1,CIC-IDS2017,DoS Hulk,231073,8.16
2,CIC-IDS2017,PortScan,158930,5.61
3,CIC-IDS2017,DDoS,128027,4.52
4,CIC-IDS2017,DoS GoldenEye,10293,0.36
5,CIC-IDS2017,FTP-Patator,7938,0.28
6,CIC-IDS2017,SSH-Patator,5897,0.21
7,CIC-IDS2017,DoS slowloris,5796,0.2
8,CIC-IDS2017,DoS Slowhttptest,5499,0.19
9,CIC-IDS2017,Bot,1966,0.07


### TON_IoT Attack Types

In [7]:
# TON_IoT attack types
ton_attacks = attack_dist[attack_dist['source_dataset'] == 'TON_IoT'].copy()
print(f"TON_IoT unique attack types: {len(ton_attacks)}")
ton_attacks

TON_IoT unique attack types: 10


Unnamed: 0,source_dataset,attack_type,count,percent
15,TON_IoT,ddos,6165008,28.89
16,TON_IoT,scanning,6153634,28.84
17,TON_IoT,dos,3375328,15.82
18,TON_IoT,xss,2108944,9.88
19,TON_IoT,password,1718568,8.05
20,TON_IoT,normal,782038,3.66
21,TON_IoT,backdoor,508116,2.38
22,TON_IoT,injection,452659,2.12
23,TON_IoT,ransomware,72805,0.34
24,TON_IoT,mitm,1052,0.0


### UNSW-NB15 Attack Types

In [8]:
# UNSW-NB15 attack types
unsw_attacks = attack_dist[attack_dist['source_dataset'] == 'UNSW-NB15'].copy()
print(f"UNSW-NB15 unique attack types: {len(unsw_attacks)}")
unsw_attacks

UNSW-NB15 unique attack types: 14


Unnamed: 0,source_dataset,attack_type,count,percent
25,UNSW-NB15,,2218764,87.35
26,UNSW-NB15,Generic,215481,8.48
27,UNSW-NB15,Exploits,44525,1.75
28,UNSW-NB15,Fuzzers,19195,0.76
29,UNSW-NB15,DoS,16353,0.64
30,UNSW-NB15,Reconnaissance,12228,0.48
31,UNSW-NB15,Fuzzers,5051,0.2
32,UNSW-NB15,Analysis,2677,0.11
33,UNSW-NB15,Backdoor,1795,0.07
34,UNSW-NB15,Reconnaissance,1759,0.07


## Create Unified Attack Taxonomy

Based on the attack types observed, we'll create a normalized taxonomy that groups similar attacks.

Common attack categories in network intrusion detection:
- **Normal**: Benign traffic
- **DoS/DDoS**: Denial of Service attacks
- **Probe/Reconnaissance**: Network scanning, port scanning
- **Exploits**: Exploitation attempts
- **Web Attacks**: HTTP/Web-based attacks
- **Brute Force**: Password cracking, SSH brute force
- **Botnet**: Bot/Botnet activity
- **Infiltration**: Network infiltration
- **Backdoor**: Backdoor access
- **Injection**: SQL injection, code injection
- **Generic Malware**: Generic malicious activity
- **Other**: Uncategorized attacks

In [9]:
# Get all unique attack types
all_attack_types = attack_dist['attack_type'].unique()
print(f"Total unique attack type labels: {len(all_attack_types)}")
print("\nAll attack types:")
for at in sorted(all_attack_types):
    print(f"  - {at}")

Total unique attack type labels: 39

All attack types:
  - 
  -  Fuzzers
  -  Fuzzers 
  -  Reconnaissance 
  -  Shellcode 
  - Analysis
  - BENIGN
  - Backdoor
  - Backdoors
  - Bot
  - DDoS
  - DoS
  - DoS GoldenEye
  - DoS Hulk
  - DoS Slowhttptest
  - DoS slowloris
  - Exploits
  - FTP-Patator
  - Generic
  - Heartbleed
  - Infiltration
  - PortScan
  - Reconnaissance
  - SSH-Patator
  - Shellcode
  - Web Attack � Brute Force
  - Web Attack � Sql Injection
  - Web Attack � XSS
  - Worms
  - backdoor
  - ddos
  - dos
  - injection
  - mitm
  - normal
  - password
  - ransomware
  - scanning
  - xss


## Define Attack Type Mapping

Create a mapping from original attack type names to normalized categories.

In [10]:
# Define normalized attack taxonomy mapping
# This will need to be customized based on the actual attack types seen above
attack_type_mapping = {
    # Normal traffic (various representations)
    'BENIGN': 'Normal',
    'Normal': 'Normal',
    'normal': 'Normal',
    '': 'Normal',  # Empty string - UNSW-NB15 normal traffic
    
    # DoS/DDoS attacks
    'DDoS': 'DoS/DDoS',
    'DoS Hulk': 'DoS/DDoS',
    'DoS GoldenEye': 'DoS/DDoS',
    'DoS slowloris': 'DoS/DDoS',
    'DoS Slowhttptest': 'DoS/DDoS',
    'DoS': 'DoS/DDoS',
    'ddos': 'DoS/DDoS',
    'dos': 'DoS/DDoS',  # TON_IoT lowercase dos
    
    # Reconnaissance/Scanning
    'PortScan': 'Reconnaissance',
    'Port Scanning': 'Reconnaissance',
    'Reconnaissance': 'Reconnaissance',
    ' Reconnaissance ': 'Reconnaissance',  # With spaces
    'Analysis': 'Reconnaissance',
    'scanning': 'Reconnaissance',
    
    # Exploits
    'Exploits': 'Exploits',
    'exploit': 'Exploits',
    'Heartbleed': 'Exploits',
    'Shellcode': 'Exploits',
    ' Shellcode ': 'Exploits',  # With spaces
    
    # Web Attacks
    'Web Attack � Brute Force': 'Web Attack',
    'Web Attack � XSS': 'Web Attack',
    'Web Attack � Sql Injection': 'Injection',
    'xss': 'Web Attack',
    
    # Brute Force
    'FTP-Patator': 'Brute Force',
    'SSH-Patator': 'Brute Force',
    'Brute Force': 'Brute Force',
    'password': 'Brute Force',
    
    # Botnet & Backdoors
    'Bot': 'Botnet',
    'Botnet': 'Botnet',
    'backdoor': 'Backdoor',
    'Backdoor': 'Backdoor',  # Capital B
    'Backdoors': 'Backdoor',  # Plural form
    
    # Infiltration
    'Infiltration': 'Infiltration',
    
    # Fuzzers
    'Fuzzers': 'Fuzzing',
    ' Fuzzers': 'Fuzzing',  # With leading space
    ' Fuzzers ': 'Fuzzing',  # With leading and trailing spaces
    
    # Generic/Other
    'Generic': 'Generic Malware',
    'Worms': 'Generic Malware',
    'ransomware': 'Generic Malware',
    'injection': 'Injection',
    'mitm': 'Man-in-the-Middle',
}

print(f"Defined mappings for {len(attack_type_mapping)} attack types")
print(f"Normalized categories: {sorted(set(attack_type_mapping.values()))}")

Defined mappings for 45 attack types
Normalized categories: ['Backdoor', 'Botnet', 'Brute Force', 'DoS/DDoS', 'Exploits', 'Fuzzing', 'Generic Malware', 'Infiltration', 'Injection', 'Man-in-the-Middle', 'Normal', 'Reconnaissance', 'Web Attack']


In [11]:
# Check which attack types don't have mappings yet
unmapped = [at for at in all_attack_types if at not in attack_type_mapping]
if unmapped:
    print("⚠️  Unmapped attack types (need to be added to mapping):")
    for at in sorted(unmapped):
        print(f"  - '{at}'")
else:
    print("✓ All attack types have been mapped!")

✓ All attack types have been mapped!


## Apply Normalization and Preview

In [12]:
# Apply mapping to the attack distribution dataframe
attack_dist['normalized_attack_type'] = attack_dist['attack_type'].map(attack_type_mapping)

# Show mapping results
print("Original vs Normalized Attack Types:")
mapping_preview = attack_dist[['source_dataset', 'attack_type', 'normalized_attack_type', 'count']].sort_values(['source_dataset', 'count'], ascending=[True, False])
mapping_preview

Original vs Normalized Attack Types:


Unnamed: 0,source_dataset,attack_type,normalized_attack_type,count
0,CIC-IDS2017,BENIGN,Normal,2273097
1,CIC-IDS2017,DoS Hulk,DoS/DDoS,231073
2,CIC-IDS2017,PortScan,Reconnaissance,158930
3,CIC-IDS2017,DDoS,DoS/DDoS,128027
4,CIC-IDS2017,DoS GoldenEye,DoS/DDoS,10293
5,CIC-IDS2017,FTP-Patator,Brute Force,7938
6,CIC-IDS2017,SSH-Patator,Brute Force,5897
7,CIC-IDS2017,DoS slowloris,DoS/DDoS,5796
8,CIC-IDS2017,DoS Slowhttptest,DoS/DDoS,5499
9,CIC-IDS2017,Bot,Botnet,1966


In [13]:
# Aggregate by normalized attack type
normalized_dist = attack_dist.groupby(['source_dataset', 'normalized_attack_type']).agg({
    'count': 'sum',
    'attack_type': lambda x: ', '.join(sorted(set(x)))
}).reset_index()
normalized_dist.columns = ['source_dataset', 'normalized_attack_type', 'count', 'original_types']

# Add percentage
normalized_dist['percent'] = normalized_dist.groupby('source_dataset')['count'].transform(
    lambda x: round(x / x.sum() * 100, 2)
)

print("Normalized Attack Distribution by Dataset:")
normalized_dist.sort_values(['source_dataset', 'count'], ascending=[True, False])

Normalized Attack Distribution by Dataset:


Unnamed: 0,source_dataset,normalized_attack_type,count,original_types,percent
6,CIC-IDS2017,Normal,2273097,BENIGN,80.3
2,CIC-IDS2017,DoS/DDoS,380688,"DDoS, DoS GoldenEye, DoS Hulk, DoS Slowhttptest, DoS slowloris",13.45
7,CIC-IDS2017,Reconnaissance,158930,PortScan,5.61
1,CIC-IDS2017,Brute Force,13835,"FTP-Patator, SSH-Patator",0.49
8,CIC-IDS2017,Web Attack,2159,"Web Attack � Brute Force, Web Attack � XSS",0.08
0,CIC-IDS2017,Botnet,1966,Bot,0.07
4,CIC-IDS2017,Infiltration,36,Infiltration,0.0
5,CIC-IDS2017,Injection,21,Web Attack � Sql Injection,0.0
3,CIC-IDS2017,Exploits,11,Heartbleed,0.0
11,TON_IoT,DoS/DDoS,9540336,"ddos, dos",44.71


In [14]:
# Overall normalized distribution across all datasets
overall_normalized = normalized_dist.groupby('normalized_attack_type').agg({
    'count': 'sum'
}).reset_index().sort_values('count', ascending=False)

overall_normalized['percent'] = round(overall_normalized['count'] / overall_normalized['count'].sum() * 100, 2)

print("Overall Normalized Attack Distribution (All Datasets):")
overall_normalized

Overall Normalized Attack Distribution (All Datasets):


Unnamed: 0,normalized_attack_type,count,percent
3,DoS/DDoS,9937377,37.21
11,Reconnaissance,6329228,23.7
10,Normal,5273899,19.75
12,Web Attack,2111103,7.9
2,Brute Force,1732403,6.49
0,Backdoor,510445,1.91
8,Injection,452680,1.69
6,Generic Malware,288460,1.08
4,Exploits,46047,0.17
5,Fuzzing,24246,0.09


## Create SQL CASE Statement for Normalization

Generate the SQL needed to create a normalized attack type column in Athena.

In [15]:
# Generate SQL CASE statement
def generate_case_statement(mapping_dict, column_name='attack_type', alias='normalized_attack_type'):
    lines = [f"CASE"]
    for original, normalized in sorted(mapping_dict.items()):
        lines.append(f"    WHEN {column_name} = '{original}' THEN '{normalized}'")
    lines.append(f"    ELSE 'Other'")
    lines.append(f"END AS {alias}")
    return '\n'.join(lines)

case_statement = generate_case_statement(attack_type_mapping)
print("SQL CASE statement for normalization:")
print("="*80)
print(case_statement)
print("="*80)

SQL CASE statement for normalization:
CASE
    WHEN attack_type = '' THEN 'Normal'
    WHEN attack_type = ' Fuzzers' THEN 'Fuzzing'
    WHEN attack_type = ' Fuzzers ' THEN 'Fuzzing'
    WHEN attack_type = ' Reconnaissance ' THEN 'Reconnaissance'
    WHEN attack_type = ' Shellcode ' THEN 'Exploits'
    WHEN attack_type = 'Analysis' THEN 'Reconnaissance'
    WHEN attack_type = 'BENIGN' THEN 'Normal'
    WHEN attack_type = 'Backdoor' THEN 'Backdoor'
    WHEN attack_type = 'Backdoors' THEN 'Backdoor'
    WHEN attack_type = 'Bot' THEN 'Botnet'
    WHEN attack_type = 'Botnet' THEN 'Botnet'
    WHEN attack_type = 'Brute Force' THEN 'Brute Force'
    WHEN attack_type = 'DDoS' THEN 'DoS/DDoS'
    WHEN attack_type = 'DoS' THEN 'DoS/DDoS'
    WHEN attack_type = 'DoS GoldenEye' THEN 'DoS/DDoS'
    WHEN attack_type = 'DoS Hulk' THEN 'DoS/DDoS'
    WHEN attack_type = 'DoS Slowhttptest' THEN 'DoS/DDoS'
    WHEN attack_type = 'DoS slowloris' THEN 'DoS/DDoS'
    WHEN attack_type = 'Exploits' THEN 'Expl

## Save Mapping to CSV

In [16]:
# Create mapping dataframe
mapping_df = pd.DataFrame([
    {'original_attack_type': k, 'normalized_attack_type': v}
    for k, v in sorted(attack_type_mapping.items())
])

# Add counts from the distribution
mapping_with_counts = mapping_df.merge(
    attack_dist[['attack_type', 'count']].groupby('attack_type').sum().reset_index(),
    left_on='original_attack_type',
    right_on='attack_type',
    how='left'
).drop('attack_type', axis=1)

# Save to CSV
output_path = '/home/sagemaker-user/AAI-540-Group5/feature-eng/attack_type_mapping.csv'
mapping_with_counts.to_csv(output_path, index=False)
print(f"Saved attack type mapping to: {output_path}")
mapping_with_counts

Saved attack type mapping to: /home/sagemaker-user/AAI-540-Group5/feature-eng/attack_type_mapping.csv


Unnamed: 0,original_attack_type,normalized_attack_type,count
0,,Normal,2218764.0
1,Fuzzers,Fuzzing,5051.0
2,Fuzzers,Fuzzing,19195.0
3,Reconnaissance,Reconnaissance,12228.0
4,Shellcode,Exploits,1288.0
5,Analysis,Reconnaissance,2677.0
6,BENIGN,Normal,2273097.0
7,Backdoor,Backdoor,1795.0
8,Backdoors,Backdoor,534.0
9,Bot,Botnet,1966.0


## Create Normalized Table in Athena

Create a new table with the normalized attack types.

In [17]:
# Define S3 location for normalized table
normalized_location = f"s3://{results_bucket}/merged_canonical_normalized/"

# Clean up S3 location
s3_client = boto3.client('s3')
bucket = results_bucket
prefix = "merged_canonical_normalized/"

print(f"Cleaning S3 location: {normalized_location}")
try:
    paginator = s3_client.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
    
    delete_count = 0
    for page in pages:
        if 'Contents' in page:
            objects = [{'Key': obj['Key']} for obj in page['Contents']]
            if objects:
                s3_client.delete_objects(Bucket=bucket, Delete={'Objects': objects})
                delete_count += len(objects)
    
    print(f"Deleted {delete_count} objects from S3")
except Exception as e:
    print(f"Note: {e}")

# Drop table if exists
exec_ddl(f"DROP TABLE IF EXISTS {database_name}.merged_canonical_normalized")
print(f"Dropped existing table (if any)")
print(f"New table will be written to: {normalized_location}")

Cleaning S3 location: s3://sagemaker-us-east-1-128131109986/merged_canonical_normalized/
Deleted 0 objects from S3
Dropped existing table (if any)
New table will be written to: s3://sagemaker-us-east-1-128131109986/merged_canonical_normalized/


In [18]:
# Create normalized table with CTAS
normalized_ctas_query = f"""
CREATE TABLE {database_name}.merged_canonical_normalized
WITH (
    format = 'PARQUET',
    external_location = '{normalized_location}',
    parquet_compression = 'SNAPPY'
) AS
SELECT
    duration,
    pkt_total,
    bytes_total,
    pkt_fwd,
    pkt_bwd,
    bytes_fwd,
    bytes_bwd,
    label,
    attack_type AS original_attack_type,
    {generate_case_statement(attack_type_mapping, 'attack_type', 'attack_category')},
    source_dataset
FROM {database_name}.merged_canonical
"""

print("Creating normalized table...")
exec_ddl(normalized_ctas_query)
print("\nNormalized table created successfully!")

Creating normalized table...

Normalized table created successfully!


## Verify Normalized Table

In [19]:
# Verify the normalized table
read_sql(f"SHOW TABLES IN {database_name}")

Unnamed: 0,tab_name
0,cic_ids2017_raw
1,merged_canonical
2,merged_canonical_normalized
3,ton_iot_raw
4,unsw_nb15_raw


In [20]:
# Check schema
read_sql(f"SHOW COLUMNS FROM {database_name}.merged_canonical_normalized")

Unnamed: 0,field
0,duration
1,pkt_total
2,bytes_total
3,pkt_fwd
4,pkt_bwd
5,bytes_fwd
6,bytes_bwd
7,label
8,original_attack_type
9,attack_category


In [21]:
# Preview normalized data
read_sql(f"""
SELECT *
FROM {database_name}.merged_canonical_normalized
LIMIT 10
""")

Unnamed: 0,duration,pkt_total,bytes_total,pkt_fwd,pkt_bwd,bytes_fwd,bytes_bwd,label,original_attack_type,attack_category,source_dataset
0,0.000127,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
1,6.6e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
2,2e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
3,1.5e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
4,2e-06,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
5,1e-06,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
6,1e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
7,3e-06,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
8,2e-06,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
9,1e-06,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT


In [22]:
# Get normalized attack distribution
normalized_verify = read_sql(f"""
SELECT 
    source_dataset,
    attack_category,
    COUNT(*) AS count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY source_dataset), 2) AS percent
FROM {database_name}.merged_canonical_normalized
GROUP BY source_dataset, attack_category
ORDER BY source_dataset, count DESC
""")

print("Normalized Attack Category Distribution:")
normalized_verify

Normalized Attack Category Distribution:


Unnamed: 0,source_dataset,attack_category,count,percent
0,CIC-IDS2017,Normal,2273097,80.3
1,CIC-IDS2017,DoS/DDoS,380688,13.45
2,CIC-IDS2017,Reconnaissance,158930,5.61
3,CIC-IDS2017,Brute Force,13835,0.49
4,CIC-IDS2017,Web Attack,2159,0.08
5,CIC-IDS2017,Botnet,1966,0.07
6,CIC-IDS2017,Infiltration,36,0.0
7,CIC-IDS2017,Injection,21,0.0
8,CIC-IDS2017,Exploits,11,0.0
9,TON_IoT,DoS/DDoS,9540336,44.71


In [23]:
# Overall summary
overall_summary = read_sql(f"""
SELECT 
    attack_category,
    COUNT(*) AS total_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS percent,
    COUNT(DISTINCT source_dataset) AS num_datasets
FROM {database_name}.merged_canonical_normalized
GROUP BY attack_category
ORDER BY total_count DESC
""")

print("\nOverall Normalized Attack Category Summary:")
overall_summary


Overall Normalized Attack Category Summary:


Unnamed: 0,attack_category,total_count,percent,num_datasets
0,DoS/DDoS,9937377,37.21,3
1,Reconnaissance,6329228,23.7,3
2,Normal,5273899,19.75,3
3,Web Attack,2111103,7.9,2
4,Brute Force,1732403,6.49,2
5,Backdoor,510445,1.91,2
6,Injection,452680,1.69,2
7,Generic Malware,288460,1.08,2
8,Exploits,46047,0.17,2
9,Fuzzing,24246,0.09,1


## Sanity Check: Compare with Original Data

I thought the original data had mostly normal traffic

In [24]:
# Check binary label distribution in original merged_canonical table
label_dist = read_sql(f"""
SELECT 
    label,
    CASE WHEN label = 0 THEN 'Normal' ELSE 'Attack' END AS label_type,
    COUNT(*) AS count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS percent
FROM {database_name}.merged_canonical
GROUP BY label
ORDER BY label
""")

print("Binary Label Distribution (Original Data):")
label_dist

Binary Label Distribution (Original Data):


Unnamed: 0,label,label_type,count,percent
0,0,Normal,5273899,19.75
1,1,Attack,21435043,80.25


In [25]:
# Check label distribution by source dataset
label_by_source = read_sql(f"""
SELECT 
    source_dataset,
    SUM(CASE WHEN label = 0 THEN 1 ELSE 0 END) AS normal_count,
    SUM(CASE WHEN label = 1 THEN 1 ELSE 0 END) AS attack_count,
    COUNT(*) AS total_count,
    ROUND(SUM(CASE WHEN label = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS normal_percent,
    ROUND(SUM(CASE WHEN label = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS attack_percent
FROM {database_name}.merged_canonical
GROUP BY source_dataset
ORDER BY source_dataset
""")

print("\nLabel Distribution by Source Dataset (Original Data):")
label_by_source


Label Distribution by Source Dataset (Original Data):


Unnamed: 0,source_dataset,normal_count,attack_count,total_count,normal_percent,attack_percent
0,CIC-IDS2017,2273097,557646,2830743,80.3,19.7
1,TON_IoT,782038,20556114,21338152,3.66,96.34
2,UNSW-NB15,2218764,321283,2540047,87.35,12.65


In [32]:
merged = read_sql(f"""
SELECT *
FROM {database_name}.merged_canonical_normalized
LIMIT 5
""")
merged

Unnamed: 0,duration,pkt_total,bytes_total,pkt_fwd,pkt_bwd,bytes_fwd,bytes_bwd,label,original_attack_type,attack_category,source_dataset
0,0.000127,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
1,6.6e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
2,2e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
3,1e-06,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT
4,1e-05,2,0,1,1,0,0,1,dos,DoS/DDoS,TON_IoT


In [40]:
attack_type_dist = read_sql(f"""
SELECT
  COALESCE(NULLIF(trim(attack_category), ''), 'UNKNOWN') AS attack_category,
  COUNT(*) AS row_count
FROM {database_name}.merged_canonical_normalized
GROUP BY 1
ORDER BY row_count DESC
""")

attack_type_dist

Unnamed: 0,attack_category,row_count
0,DoS/DDoS,9937377
1,Reconnaissance,6329228
2,Normal,5273899
3,Web Attack,2111103
4,Brute Force,1732403
5,Backdoor,510445
6,Injection,452680
7,Generic Malware,288460
8,Exploits,46047
9,Fuzzing,24246


In [41]:
label_dist = read_sql(f"""
SELECT
  label,
  COUNT(*) AS row_count
FROM {database_name}.merged_canonical
GROUP BY label
ORDER BY label
""")

label_dist



Unnamed: 0,label,row_count
0,0,5273899
1,1,21435043
