# IoT Network Threat Detection - Exploratory Data Analysis

**Project:** IoT Network Threat Detection using Machine Learning  
**Dataset:** TON_IoT from UNSW Canberra Cyber  
**Date:** July 18, 2025  
**Phase:** 1 - Data Processing and Environment Setup

## Objectives
1. Understand the structure and characteristics of TON_IoT datasets
2. Identify key features for threat detection
3. Analyze class distributions and imbalance
4. Prepare data preprocessing strategy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Data paths
DATA_DIR = Path('../data')
IOT_DIR = DATA_DIR / 'Processed_IoT_dataset'
NETWORK_DIR = DATA_DIR / 'Processed_Network_dataset'
GROUND_TRUTH_DIR = DATA_DIR / 'SecuityEvents_GroundTruth_datasets'

## 1. IoT Device Data Analysis

Let's start by examining the IoT device datasets. These represent network traffic from various IoT devices.

In [None]:
# Load IoT Thermostat dataset as our primary example
df_thermostat = pd.read_csv(IOT_DIR / 'IoT_Thermostat.csv')

print("=== IoT THERMOSTAT DATASET OVERVIEW ===")
print(f"Shape: {df_thermostat.shape}")
print(f"Memory usage: {df_thermostat.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nColumns: {df_thermostat.columns.tolist()}")
print(f"\nData types:\n{df_thermostat.dtypes}")
print(f"\nNull values:\n{df_thermostat.isnull().sum()}")

In [None]:
# Class distribution analysis
print("=== CLASS DISTRIBUTION ===")
print("\nBinary labels:")
print(df_thermostat['label'].value_counts())
print(f"\nClass imbalance ratio: {df_thermostat['label'].value_counts()[0] / df_thermostat['label'].value_counts()[1]:.2f}:1")

print("\nAttack types:")
print(df_thermostat['type'].value_counts())

In [None]:
# Visualization: Class distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Binary classification
df_thermostat['label'].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Binary Classification\n(0=Normal, 1=Attack)')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

# Multi-class classification
df_thermostat['type'].value_counts().plot(kind='bar', ax=axes[1])
axes[1].set_title('Multi-class Classification\n(Attack Types)')
axes[1].set_xlabel('Attack Type')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Feature analysis
print("=== FEATURE ANALYSIS ===")
print("\nStatistical summary:")
print(df_thermostat.describe())

print("\nUnique values per feature:")
for col in df_thermostat.columns:
    print(f"{col}: {df_thermostat[col].nunique()} unique values")

In [None]:
# Temperature and status analysis
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Temperature distribution by class
normal_data = df_thermostat[df_thermostat['label'] == 0]['current_temperature']
attack_data = df_thermostat[df_thermostat['label'] == 1]['current_temperature']

axes[0].hist(normal_data, bins=50, alpha=0.7, label='Normal', density=True)
axes[0].hist(attack_data, bins=50, alpha=0.7, label='Attack', density=True)
axes[0].set_xlabel('Temperature')
axes[0].set_ylabel('Density')
axes[0].set_title('Temperature Distribution by Class')
axes[0].legend()

# Thermostat status distribution
status_counts = df_thermostat.groupby(['thermostat_status', 'label']).size().unstack(fill_value=0)
status_counts.plot(kind='bar', ax=axes[1])
axes[1].set_title('Thermostat Status vs Attack Labels')
axes[1].set_xlabel('Thermostat Status')
axes[1].set_ylabel('Count')
axes[1].legend(['Normal', 'Attack'])
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

## 2. Network Traffic Data Analysis

Now let's examine the network traffic datasets, which contain more detailed network-level features.

In [None]:
# Load Network dataset (sample one file)
df_network = pd.read_csv(NETWORK_DIR / 'Network_dataset_1.csv', low_memory=False)

print("=== NETWORK DATASET OVERVIEW ===")
print(f"Shape: {df_network.shape}")
print(f"Memory usage: {df_network.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nColumns ({len(df_network.columns)}):")
print(df_network.columns.tolist())
print(f"\nData types:\n{df_network.dtypes}")

In [None]:
# Network dataset class distribution
print("=== NETWORK DATASET CLASS DISTRIBUTION ===")
print("\nBinary labels:")
print(df_network['label'].value_counts())
print(f"\nClass imbalance ratio: {df_network['label'].value_counts()[1] / df_network['label'].value_counts()[0]:.2f}:1")

if 'type' in df_network.columns:
    print("\nAttack types:")
    print(df_network['type'].value_counts())

In [None]:
# Key network features analysis
print("=== KEY NETWORK FEATURES ===")
key_features = ['duration', 'src_bytes', 'dst_bytes', 'src_pkts', 'dst_pkts']
available_features = [f for f in key_features if f in df_network.columns]

print(f"\nAnalyzing features: {available_features}")
print(df_network[available_features].describe())

# Null values in key features
print("\nNull values in key features:")
print(df_network[available_features].isnull().sum())

In [None]:
# Protocol and service analysis
print("=== PROTOCOL AND SERVICE ANALYSIS ===")
if 'proto' in df_network.columns:
    print("\nProtocol distribution:")
    print(df_network['proto'].value_counts().head(10))

if 'service' in df_network.columns:
    print("\nService distribution:")
    print(df_network['service'].value_counts().head(10))

if 'conn_state' in df_network.columns:
    print("\nConnection state distribution:")
    print(df_network['conn_state'].value_counts().head(10))

## 3. Data Quality Assessment

Let's assess the overall data quality and identify preprocessing requirements.

In [None]:
# Data quality assessment
def assess_data_quality(df, dataset_name):
    print(f"=== DATA QUALITY ASSESSMENT: {dataset_name} ===")
    
    # Missing values
    missing_pct = (df.isnull().sum() / len(df)) * 100
    missing_cols = missing_pct[missing_pct > 0].sort_values(ascending=False)
    print(f"\nColumns with missing values:")
    for col, pct in missing_cols.items():
        print(f"  {col}: {pct:.2f}%")
    
    # Duplicate records
    duplicates = df.duplicated().sum()
    print(f"\nDuplicate records: {duplicates} ({duplicates/len(df)*100:.2f}%)")
    
    # Data types needing conversion
    print(f"\nData types:")
    for dtype in df.dtypes.value_counts().items():
        print(f"  {dtype[0]}: {dtype[1]} columns")
    
    return missing_cols, duplicates

# Assess both datasets
thermostat_missing, thermostat_dupes = assess_data_quality(df_thermostat, "IoT Thermostat")
network_missing, network_dupes = assess_data_quality(df_network, "Network Traffic")

## 4. Feature Engineering Strategy

Based on the EDA, let's define our feature engineering and preprocessing strategy.

In [None]:
print("=== FEATURE ENGINEERING STRATEGY ===")
print("\n1. IoT Device Features:")
print("   - current_temperature: Normalize/scale")
print("   - thermostat_status: Binary feature (already encoded)")
print("   - date/time: Extract temporal features (hour, day, etc.)")

print("\n2. Network Traffic Features:")
print("   - Numerical features: duration, src_bytes, dst_bytes, src_pkts, dst_pkts")
print("   - Categorical features: proto, service, conn_state")
print("   - IP addresses: Extract network segments or encode")

print("\n3. Class Imbalance Handling:")
print(f"   - IoT Thermostat: {df_thermostat['label'].value_counts()[0] / df_thermostat['label'].value_counts()[1]:.1f}:1 imbalance")
print(f"   - Network Traffic: {df_network['label'].value_counts()[1] / df_network['label'].value_counts()[0]:.1f}:1 imbalance")
print("   - Use class_weight='balanced' in models")
print("   - Consider SMOTE for severe imbalance")

print("\n4. Data Preprocessing Steps:")
print("   - Handle missing values (drop or impute)")
print("   - Remove duplicates")
print("   - Encode categorical variables")
print("   - Scale numerical features")
print("   - Feature selection based on importance")

## 5. Summary and Next Steps

### Key Findings:

1. **IoT Thermostat Dataset**: 442K records, 6 features, 87.3% normal vs 12.7% attacks
2. **Network Dataset**: 1M records, 46 features, 79.1% attacks vs 20.9% normal  
3. **Class Imbalance**: Both datasets show significant class imbalance requiring handling
4. **Data Quality**: Some missing values in date/time fields, minimal duplicates
5. **Feature Types**: Mix of numerical, categorical, and temporal features

### Next Steps (IOT-R-02 completion):

1. **Implement preprocessing pipeline** in `src/preprocess.py`
2. **Feature selection** based on correlation and importance
3. **Handle class imbalance** with appropriate techniques
4. **Create train/test splits** for model development
5. **Prepare data** for Random Forest and XGBoost training

### Model Strategy:

- **Random Forest**: Good baseline, handles mixed data types, built-in feature importance
- **XGBoost**: Higher performance target, handles imbalance well, fast training
- **Evaluation**: Precision, Recall, F1-Score (critical for imbalanced classes)
