# Project 2 - Machine Learning Modeling Overview

## Selection of dataset
Chose the Detection of IoT Botnet Attacks N-BaIoT from the UCI Machine Learning Repository:

https://archive.ics.uci.edu/dataset/442/detection+of+iot+botnet+attacks+n+baiot

It is a collection of CSV files per device, each representing benign and attack traffic captured from IoT devices. There is a folder for each device. Inside each are many csv files. Each CSV contains raw feature vectors extracted from traffic windows.

# Business Understanding

## Background
A botnet is a network of internet-connected devices (like computers, phones, IoT gadgets) that have been infected with malware and are being controlled remotely by an attacker, often without the owner's knowledge.

Bot = infected device ('robot')
Net = network
botnet = a network of infected devices working together

Botnets are used for:
DDoS attacks - they flood a website until it crashes
spam campaigns - send millions of phishing emails
data theft/keylogging - steal login credentials
crypto mining - use your device to mine crypto
click fraud - fake ad clicks for revenue
spreading malware - infect other systems

How botnets work:
1. Malware infects a device (eg. via phishing, weak passowrds, open ports)
2. device becomes a "zombie" - quietly waits for commands
3. all zombie devices are controlled by a command and control (C&C) server
4. attacker sends order to launch attacks or harvest data

IoT botnets increasingly target smart devices such as doorbells, cameras, thermostats, and baby monitors, because they often have poor security, making them easy targets.

## Dataset Details
What is the N-BaIoT dataset?
It’s a benchmark dataset created by researchers to help build and evaluate machine learning models for detecting botnet attacks on Internet of Things (IoT) devices.

Key facts:
1. Contains data from 9 IoT devices (eg. doorbell, camera, baby monitor)
2. The dataset contains traffic types, including benign and multiple botnet attack types (eg. Mirai, Bashlite)
3. Files are CSVs per device, each representing one type of behavior
4. Features are pre-extracted statistical features from raw network traffic
5. No "Label" columns, but filesnames will indicate if its benign or an attack

Purpose of dataset:
To support network-based intrusion detection systems that:
1. monitor traffic from IoT devices
2. detect abnormal patterns linked to botnet activity
3. work without needing deep packet inspection (for speed and privacy)

## Project goals
1. train a binary classifier to detect whether an IoT event is benign or attack
2. if there's enough time, train a multi-class classification to identify specific botnet families

## Project Considerations
1. labeling - no "label" column but file names will indicate benign or attack type
2. the data is highly imbalanced. 92% of the data is labeled as attack type, while only 8% is labeled as benign. This is now a severely imbalanced classification problem which can cause models to always predict "attack", and may lead to high accuracy but poor recall on minority class (benign)



# Data Understanding

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
# define function to load data

def load_n_baiot_data(device_dir):
    dataframes = []
    
    # walk through the directory structure
    for root, dirs, files in os.walk(device_dir):
        for f in files:
            if f.endswith('.csv'):
                path = os.path.join(root, f)

                # determine label based on filename
                label = 'benign' if 'benign' in f.lower() else 'attack'
                print(f"Loading: {path} as {label}")
                # read the CSV file
                # use header=0 to read the first row as column names
                df = pd.read_csv(path, header=0, skiprows=0)

                # add label (benign or attack) and source file to the dataframe
                df['label'] = label
                df['source_file'] = f
                dataframes.append(df)

    return pd.concat(dataframes, ignore_index=True)

# define function to load all devices
def load_all_devices(base_dir):
    all_dfs = []

    for device_name in os.listdir(base_dir):
        device_path = os.path.join(base_dir, device_name)
        if os.path.isdir(device_path):
            print(f"Loading: {device_name}")
            df = load_n_baiot_data(device_path)
            df['device'] = device_name
            all_dfs.append(df)

    return pd.concat(all_dfs, ignore_index=True)



In [None]:
# define the folder where the data is stored

# path for PC
# base_dir = r"C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot"

# path for laptop
base_dir = r
# load all devices data
full_df = load_all_devices(base_dir)

Loading: Danmini_Doorbell
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\benign_traffic.csv as benign
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\gafgyt_attacks\combo.csv as attack
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\gafgyt_attacks\junk.csv as attack
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\gafgyt_attacks\scan.csv as attack
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\gafgyt_attacks\tcp.csv as attack
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\gafgyt_attacks\udp.csv as attack
Loading: C:\Users\jtlee\Documents\Flatiron\data\detection of iot botnet attacks n baiot\Danmini_Doorbell\mirai_attacks\ack.csv as attack
Loading: 

In [None]:
print("Full device dataframe shape:\n", full_df.shape)
print("\nFull device dataframe - # of observations:\n", full_df['device'].value_counts())
print("\nFull device dataframe - # of types of observations:\n", full_df['label'].value_counts())
print("\nFull device dataframe - preview:\n", full_df.head())

Full device dataframe shape:
 (7062606, 118)

Full device dataframe - # of observations:
 device
Philips_B120N10_Baby_Monitor                1098677
Danmini_Doorbell                            1018298
SimpleHome_XCS7_1002_WHT_Security_Camera     863056
SimpleHome_XCS7_1003_WHT_Security_Camera     850826
Provision_PT_838_Security_Camera             836891
Ecobee_Thermostat                            835876
Provision_PT_737E_Security_Camera            828260
Samsung_SNH_1011_N_Webcam                    375222
Ennio_Doorbell                               355500
Name: count, dtype: int64

Full device dataframe - # of types of observations:
 label
attack    6506674
benign     555932
Name: count, dtype: int64

Full device dataframe - preview:
    MI_dir_L5_weight  MI_dir_L5_mean  MI_dir_L5_variance  MI_dir_L3_weight  \
0          1.000000       60.000000            0.000000          1.000000   
1          1.000000      354.000000            0.000000          1.000000   
2          1.857879  

Data is highly imbalanced. Will need to use stratified splits during train_test_split. Use class weights in models like logistic regression, random forest, XGBoost

Cannot rely on accuracy. Will need to use precision, recall, f1-score, ROC-AUC, and confusion matrix

# Data Preparation

# Exploratory Data Analysis

# Statistical Data Analysis

# Evaluation

# Conclusion