# Anomaly Detection API Notebook
This notebook demonstrates the reusable API for loading, preprocessing, and training models on the **UNSW-NB15 dataset**.

### Contents
1. Overview of API structure
2. Import and test key functions
3. Demonstrate a minimal end-to-end example


In [6]:
#test

In [5]:
!pip install scikit-learn

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
pip install xgboost==1.7.6

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from anomaly_utils import (
    load_unsw_from_zip,
    basic_eda,
    build_preprocess_and_split,
    fast_numeric_feature_selection,
)


In [2]:
df = load_unsw_from_zip("archive.zip", extract_dir="./data")
basic_eda(df)


✅ Dataset Loaded Successfully
Shape: (2797720, 49)

--- Data Types ---
float64    42
object      6
int64       1
Name: count, dtype: int64

--- Missing Values (Top 10) ---
attack_cat          2476437
ct_ftp_cmd          1429879
is_ftp_login        1429879
ct_flw_http_mthd    1348145
dsport               257977
ct_src_ltm           257673
ct_dst_src_ltm       257673
ct_dst_sport_ltm     257673
sport                     8
sttl                      0
dtype: int64

--- Sample Rows ---
            srcip    sport          dstip  dsport proto state       dur  \
49  ï»¿59.166.0.0   1390.0  149.171.126.6    53.0   udp   CON  0.001055   
50     59.166.0.0  33661.0  149.171.126.9  1024.0   udp   CON  0.036133   
51     59.166.0.6   1464.0  149.171.126.7    53.0   udp   CON  0.001119   

    sbytes  dbytes  sttl  ...  ct_ftp_cmd  ct_srv_src  ct_srv_dst ct_dst_ltm  \
49   132.0   164.0  31.0  ...         0.0         3.0         7.0        1.0   
50   528.0   304.0  31.0  ...         0.0         2.0

## Building our preprocessing and feature-selection API layer

In [3]:
(
    preprocess,
    X_train,
    X_test,
    y_train,
    y_test,
    X_train_proc,
    X_test_proc,
    num_cols,
    cat_cols,
) = build_preprocess_and_split(df)


Preprocessing pipeline ready.
Train: (2098290, 48)  Test: (699430, 48)


In [None]:
(
    preprocess,
    X_train,
    X_test,
    y_train,
    y_test,
    X_train_proc,
    X_test_proc,
    num_cols,
    cat_cols,
) = build_preprocess_and_split(df)

## anomaly.API  Cell 3: numeric feature selection API

In [4]:
from anomaly_utils import fast_numeric_feature_selection

selected_numeric = fast_numeric_feature_selection(
    X_train,
    y_train,
    numeric_columns=num_cols,
    out_dir="outputs",   # will save outputs/selected_numeric.json
    top_k_mi=8,          # use top 8 MI features
    min_k=3,             # EFS: min subset size
    max_k=5,             # EFS: max subset size
    sample_size=50000,   # run EFS only on a 50k-row sample (keeps it fast)
)

print("✅ Final selected numeric features:", selected_numeric)


Top MI features: ['sttl', 'dttl', 'Sload', 'ct_state_ttl', 'smeansz', 'sbytes', 'dur', 'Dintpkt']
Best EFS subset: ['sttl', 'dttl', 'Sload', 'ct_state_ttl', 'smeansz']
Saved to: outputs/selected_numeric.json
✅ Final selected numeric features: ['sttl', 'dttl', 'Sload', 'ct_state_ttl', 'smeansz']
