## üõ°Ô∏è TII-SSRC-23 Dataset ‚Äì Overview

The **TII-SSRC-23 Dataset** is a modern cybersecurity dataset designed for **AI-based Intrusion Detection and Prevention Systems (IDS/IPS)**. It contains labeled network traffic data representing both **benign activity** and **multiple types of cyberattacks**.

This dataset is suitable for **binary and multi-class classification**, **anomaly detection**, and **machine learning/deep learning experiments** in network security. It reflects **realistic and recent attack behaviors**, making it ideal for evaluating intelligent IDS/IPS models.


### ***Environment Setup***

In [2]:
# Cell 1: Install and import necessary libraries
# -------------------------------------------------
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os


###   ***Data Loading and Initial Overview***

In [3]:
import pandas as pd
print('Loading dataset... This may take some time. Please monitor the RAM usage at the top right corner.')
df = pd.read_csv('/kaggle/input/tii-ssrc-23/csv/data.csv')

Loading dataset... This may take some time. Please monitor the RAM usage at the top right corner.


In [3]:
df.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Traffic Type,Traffic Subtype
0,192.168.1.90-192.168.1.3-53930-64738-6,192.168.1.90,53930.0,192.168.1.3,64738,6.0,01/01/1970 07:41:46 AM,52601173.0,1701.0,1793.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio
1,192.168.1.3-192.168.1.90-64738-37700-6,192.168.1.3,64738.0,192.168.1.90,37700,6.0,01/01/1970 07:41:46 AM,119106942.0,36.0,57.0,...,3416174.0,19996926.0,14078617.0,5001511.0,1737.400069,5003516.0,5000449.0,Benign,Audio,Audio
2,192.168.1.3-192.168.1.90-22-40854-6,192.168.1.3,22.0,192.168.1.90,40854,6.0,01/01/1970 07:41:46 AM,5589.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio
3,192.168.1.70-192.168.1.3-55422-64738-6,192.168.1.70,55422.0,192.168.1.3,64738,6.0,01/01/1970 07:41:47 AM,118166562.0,3932.0,4196.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio
4,192.168.1.90-192.168.1.3-59658-64738-17,192.168.1.90,59658.0,192.168.1.3,64738,17.0,01/01/1970 07:41:50 AM,119988385.0,25.0,6795.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio


In [4]:
df.columns

Index(['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Protocol',
       'Timestamp', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets',
       'Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
       'Packet Length Min', 'Packet Length Max', 'Packet Length Mean',
       'Packet Length Std', 'Packet Len

### ***Data Cleaning***

In [4]:
# Cell 3: A. Data Cleaning
# ============================================================================
def clean_data(df):
    """
    Clean data according to article methodology
    """
    print("üßπ Step 1: Data Cleaning")
    # Create a copy
    df_clean = df.copy()
    # 1. Remove columns specified in the article
    columns_to_remove = [
        'Flow ID',      # Identifiant unique - pas d'information discriminante
        'Src IP',       # Adresse IP source - trop sp√©cifique, cause d'overfitting
        'Src Port',     # Port source - pas pertinent pour classification g√©n√©rale
        'Dst IP',       # Adresse IP destination - trop sp√©cifique
       'Dst Port',     # Port destination - pas pertinent
       'Timestamp',    # Horodatage - non pertinent pour classification
    ]
    
    # Remove only existing columns
    existing_cols = [col for col in columns_to_remove if col in df_clean.columns]
    if existing_cols:
        df_clean = df_clean.drop(columns=existing_cols)
        print(f"   ‚úì Removed columns: {existing_cols}")
    
    # 2. Check for missing values
    missing_values = df_clean.isnull().sum().sum()
    if missing_values > 0:
        print(f"   ‚ö†Ô∏è  Missing values detected: {missing_values}")
        # Show columns with missing values
        missing_cols = df_clean.columns[df_clean.isnull().any()].tolist()
        print(f"   Columns with missing values: {missing_cols[:10]}")
    else:
        print(f"   ‚úì No missing values detected")
    
    # 3. Remove duplicates
    initial_rows = len(df_clean)
    df_clean = df_clean.drop_duplicates()
    duplicates_removed = initial_rows - len(df_clean)
    if duplicates_removed > 0:
        print(f"   ‚úì Duplicates removed: {duplicates_removed}")
    
    # 4. Data type information
    print(f"\n   üìã Data types:")
    dtypes_summary = df_clean.dtypes.value_counts()
    for dtype, count in dtypes_summary.items():
        print(f"   {dtype}: {count} columns")
    
    return df_clean

# Apply cleaning
df_clean = clean_data(df)
print(f"\n‚úÖ Cleaning completed. New shape: {df_clean.shape}")

üßπ Step 1: Data Cleaning
   ‚úì Removed columns: ['Flow ID', 'Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Timestamp']
   ‚úì No missing values detected
   ‚úì Duplicates removed: 1404612

   üìã Data types:
   float64: 77 columns
   object: 3 columns

‚úÖ Cleaning completed. New shape: (7252155, 80)


In [13]:
df_clean.columns

Index(['Protocol', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets',
       'Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
       'Packet Length Min', 'Packet Length Max', 'Packet Length Mean',
       'Packet Length Std', 'Packet Length Variance', 'FIN Flag Count',
       'SYN Flag Count', 'RST Flag Count',

### ***Data Split***

In [5]:
FEATURES = [
    col for col in df_clean.columns
    if col not in ['Label', 'Traffic Type', 'Traffic Subtype']
]

X = df_clean[FEATURES]


In [6]:
def stratified_split(X, y, test_size=0.15, val_size=0.15, random_state=42):
    X_tmp, X_test, y_tmp, y_test = train_test_split(
        X, y,
        test_size=test_size,
        stratify=y,
        random_state=random_state
    )

    val_ratio = val_size / (1 - test_size)

    X_train, X_val, y_train, y_val = train_test_split(
        X_tmp, y_tmp,
        test_size=val_ratio,
        stratify=y_tmp,
        random_state=random_state
    )

    return X_train, X_val, X_test, y_train, y_val, y_test


In [7]:
def save_dataset(base_path, X_train, X_val, X_test, y_train, y_val, y_test):
    os.makedirs(base_path, exist_ok=True)

    X_train.to_csv(f"{base_path}/X_train.csv", index=False)
    X_val.to_csv(f"{base_path}/X_val.csv", index=False)
    X_test.to_csv(f"{base_path}/X_test.csv", index=False)

    y_train.to_csv(f"{base_path}/y_train.csv", index=False)
    y_val.to_csv(f"{base_path}/y_val.csv", index=False)
    y_test.to_csv(f"{base_path}/y_test.csv", index=False)


#### ***1-Dataset BINAIRE***

In [17]:
y_binary = df_clean['Label']

Xb_train, Xb_val, Xb_test, yb_train, yb_val, yb_test = stratified_split(
    X, y_binary
)

save_dataset(
    "dataset/dataset_binary",
    Xb_train, Xb_val, Xb_test,
    yb_train, yb_val, yb_test
)

### ***Dataset TYPE (8 classes)***

In [9]:
y_type = df_clean['Traffic Type']

Xt_train, Xt_val, Xt_test, yt_train, yt_val, yt_test = stratified_split(
    X, y_type
)

save_dataset(
    "dataset/dataset_type",
     Xt_train, Xt_val, Xt_test,
     yt_train, yt_val, yt_test
)


#### ***Dataset SUBTYPE (32 classes)***

In [8]:
y_subtype = df_clean['Traffic Subtype']

Xs_train, Xs_val, Xs_test, ys_train, ys_val, ys_test = stratified_split(
    X, y_subtype
)
save_dataset(
    "dataset/dataset_subtype",
    Xs_train, Xs_val, Xs_test,
    ys_train, ys_val, ys_test
)

In [9]:
import shutil

shutil.make_archive(
    "dataset_subtype",          # nom du zip (sans .zip)
    "zip",
    "dataset/dataset_subtype"   # dossier √† compresser
)

'/kaggle/working/dataset_subtype.zip'