## Data Loading, Sampling, and Splitting (UNSW-NB15)

This code block performs the initial data preparation steps for the UNSW-NB15 dataset. It involves loading the data, defining appropriate column names, sampling a manageable subset, and splitting this subset into training, validation, and testing sets while maintaining the original class distribution.

1.  **Import Libraries:**
    * `pandas` is imported for data manipulation and reading CSV files.
    * `train_test_split` from `sklearn.model_selection` is used for splitting the data strategically.

2.  **Define Column Names:**
    * A list `column_names` is created containing all 49 feature names for the UNSW-NB15 dataset, including the target labels (`attack_cat`, `Label`).

3.  **Load Dataset:**
    * The dataset is loaded from `/kaggle/input/1-dataset/UNSW-NB15_1.csv` using `pd.read_csv`.
    * `header=None` is specified because the original CSV file does not contain a header row.
    * `names=column_names` assigns the predefined column names to the DataFrame.
    * The total number of rows (expected: 700,000) and columns (expected: 49) are printed for verification.

4.  **Stratified Sampling:**
    * A smaller sample of 240,000 rows is drawn from the full dataset.
    * `train_test_split` is used for this sampling step (even though it's typically for train/test splitting) by specifying `train_size=240000`.
    * `stratify=df['Label']` ensures that the proportion of normal (`Label=0`) and attack (`Label=1`) instances in the 240,000-row sample is the same as in the original 700,000-row dataset. This is crucial for representative sampling.
    * `random_state=42` ensures reproducibility of the sampling process.

5.  **Train/Validation/Test Split:**
    * The 240,000-row sample (`sampled_df`) is split into:
        * A combined training and validation set (`train_val_df`) of 200,000 rows.
        * A test set (`test_df`) of 40,000 rows.
        * This split is also stratified by `Label` using `stratify=sampled_df['Label']` and uses `random_state=42`.
    * The 200,000-row `train_val_df` is further split into:
        * A training set (`train_df`) of 160,000 rows.
        * A validation set (`val_df`) of 40,000 rows.
        * This split is again stratified by `Label` using `stratify=train_val_df['Label']` and uses `random_state=42`.

6.  **Verification:**
    * The shapes (number of rows) of the final `train_df`, `val_df`, and `test_df` are printed to confirm the sizes (160k, 40k, 40k respectively).
    * The normalized distribution (`value_counts(normalize=True)`) of the `Label` column is printed for each of the three sets (train, validation, test) to verify that stratification has successfully maintained similar class proportions across all splits.

7.  **Save Splits:**
    * The resulting `train_df`, `val_df`, and `test_df` DataFrames are saved as CSV files (`train_set.csv`, `val_set.csv`, `test_set.csv`) in the `/kaggle/working/` directory.
    * `index=False` prevents pandas from writing the DataFrame index as a column in the CSV files.

This process results in three distinct, stratified datasets ready for model training, validation, and testing.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Define the column names
column_names = [
    'srcip', 'sport', 'dstip', 'dsport', 'proto', 'state', 'dur', 'sbytes', 'dbytes', 
    'sttl', 'dttl', 'sloss', 'dloss', 'service', 'Sload', 'Dload', 'Spkts', 'Dpkts', 
    'swin', 'dwin', 'stcpb', 'dtcpb', 'smeansz', 'dmeansz', 'trans_depth', 'res_bdy_len', 
    'Sjit', 'Djit', 'Stime', 'Ltime', 'Sintpkt', 'Dintpkt', 'tcprtt', 'synack', 'ackdat', 
    'is_sm_ips_ports', 'ct_state_ttl', 'ct_flw_http_mthd', 'is_ftp_login', 'ct_ftp_cmd', 
    'ct_srv_src', 'ct_srv_dst', 'ct_dst_ltm', 'ct_src_ltm', 'ct_src_dport_ltm', 
    'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'attack_cat', 'Label'
]

# Load the dataset with column names
df = pd.read_csv('/kaggle/input/1-dataset/UNSW-NB15_1.csv', names=column_names, header=None)
print(f"Total rows: {df.shape[0]}")  # Should be 700,000
print(f"Columns: {df.shape[1]}")     # Should be 49
print("Column names:", df.columns.tolist())

# Step 1: Sample 240,000 rows with stratification based on 'Label'
sampled_df, _ = train_test_split(df, train_size=240000, stratify=df['Label'], random_state=42)
print(f"Sampled rows: {sampled_df.shape[0]}")  # Should be 240,000

# Step 2: Split into train+val (200,000) and test (40,000)
train_val_df, test_df = train_test_split(sampled_df, train_size=200000, stratify=sampled_df['Label'], random_state=42)
print(f"Train+Val rows: {train_val_df.shape[0]}, Test rows: {test_df.shape[0]}")

# Step 3: Split train+val into train (160,000) and val (40,000)
train_df, val_df = train_test_split(train_val_df, train_size=160000, stratify=train_val_df['Label'], random_state=42)
print(f"Training rows: {train_df.shape[0]}, Validation rows: {val_df.shape[0]}")

# Verify label distribution
print("Training label distribution:\n", train_df['Label'].value_counts(normalize=True))
print("Validation label distribution:\n", val_df['Label'].value_counts(normalize=True))
print("Testing label distribution:\n", test_df['Label'].value_counts(normalize=True))

# Save the splits to CSV files
train_df.to_csv('/kaggle/working/train_set.csv', index=False)
val_df.to_csv('/kaggle/working/val_set.csv', index=False)
test_df.to_csv('/kaggle/working/test_set.csv', index=False)
print("Datasets saved: train_set.csv, val_set.csv, test_set.csv")

  df = pd.read_csv('/kaggle/input/1-dataset/UNSW-NB15_1.csv', names=column_names, header=None)


Total rows: 700001
Columns: 49
Column names: ['srcip', 'sport', 'dstip', 'dsport', 'proto', 'state', 'dur', 'sbytes', 'dbytes', 'sttl', 'dttl', 'sloss', 'dloss', 'service', 'Sload', 'Dload', 'Spkts', 'Dpkts', 'swin', 'dwin', 'stcpb', 'dtcpb', 'smeansz', 'dmeansz', 'trans_depth', 'res_bdy_len', 'Sjit', 'Djit', 'Stime', 'Ltime', 'Sintpkt', 'Dintpkt', 'tcprtt', 'synack', 'ackdat', 'is_sm_ips_ports', 'ct_state_ttl', 'ct_flw_http_mthd', 'is_ftp_login', 'ct_ftp_cmd', 'ct_srv_src', 'ct_srv_dst', 'ct_dst_ltm', 'ct_src_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'attack_cat', 'Label']
Sampled rows: 240000
Train+Val rows: 200000, Test rows: 40000
Training rows: 160000, Validation rows: 40000
Training label distribution:
 Label
0    0.968263
1    0.031738
Name: proportion, dtype: float64
Validation label distribution:
 Label
0    0.96825
1    0.03175
Name: proportion, dtype: float64
Testing label distribution:
 Label
0    0.968275
1    0.031725
Name: proportion, dtype: float64