11/28/2025 - Nafisa - Load all raw data files
**Purpose:**  
Load all CSV, JSON, and Parquet files from the fall2025_L directory, merge them into a single DataFrame, clean labels, and save the combined dataset for downstream processing.


**Interpretation / Findings:** 
- Total 5 unique labels found: BENIGN, DoS Hulk, DoS GoldenEye, DoS Slowhttptest, Heartbleed.
- Heartbleed (11 rows) removed as required.
- Created binary target column Attack → 1 for attacks, 0 for benign.
- Final dataset size after cleaning:
    -  Attack = 1: 56,112 rows
    - Attack = 0: 5,005 rows
- Output saved to: data/interm/combined_raw.csv

**Notes for Team:**
The combined dataset is now ready for feature engineering and model training. Please ensure to review the cleaned labels and the binary target column for consistency before proceeding. 

### Done with Task

In [3]:
import pandas as pd

# read in the dataset 
df = pd.read_csv('../data/interm/combined_raw.csv')

# Count each label amount 
label_counts = df[' Label'].value_counts()
print("Label Counts:")
print(label_counts)
print()

# Count the total amount of DOS attacks 
total_dos = label_counts[label_counts.index != 'BENIGN'].sum()
print(f"Total DOS attacks: {total_dos}")
print(f"Total BENIGN: {label_counts['BENIGN']}")
print(f"Total records: {len(df)}")

Label Counts:
 Label
DoS Hulk            30027
DoS GoldenEye       20586
DoS Slowhttptest     5499
BENIGN               5005
Name: count, dtype: int64

Total DOS attacks: 56112
Total BENIGN: 5005
Total records: 61117


### Label Distribution

| Label | Count |
|-------|-------|
| DoS Hulk | 30,027 |
| DoS GoldenEye | 20,586 |
| DoS Slowhttptest | 5,499 |
| BENIGN | 5,005 |

**Summary:**
- Total DOS attacks: 56,112
- Total BENIGN: 5,005
- Total records: 61,117

DONE WITH TASK 

11/28/2025 **Moosa - Raw Data Overview** 

This section documents all raw data files stored in `data/raw/fall2025_L/`
The goal is to inspect the raw files, determine their formats, load each file once to view their shapes, and provide a brief description of what each file contains.

The goal of this task is:
- to understand the structure and size of each raw dataset,
- to confirm that all files are readable,
- and to provide clear documentation for the ETL pipeline before cleaning and processing.

Below, I load each file programmatically and print its `(rows, columns)` shape.  
After that, I provide markdown summaries for each file.


In [3]:
import os
import json

path = "../data/raw/fall2025_L"

print("Looking in:", os.path.abspath(path), "\n")

for f in sorted(os.listdir(path)):
    full = os.path.join(path, f)

    if f.endswith(".csv"):
        df = pd.read_csv(full)
    elif f.endswith(".json"):
        df = pd.read_json(full, lines=True)
    elif f.endswith(".parquet"):
        df = pd.read_parquet(full)
    else:
        continue

    print(f"{f} → {df.shape}")


Looking in: d:\ML Assignment\Project\classifier-DoS-project\data\raw\fall2025_L 

ids_0.csv → (1001, 79)
ids_1.csv → (1001, 79)
ids_10.json → (5510, 79)
ids_11.parquet → (1025, 79)
ids_2.csv → (1001, 79)
ids_3.json → (1001, 79)
ids_4.json → (1001, 79)
ids_5.parquet → (15001, 79)
ids_6.parquet → (5001, 79)
ids_7.json → (9000, 79)
ids_8.parquet → (10293, 79)
ids_9.json → (10293, 79)


## Summary
- Total raw files: **12**
- Formats used: **CSV (3), JSON (5), Parquet (4)**
- All files load successfully with **79 consistent columns**
- Only row counts vary between files
- These files are the direct inputs to `etl/load_files.py` before merging and cleaning

Task Completed.