# Sentinel-SLM: Comprehensive Exploratory Data Analysis (EDA)

**Objective:** Analyze the distribution, balance, and linguistic properties of the aggregated Sentinel-SLM dataset (1.6M+ samples) across 8 safety categories.

**Dataset:** `data/processed/final_augmented_dataset.parquet`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set Style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = [12, 6]

# Load Data
DATA_PATH = "../data/processed/final_augmented_dataset.parquet"
try:
    df = pd.read_parquet(DATA_PATH)
    print(f"✅ Loaded {len(df):,} samples.")
except FileNotFoundError:
    print("❌ Dataset not found. Run pipeline first.")

## 1. Source Distribution
Where is the data coming from? Visualizing the contribution of KoalaAI vs. others.

In [None]:
source_counts = df['source'].value_counts()
print(source_counts)

plt.figure(figsize=(10, 5))
sns.barplot(x=source_counts.index, y=source_counts.values, palette="viridis")
plt.title("Dataset Source Distribution")
plt.ylabel("Count")
plt.yscale('log')  # Log scale due to KoalaAI dominance
plt.show()

## 2. Category Balance (The 8 Taxonomy Classes)
Are we balanced? (Spoiler: 'Safe' is usually dominant).

In [None]:
from src.sentinel.utils.taxonomy import CATEGORY_NAMES

# Flatten labels
all_labels = [label for sublist in df['labels'] for label in sublist]
label_counts = pd.Series(all_labels).map(CATEGORY_NAMES).value_counts()

plt.figure(figsize=(12, 6))
sns.barplot(y=label_counts.index, x=label_counts.values, orient='h', palette="magma")
plt.title("Category Distribution (Log Scale)")
plt.xscale('log')
plt.xlabel("Count (Log)")
plt.show()

print("Exact Counts:")
print(label_counts)

## 3. Class Imbalance Ratio
Checking which categories are under-represented.

In [None]:
majority_class = label_counts.max()
imbalance_ratios = majority_class / label_counts
print("Imbalance Ratios (1 = Majority):")
print(imbalance_ratios.sort_values(ascending=False))

## 4. Text Length Analysis
SLMs have limited context windows. How long are the inputs?

In [None]:
df['char_length'] = df['text'].str.len()

plt.figure(figsize=(12, 5))
sns.histplot(df['char_length'], bins=100, log_scale=(False, True))
plt.title("Text Length Distribution (Log Frequency)")
plt.xlabel("Characters")
plt.xlim(0, 1000) # Zoom in on typical range
plt.show()

print("Length Stats:")
print(df['char_length'].describe())