# Sentinel-SLM: Comprehensive Exploratory Data Analysis (EDA)

**Objective:** Analyze the distribution, balance, and linguistic properties of the aggregated Sentinel-SLM dataset (1.6M+ samples) across 8 safety categories.

**Dataset:** `data/processed/final_augmented_dataset_enriched.parquet` (or standard version)

In [None]:
import sys
import os
sys.path.append(os.path.abspath('..'))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set Style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = [12, 6]

# Load Data (Prefer Enriched)
ENRICHED_PATH = "../data/processed/final_augmented_dataset_enriched.parquet"
STANDARD_PATH = "../data/processed/final_augmented_dataset.parquet"

if os.path.exists(ENRICHED_PATH):
    DATA_PATH = ENRICHED_PATH
    print(f"üîπ Loading Enriched Dataset: {ENRICHED_PATH}")
else:
    DATA_PATH = STANDARD_PATH
    print(f"üî∏ Loading Standard Dataset: {STANDARD_PATH}")

try:
    df = pd.read_parquet(DATA_PATH)
    print(f"‚úÖ Loaded {len(df):,} samples.")
except FileNotFoundError:
    print("‚ùå Dataset not found. Run pipeline or enrichment first.")

## 1. Source Distribution
Where is the data coming from? Visualizing the contribution of KoalaAI vs. others.

In [None]:
source_counts = df['source'].value_counts()
print(source_counts)

plt.figure(figsize=(10, 5))
sns.barplot(x=source_counts.index, y=source_counts.values, palette="viridis")
plt.title("Dataset Source Distribution")
plt.ylabel("Count")
plt.yscale('log')  # Log scale due to KoalaAI dominance
plt.show()

## 2. Category Balance (The 8 Taxonomy Classes)
Are we balanced? (Spoiler: 'Safe' is usually dominant).

In [None]:
from src.sentinel.utils.taxonomy import CATEGORY_NAMES

# Flatten labels
all_labels = [label for sublist in df['labels'] for label in sublist]
label_counts = pd.Series(all_labels).map(CATEGORY_NAMES).value_counts()

plt.figure(figsize=(12, 6))
sns.barplot(y=label_counts.index, x=label_counts.values, orient='h', palette="magma")
plt.title("Category Distribution (Log Scale)")
plt.xscale('log')
plt.xlabel("Count (Log)")
plt.show()

print("Exact Counts:")
print(label_counts)

## 3. Class Imbalance Ratio
Checking which categories are under-represented.

In [None]:
majority_class = label_counts.max()
imbalance_ratios = majority_class / label_counts
print("Imbalance Ratios (1 = Majority):")
print(imbalance_ratios.sort_values(ascending=False))

## 4. Text Length Analysis
SLMs have limited context windows. How long are the inputs?

In [None]:
# Calculate length, dropping NaNs/empty strings just in case
df['char_length'] = df['text'].astype(str).str.len()
valid_lengths = df['char_length'].dropna()

plt.figure(figsize=(12, 5))
# Use simple bins and log scale on Y axis manually for robustness
sns.histplot(valid_lengths, bins=100, kde=False)
plt.title("Text Length Distribution (Log Frequency)")
plt.xlabel("Characters")
plt.ylabel("Count (Log)")
plt.yscale('log') 
plt.xlim(0, 1000) # Zoom in on typical range
plt.show()

print("Length Stats:")
print(valid_lengths.describe())

## 5. Language Distribution
Analyzing multilingual spread.

In [None]:
if 'lang' in df.columns:
    print("üîπ Using pre-calculated 'lang' column.")
    lang_counts = df['lang'].value_counts().head(20)
else:
    print("üî∏ 'lang' column not found. Estimating on sample (slow)...")
    try:
        from langdetect import detect
        def safe_detect(text):
            try: return detect(text)
            except: return "unknown"
        sample_df = df.sample(n=5000, random_state=42)
        lang_counts = sample_df['text'].apply(lambda x: safe_detect(str(x)[:500])).value_counts().head(20)
    except ImportError:
        print("langdetect not found. Run 'pip install langdetect'")
        lang_counts = pd.Series([])

print(lang_counts)

if not lang_counts.empty:
    plt.figure(figsize=(14, 6))
    sns.barplot(x=lang_counts.index, y=lang_counts.values, palette="coolwarm")
    plt.title("Language Distribution (Top 20)")
    plt.ylabel("Count")
    plt.show()