# TAHAP 2 : **Program Untuk Menganalisa Dataset CSV**

Program ini menjalankan analisa pada kumpulan CSV yang merupakan hasil output proses ekstraksi data di program sebelumnya. Analisa ini bertujuan untuk menghitung nilai rata-rata dari jumlah frame sequence di setiap data CSV tersebut. Hasil analisa ini akan digunakan sebagai limit sequence yang bisa membantu proses training algoritma LSTM agar lebih konsisten dan akurat. 

- Untuk menjalankan block program masing-masing, tekan *shift* + *enter* atau klik tombol icon '*Play*' ▶︎.
- <b>JALANKAN SETIAP BLOK PROGRAM SECARA SEKUENSIAL!</b>
- Konten di dalam program ini bisa digunakan sebagai panduan tutorial.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from collections import defaultdict

# === CONFIG: change this if your CSV folder is different ===
csv_folder = Path("input/")

# Dictionary to store lengths per label
stats = defaultdict(list)
all_lengths = []

# Iterate through all CSV files
for csv_file in csv_folder.glob("*.csv"):
    name = csv_file.stem  # e.g., "Hello_003"
    
    # Extract label (everything before the last "_NNN")
    parts = name.split("_")
    if len(parts) < 2:
        # if filename doesn't follow label_idx pattern, skip or treat whole as label
        label = name
    else:
        label = "_".join(parts[:-1])  # supports multi-word labels

    df = pd.read_csv(csv_file)
    seq_length = len(df)

    stats[label].append(seq_length)
    all_lengths.append(seq_length)

# Convert to numpy array for global stats
all_lengths = np.array(all_lengths)

print("=== PER-LABEL STATISTICS ===\n")
for label, lengths in stats.items():
    lengths = np.array(lengths)
    print(f"Label: {label}")
    print(f"  Number of sequences  : {len(lengths)}")
    print(f"  Min sequence length  : {lengths.min()}")
    print(f"  Max sequence length  : {lengths.max()}")
    print(f"  Avg sequence length  : {lengths.mean():.2f}")
    print(f"  Median length        : {np.median(lengths):.2f}")
    print("")

print("=== GLOBAL SEQUENCE LENGTH DISTRIBUTION ===")
print(f"Total sequences        : {len(all_lengths)}")
print(f"Global min length      : {all_lengths.min()}")
print(f"Global max length      : {all_lengths.max()}")
print(f"Global mean length     : {all_lengths.mean():.2f}")
print(f"Global median length   : {np.median(all_lengths):.2f}")

# Key percentiles
p75 = np.percentile(all_lengths, 75)
p90 = np.percentile(all_lengths, 90)
p95 = np.percentile(all_lengths, 95)

print(f"75th percentile (P75)  : {p75:.2f}")
print(f"90th percentile (P90)  : {p90:.2f}")
print(f"95th percentile (P95)  : {p95:.2f}")

# === Auto-suggest padding length ===
# Heuristic: use P90 as a good trade-off, rounded up to nearest int
suggested_padding = int(np.ceil(p90))

print("\n=== SUGGESTED PADDING LENGTH ===")
print(f"Suggested padding length (≈P90): {suggested_padding}")

print("\nYou might choose:")
print(f"  - {int(np.ceil(p75))}  (P75)  → more truncation, faster training")
print(f"  - {suggested_padding} (P90)  → balanced")
print(f"  - {int(np.ceil(p95))} (P95)  → minimal truncation, more padding")
print(f"  - {int(all_lengths.max())} (max) → no truncation, lots of padding")

=== PER-LABEL STATISTICS ===

Label: 10
  Number of sequences  : 220
  Min sequence length  : 24
  Max sequence length  : 110
  Avg sequence length  : 61.66
  Median length        : 68.00

Label: 11
  Number of sequences  : 180
  Min sequence length  : 23
  Max sequence length  : 99
  Avg sequence length  : 57.01
  Median length        : 62.50

Label: 12
  Number of sequences  : 180
  Min sequence length  : 21
  Max sequence length  : 90
  Avg sequence length  : 55.55
  Median length        : 59.50

Label: 13
  Number of sequences  : 180
  Min sequence length  : 19
  Max sequence length  : 107
  Avg sequence length  : 54.57
  Median length        : 57.00

Label: 14
  Number of sequences  : 180
  Min sequence length  : 20
  Max sequence length  : 91
  Avg sequence length  : 55.92
  Median length        : 58.00

Label: 15
  Number of sequences  : 180
  Min sequence length  : 24
  Max sequence length  : 100
  Avg sequence length  : 57.29
  Median length        : 60.00

Label: 16
  Number 