# 02 — Stratified Sampling + Train/Val/Test Split

**Mục tiêu:**
- Lấy subset cân bằng từ dataset gốc (stratified sampling theo label)
- Chia train/val/test (70/15/15 hoặc 80/10/10)
- Lưu thành CSV nhẹ để dùng cho training

**Lý do:** Dataset gốc lớn (>500k) nên chỉ lấy 30k–80k tổng là đủ để train + eval robustness.

In [1]:
# Imports
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
import sys

# Add current folder to path for utils import
sys.path.insert(0, str(Path.cwd()))
from utils import stratified_sample_from_csv

print('pandas:', pd.__version__)

pandas: 2.3.3


## 1) Config: paths + columns + sampling size

In [2]:
# Dataset path (adjust if needed)
DATA_PATH = Path('twitter_sentiment_dataset.csv')

if not DATA_PATH.exists():
    candidates = list(Path.cwd().rglob('twitter_sentiment_dataset.csv'))
    if candidates:
        DATA_PATH = candidates[0]
    else:
        raise FileNotFoundError('Không tìm thấy twitter_sentiment_dataset.csv')

print('Dataset path:', DATA_PATH.resolve())

# Columns (set these based on your EDA in notebook 01)
TEXT_COL = 'cleaned_text'  # or 'text', adjust based on your dataset
LABEL_COL = 'sentiment'  # or 'label', adjust

# Sampling config
TARGET_PER_LABEL = 15000  # samples per label (adjust down to 5000-10000 if memory tight)
# Expected total: ~45k if 3 labels, ~30k if 2 labels

# Split ratios
TRAIN_RATIO = 0.70
VAL_RATIO = 0.15
TEST_RATIO = 0.15

RANDOM_STATE = 42

Dataset path: C:\Paper\Twitter proejct\Twitter Sentiment Analysis Dataset\Twitter Sentiment Analysis Dataset\twitter_sentiment_dataset.csv


## 2) Stratified sampling (memory-efficient)

In [3]:
# Use utils function to sample
df_sampled = stratified_sample_from_csv(
    csv_path=str(DATA_PATH),
    label_col=LABEL_COL,
    text_col=TEXT_COL,
    target_per_label=TARGET_PER_LABEL,
    chunksize=200_000,
    random_state=RANDOM_STATE
)

print('Sampled shape:', df_sampled.shape)
display(df_sampled[LABEL_COL].value_counts())

Sampled shape: (45000, 2)


sentiment
positive    15000
negative    15000
neutral     15000
Name: count, dtype: int64

## 3) Train/val/test split (stratified)

In [4]:
# First split: train vs temp (val+test)
df_train, df_temp = train_test_split(
    df_sampled,
    train_size=TRAIN_RATIO,
    stratify=df_sampled[LABEL_COL],
    random_state=RANDOM_STATE
)

# Second split: val vs test
val_test_ratio = VAL_RATIO / (VAL_RATIO + TEST_RATIO)  # normalize
df_val, df_test = train_test_split(
    df_temp,
    train_size=val_test_ratio,
    stratify=df_temp[LABEL_COL],
    random_state=RANDOM_STATE
)

print('Train:', df_train.shape)
print('Val  :', df_val.shape)
print('Test :', df_test.shape)

print('\nLabel distribution (train):')
display(df_train[LABEL_COL].value_counts(normalize=True))
print('\nLabel distribution (val):')
display(df_val[LABEL_COL].value_counts(normalize=True))
print('\nLabel distribution (test):')
display(df_test[LABEL_COL].value_counts(normalize=True))

Train: (31499, 2)
Val  : (6750, 2)
Test : (6751, 2)

Label distribution (train):


sentiment
negative    0.333344
neutral     0.333344
positive    0.333312
Name: proportion, dtype: float64


Label distribution (val):


sentiment
neutral     0.333333
positive    0.333333
negative    0.333333
Name: proportion, dtype: float64


Label distribution (test):


sentiment
positive    0.333432
negative    0.333284
neutral     0.333284
Name: proportion, dtype: float64

## 4) Save to CSV

In [5]:
OUTPUT_DIR = Path('data_splits')
OUTPUT_DIR.mkdir(exist_ok=True)

df_train.to_csv(OUTPUT_DIR / 'train.csv', index=False)
df_val.to_csv(OUTPUT_DIR / 'val.csv', index=False)
df_test.to_csv(OUTPUT_DIR / 'test.csv', index=False)

print(f'✅ Saved train/val/test to {OUTPUT_DIR.resolve()}')
print('Files:')
for f in OUTPUT_DIR.glob('*.csv'):
    print(f'  - {f.name} ({f.stat().st_size / 1024:.1f} KB)')

✅ Saved train/val/test to C:\Paper\Twitter proejct\Twitter Sentiment Analysis Dataset\Twitter Sentiment Analysis Dataset\data_splits
Files:
  - test.csv (538.0 KB)
  - train.csv (2512.4 KB)
  - val.csv (536.5 KB)


## 5) Quick sanity check: sample examples

In [6]:
# Show 2 random examples per label from train
for lbl, grp in df_train.groupby(LABEL_COL):
    print(f'\n=== Label: {lbl} ===')
    display(grp[[TEXT_COL, LABEL_COL]].sample(min(2, len(grp)), random_state=42))


=== Label: negative ===


Unnamed: 0,cleaned_text,sentiment
12326,Throughout direction base later forget down tr...,negative
10566,Least bit forget white use see. Eat throughout...,negative



=== Label: neutral ===


Unnamed: 0,cleaned_text,sentiment
17853,Backup systems engaged after zero-day exploit.,neutral
17116,Man-in-the-Middle breached our web_portal.,neutral



=== Label: positive ===


Unnamed: 0,cleaned_text,sentiment
39004,Great firewall update!,positive
42979,Incident response was flawless. Policy buy ans...,positive


---
## Next steps
- Notebook 03: Train sentiment model (distilbert-multilingual)
- Notebook 04: Translate val/test EN→ES/FR + evaluate robustness
- Notebook 05: XAI + explanation consistency metrics