# Yelp Dataset Processing - Sampling + Multi-file Strategy
## Xử lý dataset lớn không tràn RAM

**Chiến lược:**
- Sample 40% từ mỗi dataset
- Xử lý theo batch 100k records
- Output: Multiple small files
- RAM usage: < 2GB

**Expected output:**
- ~2.4M reviews (from 6M)
- ~24 train files + 24 test files
- Each file ~80-100k records

## 0. Cài đặt thư viện

In [6]:
!pip install tqdm -q

In [7]:
import json
import pandas as pd
import numpy as np
from datetime import datetime
import os
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import psutil
import time
import gc

print("✅ Libraries imported!")
print(f"📅 Start: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Libraries imported!
📅 Start: 2025-10-12 15:37:39


## 1. Cấu hình

In [8]:
# ⚙️ CẤU HÌNH - THAY ĐỔI ĐƯỜNG DẪN
DATA_PATH = "Yelp/yelp_dataset/"

FILE_PATHS = {
    'business': DATA_PATH + 'yelp_academic_dataset_business.json',
    'review': DATA_PATH + 'yelp_academic_dataset_review.json',
    'user': DATA_PATH + 'yelp_academic_dataset_user.json'
}

OUTPUT_DIR = "processed_data/"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Processing settings
BATCH_SIZE = 100000          # Đọc 100k records/batch
SAMPLE_RATE = 0.4            # Lấy 40% từ mỗi batch
COMBINE_BATCHES = 5          # Gộp 5 batch thành 1 file output
RECORDS_PER_FILE = 100000    # ~100k records/file output

# Train/Test split
TRAIN_RATIO = 0.8
TEST_RATIO = 0.2
RANDOM_STATE = 42

print(f"📂 Data path: {DATA_PATH}")
print(f"📂 Output: {OUTPUT_DIR}")
print(f"📊 Batch size: {BATCH_SIZE:,}")
print(f"🎲 Sample rate: {SAMPLE_RATE:.0%}")
print(f"📦 Records per output file: ~{RECORDS_PER_FILE:,}")
print(f"✂️ Train/Test: {TRAIN_RATIO:.0%}/{TEST_RATIO:.0%}")

📂 Data path: Yelp/yelp_dataset/
📂 Output: processed_data/
📊 Batch size: 100,000
🎲 Sample rate: 40%
📦 Records per output file: ~100,000
✂️ Train/Test: 80%/20%


## 2. Helper Functions

In [9]:
def get_memory_mb():
    """Get current memory usage in MB"""
    return psutil.Process().memory_info().rss / 1024 / 1024

def count_lines(filepath):
    """Count total lines"""
    print(f"⏳ Counting lines in {filepath.split('/')[-1]}...")
    with open(filepath, 'r', encoding='utf-8') as f:
        count = sum(1 for _ in f)
    print(f"   Total: {count:,} lines")
    return count

def load_and_sample_batch(filepath, batch_size, sample_rate):
    """
    Load JSON file in batches and sample immediately
    Yields sampled DataFrames
    """
    total_lines = count_lines(filepath)
    filename = filepath.split('/')[-1]
    
    batch_data = []
    processed = 0
    errors = 0
    
    with open(filepath, 'r', encoding='utf-8') as f:
        pbar = tqdm(total=total_lines, desc=f"Processing {filename}", unit=" lines")
        
        for line in f:
            line = line.strip()
            if not line:
                pbar.update(1)
                continue
            
            try:
                obj = json.loads(line)
                batch_data.append(obj)
                processed += 1
                
                # When batch is full, sample and yield
                if len(batch_data) >= batch_size:
                    df_batch = pd.DataFrame(batch_data)
                    
                    # Random sample
                    sample_size = int(len(df_batch) * sample_rate)
                    df_sampled = df_batch.sample(n=sample_size, random_state=RANDOM_STATE)
                    
                    yield df_sampled
                    
                    # Clear memory
                    batch_data = []
                    del df_batch
                    gc.collect()
                    
                    pbar.set_postfix({
                        'RAM': f'{get_memory_mb():.0f}MB',
                        'Sampled': f'{sample_size:,}'
                    })
            
            except json.JSONDecodeError:
                errors += 1
            
            pbar.update(1)
        
        # Process remaining data
        if batch_data:
            df_batch = pd.DataFrame(batch_data)
            sample_size = int(len(df_batch) * sample_rate)
            if sample_size > 0:
                df_sampled = df_batch.sample(n=sample_size, random_state=RANDOM_STATE)
                yield df_sampled
        
        pbar.close()
    
    print(f"✅ Processed {processed:,} records ({errors} errors)")
    print(f"💾 Peak RAM: {get_memory_mb():.0f} MB\n")

print("✅ Helper functions ready!")

✅ Helper functions ready!


## 3. Process Business Data (40% Sample)

In [10]:
print("="*80)
print("🏢 PROCESSING BUSINESS DATA (40% SAMPLE)")
print("="*80)

output_file = OUTPUT_DIR + 'business.csv'
first_write = True
total_records = 0

for sampled_df in load_and_sample_batch(
    FILE_PATHS['business'], 
    BATCH_SIZE, 
    SAMPLE_RATE
):
    # Append to file
    sampled_df.to_csv(output_file, mode='a', header=first_write, index=False)
    first_write = False
    total_records += len(sampled_df)
    
    del sampled_df
    gc.collect()

print(f"\n✅ Business sampled: {total_records:,} records")
print(f"📁 Saved to: {output_file}\n")

🏢 PROCESSING BUSINESS DATA (40% SAMPLE)
⏳ Counting lines in yelp_academic_dataset_business.json...
   Total: 150,348 lines


Processing yelp_academic_dataset_business.json: 100%|██████████| 150348/150348 [00:03<00:00, 49442.92 lines/s, RAM=639MB, Sampled=40,000] 

✅ Processed 150,346 records (2 errors)
💾 Peak RAM: 607 MB


✅ Business sampled: 60,138 records
📁 Saved to: processed_data/business.csv






## 4. Process User Data (40% Sample)

In [11]:
print("="*80)
print("👥 PROCESSING USER DATA (40% SAMPLE)")
print("="*80)

output_file = OUTPUT_DIR + 'user.csv'
first_write = True
total_records = 0

for sampled_df in load_and_sample_batch(
    FILE_PATHS['user'], 
    BATCH_SIZE, 
    SAMPLE_RATE
):
    # Map yelping_since to since
    if 'yelping_since' in sampled_df.columns:
        sampled_df['since'] = sampled_df['yelping_since']
    
    # Select required columns
    user_cols = ['user_id', 'name', 'review_count', 'since', 'useful', 'fans', 'average_stars']
    available_cols = [col for col in user_cols if col in sampled_df.columns]
    sampled_df = sampled_df[available_cols]
    
    # Append to file
    sampled_df.to_csv(output_file, mode='a', header=first_write, index=False)
    first_write = False
    total_records += len(sampled_df)
    
    del sampled_df
    gc.collect()

print(f"\n✅ User sampled: {total_records:,} records")
print(f"📁 Saved to: {output_file}\n")

👥 PROCESSING USER DATA (40% SAMPLE)
⏳ Counting lines in yelp_academic_dataset_user.json...
   Total: 1,987,897 lines


Processing yelp_academic_dataset_user.json: 100%|██████████| 1987897/1987897 [00:28<00:00, 68945.94 lines/s, RAM=708MB, Sampled=40,000] 

✅ Processed 1,987,897 records (0 errors)
💾 Peak RAM: 694 MB


✅ User sampled: 795,158 records
📁 Saved to: processed_data/user.csv






## 5. Process Review Data (40% Sample + Multi-file)

In [12]:
print("="*80)
print("📝 PROCESSING REVIEW DATA (40% SAMPLE + MULTI-FILE)")
print("="*80)

accumulated_data = []
file_counter = 1
total_sampled = 0
null_removed = 0

for sampled_df in load_and_sample_batch(
    FILE_PATHS['review'], 
    BATCH_SIZE, 
    SAMPLE_RATE
):
    # Create sentiment labels
    sampled_df['label'] = sampled_df['stars'].apply(
        lambda x: 0 if x <= 2 else (2 if x == 3 else 1)
    )
    
    # Remove null text
    before = len(sampled_df)
    sampled_df = sampled_df.dropna(subset=['text'])
    null_removed += (before - len(sampled_df))
    
    # Select columns
    sampled_df = sampled_df[['text', 'label']]
    sampled_df.columns = ['review', 'label']
    
    # Accumulate data
    accumulated_data.append(sampled_df)
    total_sampled += len(sampled_df)
    
    # Check if we have enough data for a file
    total_accumulated = sum(len(df) for df in accumulated_data)
    
    if total_accumulated >= RECORDS_PER_FILE:
        # Combine accumulated batches
        combined_df = pd.concat(accumulated_data, ignore_index=True)
        
        # Remove duplicates within this file
        before_dup = len(combined_df)
        combined_df = combined_df.drop_duplicates(subset=['review'])
        dup_removed = before_dup - len(combined_df)
        
        print(f"\n📦 Creating file #{file_counter}:")
        print(f"   Records: {len(combined_df):,}")
        print(f"   Duplicates removed: {dup_removed:,}")
        
        # Save combined file
        review_file = OUTPUT_DIR + f'review_combined_{file_counter}.csv'
        combined_df.to_csv(review_file, index=False)
        print(f"   ✅ Saved: {review_file}")
        
        # Clear memory
        del combined_df
        accumulated_data = []
        gc.collect()
        
        file_counter += 1
    
    del sampled_df
    gc.collect()

# Process remaining data
if accumulated_data:
    combined_df = pd.concat(accumulated_data, ignore_index=True)
    before_dup = len(combined_df)
    combined_df = combined_df.drop_duplicates(subset=['review'])
    dup_removed = before_dup - len(combined_df)
    
    print(f"\n📦 Creating final file #{file_counter}:")
    print(f"   Records: {len(combined_df):,}")
    print(f"   Duplicates removed: {dup_removed:,}")
    
    review_file = OUTPUT_DIR + f'review_combined_{file_counter}.csv'
    combined_df.to_csv(review_file, index=False)
    print(f"   ✅ Saved: {review_file}")
    
    del combined_df
    gc.collect()

num_review_files = file_counter

print(f"\n{'='*80}")
print(f"✅ Review processing complete!")
print(f"   Total sampled: {total_sampled:,} records")
print(f"   Null reviews removed: {null_removed:,}")
print(f"   Output files: {num_review_files} files")
print(f"{'='*80}\n")

📝 PROCESSING REVIEW DATA (40% SAMPLE + MULTI-FILE)
⏳ Counting lines in yelp_academic_dataset_review.json...
   Total: 6,990,280 lines


Processing yelp_academic_dataset_review.json:   4%|▍         | 299571/6990280 [00:01<00:38, 174407.86 lines/s, RAM=420MB, Sampled=40,000]


📦 Creating file #1:
   Records: 119,938
   Duplicates removed: 62


Processing yelp_academic_dataset_review.json:   4%|▍         | 299999/6990280 [00:03<00:38, 174407.86 lines/s, RAM=450MB, Sampled=40,000]

   ✅ Saved: processed_data/review_combined_1.csv


Processing yelp_academic_dataset_review.json:   9%|▊         | 597106/6990280 [00:05<00:38, 165939.93 lines/s, RAM=452MB, Sampled=40,000]


📦 Creating file #2:
   Records: 119,948
   Duplicates removed: 52


Processing yelp_academic_dataset_review.json:   9%|▉         | 616959/6990280 [00:06<02:22, 44809.56 lines/s, RAM=464MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_2.csv


Processing yelp_academic_dataset_review.json:  13%|█▎        | 899244/6990280 [00:08<00:34, 174346.49 lines/s, RAM=466MB, Sampled=40,000]


📦 Creating file #3:
   Records: 119,970
   Duplicates removed: 30


Processing yelp_academic_dataset_review.json:  13%|█▎        | 920197/6990280 [00:09<02:04, 48806.45 lines/s, RAM=469MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_3.csv


Processing yelp_academic_dataset_review.json:  17%|█▋        | 1181264/6990280 [00:11<00:38, 151559.07 lines/s, RAM=470MB, Sampled=40,000]


📦 Creating file #4:
   Records: 119,946
   Duplicates removed: 54


Processing yelp_academic_dataset_review.json:  17%|█▋        | 1200678/6990280 [00:12<02:06, 45681.18 lines/s, RAM=473MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_4.csv


Processing yelp_academic_dataset_review.json:  21%|██        | 1481782/6990280 [00:14<00:34, 159545.98 lines/s, RAM=473MB, Sampled=40,000]


📦 Creating file #5:
   Records: 119,954
   Duplicates removed: 46


Processing yelp_academic_dataset_review.json:  21%|██▏       | 1501861/6990280 [00:15<01:59, 46044.71 lines/s, RAM=476MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_5.csv


Processing yelp_academic_dataset_review.json:  26%|██▌       | 1786740/6990280 [00:17<00:32, 161191.20 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #6:
   Records: 119,937
   Duplicates removed: 63


Processing yelp_academic_dataset_review.json:  26%|██▌       | 1806996/6990280 [00:18<01:47, 48285.82 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_6.csv


Processing yelp_academic_dataset_review.json:  30%|██▉       | 2090046/6990280 [00:20<00:29, 168835.20 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #7:
   Records: 119,921
   Duplicates removed: 79


Processing yelp_academic_dataset_review.json:  30%|███       | 2110297/6990280 [00:21<01:44, 46724.41 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_7.csv


Processing yelp_academic_dataset_review.json:  34%|███▍      | 2390581/6990280 [00:23<00:26, 173783.16 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #8:
   Records: 119,936
   Duplicates removed: 64


Processing yelp_academic_dataset_review.json:  34%|███▍      | 2411398/6990280 [00:24<01:32, 49704.10 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_8.csv


Processing yelp_academic_dataset_review.json:  39%|███▊      | 2692208/6990280 [00:26<00:26, 165269.77 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #9:
   Records: 119,936
   Duplicates removed: 64


Processing yelp_academic_dataset_review.json:  39%|███▉      | 2712239/6990280 [00:27<01:32, 46122.16 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_9.csv


Processing yelp_academic_dataset_review.json:  43%|████▎     | 2991897/6990280 [00:29<00:23, 170852.51 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #10:
   Records: 119,954
   Duplicates removed: 46


Processing yelp_academic_dataset_review.json:  43%|████▎     | 3012425/6990280 [00:30<01:22, 48252.88 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_10.csv


Processing yelp_academic_dataset_review.json:  47%|████▋     | 3295833/6990280 [00:32<00:21, 168257.49 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #11:
   Records: 119,932
   Duplicates removed: 68


Processing yelp_academic_dataset_review.json:  47%|████▋     | 3315975/6990280 [00:33<01:17, 47306.17 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_11.csv


Processing yelp_academic_dataset_review.json:  51%|█████▏    | 3591495/6990280 [00:35<00:20, 168577.44 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #12:
   Records: 119,953
   Duplicates removed: 47


Processing yelp_academic_dataset_review.json:  52%|█████▏    | 3611644/6990280 [00:36<01:13, 46094.89 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_12.csv


Processing yelp_academic_dataset_review.json:  56%|█████▌    | 3886774/6990280 [00:38<00:19, 162323.40 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #13:
   Records: 119,932
   Duplicates removed: 68


Processing yelp_academic_dataset_review.json:  56%|█████▌    | 3906273/6990280 [00:40<01:07, 45748.72 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_13.csv


Processing yelp_academic_dataset_review.json:  60%|█████▉    | 4185758/6990280 [00:41<00:16, 166140.64 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #14:
   Records: 119,949
   Duplicates removed: 51


Processing yelp_academic_dataset_review.json:  60%|██████    | 4205819/6990280 [00:43<01:00, 45750.66 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_14.csv


Processing yelp_academic_dataset_review.json:  64%|██████▍   | 4488534/6990280 [00:45<00:14, 170735.06 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #15:
   Records: 119,921
   Duplicates removed: 79


Processing yelp_academic_dataset_review.json:  65%|██████▍   | 4509053/6990280 [00:46<00:51, 48076.48 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_15.csv


Processing yelp_academic_dataset_review.json:  68%|██████▊   | 4785703/6990280 [00:48<00:13, 165707.64 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #16:
   Records: 119,951
   Duplicates removed: 49


Processing yelp_academic_dataset_review.json:  69%|██████▊   | 4805608/6990280 [00:49<00:47, 45962.72 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_16.csv


Processing yelp_academic_dataset_review.json:  73%|███████▎  | 5097184/6990280 [00:51<00:11, 167261.37 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #17:
   Records: 119,948
   Duplicates removed: 52


Processing yelp_academic_dataset_review.json:  73%|███████▎  | 5117130/6990280 [00:52<00:40, 45738.24 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_17.csv


Processing yelp_academic_dataset_review.json:  77%|███████▋  | 5397155/6990280 [00:54<00:09, 169664.76 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #18:
   Records: 119,941
   Duplicates removed: 59


Processing yelp_academic_dataset_review.json:  77%|███████▋  | 5417397/6990280 [00:55<00:33, 47636.76 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_18.csv


Processing yelp_academic_dataset_review.json:  81%|████████▏ | 5695100/6990280 [00:57<00:07, 170011.30 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #19:
   Records: 119,961
   Duplicates removed: 39


Processing yelp_academic_dataset_review.json:  82%|████████▏ | 5715639/6990280 [00:58<00:26, 47426.02 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_19.csv


Processing yelp_academic_dataset_review.json:  86%|████████▌ | 5979696/6990280 [01:00<00:06, 158798.24 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #20:
   Records: 119,936
   Duplicates removed: 64


Processing yelp_academic_dataset_review.json:  86%|████████▌ | 6000000/6990280 [01:01<00:20, 48022.84 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_20.csv


Processing yelp_academic_dataset_review.json:  90%|█████████ | 6297068/6990280 [01:03<00:04, 166703.60 lines/s, RAM=477MB, Sampled=40,000]


📦 Creating file #21:
   Records: 119,951
   Duplicates removed: 49


Processing yelp_academic_dataset_review.json:  90%|█████████ | 6316780/6990280 [01:04<00:14, 46027.72 lines/s, RAM=477MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_21.csv


Processing yelp_academic_dataset_review.json:  94%|█████████▍| 6578854/6990280 [01:06<00:02, 158042.08 lines/s, RAM=478MB, Sampled=40,000]


📦 Creating file #22:
   Records: 119,931
   Duplicates removed: 69


Processing yelp_academic_dataset_review.json:  95%|█████████▍| 6621437/6990280 [01:07<00:05, 62709.39 lines/s, RAM=478MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_22.csv


Processing yelp_academic_dataset_review.json:  98%|█████████▊| 6882657/6990280 [01:09<00:00, 157872.13 lines/s, RAM=478MB, Sampled=40,000]


📦 Creating file #23:
   Records: 119,923
   Duplicates removed: 77


Processing yelp_academic_dataset_review.json:  99%|█████████▊| 6902446/6990280 [01:10<00:01, 46962.87 lines/s, RAM=478MB, Sampled=40,000] 

   ✅ Saved: processed_data/review_combined_23.csv


Processing yelp_academic_dataset_review.json: 100%|██████████| 6990280/6990280 [01:11<00:00, 97880.31 lines/s, RAM=478MB, Sampled=40,000] 


✅ Processed 6,990,280 records (0 errors)
💾 Peak RAM: 483 MB


📦 Creating final file #24:
   Records: 36,104
   Duplicates removed: 8
   ✅ Saved: processed_data/review_combined_24.csv

✅ Review processing complete!
   Total sampled: 2,796,112 records
   Null reviews removed: 0
   Output files: 24 files



## 6. Train/Test Split cho từng file

In [13]:
print("="*80)
print("✂️ TRAIN/TEST SPLIT (STRATIFIED)")
print("="*80)

total_train = 0
total_test = 0

for i in range(1, num_review_files + 1):
    review_file = OUTPUT_DIR + f'review_combined_{i}.csv'
    
    print(f"\n📂 Processing file {i}/{num_review_files}...")
    
    # Load file
    df = pd.read_csv(review_file)
    print(f"   Loaded: {len(df):,} records")
    
    # Split with stratify
    train_df, test_df = train_test_split(
        df,
        test_size=TEST_RATIO,
        random_state=RANDOM_STATE,
        stratify=df['label']
    )
    
    # Save train
    train_file = OUTPUT_DIR + f'train_part{i}.csv'
    train_df.to_csv(train_file, index=False)
    print(f"   ✅ Train: {len(train_df):,} → {train_file}")
    
    # Save test
    test_file = OUTPUT_DIR + f'test_part{i}.csv'
    test_df.to_csv(test_file, index=False)
    print(f"   ✅ Test: {len(test_df):,} → {test_file}")
    
    total_train += len(train_df)
    total_test += len(test_df)
    
    # Clean memory
    del df, train_df, test_df
    gc.collect()

print(f"\n{'='*80}")
print(f"✅ Split complete!")
print(f"   Total train: {total_train:,} records")
print(f"   Total test: {total_test:,} records")
print(f"   Train files: {num_review_files}")
print(f"   Test files: {num_review_files}")
print(f"{'='*80}\n")

✂️ TRAIN/TEST SPLIT (STRATIFIED)

📂 Processing file 1/24...
   Loaded: 119,938 records
   ✅ Train: 95,950 → processed_data/train_part1.csv
   ✅ Test: 23,988 → processed_data/test_part1.csv

📂 Processing file 2/24...
   Loaded: 119,948 records
   ✅ Train: 95,958 → processed_data/train_part2.csv
   ✅ Test: 23,990 → processed_data/test_part2.csv

📂 Processing file 3/24...
   Loaded: 119,970 records
   ✅ Train: 95,976 → processed_data/train_part3.csv
   ✅ Test: 23,994 → processed_data/test_part3.csv

📂 Processing file 4/24...
   Loaded: 119,946 records
   ✅ Train: 95,956 → processed_data/train_part4.csv
   ✅ Test: 23,990 → processed_data/test_part4.csv

📂 Processing file 5/24...
   Loaded: 119,954 records
   ✅ Train: 95,963 → processed_data/train_part5.csv
   ✅ Test: 23,991 → processed_data/test_part5.csv

📂 Processing file 6/24...
   Loaded: 119,937 records
   ✅ Train: 95,949 → processed_data/train_part6.csv
   ✅ Test: 23,988 → processed_data/test_part6.csv

📂 Processing file 7/24...
   L

## 7. Summary Report

In [14]:
print("="*80)
print("📊 FINAL SUMMARY REPORT")
print("="*80)

# List all files
all_files = os.listdir(OUTPUT_DIR)
business_files = [f for f in all_files if f.startswith('business')]
user_files = [f for f in all_files if f.startswith('user')]
review_files = [f for f in all_files if f.startswith('review_combined')]
train_files = [f for f in all_files if f.startswith('train_part')]
test_files = [f for f in all_files if f.startswith('test_part')]

print(f"\n📁 Output Files:")
print(f"   Business: {len(business_files)} file(s)")
print(f"   User: {len(user_files)} file(s)")
print(f"   Review combined: {len(review_files)} file(s)")
print(f"   Train: {len(train_files)} file(s)")
print(f"   Test: {len(test_files)} file(s)")

# Calculate total size
total_size = 0
for f in all_files:
    filepath = OUTPUT_DIR + f
    if os.path.isfile(filepath):
        total_size += os.path.getsize(filepath)

print(f"\n💾 Total output size: {total_size / 1024 / 1024:.1f} MB")

print(f"\n📊 Record Counts:")
print(f"   Business: {pd.read_csv(OUTPUT_DIR + 'business.csv').shape[0]:,}")
print(f"   User: {pd.read_csv(OUTPUT_DIR + 'user.csv').shape[0]:,}")
print(f"   Train (total): {total_train:,}")
print(f"   Test (total): {total_test:,}")

print(f"\n⚙️ Processing Settings:")
print(f"   Sample rate: {SAMPLE_RATE:.0%}")
print(f"   Batch size: {BATCH_SIZE:,}")
print(f"   Records per file: ~{RECORDS_PER_FILE:,}")

print(f"\n💾 Peak RAM usage: {get_memory_mb():.0f} MB")
print(f"📅 Completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\n" + "="*80)
print("🎉 PROCESSING COMPLETE!")
print("="*80)

print(f"\n📚 How to use:")
print(f"   1. Train files: train_part1.csv, train_part2.csv, ...")
print(f"   2. Test files: test_part1.csv, test_part2.csv, ...")
print(f"   3. Load multiple files in training loop")
print(f"   4. Labels: 0=Tiêu cực, 1=Tích cực, 2=Trung lập")

# Save summary to file
summary_file = OUTPUT_DIR + 'processing_summary.txt'
with open(summary_file, 'w', encoding='utf-8') as f:
    f.write("YELP DATASET PROCESSING SUMMARY\n")
    f.write("="*80 + "\n\n")
    f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    f.write(f"Business files: {len(business_files)}\n")
    f.write(f"User files: {len(user_files)}\n")
    f.write(f"Review combined files: {len(review_files)}\n")
    f.write(f"Train files: {len(train_files)}\n")
    f.write(f"Test files: {len(test_files)}\n\n")
    f.write(f"Total train records: {total_train:,}\n")
    f.write(f"Total test records: {total_test:,}\n\n")
    f.write(f"Sample rate: {SAMPLE_RATE:.0%}\n")
    f.write(f"Total size: {total_size / 1024 / 1024:.1f} MB\n")

print(f"\n✅ Summary saved to: {summary_file}")

📊 FINAL SUMMARY REPORT

📁 Output Files:
   Business: 1 file(s)
   User: 1 file(s)
   Review combined: 24 file(s)
   Train: 24 file(s)
   Test: 24 file(s)

💾 Total output size: 6367.9 MB

📊 Record Counts:
   Business: 60,138
   User: 795,158
   Train (total): 2,235,805
   Test (total): 558,968

⚙️ Processing Settings:
   Sample rate: 40%
   Batch size: 100,000
   Records per file: ~100,000

💾 Peak RAM usage: 478 MB
📅 Completed: 2025-10-12 15:40:01

🎉 PROCESSING COMPLETE!

📚 How to use:
   1. Train files: train_part1.csv, train_part2.csv, ...
   2. Test files: test_part1.csv, test_part2.csv, ...
   3. Load multiple files in training loop
   4. Labels: 0=Tiêu cực, 1=Tích cực, 2=Trung lập

✅ Summary saved to: processed_data/processing_summary.txt


## 8. Quick Preview

In [15]:
# Preview first train file
print("📋 PREVIEW: train_part1.csv\n")
df_preview = pd.read_csv(OUTPUT_DIR + 'train_part1.csv')
print(df_preview.head(10))

print(f"\n📊 Label distribution in train_part1:")
print(df_preview['label'].value_counts().sort_index())

📋 PREVIEW: train_part1.csv

                                              review  label
0  I stopped by Café Beignet (bourbon street loca...      1
1  First a disclaimer:  I only speak of one cab r...      1
2  Wonderful brunch option in the Funk Zone. Food...      1
3  I haven't been downtown for a while, but happe...      1
4  Good as it gets for bbq in Tampa.  Service is ...      1
5  I went to V Vegaz because they had Diva-Curls ...      1
6  The food is 5 stars, but the service is 1. If ...      0
7  Good food. Menu was too busy and they did not ...      2
8  Nice late night food and pub. Wish the variety...      2
9  Moonshine me up, baby!\n\nStopped in with my f...      2

📊 Label distribution in train_part1:
label
0    18398
1    66550
2    11002
Name: count, dtype: int64
