# PySpark Leaderboard Analysis - Local Mode

Notebook này chạy PySpark ở local mode để gen snapshot leaderboard data từ parquet file.

## Pipeline:
1. Đọc data từ Parquet file
2. Transform thành Score objects với event time
3. Tính tổng điểm trong sliding window (5 phút gần nhất)
4. Tính TopN với retraction logic
5. Snapshot mỗi 7 phút theo event time


In [3]:
# Import required libraries
import os
import sys
from datetime import datetime, timezone
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
from collections import defaultdict, deque
import math

# PySpark imports
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

print("Libraries imported successfully!")


Libraries imported successfully!


In [7]:
# Create Spark Session with proper error handling and cleanup
from pyspark.sql import SparkSession
from pyspark import SparkContext

def create_spark_session():
    """Create Spark session with proper cleanup and error handling"""
    try:
        # First, try to stop any existing SparkContext
        try:
            sc = SparkContext._active_spark_context
            if sc is not None:
                print("Stopping existing SparkContext...")
                sc.stop()
        except:
            pass
        
        # Create new Spark session
        spark = SparkSession.builder \
            .appName("LeaderBoardAnalysis") \
            .master("local[2]") \
            .config("spark.sql.adaptive.enabled", "true") \
            .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
            .getOrCreate()
        
        print(f"✅ Spark session created successfully!")
        print(f"Spark version: {spark.version}")
        print(f"Spark UI: http://localhost:4040")
        return spark
        
    except Exception as e:
        print(f"❌ Error creating Spark session: {e}")
        print("\nTroubleshooting steps:")
        print("1. Make sure Java is installed and JAVA_HOME is set")
        print("2. Check Java version: java -version")
        print("3. Check JAVA_HOME: echo $JAVA_HOME (Linux/Mac) or echo %JAVA_HOME% (Windows)")
        print("4. Try restarting Jupyter kernel completely")
        print("5. Check if any other Spark processes are running")
        
        # Try to get more detailed error information
        try:
            import subprocess
            result = subprocess.run(['java', '-version'], capture_output=True, text=True)
            print(f"\nJava version check: {result.stderr}")
        except:
            print("\nJava not found in PATH")
        
        return None

# Create the Spark session
spark = create_spark_session()


Stopping existing SparkContext...
❌ Error creating Spark session: 'JavaPackage' object is not callable

Troubleshooting steps:
1. Make sure Java is installed and JAVA_HOME is set
2. Check Java version: java -version
3. Check JAVA_HOME: echo $JAVA_HOME (Linux/Mac) or echo %JAVA_HOME% (Windows)
4. Try restarting Jupyter kernel completely
5. Check if any other Spark processes are running

Java version check: java version "17.0.12" 2024-07-16 LTS
Java(TM) SE Runtime Environment (build 17.0.12+8-LTS-286)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.12+8-LTS-286, mixed mode, sharing)



In [None]:
# Check Java installation and environment
import os
import subprocess
import sys

def check_java_installation():
    """Check Java installation and environment variables"""
    print("=== JAVA INSTALLATION CHECK ===")
    
    # Check JAVA_HOME
    java_home = os.environ.get('JAVA_HOME')
    if java_home:
        print(f"✅ JAVA_HOME is set to: {java_home}")
    else:
        print("❌ JAVA_HOME is not set")
    
    # Check Java version
    try:
        result = subprocess.run(['java', '-version'], capture_output=True, text=True, timeout=10)
        if result.returncode == 0:
            print("✅ Java is installed:")
            print(result.stderr)
        else:
            print("❌ Java command failed")
    except subprocess.TimeoutExpired:
        print("❌ Java command timed out")
    except FileNotFoundError:
        print("❌ Java not found in PATH")
    except Exception as e:
        print(f"❌ Error checking Java: {e}")
    
    # Check Python environment
    print(f"\n=== PYTHON ENVIRONMENT ===")
    print(f"Python version: {sys.version}")
    print(f"Python executable: {sys.executable}")
    
    # Check if we can import pyspark
    try:
        import pyspark
        print(f"✅ PySpark version: {pyspark.__version__}")
    except ImportError as e:
        print(f"❌ PySpark import error: {e}")

# Run the check
check_java_installation()


## Hướng dẫn khắc phục lỗi Spark

### Lỗi 1: `'JavaPackage' object is not callable`
**Nguyên nhân:** PySpark không thể kết nối với Java

**Giải pháp:**
1. **Cài đặt Java 8 hoặc 11:**
   - Tải Java từ: https://adoptium.net/
   - Chọn Java 8 hoặc 11 (không dùng Java 17+ vì có thể gây xung đột)

2. **Thiết lập JAVA_HOME:**
   - Windows: `set JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-8.0.xxx`
   - Linux/Mac: `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk`

3. **Thêm Java vào PATH:**
   - Windows: Thêm `%JAVA_HOME%\bin` vào PATH
   - Linux/Mac: Thêm `$JAVA_HOME/bin` vào PATH

### Lỗi 2: `Cannot run multiple SparkContexts at once`
**Nguyên nhân:** Đã có SparkContext đang chạy từ lần thử trước

**Giải pháp:**
1. **Restart Jupyter kernel hoàn toàn:**
   - Kernel → Restart & Clear Output
   - Hoặc Kernel → Shutdown → Start lại

2. **Kiểm tra processes đang chạy:**
   - Windows: `tasklist | findstr java`
   - Linux/Mac: `ps aux | grep java`

3. **Kill processes nếu cần:**
   - Windows: `taskkill /f /im java.exe`
   - Linux/Mac: `pkill -f java`

### Các bước khắc phục nhanh:
1. **Restart Jupyter kernel**
2. **Chạy lại cell kiểm tra Java**
3. **Nếu vẫn lỗi, cài đặt lại Java và thiết lập JAVA_HOME**
4. **Restart máy tính nếu cần thiết**


In [None]:
# Alternative Spark session creation with minimal configuration
def create_minimal_spark():
    """Create Spark session with absolute minimal configuration"""
    try:
        # Stop any existing context first
        try:
            from pyspark import SparkContext
            if SparkContext._active_spark_context:
                SparkContext._active_spark_context.stop()
        except:
            pass
        
        # Create with minimal config
        spark = SparkSession.builder \
            .appName("LeaderBoardAnalysis") \
            .master("local[1]") \
            .getOrCreate()
        
        print("✅ Minimal Spark session created successfully!")
        return spark
        
    except Exception as e:
        print(f"❌ Minimal Spark session failed: {e}")
        return None

# Try creating minimal Spark session
if spark is None:
    print("Trying minimal Spark configuration...")
    spark = create_minimal_spark()


In [4]:
# Initialize Spark Session for local mode
spark = SparkSession.builder \
    .appName("LeaderBoardAnalysis") \
    .master("local[*]") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Spark UI: http://localhost:4040")
print("Spark session created successfully!")


TypeError: 'JavaPackage' object is not callable

In [None]:
# Data classes for type safety
@dataclass
class UserData:
    uid: str
    level: int
    team: int
    updatedAt: str
    name: str
    geo: str

@dataclass
class Score:
    id: str
    score: float
    lastUpdateTime: int
    previousScore: Optional[float] = None

@dataclass
class UserTotalScore:
    userId: str
    totalScore: float
    previousTotalScore: float
    lastUpdateTime: int

@dataclass
class LeaderBoardEntry:
    userId: str
    totalScore: float
    rank: int
    lastUpdateTime: int
    snapshotTime: int

print("Data classes defined successfully!")


In [None]:
# Helper functions
def parse_timestamp(timestamp_str: str) -> int:
    """Parse timestamp string to milliseconds"""
    try:
        # Try parsing ISO format
        dt = datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))
        return int(dt.timestamp() * 1000)
    except:
        # Fallback to current time
        return int(datetime.now().timestamp() * 1000)

def format_timestamp(timestamp: int) -> str:
    """Format timestamp for display"""
    dt = datetime.fromtimestamp(timestamp / 1000, tz=timezone.utc)
    return dt.isoformat()

print("Helper functions defined successfully!")


In [None]:
# Step 1: Read data from Parquet file
input_path = "app-python/fixed-dataset.parquet"

if os.path.exists(input_path):
    print(f"Reading data from: {input_path}")
    user_data = spark.read.parquet(input_path)
    print(f"Data loaded successfully! Rows: {user_data.count()}")
    user_data.show(10, False)
    user_data.printSchema()
else:
    print(f"File not found: {input_path}")
    print("Available files in app-python directory:")
    if os.path.exists("app-python"):
        for file in os.listdir("app-python"):
            print(f"  - {file}")


In [None]:
# Step 2: Transform to Score objects
def create_score(row):
    timestamp = parse_timestamp(row.updatedAt)
    score = float(row.level)
    
    return {
        'id': row.uid,
        'score': score,
        'lastUpdateTime': timestamp,
        'previousScore': None
    }

scores = user_data.rdd.map(create_score).toDF()
print("Scores transformed successfully!")
scores.show(10, False)
scores.printSchema()


In [None]:
# Step 3: Calculate total scores in sliding window (5 minutes)
window_size_minutes = 5
window_size_ms = window_size_minutes * 60 * 1000

def calculate_user_total_scores(user_scores):
    user_id = user_scores[0]
    score_list = sorted(user_scores[1], key=lambda x: x['lastUpdateTime'])
    
    results = []
    for i, current_score in enumerate(score_list):
        current_time = current_score['lastUpdateTime']
        window_start = current_time - window_size_ms
        
        # Lấy tất cả scores trong window
        scores_in_window = [s for s in score_list if s['lastUpdateTime'] > window_start]
        total_score = sum(s['score'] for s in scores_in_window)
        
        # Tính previous total score
        previous_total_score = 0.0
        if i > 0:
            prev_score = score_list[i-1]
            prev_window_start = prev_score['lastUpdateTime'] - window_size_ms
            prev_scores_in_window = [s for s in score_list if s['lastUpdateTime'] > prev_window_start]
            previous_total_score = sum(s['score'] for s in prev_scores_in_window)
        
        results.append({
            'userId': user_id,
            'totalScore': total_score,
            'previousTotalScore': previous_total_score,
            'lastUpdateTime': current_time
        })
    
    return results

# Group by user và apply function
user_scores_rdd = scores.rdd.groupBy(lambda x: x['id']).map(calculate_user_total_scores)
total_scores = user_scores_rdd.flatMap(lambda x: x).toDF()

print(f"Total scores calculated for {window_size_minutes} minute window!")
total_scores.show(10, False)
print(f"Total records: {total_scores.count()}")


In [None]:
# Step 4: Generate snapshots every 7 minutes
def calculate_leaderboard_at_snapshots(total_scores: DataFrame, snapshot_times: List[int], 
                                     top_n: int, ttl_minutes: int) -> List[LeaderBoardEntry]:
    """Tính leaderboard tại các snapshot times cụ thể"""
    ttl_ms = ttl_minutes * 60 * 1000
    all_snapshots = []
    
    for snapshot_time in snapshot_times:
        cutoff_time = snapshot_time - ttl_ms
        
        # Lấy tất cả scores hợp lệ tại thời điểm snapshot
        valid_scores = total_scores.filter(
            (col('lastUpdateTime') <= snapshot_time) & 
            (col('lastUpdateTime') > cutoff_time)
        ).collect()
        
        # Group by user và lấy score mới nhất cho mỗi user
        user_latest_scores = {}
        for score in valid_scores:
            user_id = score['userId']
            if user_id not in user_latest_scores or score['lastUpdateTime'] > user_latest_scores[user_id]['lastUpdateTime']:
                user_latest_scores[user_id] = score
        
        # Sort by total score và lấy top N
        sorted_users = sorted(user_latest_scores.values(), key=lambda x: x['totalScore'], reverse=True)
        
        # Tạo leaderboard entries cho snapshot này
        for i, user_score in enumerate(sorted_users[:top_n]):
            all_snapshots.append(LeaderBoardEntry(
                userId=user_score['userId'],
                totalScore=user_score['totalScore'],
                rank=i + 1,
                lastUpdateTime=user_score['lastUpdateTime'],
                snapshotTime=snapshot_time
            ))
    
    return all_snapshots

def generate_snapshots(total_scores: DataFrame, top_n: int, ttl_minutes: int, 
                      snapshot_interval_minutes: int = 7) -> List[LeaderBoardEntry]:
    """Generate snapshots mỗi 7 phút theo event time"""
    # Lấy tất cả timestamps
    all_timestamps = [row['lastUpdateTime'] for row in total_scores.select('lastUpdateTime').distinct().collect()]
    all_timestamps.sort()
    
    if not all_timestamps:
        return []
    
    first_timestamp = all_timestamps[0]
    last_timestamp = all_timestamps[-1]
    
    # Generate snapshot times (mỗi 7 phút)
    snapshot_interval_ms = snapshot_interval_minutes * 60 * 1000
    snapshot_times = []
    
    # Snapshot đầu tiên sau 7 phút từ record đầu tiên
    first_snapshot_time = first_timestamp + snapshot_interval_ms
    current_time = first_snapshot_time
    
    while current_time <= last_timestamp:
        snapshot_times.append(current_time)
        current_time += snapshot_interval_ms
    
    print(f"Generated {len(snapshot_times)} snapshot times")
    for ts in snapshot_times:
        print(f"  Snapshot time: {format_timestamp(ts)}")
    
    # Tính leaderboard tại các snapshot times
    return calculate_leaderboard_at_snapshots(total_scores, snapshot_times, top_n, ttl_minutes)

print("Snapshot functions defined successfully!")


In [None]:
# Step 5: Generate leaderboard snapshots
top_n = 10
ttl_minutes = 30
snapshot_interval_minutes = 7

print(f"Generating leaderboard snapshots with:")
print(f"  Top N: {top_n}")
print(f"  TTL: {ttl_minutes} minutes")
print(f"  Snapshot interval: {snapshot_interval_minutes} minutes")

snapshots = generate_snapshots(
    total_scores, 
    top_n, 
    ttl_minutes, 
    snapshot_interval_minutes
)

print(f"\nGenerated {len(snapshots)} leaderboard entries across snapshots.")


In [None]:
# Step 6: Display results
if snapshots:
    # Convert to DataFrame for better display
    snapshot_data = []
    for entry in snapshots:
        snapshot_data.append({
            'userId': entry.userId,
            'totalScore': entry.totalScore,
            'rank': entry.rank,
            'lastUpdateTime': entry.lastUpdateTime,
            'snapshotTime': entry.snapshotTime,
            'snapshotTimeFormatted': format_timestamp(entry.snapshotTime),
            'lastUpdateTimeFormatted': format_timestamp(entry.lastUpdateTime)
        })
    
    snapshot_df = spark.createDataFrame(snapshot_data)
    
    print("\n=== LEADERBOARD SNAPSHOTS ===")
    snapshot_df.orderBy("snapshotTime", "rank").show(50, False)
    
    # Show summary by snapshot time
    print("\n=== SNAPSHOT SUMMARY ===")
    snapshot_summary = snapshot_df.groupBy("snapshotTimeFormatted") \
        .agg(
            count("userId").alias("userCount"),
            max("totalScore").alias("maxScore"),
            min("totalScore").alias("minScore"),
            avg("totalScore").alias("avgScore")
        ) \
        .orderBy("snapshotTimeFormatted")
    
    snapshot_summary.show(20, False)
    
    # Show top users across all snapshots
    print("\n=== TOP USERS ACROSS ALL SNAPSHOTS ===")
    top_users = snapshot_df.groupBy("userId") \
        .agg(
            count("rank").alias("snapshotCount"),
            max("totalScore").alias("maxScore"),
            avg("totalScore").alias("avgScore"),
            min("rank").alias("bestRank")
        ) \
        .orderBy(desc("maxScore"), desc("avgScore"))
    
    top_users.show(20, False)
    
else:
    print("No snapshots generated!")


In [None]:
# Step 7: Save results to file
output_path = "spark-jobs/result/leaderboard_snapshots"

if snapshots:
    # Create output directory if it doesn't exist
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    # Write as JSON with partitioning by snapshot time
    snapshot_df.write \
        .mode("overwrite") \
        .partitionBy("snapshotTime") \
        .json(output_path)
    
    print(f"Snapshots saved to: {output_path}")
    
    # Also save as CSV for easier viewing
    csv_path = "spark-jobs/result/leaderboard_snapshots.csv"
    snapshot_df.coalesce(1).write \
        .mode("overwrite") \
        .option("header", "true") \
        .csv(csv_path)
    
    print(f"CSV version saved to: {csv_path}")
else:
    print("No snapshots to save!")


In [None]:
# Step 8: Performance analysis
print("\n=== PERFORMANCE ANALYSIS ===")
print(f"Total user data records: {user_data.count()}")
print(f"Total score records: {scores.count()}")
print(f"Total calculated scores: {total_scores.count()}")
print(f"Total leaderboard entries: {len(snapshots)}")

if snapshots:
    unique_users = len(set(entry.userId for entry in snapshots))
    unique_snapshots = len(set(entry.snapshotTime for entry in snapshots))
    
    print(f"Unique users in leaderboard: {unique_users}")
    print(f"Unique snapshot times: {unique_snapshots}")
    
    # Calculate average users per snapshot
    avg_users_per_snapshot = len(snapshots) / unique_snapshots if unique_snapshots > 0 else 0
    print(f"Average users per snapshot: {avg_users_per_snapshot:.2f}")

print("\n=== ANALYSIS COMPLETED ===")


In [None]:
# Clean up
print("Stopping Spark session...")
spark.stop()
print("Spark session stopped successfully!")
