# ⚡ Chess Analysis Pipeline - Data Decompression

## Overview
This notebook handles the decompression of Zstandard (.zst) compressed chess game files from the Lichess database. The decompression process is optimized for large files with real-time progress tracking and error handling.

### 🔧 Technical Specifications
- **Input Format**: `.pgn.zst` (PGN files compressed with Zstandard)
- **Output Format**: `.pgn` (Plain text Portable Game Notation)
- **Compression Ratio**: Typically 10:1 to 15:1 for chess data
- **Memory Usage**: Stream-based processing (16MB chunks)

### 📈 Performance Features
- **Streaming Decompression**: Processes data in chunks to minimize memory usage
- **Real-time Monitoring**: Live progress tracking with speed metrics
- **Error Recovery**: Graceful handling of interruptions with cleanup
- **Size Validation**: Tracks both compressed and decompressed sizes

---

## 📦 Required Dependencies

Setting up the decompression environment with Zstandard support and progress visualization.

In [None]:
import zstandard as zst
import os
from rich.progress import (
    Progress,
    BarColumn,
    TextColumn,
    DownloadColumn,
    TransferSpeedColumn,
    TimeRemainingColumn,
)

## 🛠️ Utility Functions

### File Size Formatting
The `format_size()` function converts raw byte counts into human-readable format using binary units (KiB, MiB, GiB) for accurate file size representation.

In [None]:
def format_size(byte_count):
    power = 1024
    n = 0
    power_labels = {0: 'B', 1: 'KiB', 2: 'MiB', 3: 'GiB', 4: 'TiB'}
    while byte_count >= power and n < len(power_labels) - 1:
        byte_count /= power
        n += 1
    return f"{byte_count:.2f} {power_labels[n]}"

## 🗜️ Zstandard Decompression Process

### Why Zstandard?
Zstandard (.zst) is chosen for chess databases because:
- **Superior Compression**: 15-30% better than gzip for text data
- **Fast Decompression**: High-speed streaming decompression
- **Memory Efficient**: Stream-based processing without loading entire file

### Process Flow
1. **File Validation**: Checks if compressed file exists
2. **Stream Setup**: Creates decompression stream with 16MB chunks
3. **Progress Tracking**: Real-time monitoring of decompression progress
4. **Output Generation**: Writes decompressed PGN data to output file
5. **Cleanup**: Handles interruptions and errors gracefully

### Expected Results
- **Input**: ~1.0 GB compressed file
- **Output**: ~7-10 GB decompressed PGN file
- **Processing Time**: 2-5 minutes depending on system

> **Note**: The decompression will show both input progress and output size growth in real-time.

In [None]:
compressed_file = "lichess_db_standard_rated_2025-08.pgn.zst"
decompressed_file = "lichess_db_standard_rated_2025-08.pgn"

if not os.path.exists(compressed_file):
    print(f"âŒ Error: Input file not found at '{compressed_file}'")
else:
    total_size = os.path.getsize(compressed_file)

    progress = Progress(
        TextColumn("[bold cyan]{task.fields[filename]}", justify="right"),
        BarColumn(bar_width=None),
        "[progress.percentage]{task.percentage:>3.1f}%",
        "â€¢",
        DownloadColumn(binary_units=True),
        "â€¢",
        TransferSpeedColumn(),
        "â€¢",
        TimeRemainingColumn(),
        "â€¢",
        TextColumn("[green]Output Size: {task.fields[out_size]}")
    )

    try:
        with open(compressed_file, 'rb') as f_in:
            with open(decompressed_file, 'wb') as f_out:
                dctx = zst.ZstdDecompressor()
                reader = dctx.stream_reader(f_in)
                
                with progress:
                    task_id = progress.add_task(
                        "Decompressing",
                        filename=os.path.basename(compressed_file),
                        total=total_size,
                        out_size="0 B"
                    )
                    
                    decompressed_bytes_so_far = 0
                    
                    while True:
                        chunk = reader.read(16 * 1024 * 1024)
                        if not chunk:
                            break
                        
                        f_out.write(chunk)
                        decompressed_bytes_so_far += len(chunk)
                        
                        progress.update(
                            task_id,
                            completed=f_in.tell(),
                            out_size=format_size(decompressed_bytes_so_far)
                        )

        progress.update(task_id, completed=total_size)
        print(f"\nâœ… Decompression complete! Final size: {format_size(decompressed_bytes_so_far)}")

    except FileNotFoundError:
        print(f"\nâŒ Error: The file '{compressed_file}' was not found.")
    except KeyboardInterrupt:
        print(f"\nâ¹ï¸ Decompression interrupted by user. Cleaning up...")
        if os.path.exists(decompressed_file):
            os.remove(decompressed_file)
            print("ðŸ§¹ Partial file removed.")
    except Exception as e:
        print(f"\nAn unexpected error occurred: {e}")

## 📊 Decompression Results & Next Steps

### Expected Output
After successful decompression, you should have:
- **File Size**: 7-10 GB of uncompressed PGN data
- **Game Count**: Approximately 200,000-300,000 chess games
- **Format**: Standard PGN with complete game metadata

### Data Structure Preview
The decompressed PGN file contains chess games in this format:
```
[Event "Rated Blitz game"]
[Site "https://lichess.org/abc123"]
[Date "2025.08.01"]
[White "Player1"]
[Black "Player2"]
[Result "1-0"]
[WhiteElo "1500"]
[BlackElo "1600"]
[TimeControl "180+2"]
[ECO "B10"]
[Opening "Caro-Kann Defense"]
[Termination "Normal"]

1. e4 c6 2. d4 d5 3. Nc3 dxe4...
```

### 🔄 Pipeline Continuation
The next notebook (`03_data_processing.ipynb`) will:
1. **Parse PGN Format**: Convert text data into structured DataFrame
2. **Data Cleaning**: Handle malformed records and missing fields
3. **Spark Integration**: Load data into distributed processing framework
4. **Schema Validation**: Ensure data quality for downstream analysis

### 💾 Storage Considerations
- **Disk Space**: Ensure at least 15 GB free space for processing
- **Temporary Files**: Decompression creates intermediate files
- **Backup Strategy**: Consider keeping compressed version for recovery

---

## 🎯 Success Metrics
- **Compression Ratio**: ~10:1 (1 GB → 10 GB typical)
- **Processing Speed**: 50-200 MB/s decompression rate
- **Data Integrity**: Zero data loss during decompression
- **Error Handling**: Graceful recovery from interruptions