# 🏆 Chess Game Analysis - Data Acquisition Pipeline

## Project Overview
This notebook implements a robust data acquisition system for downloading and processing large-scale chess game datasets from Lichess. The pipeline is designed to handle multi-gigabyte compressed files with resume capability and progress tracking.

### 📊 Dataset Information
- **Source**: Lichess Database (https://database.lichess.org/)
- **Format**: PGN (Portable Game Notation) compressed with Zstandard
- **Content**: Standard rated chess games
- **Target Size**: 1.0 GB (configurable)

### 🎯 Key Features
- ✅ **Resume Downloads**: Automatically resumes interrupted downloads
- ✅ **Progress Tracking**: Real-time progress bars with speed and ETA
- ✅ **Error Handling**: Robust exception handling for network issues
- ✅ **Memory Efficient**: Streams data in 8MB chunks to minimize RAM usage

---

## 📦 Dependencies and Imports

Setting up the required libraries for HTTP requests and progress visualization.

In [None]:
import requests
import os
from rich.progress import (
    Progress,
    BarColumn,
    TextColumn,
    DownloadColumn,
    TransferSpeedColumn,
    TimeRemainingColumn,
)

## 🚀 Download Function Implementation

### Core Functionality
The `download_partial_file()` function provides several advanced features:

1. **Partial Download Support**: Downloads only the specified amount of data (useful for sampling large datasets)
2. **Resume Capability**: Automatically detects existing partial downloads and continues from where it left off
3. **Progress Visualization**: Uses Rich library to display real-time download progress
4. **Memory Management**: Processes data in 8MB chunks to handle large files efficiently

### Technical Details
- **HTTP Range Requests**: Uses `Range` headers to resume downloads
- **Streaming Downloads**: Processes data incrementally to avoid memory overflow
- **Error Recovery**: Graceful handling of network interruptions

In [None]:
def download_partial_file(url, filename, target_gb):
    target_bytes = int(target_gb * 1024**3)

    progress = Progress(
        TextColumn("[bold blue]{task.fields[filename]}", justify="right"),
        BarColumn(bar_width=None, complete_style="bold green"),
        "[progress.percentage]{task.percentage:>3.1f}%",
        "•",
        DownloadColumn(binary_units=True),
        "•",
        TransferSpeedColumn(),
        "•",
        TimeRemainingColumn(),
    )

    file_mode = 'wb'
    initial_pos = 0

    if os.path.exists(filename):
        initial_pos = os.path.getsize(filename)
        if initial_pos >= target_bytes:
            print(f"✅ File '{filename}' already exists and meets target size ({target_gb} GB).")
            return
        else:
            file_mode = 'ab'
    
    try:
        with progress:
            if initial_pos > 0:
                progress.print(f"Resuming download for '{filename}'.")
            else:
                progress.print(f"Starting new download for '{filename}'.")

            task_id = progress.add_task("download", filename=filename, total=target_bytes, start=True)
            progress.update(task_id, completed=initial_pos)

            with requests.get(url, headers={'Range': f'bytes={initial_pos}-'}, stream=True) as r:
                r.raise_for_status()
                with open(filename, file_mode) as f:
                    bytes_downloaded = initial_pos
                    for chunk in r.iter_content(chunk_size=8 * 1024**2):
                        if not chunk:
                            break
                            
                        if bytes_downloaded + len(chunk) > target_bytes:
                            chunk = chunk[:target_bytes - bytes_downloaded]
                        
                        f.write(chunk)
                        progress.update(task_id, advance=len(chunk))
                        bytes_downloaded += len(chunk)

                        if bytes_downloaded >= target_bytes:
                            break
            
            print(f"\n✅ Download complete! '{filename}' is now {os.path.getsize(filename) / 1024**3:.2f} GB.")

    except requests.exceptions.RequestException as e:
        print(f"\nAn error occurred: {e}")
    except KeyboardInterrupt:
        print(f"\nDownload interrupted by user. Run script again to resume.")

## 📥 Data Download Execution

### Configuration Parameters
- **Dataset**: August 2025 Lichess standard rated games
- **Compression**: Zstandard (.zst) format for optimal compression
- **Sample Size**: 1.0 GB (approximately 200,000-300,000 games)

### Why This Dataset?
- **Recent Data**: August 2025 represents current chess trends and player behavior
- **High Quality**: Only rated games ensure competitive play and accurate ratings
- **Large Scale**: Sufficient volume for meaningful statistical analysis
- **Standardized Format**: PGN format is universally compatible with chess analysis tools

> **Note**: The download will automatically resume if interrupted. Run the cell again to continue from where it left off.

In [None]:
URL = "https://database.lichess.org/standard/lichess_db_standard_rated_2025-08.pgn.zst"
FILENAME = URL.split('/')[-1]
TARGET_SIZE_GB = 1.0

download_partial_file(URL, FILENAME, TARGET_SIZE_GB)