# Data Collection Pipeline - Production Guide

This notebook demonstrates the complete data collection pipeline for the Pronósticos Football project. It shows how to use the formal Python scripts to collect comprehensive football data from FBRef.

## Pipeline Overview

The data collection pipeline consists of 4 main steps:

1. **Team ID Mapping** - Extract team IDs and create mappings
2. **Fixtures Collection** - Collect match fixtures and results
3. **Wages Collection** - Collect player wage information
4. **Match Statistics Collection** - Collect detailed match statistics

Each step builds upon the previous one, creating a comprehensive dataset for analysis and prediction modeling.

## Setup and Configuration

In [None]:
import os
import sys
import subprocess
import json
import pandas as pd
from datetime import datetime
import time

# Add scripts directory to path
# Adjust this path based on notebook location:
# From production/: '../..'
# From production/data_collections/: '../../..'
scripts_path = os.path.join('..', '..', 'scripts')  # Change to '../../../scripts' if moving to data_collections/
if scripts_path not in sys.path:
    sys.path.append(scripts_path)

# Import our utilities
from utils.data_utils import load_teams_from_json, fixtures_data_to_dataframe
from utils.scraping_utils import get_page

print("✅ Setup complete - Ready to run data collection pipeline")
print(f"📁 Working directory: {os.getcwd()}")
print(f"🐍 Python version: {sys.version}")
print(f"📂 Scripts path: {os.path.abspath(scripts_path)}")

## Configuration Parameters

Define the parameters for your data collection run:

In [None]:
# Configuration-Based Approach (Recommended)
# Load configuration from YAML files instead of defining parameters here

import sys
import os

# Add scripts path for configuration utilities
# Adjust this path based on notebook location:
# From production/: '../..'
# From production/data_collections/: '../../..'
scripts_path = os.path.join('..', '..', 'scripts')  # Change to '../../../scripts' if moving to data_collections/
if scripts_path not in sys.path:
    sys.path.append(scripts_path)

from utils.config_utils import load_config

# Choose your configuration
CONFIG_NAME = 'prod'  # Options: 'prod', 'dev', 'testing', or path to custom config

# Load the configuration
config = load_config(CONFIG_NAME)

# Display configuration summary
print("🚀 Using Configuration-Based Data Collection")
config.print_summary()

# Ensure data directories exist
config.ensure_data_directories()
print(f"✅ Data directories ensured for environment: {config.environment}")

# Legacy approach (deprecated but still available)
print("\n" + "="*60)
print("LEGACY APPROACH (for reference):")
print("You can still define parameters manually if needed:")

# Configuration
SEASONS = config.seasons
OUTPUT_FORMATS = config.output_formats
ENVIRONMENT = config.environment
DATA_DIR = config.get_raw_data_path()
# Adjust this path based on notebook location:
SCRIPTS_DIR = '../../scripts/data_collection/'  # Change to '../../../scripts/data_collection/' if moving to data_collections/

print(f"   Seasons: {SEASONS}")
print(f"   Formats: {OUTPUT_FORMATS}")
print(f"   Environment: {ENVIRONMENT}")
print(f"   Data directory: {DATA_DIR}")
print("\n💡 To switch configurations, change CONFIG_NAME above to 'dev' or 'testing'")
print(f"📂 Scripts directory: {os.path.abspath(SCRIPTS_DIR)}")

## Helper Functions

Utility functions to run scripts and track progress:

In [4]:
def run_script(script_name, args=None, capture_output=True):
    """
    Run a data collection script and return the result.
    """
    script_path = os.path.join(SCRIPTS_DIR, script_name)
    cmd = [sys.executable, script_path]
    
    if args:
        cmd.extend(args)
    
    print(f"🚀 Running: {' '.join(cmd)}")
    print(f"⏰ Start time: {datetime.now().strftime('%H:%M:%S')}")
    
    start_time = time.time()
    
    try:
        result = subprocess.run(cmd, capture_output=capture_output, text=True, check=True)
        duration = time.time() - start_time
        
        print(f"✅ Script completed successfully in {duration:.1f}s")
        if capture_output and result.stdout:
            print("📋 Output:")
            print(result.stdout)
        
        return True, result
    
    except subprocess.CalledProcessError as e:
        duration = time.time() - start_time
        print(f"❌ Script failed after {duration:.1f}s with return code {e.returncode}")
        if capture_output and e.stderr:
            print("🚨 Error output:")
            print(e.stderr)
        return False, e
    
    except Exception as e:
        duration = time.time() - start_time
        print(f"❌ Unexpected error after {duration:.1f}s: {e}")
        return False, e


def check_file_exists(filepath, description=""):
    """
    Check if a file exists and show file info.
    """
    if os.path.exists(filepath):
        size = os.path.getsize(filepath)
        modified = datetime.fromtimestamp(os.path.getmtime(filepath))
        print(f"✅ {description}found: {filepath}")
        print(f"   📏 Size: {size:,} bytes")
        print(f"   📅 Modified: {modified.strftime('%Y-%m-%d %H:%M:%S')}")
        return True
    else:
        print(f"❌ {description}not found: {filepath}")
        return False


def show_data_summary(filepath, data_type=""):
    """
    Show a summary of collected data.
    """
    if not os.path.exists(filepath):
        print(f"❌ Cannot show summary - file not found: {filepath}")
        return
    
    try:
        if filepath.endswith('.json'):
            with open(filepath, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            if isinstance(data, dict):
                print(f"📊 {data_type}Summary:")
                print(f"   📁 File: {os.path.basename(filepath)}")
                print(f"   🗂️  Main keys: {len(data)} items")
                
                if data_type.lower() == "teams":
                    total_seasons = sum(len(team_data.get('seasons', [])) for team_data in data.values())
                    print(f"   🏟️  Teams: {len(data)}")
                    print(f"   📅 Total team-seasons: {total_seasons}")
                
                elif isinstance(data, list):
                    print(f"   📊 Records: {len(data)}")
                    if data and isinstance(data[0], dict):
                        print(f"   🔑 Sample keys: {list(data[0].keys())[:5]}")
        
        elif filepath.endswith('.csv'):
            df = pd.read_csv(filepath)
            print(f"📊 {data_type}Summary:")
            print(f"   📁 File: {os.path.basename(filepath)}")
            print(f"   📊 Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
            print(f"   🔑 Columns: {list(df.columns)[:5]}{'...' if len(df.columns) > 5 else ''}")
    
    except Exception as e:
        print(f"❌ Error reading file {filepath}: {e}")

print("✅ Helper functions loaded")

✅ Helper functions loaded


---

# Step 1: Team ID Mapping

The first step extracts team IDs from FBRef for all Premier League teams across the specified seasons. This creates the foundation mapping needed for all subsequent data collection.

In [9]:
print("🏗️  STEP 1: TEAM ID MAPPING")
print("=" * 50)

# RECOMMENDED: Configuration-based approach
print("Using configuration-based script...")

# Run with configuration
success, result = run_script('team_id_mapper_config.py', ['--config', CONFIG_NAME])

if success:
    # Check the output file
    teams_file = config.get_raw_data_path('all_teams.json')
    check_file_exists(teams_file, "Teams file ")
    show_data_summary(teams_file, "Teams ")
    
    # Load and preview the data
    try:
        teams_data = load_teams_from_json(teams_file)
        print(f"\n📋 Sample teams:")
        for i, (team_id, team_info) in enumerate(list(teams_data.items())[:3]):
            print(f"   {i+1}. {team_info['team_name']} (ID: {team_id})")
            print(f"      Seasons: {team_info['seasons']}")
    except Exception as e:
        print(f"❌ Error loading teams data: {e}")
else:
    print("❌ Step 1 failed - cannot proceed with pipeline")
    print("💡 Check the error messages above and ensure FBRef is accessible")

print("\n" + "="*50)
print("ALTERNATIVE: Run complete pipeline with single command")
print("python run_all_collectors_config.py --config prod")
print("python run_all_collectors_config.py --config dev")
print("python run_all_collectors_config.py --config testing --dry-run")

🏗️  STEP 1: TEAM ID MAPPING
Using configuration-based script...
🚀 Running: c:\Users\50230\anaconda3\python.exe ../../scripts/data_collection/team_id_mapper_config.py --config prod
⏰ Start time: 19:53:06
✅ Script completed successfully in 26.0s
📋 Output:
Configuration Summary:
   Config file: c:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\config\prod.yaml
   Environment: prod
   Seasons: ['2019-2020', '2020-2021', '2021-2022', '2022-2023', '2023-2024', '2024-2025']
   Output formats: ['json', 'csv']
   Teams filter: All teams
   Competitions: ['Premier League']
   Max matches: No limit
   Enhanced scraper: True
   Log level: INFO
   Enabled steps: ['team_mapping', 'fixtures', 'wages', 'match_stats']
Data saved to ../../data\prod\raw\all_teams.json

Successfully extracted 27 teams
Seasons processed: 6
Environment: prod
Output saved to: ../../data\prod\raw\all_teams.json

✅ Teams file found: ../../data\prod\raw\all_teams.json
   📏 Size: 6,546 bytes

---

# Step 2: Fixtures Collection

This step collects match fixtures and results for all teams. It creates a comprehensive dataset of all matches played by Premier League teams.

In [10]:
print("🏗️  STEP 2: FIXTURES COLLECTION")
print("=" * 50)

# RECOMMENDED: Configuration-based approach
print("Using configuration-based script...")

# Run with configuration
success, result = run_script('fixtures_collector_config.py', ['--config', CONFIG_NAME])

if success:
    # Check output files
    fixtures_file = config.get_raw_data_path('all_competitions_fixtures.json')
    check_file_exists(fixtures_file, "Fixtures file ")
    show_data_summary(fixtures_file, "Fixtures ")
    
    # Check DataFrame files
    for fmt in config.output_formats:
        df_file = config.get_raw_data_path(f'all_competitions_fixtures_dataframe.{fmt}')
        if check_file_exists(df_file, f"Fixtures DataFrame ({fmt}) "):
            if fmt == 'csv':
                show_data_summary(df_file, "Fixtures DataFrame ")
    
    # Preview the data
    try:
        with open(fixtures_file, 'r', encoding='utf-8') as f:
            fixtures_data = json.load(f)
        
        fixtures_df = fixtures_data_to_dataframe(fixtures_data)
        print(f"\n📋 Fixtures DataFrame Preview:")
        print(f"   Shape: {fixtures_df.shape}")
        print(f"   Columns: {list(fixtures_df.columns)[:8]}...")
        print(f"   Sample matches:")
        for i in range(min(3, len(fixtures_df))):
            row = fixtures_df.iloc[i]
            print(f"   {i+1}. {row['team_name']} vs {row.get('opponent', 'N/A')} ({row.get('date', 'N/A')})")
    except Exception as e:
        print(f"❌ Error previewing fixtures data: {e}")

else:
    print("❌ Step 2 failed - check error messages above")

print("\n" + "="*50)
print("LEGACY APPROACH (for reference):")
print("The old approach required manual argument preparation:")
print(f"   Teams filter: {config.get_effective_teams() or 'All teams'}")
print(f"   Seasons filter: {config.get_effective_seasons()}")
print("   All parameters are now handled by the configuration file!")

🏗️  STEP 2: FIXTURES COLLECTION
Using configuration-based script...
🚀 Running: c:\Users\50230\anaconda3\python.exe ../../scripts/data_collection/fixtures_collector_config.py --config prod
⏰ Start time: 19:53:43
✅ Script completed successfully in 1049.4s
📋 Output:
Configuration Summary:
   Config file: c:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\config\prod.yaml
   Environment: prod
   Seasons: ['2019-2020', '2020-2021', '2021-2022', '2022-2023', '2023-2024', '2024-2025']
   Output formats: ['json', 'csv']
   Teams filter: All teams
   Competitions: ['Premier League']
   Max matches: No limit
   Enhanced scraper: True
   Log level: INFO
   Enabled steps: ['team_mapping', 'fixtures', 'wages', 'match_stats']
Data saved to ../../data\prod\raw\fixtures_progress_10.json
Data saved to ../../data\prod\raw\fixtures_progress_20.json
Data saved to ../../data\prod\raw\fixtures_progress_30.json
Data saved to ../../data\prod\raw\fixtures_progress_40.json
D

---

# Step 3: Wages Collection

This step collects player wage information for all teams. Note that wage data might not be available for all teams/seasons on FBRef.

In [11]:
print("🏗️  STEP 3: WAGES COLLECTION")
print("=" * 50)

# Check if wages collection is enabled
if not config.is_step_enabled('wages'):
    print("⚠️  Wages collection is disabled in the current configuration")
    print(f"📋 To enable it, set steps.wages.enabled=true in {CONFIG_NAME}.yaml")
else:
    print("⚠️  Note: Wage collection can take a long time and may have limited data availability")
    print("📊 This step is optional for the overall pipeline")
    
    # RECOMMENDED: Configuration-based approach
    print("Using configuration-based script...")
    
    # Note: wages_collector_config.py not created yet - would need to be implemented
    # For now, show what would happen
    print("⚠️  Configuration-based wages collector not yet implemented")
    print("📋 This would run: wages_collector_config.py --config", CONFIG_NAME)
    
    # Legacy approach for now
    teams_file = config.get_raw_data_path('all_teams.json')
    if check_file_exists(teams_file, "Teams file "):
        wages_args = [
            '--environment', config.environment,
            '--output-formats'] + config.output_formats + [
            '--summary',
            '--log-level', config.log_level
        ]
        
        # Add filters from configuration
        if config.get_effective_teams():
            wages_args.extend(['--teams'] + config.get_effective_teams())
        
        effective_seasons = config.get_effective_seasons()
        if effective_seasons != config.seasons:
            wages_args.extend(['--seasons'] + effective_seasons)
        
        # Run wages collection (legacy script)
        success, result = run_script('wages_collector.py', wages_args)
        
        if success:
            # Check output files
            wages_file = config.get_raw_data_path('premier_league_wages.json')
            check_file_exists(wages_file, "Wages file ")
            show_data_summary(wages_file, "Wages ")
            
            # Check summary file
            summary_file = config.get_raw_data_path('premier_league_wages_summary.json')
            if check_file_exists(summary_file, "Wages summary "):
                try:
                    with open(summary_file, 'r', encoding='utf-8') as f:
                        summary = json.load(f)
                    print(f"\n📊 Wages Collection Summary:")
                    print(f"   Teams processed: {summary.get('total_teams', 'N/A')}")
                    print(f"   Total seasons: {summary.get('total_seasons', 'N/A')}")
                    print(f"   Total players: {summary.get('total_players', 'N/A')}")
                except Exception as e:
                    print(f"❌ Error reading wages summary: {e}")
            
            # Check DataFrame files
            for fmt in config.output_formats:
                df_file = config.get_raw_data_path(f'premier_league_wages_dataframe.{fmt}')
                check_file_exists(df_file, f"Wages DataFrame ({fmt}) ")
        
        else:
            print("⚠️  Step 3 failed - this is common due to limited wage data availability")
            print("📋 The pipeline can continue without wage data")
    else:
        print("❌ Cannot proceed - teams file not found")
        print("💡 Please run Step 1 first")

🏗️  STEP 3: WAGES COLLECTION
⚠️  Note: Wage collection can take a long time and may have limited data availability
📊 This step is optional for the overall pipeline
Using configuration-based script...
⚠️  Configuration-based wages collector not yet implemented
📋 This would run: wages_collector_config.py --config prod
✅ Teams file found: ../../data\prod\raw\all_teams.json
   📏 Size: 6,546 bytes
   📅 Modified: 2025-07-28 19:53:32
🚀 Running: c:\Users\50230\anaconda3\python.exe ../../scripts/data_collection/wages_collector.py --environment prod --output-formats json csv --summary --log-level INFO
⏰ Start time: 20:12:30
✅ Script completed successfully in 600.3s
📋 Output:
Data saved to ../../data/prod/raw\wages_progress_10.json
Data saved to ../../data/prod/raw\wages_progress_20.json
Data saved to ../../data/prod/raw\wages_progress_30.json
Data saved to ../../data/prod/raw\wages_progress_40.json
Data saved to ../../data/prod/raw\wages_progress_50.json
Data saved to ../../data/prod/raw\wages_p

---

# Step 4: Match Statistics Collection

This is the most comprehensive step, collecting detailed match statistics for all matches. It uses enhanced scraping with anti-blocking measures and can take several hours for the complete dataset.

In [None]:
print("🏗️  STEP 4: MATCH STATISTICS COLLECTION")
print("=" * 50)

# Check if step is enabled
if not config.is_step_enabled('match_stats'):
    print("⚠️  Match statistics collection is disabled in the current configuration")
    print(f"📋 To enable it, set steps.match_stats.enabled=true in {CONFIG_NAME}.yaml")
else:
    # Check if we have fixtures data
    fixtures_file = config.get_raw_data_path('all_competitions_fixtures.json')
    if not check_file_exists(fixtures_file, "Fixtures file "):
        print("❌ Cannot proceed - fixtures file not found")
        print("💡 Please run Step 2 first")
    else:
        # Estimate the scope using configuration parameters
        try:
            with open(fixtures_file, 'r', encoding='utf-8') as f:
                fixtures_data = json.load(f)
            fixtures_df = fixtures_data_to_dataframe(fixtures_data)
            
            # Apply filters from configuration
            filtered_df = fixtures_df.copy()
            
            # Apply team filter if specified
            if config.get_effective_teams():
                filtered_df = filtered_df[filtered_df['team_name'].isin(config.get_effective_teams())]
            
            # Apply season filter if specified
            effective_seasons = config.get_effective_seasons()
            if effective_seasons != config.seasons:
                filtered_df = filtered_df[filtered_df['season'].isin(effective_seasons)]
            
            # Apply competition filter
            filtered_df = filtered_df[filtered_df['comp'].isin(config.competitions_filter)]
            
            unique_matches = filtered_df['match_report_href'].nunique() if 'match_report_href' in filtered_df.columns else len(filtered_df)
            
            print(f"📊 Scope Analysis (based on configuration):")
            print(f"   Total fixtures: {len(fixtures_df):,}")
            print(f"   After configuration filters: {len(filtered_df):,}")
            print(f"   Unique matches to process: {unique_matches:,}")
            print(f"   Teams filter: {config.get_effective_teams() or 'All teams'}")
            print(f"   Competitions: {config.competitions_filter}")
            
            if config.max_matches:
                print(f"   Limited to: {config.max_matches} matches (from configuration)")
                unique_matches = min(unique_matches, config.max_matches)
            
            # Time estimation
            estimated_hours = unique_matches * 20 / 3600  # ~20 seconds per match
            print(f"   ⏰ Estimated time: {estimated_hours:.1f} hours")
        
        except Exception as e:
            print(f"❌ Error analyzing scope: {e}")
        
        print("\n⚠️  Important Notes:")
        print("   - This step can take several hours for the complete dataset")
        print("   - Uses enhanced scraping with rate limiting")
        print("   - Progress is saved every 20 matches")
        print("   - Can be resumed if interrupted")
        
        # Ask for confirmation if processing many matches
        if not config.max_matches and unique_matches > 100:
            print(f"\n🔔 About to process {unique_matches:,} matches (estimated {estimated_hours:.1f} hours)")
            print("💡 Consider using 'testing' config for initial testing (limited to 10 matches)")
        
        # RECOMMENDED: Configuration-based approach
        print("\nUsing configuration-based script...")
        
        # Note: match_stats_collector_config.py not created yet - would need to be implemented
        print("⚠️  Configuration-based match stats collector not yet implemented")
        print("📋 This would run: match_stats_collector_config.py --config", CONFIG_NAME)
        
        # Legacy approach for now
        stats_args = [
            '--environment', config.environment,
            '--output-formats'] + config.output_formats + [
            '--competitions'] + config.competitions_filter + [
            '--enhanced-scraper' if config.enhanced_scraper else '--no-enhanced-scraper',
            '--log-level', config.log_level
        ]
        
        # Add filters from configuration
        if config.get_effective_teams():
            stats_args.extend(['--teams'] + config.get_effective_teams())
        
        effective_seasons = config.get_effective_seasons()
        if effective_seasons != config.seasons:
            stats_args.extend(['--seasons'] + effective_seasons)
        
        if config.max_matches:
            stats_args.extend(['--max-matches', str(config.max_matches)])
        
        # Run match statistics collection
        print(f"\n🚀 Starting match statistics collection...")
        start_time = datetime.now()
        
        success, result = run_script('match_stats_collector.py', stats_args, capture_output=False)
        
        end_time = datetime.now()
        duration = end_time - start_time
        
        print(f"\n⏰ Collection completed in: {duration}")
        
        if success:
            # Check output files
            stats_file = config.get_raw_data_path('match_stats', 'all_match_stats.json')
            check_file_exists(stats_file, "Match stats file ")
            
            # Check DataFrame files
            for fmt in config.output_formats:
                df_file = config.get_raw_data_path('match_stats', f'all_match_stats_dataframe.{fmt}')
                if check_file_exists(df_file, f"Match stats DataFrame ({fmt}) "):
                    if fmt == 'csv':
                        show_data_summary(df_file, "Match Stats DataFrame ")
            
            # Preview the data
            if os.path.exists(stats_file):
                try:
                    with open(stats_file, 'r', encoding='utf-8') as f:
                        stats_data = json.load(f)
                    
                    if stats_data:
                        print(f"\n📋 Match Statistics Preview:")
                        print(f"   Total records: {len(stats_data):,}")
                        
                        # Show unique matches and stats
                        unique_matches = len(set(record['match_id'] for record in stats_data))
                        unique_stats = len(set(record['stat_name'] for record in stats_data))
                        print(f"   Unique matches: {unique_matches:,}")
                        print(f"   Unique statistics: {unique_stats}")
                        
                        # Sample statistics
                        sample_stats = sorted(set(record['stat_name'] for record in stats_data[:100]))
                        print(f"   Sample stats: {sample_stats[:5]}...")
                except Exception as e:
                    print(f"❌ Error previewing match stats: {e}")
        
        else:
            print("❌ Step 4 failed - check error messages above")
            print("💡 Match statistics collection can be resumed using existing progress files")

🏗️  STEP 4: MATCH STATISTICS COLLECTION
✅ Fixtures file found: ../../data\prod\raw\all_competitions_fixtures.json
   📏 Size: 7,334,359 bytes
   📅 Modified: 2025-07-28 20:11:12
📊 Scope Analysis (based on configuration):
   Total fixtures: 5,751
   After configuration filters: 4,560
   Unique matches to process: 2,280
   Teams filter: All teams
   Competitions: ['Premier League']
   ⏰ Estimated time: 12.7 hours

⚠️  Important Notes:
   - This step can take several hours for the complete dataset
   - Uses enhanced scraping with rate limiting
   - Progress is saved every 20 matches
   - Can be resumed if interrupted

🔔 About to process 2,280 matches (estimated 12.7 hours)
💡 Consider using 'testing' config for initial testing (limited to 10 matches)

Using configuration-based script...
⚠️  Configuration-based match stats collector not yet implemented
📋 This would run: match_stats_collector_config.py --config prod

🚀 Starting match statistics collection...
🚀 Running: c:\Users\50230\anaconda3

---

# Pipeline Summary and Data Overview

Review the complete dataset collected through the pipeline:

In [5]:
print("📊 PIPELINE SUMMARY")
print("=" * 50)

# Check all major output files using configuration paths
output_files = {
    'Teams Mapping': config.get_raw_data_path('all_teams.json'),
    'Fixtures': config.get_raw_data_path('all_competitions_fixtures.json'),
    'Wages': config.get_raw_data_path('premier_league_wages.json'),
    'Match Statistics': config.get_raw_data_path('match_stats', 'all_match_stats.json')
}

pipeline_success = True
file_sizes = {}

for description, filepath in output_files.items():
    if os.path.exists(filepath):
        size = os.path.getsize(filepath)
        file_sizes[description] = size
        print(f"✅ {description}: {size:,} bytes ({size/1024/1024:.1f} MB)")
    else:
        print(f"❌ {description}: Not found")
        # Only consider it a failure if the step is enabled in configuration
        step_name = {
            'Teams Mapping': 'team_mapping',
            'Fixtures': 'fixtures', 
            'Wages': 'wages',
            'Match Statistics': 'match_stats'
        }.get(description)
        
        if step_name and config.is_step_enabled(step_name) and description != 'Wages':
            pipeline_success = False

print(f"\n📈 Total data collected: {sum(file_sizes.values()):,} bytes ({sum(file_sizes.values())/1024/1024:.1f} MB)")

if pipeline_success:
    print("\n🎉 Pipeline completed successfully!")
    print("📁 All enabled datasets have been collected")
    print("🚀 Ready for data engineering and analysis phase")
else:
    print("\n⚠️  Pipeline completed with some missing components")
    print("💡 Review the error messages above to address any issues")

# Show configuration summary
print(f"\n📋 Configuration used: {CONFIG_NAME}")
print(f"   Environment: {config.environment}")
print(f"   Data directory: {config.get_raw_data_path()}")
print(f"   Available formats: {config.output_formats}")

# Count files by format
for fmt in config.output_formats:
    format_files = []
    data_dir = config.get_raw_data_path()
    for root, dirs, files in os.walk(data_dir):
        format_files.extend([f for f in files if f.endswith(f'.{fmt}')])
    print(f"   {fmt.upper()}: {len(format_files)} files")

📊 PIPELINE SUMMARY
✅ Teams Mapping: 6,546 bytes (0.0 MB)
✅ Fixtures: 7,334,359 bytes (7.0 MB)
✅ Wages: 1,208,728 bytes (1.2 MB)
✅ Match Statistics: 20,331,565 bytes (19.4 MB)

📈 Total data collected: 28,881,198 bytes (27.5 MB)

🎉 Pipeline completed successfully!
📁 All enabled datasets have been collected
🚀 Ready for data engineering and analysis phase

📋 Configuration used: prod
   Environment: prod
   Data directory: ../../data\prod\raw
   Available formats: ['json', 'csv']
   JSON: 7 files
   CSV: 2 files


---

# Quick Data Exploration

Let's explore the collected data to understand what we have:

In [7]:
print("🔍 QUICK DATA EXPLORATION")
print("=" * 50)

# Load and explore teams data
teams_file = config.get_raw_data_path('all_teams.json')
if os.path.exists(teams_file):
    try:
        teams_data = load_teams_from_json(teams_file)
        print(f"\n🏟️  TEAMS DATA:")
        print(f"   Total teams: {len(teams_data)}")
        
        # Count by seasons
        season_counts = {}
        for team_data in teams_data.values():
            for season in team_data.get('seasons', []):
                season_counts[season] = season_counts.get(season, 0) + 1
        
        print(f"   Teams by season:")
        for season, count in sorted(season_counts.items()):
            print(f"     {season}: {count} teams")
    except Exception as e:
        print(f"❌ Error exploring teams data: {e}")

# Load and explore fixtures data
fixtures_file = config.get_raw_data_path('all_competitions_fixtures.json')
if os.path.exists(fixtures_file):
    try:
        with open(fixtures_file, 'r', encoding='utf-8') as f:
            fixtures_data = json.load(f)
        
        fixtures_df = fixtures_data_to_dataframe(fixtures_data)
        print(f"\n⚽ FIXTURES DATA:")
        print(f"   Total fixtures: {len(fixtures_df):,}")
        
        # By competition
        if 'comp' in fixtures_df.columns:
            comp_counts = fixtures_df['comp'].value_counts()
            print(f"   By competition:")
            for comp, count in comp_counts.items():
                print(f"     {comp}: {count:,} matches")
        
        # By season
        if 'season' in fixtures_df.columns:
            season_counts = fixtures_df['season'].value_counts().sort_index()
            print(f"   By season:")
            for season, count in season_counts.items():
                print(f"     {season}: {count:,} matches")
                
    except Exception as e:
        print(f"❌ Error exploring fixtures data: {e}")

# Load and explore match stats data
stats_file = config.get_raw_data_path('match_stats', 'all_match_stats.json')
if os.path.exists(stats_file):
    try:
        with open(stats_file, 'r', encoding='utf-8') as f:
            stats_data = json.load(f)
        
        print(f"\n📊 MATCH STATISTICS DATA:")
        print(f"   Total records: {len(stats_data):,}")
        
        if stats_data:
            # Unique matches and statistics
            unique_matches = len(set(record['match_id'] for record in stats_data))
            unique_stats = sorted(set(record['stat_name'] for record in stats_data))
            
            print(f"   Unique matches: {unique_matches:,}")
            print(f"   Available statistics ({len(unique_stats)}):")
            
            # Group stats by category
            basic_stats = [s for s in unique_stats if any(word in s.lower() for word in ['possession', 'passing', 'shots', 'saves'])]
            advanced_stats = [s for s in unique_stats if s not in basic_stats]
            
            print(f"     Basic: {basic_stats}")
            print(f"     Advanced: {advanced_stats[:10]}{'...' if len(advanced_stats) > 10 else ''}")
                
    except Exception as e:
        print(f"❌ Error exploring match stats data: {e}")

print(f"\n✅ Data exploration complete for {config.environment} environment!")
print(f"📁 Data location: {config.get_raw_data_path()}")

# Show configuration filters applied
print(f"\n📋 Configuration filters applied:")
print(f"   Teams: {config.get_effective_teams() or 'All teams'}")
print(f"   Seasons: {config.get_effective_seasons()}")
print(f"   Competitions: {config.competitions_filter}")
print(f"   Max matches: {config.max_matches or 'No limit'}")

🔍 QUICK DATA EXPLORATION

🏟️  TEAMS DATA:
   Total teams: 27
   Teams by season:
     2019-2020: 20 teams
     2020-2021: 20 teams
     2021-2022: 20 teams
     2022-2023: 20 teams
     2023-2024: 20 teams
     2024-2025: 20 teams

⚽ FIXTURES DATA:
   Total fixtures: 5,751
   By competition:
     Premier League: 4,560 matches
     FA Cup: 362 matches
     EFL Cup: 341 matches
     Champions Lg: 235 matches
     Europa Lg: 177 matches
     Conf Lg: 60 matches
     Community Shield: 10 matches
     Super Cup: 4 matches
     FA Community Shield: 2 matches
   By season:
     2019-2020: 957 matches
     2020-2021: 960 matches
     2021-2022: 955 matches
     2022-2023: 948 matches
     2023-2024: 956 matches
     2024-2025: 975 matches

📊 MATCH STATISTICS DATA:
   Total records: 98,164
   Unique matches: 3,236
   Available statistics (16):
     Basic: ['Passing Accuracy', 'Possession', 'Saves', 'Shots on Target']
     Advanced: ['Aerials Won', 'Clearances', 'Corners', 'Crosses', 'Fouls', 'G

---

# Alternative: Run Complete Pipeline with Single Command

You can also run the entire pipeline using the orchestrator script:

In [None]:
print("🚀 CONFIGURATION-BASED PIPELINE (RECOMMENDED)")
print("=" * 50)
print("\nThe pipeline now supports YAML configuration files for better organization:")

print("\n📋 Quick Start:")
print("   # Production data collection")
print("   python run_all_collectors_config.py --config prod")
print("   ")
print("   # Development (faster, limited data)")  
print("   python run_all_collectors_config.py --config dev")
print("   ")
print("   # Quick testing (minimal data)")
print("   python run_all_collectors_config.py --config testing")

print("\n📋 Individual steps:")
print("   python team_id_mapper_config.py --config prod")
print("   python fixtures_collector_config.py --config dev")
print("   python run_all_collectors_config.py --config prod --step fixtures")

print("\n📋 Configuration options:")
print("   --config prod     # Complete production collection")
print("   --config dev      # Development with limited teams/seasons")  
print("   --config testing  # Quick validation (10 matches only)")
print("   --dry-run         # Show what would be done")
print("   --skip-existing   # Skip steps where output exists")
print("   --step STEP       # Run only specific step")

print("\n💡 Configuration files are in config/ directory:")
print("   config/prod.yaml     # Production settings")
print("   config/dev.yaml      # Development settings")
print("   config/testing.yaml  # Quick testing settings")

print("\n🎯 Benefits of configuration approach:")
print("   ✅ All settings in one place")
print("   ✅ Easy environment switching (dev/prod)")
print("   ✅ No long command-line arguments") 
print("   ✅ Reusable configurations")
print("   ✅ Documentation in config files")
print("   ✅ Version control for parameters")

print(f"\n📊 Current configuration: {CONFIG_NAME}")
config.print_summary()

---

# Next Steps

After completing the data collection pipeline, you can proceed to the data engineering phase:

## 🔄 Data Engineering Phase
- **Data Validation**: Verify data quality and completeness
- **Data Transformation**: Clean and standardize the collected data
- **Feature Engineering**: Create derived features for modeling
- **Data Integration**: Combine different data sources

## 📊 Analysis Phase
- **Exploratory Data Analysis**: Understand patterns and relationships
- **Statistical Analysis**: Generate insights from the data
- **Visualization**: Create charts and dashboards

## 🤖 Modeling Phase
- **Feature Selection**: Choose relevant variables for prediction
- **Model Training**: Build and train prediction models
- **Model Evaluation**: Assess model performance
- **Model Deployment**: Deploy models for production use

## 📁 Data Structure Summary

The pipeline creates the following key datasets in your chosen environment:

```
data/
├── dev/                                     # Development environment
│   └── raw/
└── prod/                                    # Production environment
    └── raw/
        ├── all_teams.json                          # Team ID mappings
        ├── all_competitions_fixtures.json          # Match fixtures and results
        ├── all_competitions_fixtures_dataframe.csv # Fixtures as DataFrame
        ├── premier_league_wages.json               # Player wage data
        ├── premier_league_wages_dataframe.csv      # Wages as DataFrame
        └── match_stats/
            ├── all_match_stats.json                # Detailed match statistics
            └── all_match_stats_dataframe.csv       # Stats as DataFrame
```

### Environment Management
- **dev/**: For development, testing, and experimentation
- **prod/**: For production-ready, validated datasets
- Change the `ENVIRONMENT` variable above to switch between environments

Each dataset is available in multiple formats (JSON, CSV, Parquet) for flexibility in downstream processing.

### Data Collection Scripts Integration
All formal scripts are designed to work with this environment structure:
- Use `--output-file` parameters to specify environment paths
- Scripts automatically create necessary subdirectories
- Progress files are saved in environment-specific locations