# Data Exploration - Production Data

This notebook explores all the JSON data collected by the production data collection pipeline. It provides comprehensive analysis of teams, fixtures, wages, and match statistics data.

## Data Sources

The following JSON files from `data/prod/raw/` will be analyzed:

1. **all_teams.json** - Team ID mappings and metadata
2. **all_competitions_fixtures.json** - Match fixtures and results
3. **all_competitions_fixtures_dataframe.json** - Fixtures in DataFrame format
4. **premier_league_wages.json** - Player wage information
5. **premier_league_wages_dataframe.json** - Wages in DataFrame format
6. **premier_league_wages_summary.json** - Wage statistics summary
7. **match_stats/all_match_stats.json** - Detailed match statistics

## Objectives

- Load and understand the structure of each dataset
- Identify data quality issues and missing values
- Explore relationships between different data sources
- Generate summary statistics and visualizations
- Prepare insights for data engineering and modeling phases

## Setup and Imports

In [1]:
import os
import sys
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up paths
DATA_DIR = '../../../data/prod/raw/'
SCRIPTS_DIR = '../../../scripts/'

# Add scripts to path for utilities
if SCRIPTS_DIR not in sys.path:
    sys.path.append(SCRIPTS_DIR)

from utils.data_utils import load_teams_from_json, fixtures_data_to_dataframe

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("✅ Setup complete")
print(f"📁 Data directory: {os.path.abspath(DATA_DIR)}")
print(f"📂 Available files: {os.listdir(DATA_DIR)}")

  from pandas.core import (


✅ Setup complete
📁 Data directory: c:\Users\50230\OneDrive\Escritorio\Proyectos y trabajos\Personales\Pronósticos Football\data\prod\raw
📂 Available files: ['all_competitions_fixtures.json', 'all_competitions_fixtures_dataframe.csv', 'all_competitions_fixtures_dataframe.json', 'all_teams.json', 'match_stats', 'premier_league_wages.json', 'premier_league_wages_dataframe.csv', 'premier_league_wages_dataframe.json', 'premier_league_wages_summary.json']


## 1. Team Data Analysis

In [2]:
# Load teams data
teams_file = os.path.join(DATA_DIR, 'all_teams.json')

print("📊 TEAMS DATA ANALYSIS")
print("=" * 50)

if os.path.exists(teams_file):
    teams_data = load_teams_from_json(teams_file)
    
    print(f"📈 Total teams: {len(teams_data)}")
    
    # Analyze seasons coverage
    all_seasons = set()
    team_season_count = {}
    
    for team_id, team_info in teams_data.items():
        team_name = team_info['team_name']
        seasons = team_info['seasons']
        
        all_seasons.update(seasons)
        team_season_count[team_name] = len(seasons)
    
    print(f"📅 Seasons covered: {sorted(all_seasons)}")
    print(f"📊 Total team-season combinations: {sum(team_season_count.values())}")
    
    # Teams by number of seasons
    season_distribution = pd.Series(team_season_count.values()).value_counts().sort_index()
    print("\n🏟️ Teams by number of seasons played:")
    for seasons, count in season_distribution.items():
        print(f"   {seasons} seasons: {count} teams")
    
    # Show teams with different season counts
    print("\n📋 Team participation:")
    for team_name, season_count in sorted(team_season_count.items(), key=lambda x: x[1], reverse=True):
        print(f"   {team_name}: {season_count} seasons")
    
    # Store for later use
    teams_df = pd.DataFrame([(team_id, team_info['team_name'], len(team_info['seasons']), team_info['seasons']) 
                            for team_id, team_info in teams_data.items()],
                           columns=['team_id', 'team_name', 'seasons_count', 'seasons_list'])
    
    print(f"\n✅ Teams data loaded: {len(teams_df)} teams")
    display(teams_df.head())
else:
    print("❌ Teams file not found")
    teams_data = None
    teams_df = None

📊 TEAMS DATA ANALYSIS
📈 Total teams: 27
📅 Seasons covered: ['2019-2020', '2020-2021', '2021-2022', '2022-2023', '2023-2024', '2024-2025']
📊 Total team-season combinations: 120

🏟️ Teams by number of seasons played:
   1 seasons: 3 teams
   2 seasons: 2 teams
   3 seasons: 3 teams
   4 seasons: 4 teams
   5 seasons: 2 teams
   6 seasons: 13 teams

📋 Team participation:
   Arsenal: 6 seasons
   Aston Villa: 6 seasons
   Brighton: 6 seasons
   Chelsea: 6 seasons
   Crystal Palace: 6 seasons
   Everton: 6 seasons
   Liverpool: 6 seasons
   Manchester City: 6 seasons
   Manchester Utd: 6 seasons
   Newcastle Utd: 6 seasons
   Tottenham: 6 seasons
   West Ham: 6 seasons
   Wolves: 6 seasons
   Leicester City: 5 seasons
   Southampton: 5 seasons
   Bournemouth: 4 seasons
   Burnley: 4 seasons
   Fulham: 4 seasons
   Brentford: 4 seasons
   Sheffield Utd: 3 seasons
   Leeds United: 3 seasons
   Nott'ham Forest: 3 seasons
   Norwich City: 2 seasons
   Watford: 2 seasons
   West Brom: 1 seasons


Unnamed: 0,team_id,team_name,seasons_count,seasons_list
0,18bb7c10,Arsenal,6,"[2019-2020, 2020-2021, 2021-2022, 2022-2023, 2..."
1,8602292d,Aston Villa,6,"[2019-2020, 2020-2021, 2021-2022, 2022-2023, 2..."
2,4ba7cbea,Bournemouth,4,"[2019-2020, 2022-2023, 2023-2024, 2024-2025]"
3,d07537b9,Brighton,6,"[2019-2020, 2020-2021, 2021-2022, 2022-2023, 2..."
4,943e8050,Burnley,4,"[2019-2020, 2020-2021, 2021-2022, 2023-2024]"


## 2. Fixtures Data Analysis

In [11]:
# Load fixtures data
fixtures_file = os.path.join(DATA_DIR, 'all_competitions_fixtures.json')
fixtures_df_file = os.path.join(DATA_DIR, 'all_competitions_fixtures_dataframe.json')

print("⚽ FIXTURES DATA ANALYSIS")
print("=" * 50)

if os.path.exists(fixtures_file):
    # Load raw fixtures data
    with open(fixtures_file, 'r', encoding='utf-8') as f:
        fixtures_raw = json.load(f)
    
    print(f"📈 Teams with fixtures data: {len(fixtures_raw)}")
    
    # Convert to DataFrame
    fixtures_df = fixtures_data_to_dataframe(fixtures_raw)
    
    print(f"📊 Total fixture records: {len(fixtures_df):,}")
    print(f"📅 Date range: {fixtures_df['date'].min()} to {fixtures_df['date'].max()}")
    
    # Analyze by competition
    if 'comp' in fixtures_df.columns:
        comp_counts = fixtures_df['comp'].value_counts()
        print("\n🏆 Matches by competition:")
        for comp, count in comp_counts.items():
            print(f"   {comp}: {count:,} matches")
    
    # Analyze by season
    if 'season' in fixtures_df.columns:
        season_counts = fixtures_df['season'].value_counts().sort_index()
        print("\n📅 Matches by season:")
        for season, count in season_counts.items():
            print(f"   {season}: {count:,} matches")
    
    # Check for missing values
    print("\n🔍 Missing values analysis:")
    missing_analysis = fixtures_df.isnull().sum()
    missing_pct = (missing_analysis / len(fixtures_df) * 100).round(2)
    missing_df = pd.DataFrame({
        'Missing Count': missing_analysis,
        'Missing %': missing_pct
    }).sort_values('Missing %', ascending=False)
    
    print(missing_df[missing_df['Missing %'] > 0].head(10))
    
    # Sample of data
    print("\n📋 Sample fixtures data:")
    display(fixtures_df[['date', 'team_name', 'opponent', 'venue', 'result', 'comp']].head())
    
    print(f"\n✅ Fixtures data loaded: {len(fixtures_df):,} records")
else:
    print("❌ Fixtures file not found")
    fixtures_df = None

⚽ FIXTURES DATA ANALYSIS
📈 Teams with fixtures data: 27
📊 Total fixture records: 5,751
📅 Date range: 2019-07-25 to 2025-05-28

🏆 Matches by competition:
   Premier League: 4,560 matches
   FA Cup: 362 matches
   EFL Cup: 341 matches
   Champions Lg: 235 matches
   Europa Lg: 177 matches
   Conf Lg: 60 matches
   Community Shield: 10 matches
   Super Cup: 4 matches
   FA Community Shield: 2 matches

📅 Matches by season:
   2019-2020: 957 matches
   2020-2021: 960 matches
   2021-2022: 955 matches
   2022-2023: 948 matches
   2023-2024: 956 matches
   2024-2025: 975 matches

🔍 Missing values analysis:
               Missing Count  Missing %
notes                   5400      93.90
attendance              1100      19.13
xg_against               737      12.82
xg_for                   737      12.82
possession                40       0.70
referee                    6       0.10
opp_formation              2       0.03
captain_href               2       0.03
captain                    2     

Unnamed: 0,date,team_name,opponent,venue,result,comp
0,2019-08-11,Arsenal,Newcastle Utd,Away,W,Premier League
1,2019-08-17,Arsenal,Burnley,Home,W,Premier League
2,2019-08-24,Arsenal,Liverpool,Away,L,Premier League
3,2019-09-01,Arsenal,Tottenham,Home,D,Premier League
4,2019-09-15,Arsenal,Watford,Away,D,Premier League



✅ Fixtures data loaded: 5,751 records


## 3. Wages Data Analysis

In [13]:
# Load wages data
wages_file = os.path.join(DATA_DIR, 'premier_league_wages.json')
wages_df_file = os.path.join(DATA_DIR, 'premier_league_wages_dataframe.json')
wages_summary_file = os.path.join(DATA_DIR, 'premier_league_wages_summary.json')

print("💰 WAGES DATA ANALYSIS")
print("=" * 50)

# Load wages summary first
if os.path.exists(wages_summary_file):
    with open(wages_summary_file, 'r', encoding='utf-8') as f:
        wages_summary = json.load(f)
    
    print("📊 Wages Collection Summary:")
    print(f"   Total teams: {wages_summary.get('total_teams', 'N/A')}")
    print(f"   Total seasons: {wages_summary.get('total_seasons', 'N/A')}")
    print(f"   Total players: {wages_summary.get('total_players', 'N/A')}")
    
    if 'teams_by_season' in wages_summary:
        print("\n📅 Teams with wage data by season:")
        for season, count in sorted(wages_summary['teams_by_season'].items()):
            print(f"   {season}: {count} teams")
    
    if 'tables_coverage' in wages_summary:
        print("\n📋 Table coverage:")
        for table, count in wages_summary['tables_coverage'].items():
            print(f"   {table}: {count} team-seasons")

# Load detailed wages data
if os.path.exists(wages_df_file):
    wages_df = pd.read_json(wages_df_file)
    
    print(f"\n📈 Total wage records: {len(wages_df):,}")
    
    if len(wages_df) > 0:
        print(f"📊 Columns: {list(wages_df.columns)}")
        
        # Analyze data availability
        if 'season' in wages_df.columns:
            season_wage_counts = wages_df['season'].value_counts().sort_index()
            print("\n📅 Player records by season:")
            for season, count in season_wage_counts.items():
                print(f"   {season}: {count:,} players")
        
        if 'team_name' in wages_df.columns:
            team_wage_counts = wages_df['team_name'].value_counts()
            print("\n🏟️ Top teams by wage records:")
            for team, count in team_wage_counts.head(10).items():
                print(f"   {team}: {count:,} records")
        
        # Check for missing values
        print("\n🔍 Missing values in wages data:")
        wages_missing = wages_df.isnull().sum()
        wages_missing_pct = (wages_missing / len(wages_df) * 100).round(2)
        wages_missing_df = pd.DataFrame({
            'Missing Count': wages_missing,
            'Missing %': wages_missing_pct
        }).sort_values('Missing %', ascending=False)
        
        print(wages_missing_df[wages_missing_df['Missing %'] > 0].head(10))
        
        # Sample of data
        print("\n📋 Sample wages data:")
        display(wages_df.head())
        
        print(f"\n✅ Wages data loaded: {len(wages_df):,} records")
    else:
        print("⚠️ Wages DataFrame is empty")
        wages_df = None
else:
    print("❌ Wages DataFrame file not found")
    wages_df = None

💰 WAGES DATA ANALYSIS
📊 Wages Collection Summary:
   Total teams: 27
   Total seasons: 120
   Total players: 3341

📅 Teams with wage data by season:
   2019-2020: 20 teams
   2020-2021: 20 teams
   2021-2022: 20 teams
   2022-2023: 20 teams
   2023-2024: 20 teams
   2024-2025: 20 teams

📋 Table coverage:
   wages: 120 team-seasons
   div_wages: 0 team-seasons

📈 Total wage records: 3,341
📊 Columns: ['team_name', 'season', 'player_name', 'age', 'annual_wages', 'weekly_wages', 'team_id', 'tables_found', 'table_source', 'nationality', 'position', 'notes']

📅 Player records by season:
   2019-2020: 530 players
   2020-2021: 568 players
   2021-2022: 531 players
   2022-2023: 573 players
   2023-2024: 560 players
   2024-2025: 579 players

🏟️ Top teams by wage records:
   Chelsea: 180 records
   Newcastle Utd: 178 records
   Manchester Utd: 177 records
   Brighton: 175 records
   Wolves: 173 records
   Tottenham: 166 records
   Everton: 163 records
   Aston Villa: 162 records
   Liverpool: 

Unnamed: 0,team_name,season,player_name,age,annual_wages,weekly_wages,team_id,tables_found,table_source,nationality,position,notes
0,Arsenal,2019-2020,Mesut Özil,30,"£ 18,200,000 (€ 21,704,678, $22,117,026)","£ 350,000 (€ 417,398, $425,327)",18bb7c10,wages,wages,de GER,MF,
1,Arsenal,2019-2020,Pierre-Emerick Aubameyang,30,"£ 13,000,000 (€ 15,503,341, $15,797,876)","£ 250,000 (€ 298,141, $303,805)",18bb7c10,wages,wages,ga GAB,FW,
2,Arsenal,2019-2020,Alexandre Lacazette,28,"£ 9,470,000 (€ 11,293,588, $11,508,145)","£ 182,115 (€ 217,184, $221,310)",18bb7c10,wages,wages,fr FRA,FW,
3,Arsenal,2019-2020,Héctor Bellerín,24,"£ 5,720,000 (€ 6,821,470, $6,951,065)","£ 110,000 (€ 131,182, $133,674)",18bb7c10,wages,wages,es ESP,"DF,MF",
4,Arsenal,2019-2020,David Luiz,32,"£ 5,250,000 (€ 6,260,965, $6,379,911)","£ 100,962 (€ 120,403, $122,691)",18bb7c10,wages,wages,br BRA,DF,



✅ Wages data loaded: 3,341 records


## 4. Match Statistics Data Analysis

In [14]:
# Load match statistics data
match_stats_file = os.path.join(DATA_DIR, 'match_stats', 'all_match_stats.json')

print("📊 MATCH STATISTICS DATA ANALYSIS")
print("=" * 50)

if os.path.exists(match_stats_file):
    with open(match_stats_file, 'r', encoding='utf-8') as f:
        match_stats = json.load(f)
    
    print(f"📈 Total match stat records: {len(match_stats):,}")
    
    if len(match_stats) > 0:
        # Convert to DataFrame for analysis
        match_stats_df = pd.DataFrame(match_stats)
        
        print(f"📊 Columns: {list(match_stats_df.columns)}")
        
        # Unique matches and statistics
        unique_matches = match_stats_df['match_id'].nunique()
        unique_teams = match_stats_df['team_name'].nunique()
        unique_stats = match_stats_df['stat_name'].nunique()
        
        print(f"🏟️ Unique matches: {unique_matches:,}")
        print(f"⚽ Unique teams: {unique_teams}")
        print(f"📈 Unique statistics: {unique_stats}")
        
        # Most common statistics
        stat_counts = match_stats_df['stat_name'].value_counts()
        print("\n📊 Most common statistics:")
        for stat, count in stat_counts.head(15).items():
            print(f"   {stat}: {count:,} records")
        
        # Teams with most match stats
        team_stats_counts = match_stats_df['team_name'].value_counts()
        print("\n🏟️ Teams with most stat records:")
        for team, count in team_stats_counts.head(10).items():
            print(f"   {team}: {count:,} records")
        
        # Data quality check
        print("\n🔍 Data quality analysis:")
        print(f"   Records with missing stat_value: {match_stats_df['stat_value'].isnull().sum():,}")
        print(f"   Records with empty team_name: {(match_stats_df['team_name'] == '').sum():,}")
        print(f"   Records with empty match_id: {(match_stats_df['match_id'] == '').sum():,}")
        
        # Sample of different stat types
        print("\n📋 Sample match statistics:")
        sample_stats = match_stats_df.groupby('stat_name').first().reset_index()[['stat_name', 'team_name', 'stat_value']].head(10)
        display(sample_stats)
        
        print(f"\n✅ Match statistics data loaded: {len(match_stats_df):,} records")
    else:
        print("⚠️ Match statistics data is empty")
        match_stats_df = None
else:
    print("❌ Match statistics file not found")
    match_stats_df = None

📊 MATCH STATISTICS DATA ANALYSIS
📈 Total match stat records: 98,164
📊 Columns: ['match_id', 'team_name', 'stat_name', 'stat_value']
🏟️ Unique matches: 3,236
⚽ Unique teams: 243
📈 Unique statistics: 16

📊 Most common statistics:
   Possession: 6,472 records
   Shots on Target: 6,472 records
   Saves: 6,472 records
   Fouls: 6,472 records
   Corners: 6,472 records
   Crosses: 6,472 records
   Offsides: 6,458 records
   Interceptions: 6,422 records
   Touches: 5,818 records
   Goal Kicks: 5,818 records
   Throw Ins: 5,818 records
   Long Balls: 5,818 records
   Clearances: 5,814 records
   Passing Accuracy: 5,790 records
   Tackles: 5,788 records

🏟️ Teams with most stat records:
   Utd: 5,813 records
   City: 5,635 records
   Manchester: 5,275 records
   Liverpool: 5,085 records
   Chelsea: 5,008 records
   Arsenal: 4,887 records
   Tottenham: 4,759 records
   Wolves: 4,135 records
   Brighton: 4,099 records
   Everton: 4,007 records

🔍 Data quality analysis:
   Records with missing stat

Unnamed: 0,stat_name,team_name,stat_value
0,Aerials Won,Newcastle,21
1,Clearances,Newcastle,17
2,Corners,Newcastle,5
3,Crosses,Newcastle,15
4,Fouls,Newcastle,12
5,Goal Kicks,Newcastle,8
6,Interceptions,Newcastle,20
7,Long Balls,Newcastle,68
8,Offsides,Newcastle,1
9,Passing Accuracy,Newcastle,75%



✅ Match statistics data loaded: 98,164 records


In [15]:
match_stats_df

Unnamed: 0,match_id,team_name,stat_name,stat_value
0,https://fbref.com/en/matches/1405a610/Newcastl...,Newcastle,Possession,38%
1,https://fbref.com/en/matches/1405a610/Newcastl...,Arsenal,Possession,62%
2,https://fbref.com/en/matches/1405a610/Newcastl...,Newcastle,Passing Accuracy,75%
3,https://fbref.com/en/matches/1405a610/Newcastl...,Arsenal,Passing Accuracy,84%
4,https://fbref.com/en/matches/1405a610/Newcastl...,Newcastle,Shots on Target,22%
...,...,...,...,...
98159,https://fbref.com/en/matches/be42686a/Coventry...,Town,Crosses,9
98160,https://fbref.com/en/matches/be42686a/Coventry...,Coventry,Interceptions,13
98161,https://fbref.com/en/matches/be42686a/Coventry...,Town,Interceptions,14
98162,https://fbref.com/en/matches/be42686a/Coventry...,Coventry,Offsides,3


## 5. Cross-Dataset Analysis

In [None]:
print("🔗 CROSS-DATASET ANALYSIS")
print("=" * 50)

# Data availability summary
data_availability = {
    'Teams': teams_df is not None,
    'Fixtures': fixtures_df is not None,
    'Wages': wages_df is not None,
    'Match Stats': match_stats_df is not None
}

print("📋 Data availability:")
for dataset, available in data_availability.items():
    status = "✅ Available" if available else "❌ Missing"
    print(f"   {dataset}: {status}")

# If we have multiple datasets, analyze relationships
if teams_df is not None and fixtures_df is not None:
    print("\n🔍 Team coverage analysis:")
    
    # Teams in both datasets
    teams_in_metadata = set(teams_df['team_name'])
    teams_in_fixtures = set(fixtures_df['team_name'])
    
    teams_both = teams_in_metadata.intersection(teams_in_fixtures)
    teams_only_metadata = teams_in_metadata - teams_in_fixtures
    teams_only_fixtures = teams_in_fixtures - teams_in_metadata
    
    print(f"   Teams in both datasets: {len(teams_both)}")
    print(f"   Teams only in metadata: {len(teams_only_metadata)}")
    print(f"   Teams only in fixtures: {len(teams_only_fixtures)}")
    
    if teams_only_metadata:
        print(f"   Teams missing from fixtures: {sorted(teams_only_metadata)}")
    if teams_only_fixtures:
        print(f"   Teams missing from metadata: {sorted(teams_only_fixtures)}")

# Data size summary
print("\n📊 Dataset sizes:")
if teams_df is not None:
    print(f"   Teams: {len(teams_df)} teams")
if fixtures_df is not None:
    print(f"   Fixtures: {len(fixtures_df):,} records")
if wages_df is not None:
    print(f"   Wages: {len(wages_df):,} records")
if match_stats_df is not None:
    print(f"   Match Stats: {len(match_stats_df):,} records")

print("\n✅ Cross-dataset analysis complete")

## 6. Data Quality Summary

In [None]:
print("🔍 DATA QUALITY SUMMARY")
print("=" * 50)

quality_report = []

# Teams data quality
if teams_df is not None:
    teams_quality = {
        'Dataset': 'Teams',
        'Records': len(teams_df),
        'Completeness': '100%',  # Teams data is typically complete
        'Issues': 'None identified'
    }
    quality_report.append(teams_quality)

# Fixtures data quality
if fixtures_df is not None:
    missing_pct = (fixtures_df.isnull().sum().sum() / (len(fixtures_df) * len(fixtures_df.columns)) * 100)
    completeness = f"{100 - missing_pct:.1f}%"
    
    issues = []
    if 'result' in fixtures_df.columns and fixtures_df['result'].isnull().sum() > 0:
        issues.append(f"{fixtures_df['result'].isnull().sum()} matches without results")
    if 'gf' in fixtures_df.columns and fixtures_df['gf'].isnull().sum() > 0:
        issues.append(f"{fixtures_df['gf'].isnull().sum()} matches without goals data")
    
    fixtures_quality = {
        'Dataset': 'Fixtures',
        'Records': f"{len(fixtures_df):,}",
        'Completeness': completeness,
        'Issues': '; '.join(issues) if issues else 'None identified'
    }
    quality_report.append(fixtures_quality)

# Wages data quality
if wages_df is not None and len(wages_df) > 0:
    wages_missing_pct = (wages_df.isnull().sum().sum() / (len(wages_df) * len(wages_df.columns)) * 100)
    wages_completeness = f"{100 - wages_missing_pct:.1f}%"
    
    wages_quality = {
        'Dataset': 'Wages',
        'Records': f"{len(wages_df):,}",
        'Completeness': wages_completeness,
        'Issues': 'Limited availability - not all teams/seasons have wage data'
    }
    quality_report.append(wages_quality)

# Match stats data quality
if match_stats_df is not None and len(match_stats_df) > 0:
    stats_missing = match_stats_df['stat_value'].isnull().sum()
    stats_completeness = f"{((len(match_stats_df) - stats_missing) / len(match_stats_df) * 100):.1f}%"
    
    stats_issues = []
    if stats_missing > 0:
        stats_issues.append(f"{stats_missing:,} records with missing stat values")
    
    stats_quality = {
        'Dataset': 'Match Statistics',
        'Records': f"{len(match_stats_df):,}",
        'Completeness': stats_completeness,
        'Issues': '; '.join(stats_issues) if stats_issues else 'None identified'
    }
    quality_report.append(stats_quality)

# Display quality report
if quality_report:
    quality_df = pd.DataFrame(quality_report)
    display(quality_df)

print("\n📋 Recommendations for data engineering:")
print("1. Handle missing values in fixtures data (especially future matches)")
print("2. Standardize team names across datasets")
print("3. Create unified date formats and validation")
print("4. Implement data quality checks in the pipeline")
print("5. Consider data imputation strategies for wages data")
print("6. Validate match statistics consistency")

print("\n✅ Data quality analysis complete")

## 7. Summary and Next Steps

In [None]:
print("📋 DATA EXPLORATION SUMMARY")
print("=" * 50)

# Calculate total data points
total_records = 0
datasets_loaded = 0

if teams_df is not None:
    total_records += len(teams_df)
    datasets_loaded += 1
    print(f"✅ Teams: {len(teams_df)} teams across multiple seasons")

if fixtures_df is not None:
    total_records += len(fixtures_df)
    datasets_loaded += 1
    print(f"✅ Fixtures: {len(fixtures_df):,} match records")

if wages_df is not None and len(wages_df) > 0:
    total_records += len(wages_df)
    datasets_loaded += 1
    print(f"✅ Wages: {len(wages_df):,} player wage records")

if match_stats_df is not None and len(match_stats_df) > 0:
    total_records += len(match_stats_df)
    datasets_loaded += 1
    print(f"✅ Match Stats: {len(match_stats_df):,} statistical records")

print(f"\n📊 Total: {datasets_loaded}/4 datasets loaded, {total_records:,} total records")

# Key insights
print("\n🔑 Key Insights:")
if teams_df is not None:
    most_seasons = teams_df['seasons_count'].max()
    least_seasons = teams_df['seasons_count'].min()
    print(f"   • Team participation ranges from {least_seasons} to {most_seasons} seasons")

if fixtures_df is not None:
    if 'season' in fixtures_df.columns:
        seasons_covered = fixtures_df['season'].nunique()
        print(f"   • Fixtures data covers {seasons_covered} seasons")
    if 'comp' in fixtures_df.columns:
        competitions = fixtures_df['comp'].nunique()
        print(f"   • Data includes {competitions} different competitions")

if match_stats_df is not None and len(match_stats_df) > 0:
    matches_with_stats = match_stats_df['match_id'].nunique()
    stats_per_match = len(match_stats_df) / matches_with_stats
    print(f"   • Average {stats_per_match:.0f} statistics per match")

print("\n🚀 Recommended Next Steps:")
print("1. 🔧 Data Engineering:")
print("   - Create unified team identifier mapping")
print("   - Standardize date formats across datasets")
print("   - Handle missing values with appropriate strategies")
print("   - Create data validation and quality checks")

print("\n2. 📊 Feature Engineering:")
print("   - Calculate team performance metrics")
print("   - Create rolling averages for statistics")
print("   - Engineer target variables for prediction")
print("   - Normalize statistics for fair comparison")

print("\n3. 🤖 Modeling Preparation:")
print("   - Split data into training/validation/test sets")
print("   - Create time-based splits to avoid data leakage")
print("   - Define prediction targets (match outcomes, goals, etc.)")
print("   - Select relevant features for different model types")

print("\n✅ Data exploration complete! Ready for data engineering phase.")