# Notebook 1: Data Collection and Preprocessing

This notebook demonstrates how to use the `data_collection.py` and `data_preprocessing.py` modules to fetch and prepare sports data.

**Note:** To fetch real data from `football-data.org`, you need to set the `FOOTBALL_DATA_API_KEY` environment variable with your API token.

## 1. Setup and Imports

In [None]:
import os
import sys
from datetime import datetime, timedelta
import pandas as pd

# Add src directory to Python path to import custom modules
project_root = os.path.abspath(os.path.join(os.getcwd(), '..')) # Assuming notebook is in 'notebooks' directory
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

try:
    from data_collection import get_matches_for_date
    from data_preprocessing import preprocess_match_data
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Make sure 'src' is in sys.path and __init__.py files are present.")
    print(f"Current sys.path: {sys.path}")

## 2. Data Collection

Fetch matches for a specific date. We'll use yesterday as an example to likely get finished matches.

In [None]:
api_key = os.getenv("FOOTBALL_DATA_API_KEY", "YOUR_API_TOKEN")

if api_key == "YOUR_API_TOKEN":
    print("Warning: FOOTBALL_DATA_API_KEY is not set. Using placeholder.")
    print("Data fetching will likely return an empty list or fail.")
    print("Please set the environment variable for actual data.")

yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
print(f"Fetching matches for: {yesterday}")

raw_matches_data = get_matches_for_date(yesterday, api_key=api_key)

if raw_matches_data:
    print(f"Successfully fetched {len(raw_matches_data)} matches.")
    # Display first match as an example
    # print("Example match data:", raw_matches_data[0]) 
else:
    print("No matches fetched. This could be due to no games on that day, API limits, or an invalid/missing API key.")

## 3. Data Preprocessing

Convert the raw list of matches into a Pandas DataFrame and perform initial cleaning/feature extraction.

In [None]:
# Ensure raw_matches_data is not None and is a list
if isinstance(raw_matches_data, list) and raw_matches_data:
    matches_df = preprocess_match_data(raw_matches_data)
    if not matches_df.empty:
        print("Preprocessed DataFrame info:")
        matches_df.info()
        print("\nDataFrame head:")
        print(matches_df[['home_team_name', 'away_team_name', 'home_team_score', 'away_team_score', 'utcDate', 'status']].head())
    else:
        print("Preprocessing resulted in an empty DataFrame.")
elif not raw_matches_data: # Explicitly handle case where data is empty from API
    print("Skipping preprocessing as no raw match data was fetched.")
    matches_df = pd.DataFrame() # Create an empty df to avoid errors later if something expects it
else:
    print(f"Skipping preprocessing as raw_matches_data is not a list or is None (type: {type(raw_matches_data)}).")
    matches_df = pd.DataFrame()

## 4. Further Exploration (Placeholder)

At this stage, you would typically perform more in-depth exploratory data analysis (EDA) and feature engineering.
- Analyze distributions of scores, outcomes.
- Engineer features like: 
    - Recent form (last N games win/loss/draw ratio, goals scored/conceded).
    - Head-to-head statistics between teams.
    - Team strength based on league position or ratings (e.g., Elo).
    - Home/Away advantages.

This often involves fetching more historical data for each team.

In [None]:
if not matches_df.empty:
    print("Example: Value counts of match status (if available and processed):")
    if 'status' in matches_df.columns:
        print(matches_df['status'].value_counts())
    else:
        print("'status' column not found in the preprocessed DataFrame.")
    
    # Placeholder for creating a 'result' column (Home Win, Draw, Away Win)
    # This depends on 'home_team_score' and 'away_team_score' being populated and numeric
    if 'home_team_score' in matches_df.columns and 'away_team_score' in matches_df.columns:
        # Ensure scores are numeric and handle NaNs if any score is missing
        h_scores = pd.to_numeric(matches_df['home_team_score'], errors='coerce')
        a_scores = pd.to_numeric(matches_df['away_team_score'], errors='coerce')
        
        conditions = [
            h_scores > a_scores,
            h_scores == a_scores,
            h_scores < a_scores
        ]
        outcomes = [0, 1, 2] # 0: Home Win, 1: Draw, 2: Away Win
        matches_df['result_label'] = pd.Series(np.select(conditions, outcomes, default=np.nan), index=matches_df.index)
        print("\n'result_label' distribution (0=Home, 1=Draw, 2=Away):")
        print(matches_df['result_label'].value_counts(dropna=False))
    else:
        print("\nScore columns not found, cannot determine match result label.")
else:
    print("Matches DataFrame is empty, skipping further exploration.")