# Tennis Prediction Setup & Quick Start

This notebook will:
1. Verify all packages are installed
2. Check that data files exist
3. Load and explore the data quickly
4. Run a complete end-to-end example

**Run this first to make sure everything works!**

## Step 1: Import Packages

In [None]:
# Import required packages
import sys
import os

# Add src to path
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

print("All packages imported successfully!")
print(f"   pandas: {pd.__version__}")
print(f"   numpy: {np.__version__}")
print(f"   xgboost: {xgb.__version__}")

## Step 2: Check Data Files

In [None]:
# Check if data file exists
data_path = '../data/raw/atp_tennis.csv'

if os.path.exists(data_path):
    file_size = os.path.getsize(data_path) / (1024 * 1024)  # Size in MB
    print(f"Data file found: {data_path}")
    print(f"   File size: {file_size:.2f} MB")
else:
    print(f"Data file NOT found: {data_path}")
    print("   Please make sure atp_tennis.csv is in the data/raw/ folder!")

## Step 3: Quick Data Load

In [None]:
# Load the data
print("Loading data...")
df = pd.read_csv(data_path)

print(f"Data loaded!")
print(f"   Total matches: {len(df):,}")
print(f"   Columns: {len(df.columns)}")
print(f"\nFirst few rows:")
df.head()

## Step 4: Quick Data Summary

In [None]:
# Convert date and add year
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year

print("DATA SUMMARY")
print("=" * 60)
print(f"\nDate Range: {df['Date'].min().date()} to {df['Date'].max().date()}")
print(f"\nMatches by Year:")
print(df['Year'].value_counts().sort_index())

print(f"\nSurface Distribution:")
print(df['Surface'].value_counts())

# Check for 2025 data
test_2025 = df[df['Year'] == 2025]
print(f"\n2025 Test Data: {len(test_2025):,} matches")

# Check for training data
train_pre2025 = df[df['Year'] < 2025]
print(f"Pre-2025 Training Data: {len(train_pre2025):,} matches")

## Step 5: Test Our Modules

In [None]:
# Import our custom modules
from data_loader import TennisDataLoader
from elo_calculator import EloCalculator
from feature_engineering import FeatureEngineer
from model import TennisPredictionModel
from visualizations import plot_elo_evolution

print("All custom modules imported successfully!")
print("   You're ready to go!")

## Step 6: Quick End-to-End Test

Let's run a quick test on a small subset to make sure everything works!

In [None]:
# Take a small sample for quick testing
print("Running quick test on recent data...\n")

# Use only 2024-2025 for quick test
df_test = df[df['Year'] >= 2024].copy().sort_values('Date').reset_index(drop=True)

print(f"Test data: {len(df_test):,} matches (2024-2025)")

# Quick ELO calculation on small dataset
from elo_calculator import calculate_elo_for_dataframe

df_with_elo, elo_calc = calculate_elo_for_dataframe(df_test)

print(f"\nELO calculation successful!")
print(f"\nSample ELO ratings:")
print(df_with_elo[['Date', 'Player_1', 'Player_2', 'elo_1', 'elo_2', 'Winner']].head(10))

## Setup Complete!

If you see all checkmarks above, you're ready to go!

### Next Steps:

Run the notebooks in order:
1. **0_Setup.ipynb** (You are here!)
2. **1_DataExploration.ipynb** - Explore the dataset in detail
3. **2_CleanData.ipynb** - Clean and split data properly
4. **3_FeatureEngineering.ipynb** - Calculate ELO for full dataset
5. **4_TrainModel.ipynb** - Train XGBoost model
6. **5_MakePredictions.ipynb** - Predict 2025 matches
7. **6_EvaluateResults.ipynb** - Analyze accuracy

**Each notebook builds on the previous one!**