# üéæ Tennis Prediction Setup & Quick Start

This notebook will:
1. Verify all packages are installed
2. Check that data files exist
3. Load and explore the data quickly
4. Run a complete end-to-end example

**Run this first to make sure everything works!**

## Step 1: Import Packages

In [1]:
# Import required packages
import sys
import os

# Add src to path
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

print("‚úÖ All packages imported successfully!")
print(f"   pandas: {pd.__version__}")
print(f"   numpy: {np.__version__}")
print(f"   xgboost: {xgb.__version__}")

‚úÖ All packages imported successfully!
   pandas: 2.3.3
   numpy: 2.3.5
   xgboost: 3.1.1


## Step 2: Check Data Files

In [2]:
# Check if data file exists
data_path = '../data/raw/atp_tennis.csv'

if os.path.exists(data_path):
    file_size = os.path.getsize(data_path) / (1024 * 1024)  # Size in MB
    print(f"‚úÖ Data file found: {data_path}")
    print(f"   File size: {file_size:.2f} MB")
else:
    print(f"‚ùå Data file NOT found: {data_path}")
    print("   Please make sure atp_tennis.csv is in the data/raw/ folder!")

‚úÖ Data file found: ../data/raw/atp_tennis.csv
   File size: 8.47 MB


## Step 3: Quick Data Load

In [3]:
# Load the data
print("Loading data...")
df = pd.read_csv(data_path)

print(f"‚úÖ Data loaded!")
print(f"   Total matches: {len(df):,}")
print(f"   Columns: {len(df.columns)}")
print(f"\nFirst few rows:")
df.head()

Loading data...
‚úÖ Data loaded!
   Total matches: 66,681
   Columns: 17

First few rows:


Unnamed: 0,Tournament,Date,Series,Court,Surface,Round,Best of,Player_1,Player_2,Winner,Rank_1,Rank_2,Pts_1,Pts_2,Odd_1,Odd_2,Score
0,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Dosedel S.,Ljubicic I.,Dosedel S.,63,77,-1,-1,-1.0,-1.0,6-4 6-2
1,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Clement A.,Enqvist T.,Enqvist T.,56,5,-1,-1,-1.0,-1.0,3-6 3-6
2,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Escude N.,Baccanello P.,Escude N.,40,655,-1,-1,-1.0,-1.0,6-7 7-5 6-3
3,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Knippschild J.,Federer R.,Federer R.,87,65,-1,-1,-1.0,-1.0,1-6 4-6
4,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Fromberg R.,Woodbridge T.,Fromberg R.,81,198,-1,-1,-1.0,-1.0,7-6 5-7 6-4


## Step 4: Quick Data Summary

In [4]:
# Convert date and add year
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year

print("üìä DATA SUMMARY")
print("=" * 60)
print(f"\nüìÖ Date Range: {df['Date'].min().date()} to {df['Date'].max().date()}")
print(f"\nüéæ Matches by Year:")
print(df['Year'].value_counts().sort_index())

print(f"\nüèüÔ∏è Surface Distribution:")
print(df['Surface'].value_counts())

# Check for 2025 data
test_2025 = df[df['Year'] == 2025]
print(f"\n‚úÖ 2025 Test Data: {len(test_2025):,} matches")

# Check for training data
train_pre2025 = df[df['Year'] < 2025]
print(f"‚úÖ Pre-2025 Training Data: {len(train_pre2025):,} matches")

üìä DATA SUMMARY

üìÖ Date Range: 2000-01-03 to 2025-11-16

üéæ Matches by Year:
Year
2000    2874
2001    2979
2002    2700
2003    2698
2004    2782
2005    2815
2006    2793
2007    2712
2008    2565
2009    2607
2010    2562
2011    2565
2012    2508
2013    2538
2014    2448
2015    2518
2016    2520
2017    2528
2018    2568
2019    2497
2020    1231
2021    2402
2022    2545
2023    2607
2024    2631
2025    2488
Name: count, dtype: int64

üèüÔ∏è Surface Distribution:
Surface
Hard      36216
Clay      21389
Grass      7444
Carpet     1632
Name: count, dtype: int64

‚úÖ 2025 Test Data: 2,488 matches
‚úÖ Pre-2025 Training Data: 64,193 matches


## Step 5: Test Our Modules

In [5]:
# Import our custom modules
from data_loader import TennisDataLoader
from elo_calculator import EloCalculator
from feature_engineering import FeatureEngineer
from model import TennisPredictionModel
from visualizations import plot_elo_evolution

print("‚úÖ All custom modules imported successfully!")
print("   You're ready to go!")

‚úÖ All custom modules imported successfully!
   You're ready to go!


## Step 6: Quick End-to-End Test

Let's run a quick test on a small subset to make sure everything works!

In [6]:
# Take a small sample for quick testing
print("Running quick test on recent data...\n")

# Use only 2024-2025 for quick test
df_test = df[df['Year'] >= 2024].copy().sort_values('Date').reset_index(drop=True)

print(f"Test data: {len(df_test):,} matches (2024-2025)")

# Quick ELO calculation on small dataset
from elo_calculator import calculate_elo_for_dataframe

df_with_elo, elo_calc = calculate_elo_for_dataframe(df_test)

print(f"\n‚úÖ ELO calculation successful!")
print(f"\nSample ELO ratings:")
print(df_with_elo[['Date', 'Player_1', 'Player_2', 'elo_1', 'elo_2', 'Winner']].head(10))

Running quick test on recent data...

Test data: 5,119 matches (2024-2025)
üéÆ Calculating ELO ratings...
   This may take a few minutes for large datasets...
   Processed 5,000 matches...
‚úÖ ELO calculation complete!
   Average ELO: 1556.5
   ELO range: 1331.8 - 2031.5

‚úÖ ELO calculation successful!

Sample ELO ratings:
        Date              Player_1       Player_2   elo_1   elo_2  \
0 2024-01-01          Safiullin R.     Shelton B.  1500.0  1500.0   
1 2024-01-01              Bonzi B.  Ruusuvuori E.  1500.0  1500.0   
2 2024-01-01  Van De Zandschulp B.   Mochizuki S.  1500.0  1500.0   
3 2024-01-01              Kotov P.      Borges N.  1500.0  1500.0   
4 2024-01-01              Djere L.       Shang J.  1500.0  1500.0   
5 2024-01-01            Purcell M.        Rune H.  1500.0  1500.0   
6 2024-01-01           Dimitrov G.      Murray A.  1500.0  1500.0   
7 2024-01-02             Wolf J.J.   Duckworth J.  1500.0  1500.0   
8 2024-01-02             Daniel T.  Kecmanovic M.  1

## ‚úÖ Setup Complete!

If you see all green checkmarks above, you're ready to go!

### Next Steps:

Run the notebooks in order:
1. ‚úÖ **0_Setup.ipynb** (You are here!)
2. **1_DataExploration.ipynb** - Explore the dataset in detail
3. **2_CleanData.ipynb** - Clean and split data properly
4. **3_FeatureEngineering.ipynb** - Calculate ELO for full dataset
5. **4_TrainModel.ipynb** - Train XGBoost model
6. **5_MakePredictions.ipynb** - Predict 2025 matches
7. **6_EvaluateResults.ipynb** - Analyze accuracy

**Each notebook builds on the previous one!**