Preliminary Analysis 

---

## **Introduction**

The Rugby World Cup (RWC) is the most prestigious tournament in men’s international rugby, held every four years and featuring the top-performing nations from around the globe. Since its inception in 1987, all winning teams have come from the group of tier-one rugby nations—those with well-established professional structures and a history of competitive success. The World Cup's high stakes, unique pressure, and knockout format often produce different outcomes from regular international fixtures, yet the path to RWC success may still be written in a team’s performance leading up to the tournament.

This project seeks to bridge our understanding of regular international performance with World Cup outcomes through a predictive modeling lens. Our guiding research question is:  
> **Can regular fixture performance since 1999 reliably predict World Cup success for tier-one rugby nations?**

We expect that teams with consistently high win rates, dominant scoring margins, and strong rankings (e.g., Elo ratings) leading into a World Cup are more likely to progress further or win the tournament. Based on our model, and consistent with historical trends, we predict that **South Africa (RSA)**, **Ireland (IRE)**, or **New Zealand (NZ)** are the most likely candidates to win the 2027 Rugby World Cup.

---

## **Data**

### **Data Source and Collection**
The data for this project was sourced from an international rugby match dataset containing results from test matches played by men's national teams. The dataset includes **2,783 matches**, of which **1,230** were retained after filtering for:
- Matches played **between 1999 and 2024**, inclusive
- **Tier-one teams** only
- Complete match records (with no missing scores or team names)

### **Cases and Variables**
Each row in the dataset represents a single international rugby match and includes variables such as:
- `date`: Match date
- `home_team`, `away_team`: Competing teams
- `home_score`, `away_score`: Points scored by each team
- `competition`: Type of match (e.g., World Cup, Six Nations, etc.)
- `neutral`: Boolean indicator for neutral venue
- `world_cup`: Boolean indicator for whether the match is a World Cup fixture

### **Data Wrangling and Feature Engineering**
To support our analysis, we created several new variables and tidied the dataset:
- Extracted `year` from the match date for time-series analysis
- Created binary indicators: `homeWin` and `awayWin`
- Calculated `count` to aid in aggregating match totals
- Grouped matches by team to calculate:
  - **Home and away win percentages**
  - **Average points scored and conceded (home and away)**
  - **World Cup vs. regular fixture performance**
- Developed year-by-year win percentage timelines for each team

### **Variables for Modeling**
We plan to include the following variables in our predictive models:
- **Win Percentage (last 2 years before RWC)**: Measures recent performance
- **Average Point Differential**: Offensive and defensive strength indicator
- **Elo Rating** (if included or computed): Captures opponent-adjusted team strength
- **Tournament Flag**: Helps compare regular matches to World Cup fixtures
- **Home/Away Advantage**: Quantified through win rates and score differentials

These variables were chosen for their interpretability and demonstrated relevance in differentiating strong and weak tournament performers.

---

## **Methodology**

### **Research Question**
Our goal is to explore whether consistent performance in international fixtures since 1999 can predict a team’s success at the Rugby World Cup (RWC). We focus on tier-one nations, and based on trends in both regular and tournament data, we aim to forecast the winner of the 2027 RWC. Our preliminary model predicts that **South Africa, Ireland, or New Zealand** are the most likely champions.

---

### **Data Overview and Cleaning**

We used a dataset containing **2,783 international matches**, which was filtered to:
- **Matches after 1998**, yielding **1,230 matches**.
- **Tier-one teams only**, excluding matches with insufficient data or unclear team tier.

Cleaning Steps:
- Extracted **year** from match dates for temporal analysis.
- Created binary flags for **homeWin** and **awayWin**.
- Dropped rows with missing values.
- Created a **"count"** variable to aggregate matches played per team.

---

### **Feature Engineering and Wrangling**

We created several new variables from existing match data:
- **Home/Away Win Percentages**: Calculated by aggregating match results per team.
- **Average Points Scored and Conceded**: For both World Cup and non-World Cup matches.
- **World Cup Match Flag**: To compare regular vs. tournament performance.
- **Yearly Win Percentage**: A normalized view of performance over time per team.

> Code implementation used pandas and NumPy, with seaborn and matplotlib for plotting.

---

### **Exclusions**
We excluded:
- All matches involving **non-tier-one nations**.
- Matches with **missing scores** or unclear venues.
- Non-competitive matches that could skew performance metrics (e.g., experimental squads).

---

### **Summary Statistics**

| Variable            | Mean   | Std Dev | Min   | Max   |
|---------------------|--------|---------|-------|-------|
| Home Score          | 25.07  | 13.15   | 0     | 101   |
| Away Score          | 20.78  | 11.38   | 0     | 68    |
| Home Win Rate       | 59.6%  | —       | 0     | 1     |
| Away Win Rate       | 38.4%  | —       | 0     | 1     |
| Year                | 2011.5 | 7.2     | 1999  | 2024  |

**Observation**:  
On average, the **home team outperforms the away team by ~4.3 points**. The home win rate is significantly higher across all teams.

---

### **Key Visualizations**

**1. Home and Away Win Rates by Team**
<img src="data:image/png;base64,[image_placeholder]" alt="Win Rate Plot" />

- New Zealand dominates both at home and away.
- All teams perform **better at home**, confirming home advantage.
- Italy underperforms consistently.

**2. Average Points Scored vs. Conceded**
- Teams like **New Zealand** and **South Africa** show high scores with low concession.
- **Italy** and **Scotland** concede more than they score, especially in away games.

**3. World Cup vs. Regular Match Comparison**
- **Ireland and South Africa** maintain strong performance in both contexts.
- Some teams **drop significantly in performance during World Cups**, revealing tournament pressure or lack of depth.

**4. Year-wise Win Percentage Trends**
- Shows **consistency and peaks** for contenders like **NZ, RSA, and IRE**.
- Allows modeling team form trajectory over time.

---

### **Modeling Approach**

#### **Random Forest Classifier**
- Used for **binary classification**: World Cup Winner vs. Non-Winner.
- Captures **non-linear patterns** and **interaction effects**.
- Feature importance helps identify key performance indicators.

#### **Time-Series Analysis**
- Year-wise win percentage trends inform about **form, slumps, and growth**.
- Useful for **pre-tournament prediction**, especially for 2027.

---

### **Why These Methods?**
- **Logistic regression** works well with ordered categories.
- **Random forests** are robust to outliers and handle complex variable relationships.
- **Visualization-backed wrangling** ensured each model was driven by interpretable and meaningful variables.

---

#### **XGBoost CLassifier with Monte Carlo Simulations**
XGBoost (Extreme Gradient Boosting) is a highly efficient and scalable machine learning algorithm, especially for structured/tabular data. Its advantages include:

- High Predictive Accuracy: Often outperforms other algorithms (like logistic regression, random forests, or neural networks) on structured datasets.

- Handles Non-Linearity & Complex Relationships: Captures intricate patterns in data that simpler models might miss.

- Feature Importance: Provides insights into which variables most influence predictions.

- Robust to Overfitting: Includes regularization (L1/L2) and early stopping.

- Works Well with Imbalanced Data: Can handle classification tasks where one class is rare (e.g., fraud detection).

Monte Carlo (MC) simulations are used to model uncertainty and probabilistic outcomes by running thousands of random simulations. When combined with XGBoost, they help:

- Quantify Uncertainty: XGBoost gives a point prediction (e.g., probability of an event), but MC can simulate how reliable that prediction is under varying conditions.

- Risk Assessment: If inputs have randomness (e.g., stock prices, sensor noise), MC can propagate this uncertainty through the XGBoost model.

- Sensitivity Analysis: Test how small changes in input variables affect the output (e.g., "What if Feature X varies by ±10%?").

- Decision-Making Under Uncertainty: Useful in finance (portfolio risk), healthcare (treatment outcomes), or engineering (system failures).


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import xgboost as xgb
from collections import defaultdict
import random

# Load the dataset
df = pd.read_csv('data/rugby.csv')

# Data preprocessing
# Filter to only international matches (remove club matches)
international_matches = df['competition'].str.contains('International|Championship|Nations|World Cup|Test Match', na=False)
df = df.loc[international_matches].copy()

# Feature engineering
# Create target variable - winner of each match
conditions = [
    df['home_score'] > df['away_score'],
    df['home_score'] < df['away_score']
]
choices = [df['home_team'], df['away_team']]
df.loc[:, 'winner'] = np.select(conditions, choices, default='Draw')

# Remove draws for classification
df = df.loc[df['winner'] != 'Draw'].copy()

# Encode categorical variables
le = LabelEncoder()
df.loc[:, 'home_team_encoded'] = le.fit_transform(df['home_team'])
df.loc[:, 'away_team_encoded'] = le.transform(df['away_team'])
df.loc[:, 'winner_encoded'] = le.transform(df['winner'])

# Create features based on historical performance
def calculate_team_stats(df):
    team_stats = defaultdict(lambda: {'games': 0, 'wins': 0, 'points_for': 0, 'points_against': 0})
    
    for _, row in df.iterrows():
        home_team = row['home_team']
        away_team = row['away_team']
        home_score = row['home_score']
        away_score = row['away_score']
        
        # Update home team stats
        team_stats[home_team]['games'] += 1
        team_stats[home_team]['points_for'] += home_score
        team_stats[home_team]['points_against'] += away_score
        team_stats[home_team]['wins'] += 1 if home_score > away_score else 0
        
        # Update away team stats
        team_stats[away_team]['games'] += 1
        team_stats[away_team]['points_for'] += away_score
        team_stats[away_team]['points_against'] += home_score
        team_stats[away_team]['wins'] += 1 if away_score > home_score else 0
    
    return team_stats

# Calculate rolling stats
team_stats = calculate_team_stats(df)

# Add features to dataframe
def add_features(df, team_stats):
    df = df.copy()
    df.loc[:, 'home_win_pct'] = df['home_team'].apply(
        lambda x: team_stats[x]['wins'] / team_stats[x]['games'] if team_stats[x]['games'] > 0 else 0.5)
    df.loc[:, 'away_win_pct'] = df['away_team'].apply(
        lambda x: team_stats[x]['wins'] / team_stats[x]['games'] if team_stats[x]['games'] > 0 else 0.5)
    df.loc[:, 'home_points_avg'] = df['home_team'].apply(
        lambda x: team_stats[x]['points_for'] / team_stats[x]['games'] if team_stats[x]['games'] > 0 else 0)
    df.loc[:, 'away_points_avg'] = df['away_team'].apply(
        lambda x: team_stats[x]['points_for'] / team_stats[x]['games'] if team_stats[x]['games'] > 0 else 0)
    df.loc[:, 'home_points_against_avg'] = df['home_team'].apply(
        lambda x: team_stats[x]['points_against'] / team_stats[x]['games'] if team_stats[x]['games'] > 0 else 0)
    df.loc[:, 'away_points_against_avg'] = df['away_team'].apply(
        lambda x: team_stats[x]['points_against'] / team_stats[x]['games'] if team_stats[x]['games'] > 0 else 0)
    
    # Add some interaction features
    df.loc[:, 'win_pct_diff'] = df['home_win_pct'] - df['away_win_pct']
    df.loc[:, 'points_diff'] = df['home_points_avg'] - df['away_points_avg']
    
    return df

df = add_features(df, team_stats)

In [None]:
# Split into features and target
feature_cols = ['home_team_encoded', 'away_team_encoded', 
                'home_win_pct', 'away_win_pct', 
                'home_points_avg', 'away_points_avg',
                'home_points_against_avg', 'away_points_against_avg',
                'win_pct_diff', 'points_diff']
X = df.loc[:, feature_cols]
y = df.loc[:, 'winner_encoded']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost model
model = xgb.XGBClassifier(
    objective='multi:softprob',
    num_class=len(le.classes_),
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)

model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

In [None]:
# Monte Carlo simulation for World Cup prediction
def simulate_world_cup(teams, model, le, n_simulations=1000):
    # Get current top teams (from our dataset)
    top_teams = ['New Zealand', 'South Africa', 'Australia', 'England', 'France', 
                 'Ireland', 'Wales', 'Scotland', 'Argentina', 'Italy']
    
    # Create a dictionary to store win counts
    win_counts = {team: 0 for team in top_teams}
    
    for _ in range(n_simulations):
        # Randomly select 2 teams to "play" in the final
        finalists = random.sample(top_teams, 2)
        team1, team2 = finalists
        
        # Create feature vector for this matchup
        try:
            team1_encoded = le.transform([team1])[0]
            team2_encoded = le.transform([team2])[0]
        except ValueError:
            # Skip if team not in our training data
            continue
            
        # Get team stats (simplified - in practice you'd want more recent stats)
        team1_stats = team_stats.get(team1, {'games': 1, 'wins': 0, 'points_for': 0, 'points_against': 0})
        team2_stats = team_stats.get(team2, {'games': 1, 'wins': 0, 'points_for': 0, 'points_against': 0})
        
        # Create feature row
        features = np.array([
            team1_encoded, team2_encoded,
            team1_stats['wins'] / max(1, team1_stats['games']),
            team2_stats['wins'] / max(1, team2_stats['games']),
            team1_stats['points_for'] / max(1, team1_stats['games']),
            team2_stats['points_for'] / max(1, team2_stats['games']),
            team1_stats['points_against'] / max(1, team1_stats['games']),
            team2_stats['points_against'] / max(1, team2_stats['games']),
            (team1_stats['wins'] / max(1, team1_stats['games'])) - 
            (team2_stats['wins'] / max(1, team2_stats['games'])),
            (team1_stats['points_for'] / max(1, team1_stats['games'])) - 
            (team2_stats['points_for'] / max(1, team2_stats['games']))
        ]).reshape(1, -1)
        
        # Predict probabilities
        try:
            probs = model.predict_proba(features)[0]
            
            # Sample winner based on probabilities
            winner_idx = np.random.choice(len(probs), p=probs)
            winner = le.inverse_transform([winner_idx])[0]
            
            if winner in win_counts:
                win_counts[winner] += 1
        except:
            continue
    
    # Convert counts to percentages
    total = max(1, sum(win_counts.values()))
    win_pct = {team: (count / total) * 100 for team, count in win_counts.items()}
    
    return win_pct

# Run simulation
win_probabilities = simulate_world_cup(le.classes_, model, le, n_simulations=100000)

# Display results
print("\nPredicted 2027 Rugby World Cup Win Probabilities:")
for team, prob in sorted(win_probabilities.items(), key=lambda x: x[1], reverse=True):
    if prob > 0:
        print(f"{team}: {prob:.1f}%")

#### **Ordinal Logistic Regression**
- Predicts **World Cup stage reached** (Group → QF → SF → Final → Winner).
- Fits ordinal nature of outcomes and highlights incremental performance gains.

Create a worldcup dataset that has each teams outcome for that world cup, elo rating, avg_point_diff, and win_pct for that year.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.miscmodels.ordinal_model import OrderedModel

# Example DataFrame structure
# Make sure your dataset includes these columns
# stage: 1=Group, 2=QF, 3=SF, 4=Final, 5=Winner (ordinal target)
# win_pct, avg_point_diff, elo_rating, etc. are explanatory variables

# Load your cleaned dataset (replace with actual file or DataFrame)
df = pd.read_csv("your_rwc_model_data.csv")

# Drop any rows with missing values (if needed)
df.dropna(subset=['stage', 'win_pct', 'avg_point_diff', 'elo_rating'], inplace=True)

# Define features and outcome variable
X = df[['win_pct', 'avg_point_diff', 'elo_rating']]  # Add other variables if needed
y = df['stage']  # Ordinal outcome

# Create the ordinal logistic regression model
model = OrderedModel(
    endog=y,
    exog=X,
    distr='logit'  # You could also try 'probit'
)

# Fit the model
res = model.fit(method='bfgs')
print(res.summary())