# Replication with Actual Data

This notebook replicates the paper's econometric analysis using the **actual** panel dataset built from FIFA rankings, club rankings, and World Development Indicators.

**Data Sources**:
- FIFA World Rankings (1993-2010)
- FIFA Club Rankings (2000-2010)
- World Bank WDI (1993-2010)

**Output**: Regression results for equations (1), (2), and (3) saved to CSV files.

**Project docs**: [README.md](../README.md), [Data Dictionary](../docs/data-dictionary.md), [Data Pipeline](../docs/data-pipeline.md)

In [7]:
import subprocess
import sys
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS

print(f"Python: {sys.version}")
print(f"Pandas: {pd.__version__}")
print(f"Working directory: {Path.cwd()}")

Python: 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.6.3.2)]
Pandas: 3.0.0
Working directory: /Users/dkritz/git/fifa-paper/notebooks


## Load Actual Panel Data

Load the panel dataset built from FIFA points, club rankings, and WDI data (1993-2010).

In [8]:
# Load the panel dataset
repo_root = Path.cwd().parent
data_path = repo_root / "data/analysis/panel.csv"

if not data_path.exists():
    print(f"Panel data not found at {data_path}. Building panel...")
    subprocess.run(["python3", "scripts/build_panel.py"], cwd=repo_root, check=True)

if not data_path.exists():
    raise FileNotFoundError(
        f"Panel data not found at {data_path}. Run: python3 scripts/build_panel.py"
    )

df = pd.read_csv(data_path)
df["gdp_pc_k"] = df["gdp_pc"] / 1000.0
df["gdp_pc_k_sq"] = df["gdp_pc_k"] ** 2
df["pop_m"] = df["pop"] / 1_000_000.0
df["pop_m_sq"] = df["pop_m"] ** 2
print(f"Loaded {len(df)} observations")
print(f"Countries: {df['country'].nunique()}")
print(f"Years: {df['year'].min()}-{df['year'].max()}")
print(f"Variables: {list(df.columns)}")
print("\nConfederations:")
print(df['confed'].value_counts())

Loaded 3276 observations
Countries: 182
Years: 1993-2010
Variables: ['country', 'year', 'confed', 'fifa_points', 'gdp_pc', 'gdp_pc_sq', 'pop', 'pop_sq', 'trade', 'infl', 'oil', 'leb', 'club', 'urbpop', 'urbpop_sq', 'gdp_pc_k', 'gdp_pc_k_sq', 'pop_m', 'pop_m_sq']

Confederations:
confed
CAF         918
UEFA        810
AFC         720
CONCACAF    522
CONMEBOL    180
OFC         126
Name: count, dtype: int64


### Data Snapshot

In [9]:
# Display first few rows and summary stats
display(df.head(10))
print(f"\nDataset shape: {df.shape}")

# Check for missing values
missing = df.isnull().sum()
print("\nMissing values:")
print(missing[missing > 0])

Unnamed: 0,country,year,confed,fifa_points,gdp_pc,gdp_pc_sq,pop,pop_sq,trade,infl,oil,leb,club,urbpop,urbpop_sq,gdp_pc_k,gdp_pc_k_sq,pop_m,pop_m_sq
0,Albania,1993,UEFA,15.0,380.808911,145015.4,3224901.0,10399990000000.0,77.715069,85.004751,4.245053,71.487,,37.9,1436.41,0.380809,0.145015,3.224901,10.399986
1,Albania,1994,UEFA,15.0,624.535068,390044.1,3179442.0,10108850000000.0,50.302012,22.565053,2.3091,71.617732,,38.4,1474.56,0.624535,0.390044,3.179442,10.108851
2,Albania,1995,UEFA,15.0,771.862535,595771.8,3141102.0,9866522000000.0,46.987422,7.793219,2.03482,71.870293,,38.9,1513.21,0.771863,0.595772,3.141102,9.866522
3,Albania,1996,UEFA,21.0,968.071952,937163.3,3112597.0,9688260000000.0,47.547304,12.725478,1.912895,72.241073,,39.46,1557.0916,0.968072,0.937163,3.112597,9.68826
4,Albania,1997,UEFA,21.0,710.314788,504547.1,3091902.0,9559858000000.0,47.210926,33.180274,1.783663,72.696073,,40.02,1601.6004,0.710315,0.504547,3.091902,9.559858
5,Albania,1998,UEFA,24.0,885.908631,784834.1,3079037.0,9480469000000.0,45.13491,20.642859,0.815573,73.197805,,40.58,1646.7364,0.885909,0.784834,3.079037,9.480469
6,Albania,1999,UEFA,29.0,1117.705767,1249266.0,3072725.0,9441639000000.0,49.315913,0.389438,0.873985,73.721829,,41.14,1692.4996,1.117706,1.249266,3.072725,9.441639
7,Albania,2000,UEFA,428.0,1200.137437,1440330.0,3071856.0,9436299000000.0,56.625339,0.050018,1.436102,74.238732,0.0,41.7,1738.89,1.200137,1.44033,3.071856,9.436299
8,Albania,2001,UEFA,480.0,1329.385031,1767265.0,3077378.0,9470255000000.0,59.230181,3.107588,1.02929,74.723659,0.0,42.32,1790.9824,1.329385,1.767265,3.077378,9.470255
9,Albania,2002,UEFA,448.0,1440.030143,2073687.0,3089778.0,9546728000000.0,67.135909,7.770526,1.219249,75.16122,0.0,42.94,1843.8436,1.44003,2.073687,3.089778,9.546728



Dataset shape: (3276, 19)

Missing values:
fifa_points      96
gdp_pc          111
gdp_pc_sq       111
pop               2
pop_sq            2
trade           232
infl            401
oil             914
leb             109
club           1274
gdp_pc_k        111
gdp_pc_k_sq     111
pop_m             2
pop_m_sq          2
dtype: int64


In [10]:
# Missingness summary (rates)
missing_counts = df.isnull().sum()
missing_rates = (missing_counts / len(df)).sort_values(ascending=False)
summary = pd.DataFrame({"missing_count": missing_counts, "missing_rate": missing_rates})
display(summary[summary['missing_count'] > 0])

Unnamed: 0,missing_count,missing_rate
club,1274,0.388889
fifa_points,96,0.029304
gdp_pc,111,0.033883
gdp_pc_k,111,0.033883
gdp_pc_k_sq,111,0.033883
gdp_pc_sq,111,0.033883
infl,401,0.122405
leb,109,0.033272
oil,914,0.278999
pop,2,0.000611


## Load Model Runner Helpers

Import the regression functions from the replication script.

In [11]:
# Add repo root to path
repo_root = Path.cwd().parent
sys.path.append(str(repo_root))

try:
    from scripts.replicate_stata import fit_fe_ols as _fit_fe_ols, fit_fe_iv as _fit_fe_iv
    print("Successfully imported fit_fe_ols and fit_fe_iv from scripts.replicate_stata")
except Exception as e:
    print(f"Import failed: {e}")
    print("Using notebook implementations...")

def demean_by_entity(df, cols, entity):
    out = df.copy()
    for col in cols:
        ent_mean = out.groupby(entity)[col].transform("mean")
        out[col] = out[col] - ent_mean
    return out

def fit_fe_ols(panel, dep, exog):
    y = panel[dep]
    X = panel[exog]
    model = PanelOLS(y, X, entity_effects=True, time_effects=False, check_rank=False)
    return model.fit(cov_type="clustered", cluster_entity=True, debiased=True)

def fit_fe_iv(df, dep, exog, endog, instr, entity, time):
    cols = [dep] + exog + [endog] + instr + [entity, time]
    work = df[cols].dropna().copy()
    work = demean_by_entity(work, [dep] + exog + [endog] + instr, entity)
    y = work[dep]
    X = work[exog]
    endog_v = work[endog]
    Z = work[instr]
    model = IV2SLS(y, X, endog_v, Z)
    return model.fit(cov_type="clustered", clusters=work[entity], debiased=True)

Successfully imported fit_fe_ols and fit_fe_iv from scripts.replicate_stata


## Equation (1): Baseline Model

**Specification**: FIFA points ~ GDP per capita (thousands) + GDP per capita² + Population (millions) + Population²

**Estimation**: Fixed-effects OLS with country effects and clustered standard errors

In [12]:
# Define variables
dep = "fifa_points"
base = ["gdp_pc_k", "gdp_pc_k_sq", "pop_m", "pop_m_sq"]

# Prepare panel
panel = df.set_index(["country", "year"])

# Run Equation (1)
print("="*80)
print("EQUATION (1): Baseline Model")
print("="*80)
res1 = fit_fe_ols(panel, dep, base)
print(res1.summary)
display(res1.summary)

EQUATION (1): Baseline Model
                          PanelOLS Estimation Summary                           
Dep. Variable:            fifa_points   R-squared:                        0.3025
Estimator:                   PanelOLS   R-squared (Between):             -32.594
No. Observations:                3076   R-squared (Within):               0.3025
Date:                Wed, Feb 04 2026   R-squared (Overall):             -20.928
Time:                        16:49:49   Log-likelihood                -2.066e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      313.72
Entities:                         179   P-value                           0.0000
Avg Obs:                       17.184   Distribution:                  F(4,2893)
Min Obs:                       1.0000                                           
Max Obs:                       18.000   F-statistic (robust):             38.144

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)


0,1,2,3
Dep. Variable:,fifa_points,R-squared:,0.3025
Estimator:,PanelOLS,R-squared (Between):,-32.594
No. Observations:,3076,R-squared (Within):,0.3025
Date:,"Wed, Feb 04 2026",R-squared (Overall):,-20.928
Time:,16:49:49,Log-likelihood,-2.066e+04
Cov. Estimator:,Clustered,,
,,F-statistic:,313.72
Entities:,179,P-value,0.0000
Avg Obs:,17.184,Distribution:,"F(4,2893)"
Min Obs:,1.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
gdp_pc_k,35.216,4.4687,7.8807,0.0000,26.454,43.978
gdp_pc_k_sq,-0.2108,0.0437,-4.8207,0.0000,-0.2966,-0.1251
pop_m,24.418,4.4978,5.4290,0.0000,15.599,33.238
pop_m_sq,-0.0104,0.0022,-4.7379,0.0000,-0.0147,-0.0061


## Equation (2): Model with Macro/Resource Controls

**Specification**: Equation (1) + Trade + Inflation + Oil rents + Life expectancy

**Estimation**: Fixed-effects OLS with country effects and clustered SEs

In [13]:
# Define additional controls
macro = ["trade", "infl", "oil", "leb"]

# Run Equation (2)
print("="*80)
print("EQUATION (2): Model with Controls")
print("="*80)
res2 = fit_fe_ols(panel, dep, base + macro)
print(res2.summary)
display(res2.summary)

EQUATION (2): Model with Controls
                          PanelOLS Estimation Summary                           
Dep. Variable:            fifa_points   R-squared:                        0.4317
Estimator:                   PanelOLS   R-squared (Between):             -105.34
No. Observations:                2105   R-squared (Within):               0.4317
Date:                Wed, Feb 04 2026   R-squared (Overall):             -70.131
Time:                        16:50:00   Log-likelihood                -1.417e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      186.95
Entities:                         128   P-value                           0.0000
Avg Obs:                       16.445   Distribution:                  F(8,1969)
Min Obs:                       1.0000                                           
Max Obs:                       18.000   F-statistic (robust):             2

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)


0,1,2,3
Dep. Variable:,fifa_points,R-squared:,0.4317
Estimator:,PanelOLS,R-squared (Between):,-105.34
No. Observations:,2105,R-squared (Within):,0.4317
Date:,"Wed, Feb 04 2026",R-squared (Overall):,-70.131
Time:,16:50:00,Log-likelihood,-1.417e+04
Cov. Estimator:,Clustered,,
,,F-statistic:,186.95
Entities:,128,P-value,0.0000
Avg Obs:,16.445,Distribution:,"F(8,1969)"
Min Obs:,1.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
gdp_pc_k,33.170,4.7123,7.0390,0.0000,23.928,42.411
gdp_pc_k_sq,-0.2481,0.0356,-6.9732,0.0000,-0.3178,-0.1783
pop_m,13.455,4.8729,2.7611,0.0058,3.8982,23.011
pop_m_sq,-0.0060,0.0022,-2.7061,0.0069,-0.0103,-0.0017
trade,3.1053,0.8514,3.6474,0.0003,1.4356,4.7750
infl,-0.0024,0.0107,-0.2275,0.8201,-0.0235,0.0186
oil,2.9092,2.2218,1.3094,0.1905,-1.4480,7.2665
leb,41.406,13.530,3.0602,0.0022,14.871,67.942


## Equation (3): IV/2SLS with Club Strength

**Specification**: Equation (2) + Club strength (instrumented by urban population + urban population²)

**Estimation**: Fixed-effects IV/2SLS with country effects and clustered SEs

**Instruments**: urbpop, urbpop_sq (for endogenous "club" variable)

In [14]:
# Define IV variables
club = "club"
instr = ["urbpop", "urbpop_sq"]

# Check club data availability
club_available = df[club].notna().sum()
club_total = len(df)
print(f"\nClub data availability: {club_available}/{club_total} ({100*club_available/club_total:.1f}%)")

# Run Equation (3) if we have club data
if club_available > 0:
    print("="*80)
    print("EQUATION (3): IV/2SLS with Club Strength")
    print("="*80)
    try:
        res3 = fit_fe_iv(df, dep, base + macro, club, instr, "country", "year")
        print(res3.summary)
        display(res3.summary)
    except Exception as e:
        print(f"Error running IV regression: {e}")
        print("\nThis may occur if:")
        print("- All instruments are absorbed by fixed effects")
        print("- No variation in instruments after demeaning")
else:
    print("No club data available - skipping Equation (3)")


Club data availability: 2002/3276 (61.1%)
EQUATION (3): IV/2SLS with Club Strength
                          IV-2SLS Estimation Summary                          
Dep. Variable:            fifa_points   R-squared:                     -69.871
Estimator:                    IV-2SLS   Adj. R-squared:                -70.358
No. Observations:                1320   F-statistic:                    12.174
Date:                Wed, Feb 04 2026   P-value (F-stat)                0.0000
Time:                        16:50:12   Distribution:                F(9,1311)
Cov. Estimator:             clustered                                         
                                                                              
                              Parameter Estimates                              
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
gdp_pc_k        9.4107     18.083     0.5204

0,1,2,3
Dep. Variable:,fifa_points,R-squared:,-69.871
Estimator:,IV-2SLS,Adj. R-squared:,-70.358
No. Observations:,1320,F-statistic:,12.174
Date:,"Wed, Feb 04 2026",P-value (F-stat),0.0000
Time:,16:50:12,Distribution:,"F(9,1311)"
Cov. Estimator:,clustered,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
gdp_pc_k,9.4107,18.083,0.5204,0.6029,-26.064,44.885
gdp_pc_k_sq,-0.0761,0.1235,-0.6158,0.5381,-0.3183,0.1662
pop_m,5.5183,17.062,0.3234,0.7464,-27.954,38.991
pop_m_sq,-0.0027,0.0069,-0.3876,0.6984,-0.0163,0.0109
trade,-2.0191,4.4469,-0.4540,0.6499,-10.743,6.7048
infl,-0.0017,0.0047,-0.3504,0.7261,-0.0109,0.0076
oil,5.3385,13.912,0.3837,0.7012,-21.955,32.632
leb,-24.297,45.723,-0.5314,0.5952,-113.99,65.401
club,-21.309,40.368,-0.5279,0.5977,-100.50,57.883


## Confederation-Level Analysis

Run regressions separately for each football confederation.

In [15]:
# Run by confederation
if "confed" in df.columns:
    confederations = sorted(df["confed"].dropna().unique())
    
    for confed in confederations:
        subset = df[df["confed"] == confed]
        if subset.empty:
            continue
        
        print(f"\n{'='*60}")
        print(f"CONFEDERATION: {confed}")
        print(f"Observations: {len(subset)}")
        print(f"Countries: {subset['country'].nunique()}")
        print(f"{'='*60}")
        
        panel_sub = subset.set_index(["country", "year"])
        
        # Equation (1)
        try:
            res1c = fit_fe_ols(panel_sub, dep, base)
            print(f"\nEquation (1) - {confed}")
            print(res1c.summary.tables[1])
        except Exception as e:
            print(f"\nEquation (1) failed for {confed}: {e}")
        
        # Equation (2)
        try:
            res2c = fit_fe_ols(panel_sub, dep, base + macro)
            print(f"\nEquation (2) - {confed}")
            print(res2c.summary.tables[1])
        except Exception as e:
            print(f"\nEquation (2) failed for {confed}: {e}")
        
        # Equation (3) - typically only for UEFA
        if confed.upper() == "UEFA" and club_available > 0:
            try:
                res3c = fit_fe_iv(subset, dep, base + macro, club, instr, "country", "year")
                print(f"\nEquation (3) - {confed}")
                print(res3c.summary.tables[1])
            except Exception as e:
                print(f"\nEquation (3) failed for {confed}: {e}")


CONFEDERATION: AFC
Observations: 720
Countries: 40

Equation (1) - AFC
                              Parameter Estimates                              
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
gdp_pc_k        32.832     7.5408     4.3539     0.0000      18.023      47.641
gdp_pc_k_sq    -0.3267     0.0874    -3.7372     0.0002     -0.4984     -0.1550
pop_m           9.6184     3.0381     3.1659     0.0016      3.6520      15.585
pop_m_sq       -0.0037     0.0015    -2.4834     0.0133     -0.0067     -0.0008

Equation (2) - AFC
                              Parameter Estimates                              
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
gdp_pc_k        10.145     9.5689     1.0602     0.2896     -8.6592      28.949
gdp_pc_k_sq    -0.0636     0

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weig


CONFEDERATION: UEFA
Observations: 810
Countries: 45

Equation (1) - UEFA
                              Parameter Estimates                              
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
gdp_pc_k        42.324     6.1059     6.9317     0.0000      30.337      54.311
gdp_pc_k_sq    -0.2394     0.0515    -4.6447     0.0000     -0.3406     -0.1382
pop_m           116.99     67.845     1.7244     0.0851     -16.203      250.19
pop_m_sq       -0.6276     0.3291    -1.9070     0.0569     -1.2737      0.0185

Equation (2) - UEFA
                              Parameter Estimates                              
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
gdp_pc_k        19.999     4.9178     4.0666     0.0001      10.343      29.655
gdp_pc_k_sq    -0.1722   

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)


## Export Results

Save regression results to CSV files for further analysis.

In [16]:
# Create results directory
results_dir = Path("results")
results_dir.mkdir(exist_ok=True)

def save_summary(res, path):
    """Save regression summary to CSV."""
    table = res.summary.tables[1].as_csv()
    with open(path, "w") as f:
        f.write(table)
    print(f"Saved: {path}")

# Save full sample results
try:
    save_summary(res1, results_dir / "eq1_full.csv")
    save_summary(res2, results_dir / "eq2_full.csv")
    if 'res3' in dir():
        save_summary(res3, results_dir / "eq3_full.csv")
except Exception as e:
    print(f"Error saving results: {e}")

print(f"\nResults saved to {results_dir}/")
print(f"Files: {list(results_dir.glob('*.csv'))}")

Saved: results/eq1_full.csv
Saved: results/eq2_full.csv
Saved: results/eq3_full.csv

Results saved to results/
Files: [PosixPath('results/eq2_full.csv'), PosixPath('results/eq3_full.csv'), PosixPath('results/eq1_full.csv')]


## Summary and Comparison

### Key Findings from Actual Data

Compare these results to the synthetic data notebook and eventually to the original paper.

In [17]:
# Summary statistics
print("DATASET SUMMARY")
print("="*60)
print(f"Total observations: {len(df)}")
print(f"Countries: {df['country'].nunique()}")
print(f"Year range: {df['year'].min()}-{df['year'].max()}")
print(f"Balanced panel: {len(df) / (df['country'].nunique() * df['year'].nunique()):.1%} complete")

print("\nCONFEDERATION DISTRIBUTION:")
print(df['confed'].value_counts())

print("\nDEPENDENT VARIABLE (fifa_points):")
print(df['fifa_points'].describe())

print("\nKEY INDEPENDENT VARIABLES:")
for var in ['gdp_pc', 'pop', 'club', 'urbpop']:
    if var in df.columns:
        missing = df[var].isnull().sum()
        print(f"  {var}: {missing}/{len(df)} missing ({100*missing/len(df):.1f}%)")

DATASET SUMMARY
Total observations: 3276
Countries: 182
Year range: 1993-2010
Balanced panel: 100.0% complete

CONFEDERATION DISTRIBUTION:
confed
CAF         918
UEFA        810
AFC         720
CONCACAF    522
CONMEBOL    180
OFC         126
Name: count, dtype: int64

DEPENDENT VARIABLE (fifa_points):
count    3180.000000
mean      271.935535
std       278.741446
min         0.000000
25%        33.000000
50%       176.500000
75%       477.000000
max      1663.000000
Name: fifa_points, dtype: float64

KEY INDEPENDENT VARIABLES:
  gdp_pc: 111/3276 missing (3.4%)
  pop: 2/3276 missing (0.1%)
  club: 1274/3276 missing (38.9%)
  urbpop: 0/3276 missing (0.0%)


## Next Steps

### Validation Tasks
- [ ] Compare coefficients to original paper
- [ ] Verify sign and significance match expectations
- [ ] Check standard error magnitudes
- [ ] Validate R-squared values

### Data Quality Checks
- [ ] Investigate countries with missing club data
- [ ] Verify country name mappings
- [ ] Check for outliers in FIFA points
- [ ] Validate WDI data ranges

### Extensions
- [ ] Run placebo tests
- [ ] Test robustness to different time periods
- [ ] Add additional controls
- [ ] Export formatted tables

### For Comparison to Paper
1. Locate original Stata output files (if available)
2. Create coefficient comparison table
3. Document any discrepancies
4. Investigate sources of difference (Stata 8/9 defaults, data revisions, etc.)