# Replication with Actual Data

This notebook replicates the paper's econometric analysis using the **actual** panel dataset built from FIFA rankings, club rankings, and World Development Indicators.

**Data Sources**:
- FIFA World Rankings (1993-2010)
- FIFA Club Rankings (2000-2010)
- World Bank WDI (1993-2010)

**Output**: Regression results for equations (1), (2), and (3) saved to CSV files.

**Project docs**: [README.md](../README.md), [Data Dictionary](../docs/data-dictionary.md), [Data Pipeline](../docs/data-pipeline.md)

In [1]:
import sys
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display
from linearmodels.panel import PanelOLS
from linearmodels.iv import IV2SLS

print(f"Python: {sys.version}")
print(f"Pandas: {pd.__version__}")
print(f"Working directory: {Path.cwd()}")

Python: 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.6.3.2)]
Pandas: 3.0.0
Working directory: /Users/dkritz/git/fifa-paper/notebooks


## Load Actual Panel Data

Load the panel dataset built from FIFA points, club rankings, and WDI data (1993-2010).

In [2]:
# Load the panel dataset
data_path = Path("../data/analysis/panel.csv")

if not data_path.exists():
    raise FileNotFoundError(f"Panel data not found at {data_path}. Run: python3 scripts/build_panel.py")

df = pd.read_csv(data_path)
print(f"Loaded {len(df)} observations")
print(f"Countries: {df['country'].nunique()}")
print(f"Years: {df['year'].min()}-{df['year'].max()}")
print(f"Variables: {list(df.columns)}")
print("\nConfederations:")
print(df['confed'].value_counts())

Loaded 3276 observations
Countries: 182
Years: 1993-2010
Variables: ['country', 'year', 'confed', 'fifa_points', 'gdp_pc', 'gdp_pc_sq', 'pop', 'pop_sq', 'trade', 'infl', 'oil', 'leb', 'club', 'urbpop', 'urbpop_sq']

Confederations:
confed
CAF         918
UEFA        810
AFC         720
CONCACAF    522
CONMEBOL    180
OFC         126
Name: count, dtype: int64


### Data Snapshot

In [3]:
# Display first few rows and summary stats
display(df.head(10))
print(f"\nDataset shape: {df.shape}")

# Check for missing values
missing = df.isnull().sum()
print("\nMissing values:")
print(missing[missing > 0])

Unnamed: 0,country,year,confed,fifa_points,gdp_pc,gdp_pc_sq,pop,pop_sq,trade,infl,oil,leb,club,urbpop,urbpop_sq
0,Albania,1993,UEFA,15.0,380.808911,145015.4,3224901.0,10399990000000.0,77.715069,85.004751,4.245053,71.487,,37.9,1436.41
1,Albania,1994,UEFA,15.0,624.535068,390044.1,3179442.0,10108850000000.0,50.302012,22.565053,2.3091,71.617732,,38.4,1474.56
2,Albania,1995,UEFA,15.0,771.862535,595771.8,3141102.0,9866522000000.0,46.987422,7.793219,2.03482,71.870293,,38.9,1513.21
3,Albania,1996,UEFA,21.0,968.071952,937163.3,3112597.0,9688260000000.0,47.547304,12.725478,1.912895,72.241073,,39.46,1557.0916
4,Albania,1997,UEFA,21.0,710.314788,504547.1,3091902.0,9559858000000.0,47.210926,33.180274,1.783663,72.696073,,40.02,1601.6004
5,Albania,1998,UEFA,24.0,885.908631,784834.1,3079037.0,9480469000000.0,45.13491,20.642859,0.815573,73.197805,,40.58,1646.7364
6,Albania,1999,UEFA,29.0,1117.705767,1249266.0,3072725.0,9441639000000.0,49.315913,0.389438,0.873985,73.721829,,41.14,1692.4996
7,Albania,2000,UEFA,428.0,1200.137437,1440330.0,3071856.0,9436299000000.0,56.625339,0.050018,1.436102,74.238732,0.0,41.7,1738.89
8,Albania,2001,UEFA,480.0,1329.385031,1767265.0,3077378.0,9470255000000.0,59.230181,3.107588,1.02929,74.723659,0.0,42.32,1790.9824
9,Albania,2002,UEFA,448.0,1440.030143,2073687.0,3089778.0,9546728000000.0,67.135909,7.770526,1.219249,75.16122,0.0,42.94,1843.8436



Dataset shape: (3276, 15)

Missing values:
fifa_points      96
gdp_pc          111
gdp_pc_sq       111
pop               2
pop_sq            2
trade           232
infl            401
oil             914
leb             109
club           1274
dtype: int64


In [4]:
# Missingness summary (rates)
missing_counts = df.isnull().sum()
missing_rates = (missing_counts / len(df)).sort_values(ascending=False)
summary = pd.DataFrame({"missing_count": missing_counts, "missing_rate": missing_rates})
display(summary[summary['missing_count'] > 0])

Unnamed: 0,missing_count,missing_rate
club,1274,0.388889
fifa_points,96,0.029304
gdp_pc,111,0.033883
gdp_pc_sq,111,0.033883
infl,401,0.122405
leb,109,0.033272
oil,914,0.278999
pop,2,0.000611
pop_sq,2,0.000611
trade,232,0.070818


## Load Model Runner Helpers

Import the regression functions from the replication script.

In [5]:
# Add repo root to path
repo_root = Path.cwd().parent
sys.path.append(str(repo_root))

try:
    from scripts.replicate_stata import fit_fe_ols, fit_fe_iv
    print("Successfully imported fit_fe_ols and fit_fe_iv from scripts.replicate_stata")
except Exception as e:
    print(f"Import failed: {e}")
    print("Using fallback implementations...")
    
    def two_way_demean(df, cols, entity, time):
        out = df.copy()
        for col in cols:
            overall = out[col].mean()
            ent_mean = out.groupby(entity)[col].transform("mean")
            time_mean = out.groupby(time)[col].transform("mean")
            out[col] = out[col] - ent_mean - time_mean + overall
        return out

    def _drop_absorbed(panel, exog):
        names = list(panel.index.names)
        if len(names) < 2 or names[0] is None or names[1] is None:
            return exog, []
        entity, time = names[0], names[1]
        work = panel.reset_index()
        demeaned = two_way_demean(work, exog, entity, time)
        dropped = []
        keep = []
        for col in exog:
            if demeaned[col].var() <= 1e-12:
                dropped.append(col)
            else:
                keep.append(col)
        return keep, dropped

    def fit_fe_ols(panel, dep, exog):
        keep, dropped = _drop_absorbed(panel, exog)
        if dropped:
            print(f"Dropped absorbed variables: {', '.join(dropped)}")
        if not keep:
            raise ValueError("All regressors absorbed by fixed effects.")
        y = panel[dep]
        X = panel[keep]
        model = PanelOLS(y, X, entity_effects=True, time_effects=True)
        return model.fit(cov_type="clustered", cluster_entity=True, debiased=True)

    def fit_fe_iv(df, dep, exog, endog, instr, entity, time):
        cols = [dep] + exog + [endog] + instr + [entity, time]
        work = df[cols].dropna().copy()
        work = two_way_demean(work, [dep] + exog + [endog] + instr, entity, time)
        keep_exog = [c for c in exog if work[c].var() > 1e-12]
        if not keep_exog:
            raise ValueError("All exogenous regressors absorbed.")
        endog_v = work[endog]
        if endog_v.var() <= 1e-12:
            raise ValueError("Endogenous regressor absorbed.")
        keep_instr = [c for c in instr if work[c].var() > 1e-12]
        if not keep_instr:
            raise ValueError("All instruments absorbed.")
        y = work[dep]
        X = work[keep_exog]
        Z = work[keep_instr]
        model = IV2SLS(y, X, endog_v, Z)
        return model.fit(cov_type="clustered", clusters=work[entity], debiased=True)

Successfully imported fit_fe_ols and fit_fe_iv from scripts.replicate_stata


## Equation (1): Baseline Model

**Specification**: FIFA points ~ GDP per capita + GDP per capita² + Population + Population²

**Estimation**: Fixed-effects OLS with two-way (country + time) effects and clustered standard errors

In [6]:
# Define variables
dep = "fifa_points"
base = ["gdp_pc", "gdp_pc_sq", "pop", "pop_sq"]

# Prepare panel
panel = df.set_index(["country", "year"])

# Run Equation (1)
print("="*80)
print("EQUATION (1): Baseline Model")
print("="*80)
try:
    res1 = fit_fe_ols(panel, dep, base)
except Exception as e:
    msg = str(e)
    if "full column rank" in msg or "absorbed" in msg:
        print("Rank/absorption issue detected. Retrying with check_rank=False and drop_absorbed=True.")
        y = panel[dep]
        X = panel[base]
        model = PanelOLS(
            y,
            X,
            entity_effects=True,
            time_effects=True,
            check_rank=False,
            drop_absorbed=True,
        )
        res1 = model.fit(cov_type="clustered", cluster_entity=True, debiased=True)
    else:
        raise
print(res1.summary)
display(res1.summary)

EQUATION (1): Baseline Model
Rank/absorption issue detected. Retrying with check_rank=False and drop_absorbed=True.
                          PanelOLS Estimation Summary                           
Dep. Variable:            fifa_points   R-squared:                        0.0021
Estimator:                   PanelOLS   R-squared (Between):             -0.0568
No. Observations:                3076   R-squared (Within):              -0.0018
Date:                Tue, Feb 03 2026   R-squared (Overall):             -0.0375
Time:                        23:14:21   Log-likelihood                -1.946e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      3.0938
Entities:                         179   P-value                           0.0455
Avg Obs:                       17.184   Distribution:                  F(2,2878)
Min Obs:                       1.0000                                     

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Variables have been fully absorbed and have removed from the regression:

gdp_pc, pop

  res1 = model.fit(cov_type="clustered", cluster_entity=True, debiased=True)


0,1,2,3
Dep. Variable:,fifa_points,R-squared:,0.0021
Estimator:,PanelOLS,R-squared (Between):,-0.0568
No. Observations:,3076,R-squared (Within):,-0.0018
Date:,"Tue, Feb 03 2026",R-squared (Overall):,-0.0375
Time:,23:14:21,Log-likelihood,-1.946e+04
Cov. Estimator:,Clustered,,
,,F-statistic:,3.0938
Entities:,179,P-value,0.0455
Avg Obs:,17.184,Distribution:,"F(2,2878)"
Min Obs:,1.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
gdp_pc_sq,2.272e-09,1.243e-08,0.1828,0.8550,-2.21e-08,2.664e-08
pop_sq,-3.484e-16,1.478e-16,-2.3574,0.0185,-6.382e-16,-5.862e-17


## Equation (2): Model with Macro/Resource Controls

**Specification**: Equation (1) + Trade + Inflation + Oil rents + Life expectancy

**Estimation**: Fixed-effects OLS with two-way effects and clustered SEs

In [7]:
# Define additional controls
macro = ["trade", "infl", "oil", "leb"]

# Run Equation (2)
print("="*80)
print("EQUATION (2): Model with Controls")
print("="*80)
try:
    res2 = fit_fe_ols(panel, dep, base + macro)
except Exception as e:
    msg = str(e)
    if "full column rank" in msg or "absorbed" in msg:
        print("Rank/absorption issue detected. Retrying with check_rank=False and drop_absorbed=True.")
        y = panel[dep]
        X = panel[base + macro]
        model = PanelOLS(
            y,
            X,
            entity_effects=True,
            time_effects=True,
            check_rank=False,
            drop_absorbed=True,
        )
        res2 = model.fit(cov_type="clustered", cluster_entity=True, debiased=True)
    else:
        raise
print(res2.summary)
display(res2.summary)

EQUATION (2): Model with Controls
Rank/absorption issue detected. Retrying with check_rank=False and drop_absorbed=True.
                          PanelOLS Estimation Summary                           
Dep. Variable:            fifa_points   R-squared:                        0.0091
Estimator:                   PanelOLS   R-squared (Between):             -0.1515
No. Observations:                2105   R-squared (Within):              -0.0088
Date:                Tue, Feb 03 2026   R-squared (Overall):             -0.1107
Time:                        23:14:21   Log-likelihood                -1.331e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      17.978
Entities:                         128   P-value                           0.0000
Avg Obs:                       16.445   Distribution:                  F(1,1959)
Min Obs:                       1.0000                                

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Variables have been fully absorbed and have removed from the regression:

gdp_pc, gdp_pc_sq, pop, trade, infl, oil, leb

  res2 = model.fit(cov_type="clustered", cluster_entity=True, debiased=True)


0,1,2,3
Dep. Variable:,fifa_points,R-squared:,0.0091
Estimator:,PanelOLS,R-squared (Between):,-0.1515
No. Observations:,2105,R-squared (Within):,-0.0088
Date:,"Tue, Feb 03 2026",R-squared (Overall):,-0.1107
Time:,23:14:21,Log-likelihood,-1.331e+04
Cov. Estimator:,Clustered,,
,,F-statistic:,17.978
Entities:,128,P-value,0.0000
Avg Obs:,16.445,Distribution:,"F(1,1959)"
Min Obs:,1.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
pop_sq,-6.156e-16,1.102e-16,-5.5863,0.0000,-8.317e-16,-3.995e-16


## Equation (3): IV/2SLS with Club Strength

**Specification**: Equation (2) + Club strength (instrumented by urban population + urban population²)

**Estimation**: Fixed-effects IV/2SLS with clustered SEs

**Instruments**: urbpop, urbpop_sq (for endogenous "club" variable)

In [8]:
# Define IV variables
club = "club"
instr = ["urbpop", "urbpop_sq"]

# Check club data availability
club_available = df[club].notna().sum()
club_total = len(df)
print(f"\nClub data availability: {club_available}/{club_total} ({100*club_available/club_total:.1f}%)")

# Run Equation (3) if we have club data
if club_available > 0:
    print("="*80)
    print("EQUATION (3): IV/2SLS with Club Strength")
    print("="*80)
    try:
        res3 = fit_fe_iv(df, dep, base + macro, club, instr, "country", "year")
        print(res3.summary)
        display(res3.summary)
    except Exception as e:
        print(f"Error running IV regression: {e}")
        print("\nThis may occur if:")
        print("- All instruments are absorbed by fixed effects")
        print("- No variation in instruments after demeaning")
else:
    print("No club data available - skipping Equation (3)")


Club data availability: 2002/3276 (61.1%)
EQUATION (3): IV/2SLS with Club Strength


Error running IV regression: regressors [exog endog] do not have full column rank

This may occur if:
- All instruments are absorbed by fixed effects
- No variation in instruments after demeaning


## Confederation-Level Analysis

Run regressions separately for each football confederation.

In [9]:
# Run by confederation
if "confed" in df.columns:
    confederations = sorted(df["confed"].dropna().unique())
    
    for confed in confederations:
        subset = df[df["confed"] == confed]
        if subset.empty:
            continue
        
        print(f"\n{'='*60}")
        print(f"CONFEDERATION: {confed}")
        print(f"Observations: {len(subset)}")
        print(f"Countries: {subset['country'].nunique()}")
        print(f"{'='*60}")
        
        panel_sub = subset.set_index(["country", "year"])
        
        # Equation (1)
        try:
            res1c = fit_fe_ols(panel_sub, dep, base)
            print(f"\nEquation (1) - {confed}")
            print(res1c.summary.tables[1])
        except Exception as e:
            print(f"\nEquation (1) failed for {confed}: {e}")
        
        # Equation (2)
        try:
            res2c = fit_fe_ols(panel_sub, dep, base + macro)
            print(f"\nEquation (2) - {confed}")
            print(res2c.summary.tables[1])
        except Exception as e:
            print(f"\nEquation (2) failed for {confed}: {e}")
        
        # Equation (3) - typically only for UEFA
        if confed.upper() == "UEFA" and club_available > 0:
            try:
                res3c = fit_fe_iv(subset, dep, base + macro, club, instr, "country", "year")
                print(f"\nEquation (3) - {confed}")
                print(res3c.summary.tables[1])
            except Exception as e:
                print(f"\nEquation (3) failed for {confed}: {e}")


CONFEDERATION: AFC
Observations: 720
Countries: 40



Equation (1) failed for AFC: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accuracy of coefficient estimates, you can set check_rank=False.



Equation (2) failed for AFC: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accuracy of coefficient estimates, you can set check_rank=False.

CONFEDERATION: CAF
Observations: 918
Countries: 51


Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)



Equation (1) - CAF
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc        -0.0019     0.0168    -0.1117     0.9111     -0.0348      0.0311
gdp_pc_sq   2.736e-07  6.414e-07     0.4266     0.6698  -9.854e-07   1.533e-06
pop         1.504e-05  1.537e-06     9.7867     0.0000   1.203e-05   1.806e-05
pop_sq     -9.291e-15  1.181e-14    -0.7866     0.4317  -3.248e-14   1.389e-14

Equation (2) failed for CAF: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accuracy of coefficient estimates, you can set check_rank=False.

CONFEDERATION: CONCACAF
Observations: 522
Countries: 29

Equation (1) - CONCACAF
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    L

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)



Equation (2) failed for CONCACAF: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accuracy of coefficient estimates, you can set check_rank=False.

CONFEDERATION: CONMEBOL
Observations: 180
Countries: 10

Equation (1) - CONMEBOL
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc        -0.1195     0.0459    -2.6014     0.0102     -0.2103     -0.0287
gdp_pc_sq   6.988e-06  2.512e-06     2.7813     0.0061   2.024e-06   1.195e-05
pop         2.093e-05  5.335e-06     3.9223     0.0001   1.038e-05   3.147e-05
pop_sq      5.989e-15   1.55e-14     0.3863     0.6998  -2.464e-14   3.662e-14

Equation (2) failed for CONMEBOL: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accura


Equation (1) - OFC
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc         0.0412     0.0153     2.6908     0.0086      0.0108      0.0717
gdp_pc_sq  -8.813e-07  4.144e-07    -2.1266     0.0364  -1.705e-06  -5.718e-08
pop            0.0008     0.0005     1.5838     0.1170     -0.0002      0.0019
pop_sq     -6.993e-11  4.329e-11    -1.6152     0.1100   -1.56e-10   1.617e-11


Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)



Equation (2) failed for OFC: 
The model cannot be estimated. The included effects have fully absorbed
one or more of the variables. This occurs when one or more of the dependent
variable is perfectly explained using the effects included in the model.

The following variables or variable combinations have been fully absorbed
or have become perfectly collinear after effects are removed:

          gdp_pc, gdp_pc_sq, pop, pop_sq, trade, infl, oil, leb

Set drop_absorbed=True to automatically drop absorbed variables.




CONFEDERATION: UEFA
Observations: 810
Countries: 45



Equation (1) - UEFA
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
gdp_pc         0.0079     0.0049     1.6056     0.1088     -0.0018      0.0175
gdp_pc_sq  -7.081e-08  2.655e-08    -2.6669     0.0078  -1.229e-07  -1.868e-08
pop         2.194e-05  1.058e-05     2.0743     0.0384   1.173e-06    4.27e-05
pop_sq     -7.931e-14  1.346e-13    -0.5894     0.5558  -3.435e-13   1.849e-13


Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)



Equation (2) failed for UEFA: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accuracy of coefficient estimates, you can set check_rank=False.

Equation (3) failed for UEFA: regressors [exog endog] do not have full column rank


## Export Results

Save regression results to CSV files for further analysis.

In [10]:
# Create results directory
results_dir = Path("results")
results_dir.mkdir(exist_ok=True)

def save_summary(res, path):
    """Save regression summary to CSV."""
    table = res.summary.tables[1].as_csv()
    with open(path, "w") as f:
        f.write(table)
    print(f"Saved: {path}")

# Save full sample results
try:
    save_summary(res1, results_dir / "eq1_full.csv")
    save_summary(res2, results_dir / "eq2_full.csv")
    if 'res3' in dir():
        save_summary(res3, results_dir / "eq3_full.csv")
except Exception as e:
    print(f"Error saving results: {e}")

print(f"\nResults saved to {results_dir}/")
print(f"Files: {list(results_dir.glob('*.csv'))}")

Saved: results/eq1_full.csv
Saved: results/eq2_full.csv

Results saved to results/
Files: [PosixPath('results/eq2_full.csv'), PosixPath('results/eq3_full.csv'), PosixPath('results/eq1_full.csv')]


## Summary and Comparison

### Key Findings from Actual Data

Compare these results to the synthetic data notebook and eventually to the original paper.

In [11]:
# Summary statistics
print("DATASET SUMMARY")
print("="*60)
print(f"Total observations: {len(df)}")
print(f"Countries: {df['country'].nunique()}")
print(f"Year range: {df['year'].min()}-{df['year'].max()}")
print(f"Balanced panel: {len(df) / (df['country'].nunique() * df['year'].nunique()):.1%} complete")

print("\nCONFEDERATION DISTRIBUTION:")
print(df['confed'].value_counts())

print("\nDEPENDENT VARIABLE (fifa_points):")
print(df['fifa_points'].describe())

print("\nKEY INDEPENDENT VARIABLES:")
for var in ['gdp_pc', 'pop', 'club', 'urbpop']:
    if var in df.columns:
        missing = df[var].isnull().sum()
        print(f"  {var}: {missing}/{len(df)} missing ({100*missing/len(df):.1f}%)")

DATASET SUMMARY
Total observations: 3276
Countries: 182
Year range: 1993-2010
Balanced panel: 100.0% complete

CONFEDERATION DISTRIBUTION:
confed
CAF         918
UEFA        810
AFC         720
CONCACAF    522
CONMEBOL    180
OFC         126
Name: count, dtype: int64

DEPENDENT VARIABLE (fifa_points):
count    3180.000000
mean      271.935535
std       278.741446
min         0.000000
25%        33.000000
50%       176.500000
75%       477.000000
max      1663.000000
Name: fifa_points, dtype: float64

KEY INDEPENDENT VARIABLES:
  gdp_pc: 111/3276 missing (3.4%)
  pop: 2/3276 missing (0.1%)
  club: 1274/3276 missing (38.9%)
  urbpop: 0/3276 missing (0.0%)


## Next Steps

### Validation Tasks
- [ ] Compare coefficients to original paper
- [ ] Verify sign and significance match expectations
- [ ] Check standard error magnitudes
- [ ] Validate R-squared values

### Data Quality Checks
- [ ] Investigate countries with missing club data
- [ ] Verify country name mappings
- [ ] Check for outliers in FIFA points
- [ ] Validate WDI data ranges

### Extensions
- [ ] Run placebo tests
- [ ] Test robustness to different time periods
- [ ] Add additional controls
- [ ] Export formatted tables

### For Comparison to Paper
1. Locate original Stata output files (if available)
2. Create coefficient comparison table
3. Document any discrepancies
4. Investigate sources of difference (Stata 8/9 defaults, data revisions, etc.)