# VAR Fairness Audit: Statistical Analysis

**DS 112 Final Project**

This notebook performs statistical tests and ML modeling to detect potential bias.

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn plotly scikit-learn scipy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)

## Load and Prepare Data

First, let's load the combined dataset and prepare it for statistical analysis.

In [None]:
# Load the combined dataset
df = pd.read_csv('var_combined.csv')

# Display basic information
print("Dataset shape:", df.shape)
df.head()

## Statistical Tests

Let's perform some statistical tests to check for potential bias in VAR decisions.

In [None]:
# Import statistical libraries
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

### Chi-Square Test

Let's test if there's a relationship between team ranking and favorable VAR decisions.

In [None]:
# Create a contingency table (team rank vs. favorable decisions)
# Assuming 'team_rank' and 'decision_favorable' columns exist

# Bin teams into tiers based on ranking
df['team_tier'] = pd.qcut(df['team_rank'], q=4, labels=['Top Tier', 'Upper Mid', 'Lower Mid', 'Bottom Tier'])

# Create contingency table
contingency = pd.crosstab(df['team_tier'], df['decision_favorable'])
print("Contingency Table:")
print(contingency)

# Chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print("\nExpected frequencies:")
print(pd.DataFrame(expected, index=contingency.index, columns=contingency.columns))

### Logistic Regression Model

Let's build a model to predict favorable VAR decisions based on team characteristics.

In [None]:
# Prepare features and target variable
X = df[['team_rank', 'market_value', 'avg_attendance', 'historical_success']]
y = df['decision_favorable']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get feature coefficients
coefs = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
})
coefs = coefs.sort_values('Coefficient', ascending=False)

# Plot coefficients
plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y='Feature', data=coefs)
plt.title('Feature Importance for Predicting Favorable VAR Decisions')
plt.axvline(x=0, color='black', linestyle='--')
plt.tight_layout()
plt.show()