# 07. Linear Regression for Player Points Prediction

This notebook demonstrates building regression models to predict individual player statistics.

## Objectives
- Predict player points per game
- Build Linear, Ridge, and Lasso regression models
- Perform residual analysis
- Check regression assumptions
- Compare regularization techniques

## Regression vs Classification
- **Classification** (Notebooks 03-06): Predict categories (Win/Loss)
- **Regression** (This notebook): Predict continuous values (Points)

## Why Regression for Player Stats?
- Predict exact point totals
- Understand which factors drive scoring
- Identify over/under-performing players
- Fantasy sports applications

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_processing.cleaning import DataCleaner
from src.data_processing.player_features import PlayerFeatureEngineer
from src.data_processing.dataset_builder import DatasetBuilder
from src.models.linear_regression_model import PlayerLinearRegression
from src.models.ridge_lasso_regression import PlayerRidgeRegression, PlayerLassoRegression
from src.evaluation.model_comparison import ModelComparison
from src.evaluation.metrics import RegressionMetrics
from src.utils.data_loader import load_player_stats_as_dataframe

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("‚úì Imports successful")

## 1. Load Player Statistics Data

In [None]:
# Load player stats
try:
    stats_df = load_player_stats_as_dataframe(season=2023)
    print(f"Loaded {len(stats_df)} player stat records")
except:
    print("Loading sample data...")
    from scripts.generate_sample_data import generate_sample_player_stats
    stats_df = pd.DataFrame(generate_sample_player_stats(100))
    print(f"Generated {len(stats_df)} sample player stats")

print(f"\nDataset shape: {stats_df.shape}")
print(f"\nSample data:")
stats_df.head()

In [None]:
# Explore target variable (points)
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(stats_df['pts'], bins=30, edgecolor='black')
plt.xlabel('Points')
plt.ylabel('Frequency')
plt.title('Distribution of Player Points')

plt.subplot(1, 2, 2)
plt.boxplot(stats_df['pts'])
plt.ylabel('Points')
plt.title('Points Distribution (Box Plot)')

plt.tight_layout()
plt.show()

print("Points Statistics:")
print(stats_df['pts'].describe())

## 2. Clean and Engineer Features

In [None]:
# Clean data
cleaner = DataCleaner()
stats_df = cleaner.clean_player_stats(stats_df)
print(f"After cleaning: {len(stats_df)} records")

# Engineer features
engineer = PlayerFeatureEngineer()
features_df = engineer.create_player_features(
    stats_df,
    include_target=True,
    target_column='pts'
)

print(f"\nCreated {len(features_df.columns)} features")
print(f"Features: {features_df.columns.tolist()}")

## 3. Create Train/Val/Test Splits

In [None]:
# Build dataset
builder = DatasetBuilder()
dataset = builder.create_dataset(
    df=features_df,
    target_column='target',  # Points are stored as 'target'
    date_column='game_date',
    split_method='time',
    scale_features=True,  # Important for regression!
    exclude_columns=['player_id', 'game_id']
)

print("Dataset splits:")
print(f"  Training:   {len(dataset['X_train'])} samples")
print(f"  Validation: {len(dataset['X_val'])} samples")
print(f"  Testing:    {len(dataset['X_test'])} samples")
print(f"\nFeatures: {dataset['X_train'].shape[1]}")

## 4. Linear Regression (No Regularization)

In [None]:
# Train linear regression
linear_model = PlayerLinearRegression()

print("Training Linear Regression...")
linear_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val']
)

# Evaluate
test_metrics = linear_model.evaluate(dataset['X_test'], dataset['y_test'])

print("\n" + "="*60)
print("LINEAR REGRESSION RESULTS")
print("="*60)
for metric, value in test_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## 5. Residual Analysis

In [None]:
# Check regression assumptions
linear_model.check_assumptions(dataset['X_test'], dataset['y_test'])
plt.show()

# Analyze residuals
residuals_stats = linear_model.analyze_residuals(dataset['X_test'], dataset['y_test'])
print("\nResidual Statistics:")
for key, value in residuals_stats.items():
    print(f"{key:15s}: {value:.4f}")

## 6. Ridge Regression (L2 Regularization)

In [None]:
# Train Ridge regression
ridge_model = PlayerRidgeRegression()

print("Training Ridge Regression with L2 regularization...")
ridge_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val'],
    tune_alpha=True
)

ridge_metrics = ridge_model.evaluate(dataset['X_test'], dataset['y_test'])

print("\n" + "="*60)
print("RIDGE REGRESSION RESULTS")
print("="*60)
print(f"Best alpha: {ridge_model.model.alpha}")
for metric, value in ridge_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## 7. Lasso Regression (L1 Regularization + Feature Selection)

In [None]:
# Train Lasso regression
lasso_model = PlayerLassoRegression()

print("Training Lasso Regression with L1 regularization...")
lasso_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val'],
    tune_alpha=True
)

lasso_metrics = lasso_model.evaluate(dataset['X_test'], dataset['y_test'])

print("\n" + "="*60)
print("LASSO REGRESSION RESULTS")
print("="*60)
print(f"Best alpha: {lasso_model.model.alpha}")
for metric, value in lasso_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

# Feature selection
selected_features = lasso_model.get_selected_features()
print(f"\n‚úì Lasso selected {len(selected_features)} features (out of {dataset['X_train'].shape[1]})")
print(f"Selected features: {selected_features}")

## 8. Compare All Regression Models

In [None]:
# Compare models
comparison = ModelComparison(task_type='regression')
comparison.add_model('Linear Regression', linear_model, dataset['X_test'], dataset['y_test'])
comparison.add_model('Ridge Regression', ridge_model, dataset['X_test'], dataset['y_test'])
comparison.add_model('Lasso Regression', lasso_model, dataset['X_test'], dataset['y_test'])

results = comparison.compare_all()
print("\n" + "="*80)
print("REGRESSION MODEL COMPARISON")
print("="*80)
print(results)

best_name, best_model = comparison.get_best_model(metric='mae')
print(f"\n‚úì Best Model (by MAE): {best_name}")

## 9. Prediction Visualizations

In [None]:
# Predictions vs Actual for all models
metrics_helper = RegressionMetrics()

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (name, model) in enumerate([
    ('Linear', linear_model),
    ('Ridge', ridge_model),
    ('Lasso', lasso_model)
]):
    y_pred = model.predict(dataset['X_test'])
    plt.sca(axes[idx])
    metrics_helper.plot_predictions_vs_actual(dataset['y_test'], y_pred)
    axes[idx].set_title(f'{name} Regression\nPredictions vs Actual')

plt.tight_layout()
plt.show()

In [None]:
# Residual plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (name, model) in enumerate([
    ('Linear', linear_model),
    ('Ridge', ridge_model),
    ('Lasso', lasso_model)
]):
    y_pred = model.predict(dataset['X_test'])
    plt.sca(axes[idx])
    metrics_helper.plot_residuals(dataset['y_test'], y_pred)
    axes[idx].set_title(f'{name} Regression\nResiduals')

plt.tight_layout()
plt.show()

## 10. Feature Importance (Linear Coefficients)

In [None]:
# Get feature importance from linear model
importance_df = linear_model.get_feature_importance(dataset['feature_names'])

# Plot top features
top_features = importance_df.head(15)

plt.figure(figsize=(10, 8))
colors = ['green' if x > 0 else 'red' for x in top_features['coefficient']]
plt.barh(range(len(top_features)), top_features['importance'], color=colors, alpha=0.7)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Coefficient Magnitude')
plt.title('Top 15 Features for Points Prediction\n(Green = Positive, Red = Negative)')
plt.tight_layout()
plt.show()

print("Top 10 Features:")
print(importance_df.head(10))

## 11. Conclusion

### Model Performance Summary
- **MAE (Mean Absolute Error)**: How many points off, on average?
- **RMSE (Root Mean Squared Error)**: Penalizes larger errors
- **R¬≤ Score**: How much variance is explained?

### Regularization Comparison

**Linear Regression:**
- No penalty on coefficients
- Can overfit with many features
- Best when features are truly predictive

**Ridge Regression (L2):**
- Shrinks coefficients toward zero
- Keeps all features
- Good when many features are somewhat useful
- Often best generalization

**Lasso Regression (L1):**
- Forces some coefficients to exactly zero
- Performs feature selection
- Good when many features are irrelevant
- Interpretable (fewer features)

### Typical Results
- MAE: 3-5 points (predictions within ~4 points on average)
- R¬≤: 0.6-0.8 (explaining 60-80% of variance)
- Ridge often performs best

### Key Insights
- Recent performance is most predictive
- Usage rate and minutes strongly correlate with points
- Opponent defense rating matters
- Some players more predictable than others

### Applications
- Fantasy sports lineups
- Sports betting insights
- Player performance tracking
- Contract negotiations

### Extensions
- Multi-output regression for Points + Rebounds + Assists
- Time series models for trends
- Player-specific models
- Advanced features (opponent matchups, rest days)

üèÄ Regression models complete - ready to predict player performance!