# NBA Halftime Prediction Model

Can we predict whether the home team wins using only first-half data?

**Features:**
- Halftime scores and score differential
- Q1 scores and score differential
- Scoring momentum (Q2 vs Q1 scoring change)
- FG% by shot zone (paint, midrange, three) for each team
- Shot distribution by zone for each team

**Models:** Logistic Regression, Random Forest

**Data:** ~913,000 play-by-play rows across 1,830 NBA games (2024-25 + partial 2025-26)

## 1. Load Data

In [None]:
import pandas as pd
import psycopg2
import os
from dotenv import load_dotenv

load_dotenv()

conn = psycopg2.connect(
    dbname=os.getenv('DB_NAME'),
    user=os.getenv('DB_USER'),
    host=os.getenv('DB_HOST'),
    password=os.getenv('DB_PASSWORD'),
    port=os.getenv('DB_PORT')
)

df = pd.read_sql('SELECT * FROM nba_data', conn)
conn.close()

print(f'Rows: {len(df):,}')
df.head()

## 2. Target Variable

Determine the winner of each game from the final scores (last row per game).

In [None]:
final_scores = df.groupby('gameid')[['scorehome', 'scoreaway']].last()
final_scores['home_win'] = (final_scores['scorehome'] > final_scores['scoreaway']).astype(int)

print(f"Games: {len(final_scores)}")
print(final_scores['home_win'].value_counts())
final_scores.head(10)

## 3. Feature Engineering

### 3a. Halftime Scores and Differential

In [None]:
first_half = df[df['period'] <= 2]

lead = first_half.groupby('gameid')[['scorehome', 'scoreaway']].last()
lead['halftime_diff'] = lead['scorehome'] - lead['scoreaway']
lead.head(10)

### 3b. Q1 Scores and Differential

In [None]:
first_quarter = df[df['period'] <= 1].groupby('gameid')[['scorehome', 'scoreaway']].last()
first_quarter['Q1Difference'] = first_quarter['scorehome'] - first_quarter['scoreaway']

features = lead.join(first_quarter, lsuffix='_half', rsuffix='_q1')
features.head(10)

### 3c. Scoring Momentum

Momentum = Q2 scoring - Q1 scoring. Positive means the team scored more in Q2 than Q1.

In [None]:
home_momentum = (features['scorehome_half'] - features['scorehome_q1']) - features['scorehome_q1']
away_momentum = (features['scoreaway_half'] - features['scoreaway_q1']) - features['scoreaway_q1']

features['home_momentum'] = home_momentum
features['away_momentum'] = away_momentum
features.head(10)

### 3d. Shot Zones

Classify first-half shots into three zones based on distance:
- **Paint**: 0-8 ft
- **Midrange**: 8-23.75 ft
- **Three**: 23.75+ ft

In [None]:
first_half_shots = first_half[first_half['isfieldgoal'] == 1].copy()
first_half_shots['zone'] = pd.cut(
    first_half_shots['shotdistance'],
    bins=[0, 8, 23.75, 100],
    labels=['paint', 'midrange', 'three']
)
first_half_shots['made'] = (first_half_shots['shotresult'] == 'Made').astype(int)

print(f"First-half shot attempts: {len(first_half_shots):,}")
first_half_shots[['gameid', 'shotdistance', 'zone', 'shotresult']].head(10)

### 3e. FG% by Zone

In [None]:
fg_pct = first_half_shots.groupby(['gameid', 'location', 'zone'], observed=True)['made'].mean()
fg_pct_flat = fg_pct.unstack(['location', 'zone'])
fg_pct_flat.columns = [
    f"{'home' if loc == 'h' else 'away'}_fg_{zone}"
    for loc, zone in fg_pct_flat.columns
]
fg_pct_flat.head()

### 3f. Shot Distribution by Zone

In [None]:
shot_dist = first_half_shots.groupby(['gameid', 'location'], observed=True)['zone'].value_counts(normalize=True)
shot_dist_flat = shot_dist.unstack(['location', 'zone'])
shot_dist_flat.columns = [
    f"{'home' if loc == 'h' else 'away'}_dist_{zone}"
    for loc, zone in shot_dist_flat.columns
]
shot_dist_flat.head()

### Join All Features

In [None]:
final_df = features.join(fg_pct_flat).join(shot_dist_flat).join(final_scores['home_win'])
final_df = final_df.fillna(0)

print(f"Games: {len(final_df)}, Features: {len(final_df.columns) - 1}")
final_df.head()

## 4. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = final_df.drop('home_win', axis=1)
y = final_df['home_win']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

## 5. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

print(f"Train accuracy: {lr_model.score(X_train_scaled, y_train):.3f}")
print(f"Test accuracy: {lr_model.score(X_test_scaled, y_test):.3f}")

## 6. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

print(f"Train accuracy: {rf_model.score(X_train, y_train):.3f}")
print(f"Test accuracy: {rf_model.score(X_test, y_test):.3f}")

## 7. Results

### Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Logistic Regression
cm_lr = confusion_matrix(y_test, lr_model.predict(X_test_scaled))
ConfusionMatrixDisplay(cm_lr, display_labels=['Away Win', 'Home Win']).plot(ax=axes[0], cmap='Blues')
axes[0].set_title('Logistic Regression')

# Random Forest
cm_rf = confusion_matrix(y_test, rf_model.predict(X_test))
ConfusionMatrixDisplay(cm_rf, display_labels=['Away Win', 'Home Win']).plot(ax=axes[1], cmap='Greens')
axes[1].set_title('Random Forest')

plt.tight_layout()
plt.show()

### Feature Importance (Random Forest)

In [None]:
importances = pd.Series(rf_model.feature_importances_, index=X.columns).sort_values(ascending=True)

plt.figure(figsize=(10, 8))
importances.plot(kind='barh')
plt.xlabel('Importance')
plt.title('Feature Importance (Random Forest)')
plt.tight_layout()
plt.show()

### Halftime Differential vs Win Probability

In [None]:
# Bin halftime differentials and calculate actual win rate per bin
final_df['diff_bin'] = pd.cut(final_df['halftime_diff'], bins=20)
win_rate = final_df.groupby('diff_bin', observed=True)['home_win'].mean()

plt.figure(figsize=(10, 6))
win_rate.plot(kind='bar', color='steelblue')
plt.axhline(y=0.5, color='red', linestyle='--', label='50% win rate')
plt.xlabel('Halftime Differential (Home - Away)')
plt.ylabel('Home Win Rate')
plt.title('Halftime Lead vs Actual Home Win Rate')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Summary

| Model | Train Accuracy | Test Accuracy |
|-------|---------------|---------------|
| Logistic Regression | 73.0% | 72.7% |
| Random Forest (max_depth=5) | 78.8% | 71.6% |

**Key Findings:**
- Halftime score differential is the dominant predictor (~28% importance)
- Raw halftime scores (home and away) are the next most important features (~13% each)
- Q1 differential adds meaningful signal (~9%)
- Shot location features (FG% and distribution by zone) contribute 1-3% each
- Both models converge at ~72% test accuracy, suggesting this is roughly the ceiling for halftime-only prediction
- The halftime lead vs win rate chart shows a clear S-curve: teams up 15+ at halftime win ~90%+ of the time