# AI05b: Random Forest Walkthrough
## Predicting NBA Game Outcomes

**AI/ML Course | Medina County Career Center**

Run each cell in order. Focus on the outputs, not memorizing the code.

## Setup

Run this cell to load the libraries we need.

In [2]:
import pandas as pd                                        # pandas = data tables (like Excel in Python)
from sklearn.model_selection import train_test_split       # splits data into training vs testing sets
from sklearn.ensemble import RandomForestClassifier        # the Random Forest model itself
from sklearn.metrics import accuracy_score                 # compares predictions to actual results
from nba_api.stats.endpoints import LeagueGameLog          # pulls real NBA game stats from nba.com
from nba_api.stats.static import teams                     # gives us a list of all NBA teams and their IDs

print("Libraries loaded.")

Libraries loaded.


## Step 1: Pull NBA Game Data

We're using the `nba_api` library (from ai05a) to get real game data. Each row in the raw data is one team's stats from one game — so every game produces two rows (one per team). This may take 10-20 seconds.

In [6]:
# hit the NBA API and pull every team's game log for this season
game_log = LeagueGameLog(season="2025-26", season_type_all_star="Regular Season")

# .get_data_frames() returns a list of tables — [0] grabs the first (and only) one
games = game_log.get_data_frames()[0]

# the API returns column names in ALL CAPS like "PTS", "AST"
# .str.lower() converts them to lowercase so we can type games['pts'] instead of games['PTS']
games.columns = games.columns.str.lower()

# each game has TWO rows (one per team), so total rows / 2 = number of actual games
print(f"Got {len(games)} team-game records ({len(games)//2} games)")

# show a few rows so we can see what the raw data looks like
# we pick specific columns to keep it readable (there are 20+ columns total)
display(games[['game_date', 'team_name', 'matchup', 'wl', 'pts', 'ast', 'reb', 'fg_pct']].head(10))

Got 1576 team-game records (788 games)


Unnamed: 0,game_date,team_name,matchup,wl,pts,ast,reb,fg_pct
0,2025-10-21,Golden State Warriors,GSW @ LAL,W,119,29,40,0.487
1,2025-10-21,Los Angeles Lakers,LAL vs. GSW,L,109,23,39,0.545
2,2025-10-21,Houston Rockets,HOU @ OKC,L,124,23,52,0.443
3,2025-10-21,Oklahoma City Thunder,OKC vs. HOU,W,125,29,38,0.442
4,2025-10-22,Boston Celtics,BOS vs. PHI,L,116,16,42,0.451
5,2025-10-22,Cleveland Cavaliers,CLE @ NYK,L,111,21,32,0.465
6,2025-10-22,New Orleans Pelicans,NOP @ MEM,L,122,20,47,0.459
7,2025-10-22,Chicago Bulls,CHI vs. DET,W,115,29,50,0.448
8,2025-10-22,LA Clippers,LAC @ UTA,L,108,28,38,0.443
9,2025-10-22,Miami Heat,MIA @ ORL,L,121,26,47,0.484


## Step 2: Calculate Team Averages and Differentials

The model doesn't see team names — it sees **stat differences** between the home team and the away team. Every column below is: **home team's season average minus away team's season average.** Positive = home team is better at that stat.

We use season averages (not actual game stats) because in a real prediction, we wouldn't know the game stats beforehand.

In [7]:
# ---------- PART A: calculate each team's season averages ----------

# .groupby('team_id') = "make a separate pile for each team"
# .agg({'pts': 'mean', ...}) = "for each pile, calculate the average of these columns"
# result: one row per team with their average pts, ast, reb, etc. for the whole season
team_avgs = games.groupby('team_id').agg({
    'pts': 'mean', 'ast': 'mean', 'reb': 'mean',
    'stl': 'mean', 'blk': 'mean', 'tov': 'mean',
    'fg_pct': 'mean', 'fg3_pct': 'mean'
}).reset_index()                                           # .reset_index() turns team_id back into a normal column

# ---------- PART B: figure out which team was home vs away in each game ----------

# matchup column looks like "CLE vs. BOS" (home) or "BOS @ CLE" (away)
# "vs." = this team was at home, "@" = this team was on the road
home_games = games[games['matchup'].str.contains('vs.')].copy()   # keep only the home team's rows
away_games = games[games['matchup'].str.contains('@')].copy()     # keep only the away team's rows

# .merge() = combine two tables by matching on a shared column (like VLOOKUP in Excel)
# we match on game_id so each row becomes: home team info + away team's team_id
# suffixes=('_home', '_away') adds labels so we know which team_id is which
matchups = home_games.merge(
    away_games[['game_id', 'team_id']],
    on='game_id', suffixes=('_home', '_away')
)

# create target column: 1 if home team won, 0 if away team won
# (matchups['wl'] == 'W') returns True/False, .astype(int) converts True=1 False=0
matchups['home_win'] = (matchups['wl'] == 'W').astype(int)

# ---------- PART C: build the training data (one row per game, stat diffs only) ----------

training_rows = []                                         # empty list — we'll add one dict per game

for _, game in matchups.iterrows():                        # loop through every game (row by row)

    # look up this game's home team averages and away team averages
    # team_avgs[team_avgs['team_id'] == ...] = "find the row where team_id matches"
    home = team_avgs[team_avgs['team_id'] == game['team_id_home']]
    away = team_avgs[team_avgs['team_id'] == game['team_id_away']]

    if len(home) == 0 or len(away) == 0:                   # skip if either team not found (shouldn't happen)
        continue

    # .iloc[0] = "grab the first (and only) row as a simple data series"
    # without iloc[0] we'd have a mini-table instead of a single row, and subtraction wouldn't work right
    h, a = home.iloc[0], away.iloc[0]

    # for each stat, subtract: home team avg - away team avg
    # positive = home team is better at that stat
    training_rows.append({
        'pts_diff': h['pts'] - a['pts'],                   # points per game difference
        'ast_diff': h['ast'] - a['ast'],                   # assists per game difference
        'reb_diff': h['reb'] - a['reb'],                   # rebounds per game difference
        'stl_diff': h['stl'] - a['stl'],                   # steals per game difference
        'blk_diff': h['blk'] - a['blk'],                   # blocks per game difference
        'tov_diff': h['tov'] - a['tov'],                   # turnovers per game difference
        'fg_pct_diff': h['fg_pct'] - a['fg_pct'],         # field goal % difference
        'fg3_pct_diff': h['fg3_pct'] - a['fg3_pct'],      # 3-point % difference
        'home_win': game['home_win']                       # 1 = home won, 0 = away won
    })

# turn the list of dicts into a pandas table (each dict becomes one row)
training_data = pd.DataFrame(training_rows)

print(f"Training data: {len(training_data)} games, {len(training_data.columns)-1} features")
print(f"Home team win rate: {training_data['home_win'].mean():.1%}\n")  # .mean() of 1s and 0s = win percentage
display(training_data.head())                              # show first 5 rows so we can see the format

Training data: 783 games, 8 features
Home team win rate: 54.8%



Unnamed: 0,pts_diff,ast_diff,reb_diff,stl_diff,blk_diff,tov_diff,fg_pct_diff,fg3_pct_diff,home_win
0,0.533481,-3.502775,-1.367,-1.764706,-0.411395,-0.467999,0.039387,-0.008385,0
1,4.768405,0.781724,-5.105808,0.944136,0.047355,-2.839808,0.01448,-0.011065,1
2,-1.559507,-1.093251,1.761974,-1.377358,-0.427068,-1.731858,0.004085,0.00705,0
3,-0.432852,2.610433,-0.743618,-3.306696,-1.312986,-0.900111,-0.010563,0.018632,1
4,-2.320755,-1.849057,1.679245,-1.132075,-1.018868,-0.773585,-0.006019,0.017679,1


## Step 3: Train/Test Split

We split the data so the model learns from 80% of games and gets tested on the remaining 20% it has never seen. This keeps the evaluation honest — without the split, the model could just memorize answers.

In [8]:
# these are the 8 stat columns the model will use to make predictions
# (everything EXCEPT home_win, which is the answer we're trying to predict)
feature_cols = ['pts_diff', 'ast_diff', 'reb_diff', 'stl_diff',
                'blk_diff', 'tov_diff', 'fg_pct_diff', 'fg3_pct_diff']

X = training_data[feature_cols]                            # X = the inputs (stat diffs we know before the game)
y = training_data['home_win']                              # y = the answer (did the home team actually win?)

# train_test_split randomly shuffles the data and splits it into two groups:
#   X_train, y_train = 80% of games (the model learns from these)
#   X_test,  y_test  = 20% of games (held back to test accuracy on unseen data)
# random_state=42 = use the same random shuffle every time so results are reproducible
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training on: {len(X_train)} games")
print(f"Testing on:  {len(X_test)} games (model has never seen these)")

Training on: 626 games
Testing on:  157 games (model has never seen these)


## Step 4: Build the Random Forest

This is where 100 decision trees get built. Remember from the slides: each tree gets a random sample of the training games, and at every split it draws a random subset of features to consider. Then all 100 trees will vote on predictions.

In [10]:
# create the Random Forest model with our settings
model = RandomForestClassifier(
    n_estimators=100,                                      # build 100 separate decision trees
    max_depth=10,                                          # each tree can ask up to 10 levels of questions
    random_state=42                                        # same random seed = same results every run
)

# .fit() = "train the model" — this is where all 100 trees get built
# each tree gets a random sample of games and picks random features at every split
# (this is the bagging + random feature selection from the slides)
model.fit(X_train, y_train)

print(f"Done -- {model.n_estimators} trees built, each trained on {len(X_train)} games")

Done -- 100 trees built, each trained on 626 games


## Step 5: How Accurate Is It?

We test on the 20% of games the model never trained on. Compare to baselines — random guessing gets 50%, and always picking the home team gets around 54-55%.

In [11]:
# .predict() = "for each test game, have all 100 trees vote and return the majority answer"
# y_pred is a list of 1s and 0s — one prediction per test game
y_pred = model.predict(X_test)

# accuracy_score compares our predictions (y_pred) to what actually happened (y_test)
# it returns the % that matched (e.g., 0.65 = we got 65% of test games right)
accuracy = accuracy_score(y_test, y_pred)

# compare our model to two simple baselines:
#   - random guessing = 50% (flip a coin)
#   - always picking home = whatever % of games the home team actually wins (~54%)
# if our model can't beat these, it's not learning anything useful
print(f"Accuracy: {accuracy:.1%}")                         # :.1% formats 0.65 as "65.0%"
print(f"")
print(f"  Random guessing:   50.0%")
print(f"  Always pick home:  {y.mean():.1%}")              # y.mean() = home win rate in the full dataset
print(f"  Our model:         {accuracy:.1%}")

Accuracy: 63.1%

  Random guessing:   50.0%
  Always pick home:  54.8%
  Our model:         63.1%


## Step 6: Which Stats Matter Most?

Random Forest tells us how much each feature contributed to the predictions across all 100 trees. This is real insight about basketball — which statistical advantages actually predict wins?

In [12]:
# model.feature_importances_ = a list of scores (one per feature) showing how much
# each stat contributed to the predictions across all 100 trees
# higher score = that feature was more useful for separating wins from losses
importance = pd.DataFrame({
    'Feature': feature_cols,                               # the column names
    'Importance': model.feature_importances_               # the importance scores (add up to 1.0)
}).sort_values('Importance', ascending=False)               # sort highest to lowest

# print a simple text bar chart so we can visualize the differences
for _, row in importance.iterrows():                       # loop through each feature
    bar = '#' * int(row['Importance'] * 50)                # scale the score to a bar width (e.g., 0.20 = 10 #'s)
    print(f"  {row['Feature']:15} {bar} {row['Importance']:.1%}")  # :15 pads name to 15 chars for alignment

print(f"\nTop predictor: {importance.iloc[0]['Feature']}")  # iloc[0] = first row (highest importance)

  pts_diff        ######## 16.3%
  fg_pct_diff     ####### 14.3%
  tov_diff        ###### 12.6%
  reb_diff        ##### 11.9%
  fg3_pct_diff    ##### 11.8%
  stl_diff        ##### 11.8%
  ast_diff        ##### 11.2%
  blk_diff        ##### 10.0%

Top predictor: pts_diff


## Step 7: Predict a Real Matchup

Change `HOME_TEAM` and `AWAY_TEAM` below to predict different games. The model calculates the stat differentials, sends them through all 100 trees, and counts the votes.

Remember: the model never sees team names. It only sees the row of numbers (stat differences) and asks "how often did the home team win in similar situations?"

In [13]:
# ---------- PART A: create a quick lookup so we can type team nicknames ----------

# teams.get_teams() returns a list of all 30 NBA teams with their IDs and names
# each item looks like: {'id': 1610612739, 'full_name': 'Cleveland Cavaliers', ...}
nba_teams = teams.get_teams()

# turn that list into a dictionary: {'Cleveland Cavaliers': 1610612739, ...}
# this lets us look up any team's numeric ID from their name
team_lookup = {t['full_name']: t['id'] for t in nba_teams}

# we make a shorter TEAMS dictionary so you don't have to type full names every time
# just map nicknames to their numeric IDs using the team_lookup we just built
# add more teams here if you want to try other matchups!
TEAMS = {
    'Cavaliers': team_lookup['Cleveland Cavaliers'],
    'Celtics': team_lookup['Boston Celtics'],
    'Lakers': team_lookup['Los Angeles Lakers'],
    'Warriors': team_lookup['Golden State Warriors'],
    'Bucks': team_lookup['Milwaukee Bucks'],
    'Heat': team_lookup['Miami Heat'],
    'Knicks': team_lookup['New York Knicks'],
    'Thunder': team_lookup['Oklahoma City Thunder'],
}

# ---------- PART B: pick the two teams ----------

# CHANGE THESE TWO LINES TO PREDICT DIFFERENT MATCHUPS
HOME_TEAM = 'Cavaliers'
AWAY_TEAM = 'Celtics'

# ---------- PART C: look up their season averages and calculate stat diffs ----------

# find the home team's row in team_avgs and grab it as a single row (.iloc[0])
# without .iloc[0] we'd get a mini-table which makes the subtraction below messy
h = team_avgs[team_avgs['team_id'] == TEAMS[HOME_TEAM]].iloc[0]
a = team_avgs[team_avgs['team_id'] == TEAMS[AWAY_TEAM]].iloc[0]

# build one row of data in the same format the model was trained on:
# for each feature like 'pts_diff', grab the stat name ('pts') and subtract away from home
# .replace('_diff','') turns 'pts_diff' into 'pts' so we can look it up in team_avgs
# the result is a single-row DataFrame the model can make a prediction on
pred_input = pd.DataFrame([{
    col: h[col.replace('_diff','')] - a[col.replace('_diff','')]
    for col in feature_cols                                # loop through all 8 feature columns
}])

# ---------- PART D: get the prediction ----------

# .predict() returns 1 (home win) or 0 (away win) — the majority vote of all 100 trees
prediction = model.predict(pred_input)[0]                  # [0] because predict returns a list, we want the single answer

# .predict_proba() returns the vote percentages: [away win %, home win %]
# e.g., [0.32, 0.68] means 32 trees said away win, 68 said home win
proba = model.predict_proba(pred_input)[0]                 # [0] to get the single result

winner = HOME_TEAM if prediction == 1 else AWAY_TEAM      # translate 1/0 back to a team name

print(f"{HOME_TEAM} (Home) vs {AWAY_TEAM} (Away)")
print(f"")
print(f"  Prediction:  {winner.upper()} WIN")
print(f"  Confidence:  {max(proba)*100:.0f}%")             # max(proba) = the winning side's vote %
print(f"  Tree votes:  {int(proba[1]*100)} for {HOME_TEAM}, {int(proba[0]*100)} for {AWAY_TEAM}")

Cavaliers (Home) vs Celtics (Away)

  Prediction:  CELTICS WIN
  Confidence:  66%
  Tree votes:  34 for Cavaliers, 65 for Celtics


---

## Summary

**What we did:** Built 100 decision trees, each trained on a random sample of NBA games with random features at each split. All 100 trees vote on every prediction.

**Key outputs:**
- **Accuracy** -- how often the model is correct on games it never saw (compare to baselines)
- **Feature importance** -- which stat differences matter most for predicting wins
- **Confidence** -- how many of the 100 trees agreed (e.g., 68 said W = 68% confidence)

**Why the accuracy is actually solid:** NBA games have real randomness -- injuries, hot/cold shooting, rest days -- that no model can predict. We're working with season averages only, and we still beat both random guessing and always-pick-home.