# Predicting Early Match Success in League of Legends

**Name(s)**: Gabrielle Despaigne

**Website Link**: https://gdespaigne.github.io/dsc259r-final/

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

# from dsc259r_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

### Quick Stats About the Dataset
The League of Legends dataset contains detailed statistics from professional matches. Each game is represented by multiple rows, including player-level statistics and team-level summary rows. The dataset includes 165 columns covering match metadata (league, patch, split, date), performance metrics (kills, gold, objectives), and match outcome (result).

The raw dataset contains 12 rows per game:

10 player-level rows (5 per team)

2 team-level summary rows

At this stage, no filtering has been performed. This initial inspection helps identify how the dataset is structured before any cleaning or transformation.

In [22]:
df = pd.read_csv("/Users/gabrielledespaigne/Documents/GitHub/dsc259r-final/data/2022_LoL_esports_match_data_from_OraclesElixir.csv")
df.head()

  df = pd.read_csv("/Users/gabrielledespaigne/Documents/GitHub/dsc259r-final/data/2022_LoL_esports_match_data_from_OraclesElixir.csv")


Unnamed: 0,gameid,datacompleteness,url,league,year,split,playoffs,date,game,patch,...,opp_csat25,golddiffat25,xpdiffat25,csdiffat25,killsat25,assistsat25,deathsat25,opp_killsat25,opp_assistsat25,opp_deathsat25
0,ESPORTSTMNT01_2690210,complete,,LCKC,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,203.0,605.0,-525.0,9.0,0.0,1.0,1.0,0.0,2.0,0.0
1,ESPORTSTMNT01_2690210,complete,,LCKC,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,163.0,421.0,-903.0,-28.0,2.0,4.0,2.0,1.0,5.0,1.0
2,ESPORTSTMNT01_2690210,complete,,LCKC,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,187.0,-149.0,-224.0,-5.0,1.0,3.0,0.0,3.0,4.0,3.0
3,ESPORTSTMNT01_2690210,complete,,LCKC,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,284.0,-1288.0,-2005.0,-85.0,2.0,1.0,2.0,3.0,4.0,0.0
4,ESPORTSTMNT01_2690210,complete,,LCKC,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,27.0,499.0,-314.0,12.0,1.0,3.0,2.0,0.0,7.0,2.0


In [87]:
## before filtering
display(df.shape)
display(df.columns)

(148980, 165)

Index(['gameid', 'datacompleteness', 'url', 'league', 'year', 'split',
       'playoffs', 'date', 'game', 'patch',
       ...
       'opp_csat25', 'golddiffat25', 'xpdiffat25', 'csdiffat25', 'killsat25',
       'assistsat25', 'deathsat25', 'opp_killsat25', 'opp_assistsat25',
       'opp_deathsat25'],
      dtype='object', length=165)

### Initial Thoughts
From the initial inspection, several important characteristics are evident:

- The dataset includes both player-level and team-level rows.

- There are numerous early-game metrics (e.g., killsat10, golddiffat10).

- Several columns indicate early objective control (e.g., firstdragon, firstherald, firsttower).

- Some columns contain missing values, suggesting structural changes across patches.

### Brainstorming
Based on the structure of the dataset, several potential research questions arise:

- Does securing first blood increase a team’s probability of winning?

- Are early gold advantages at 10 minutes predictive of match outcome?

- Do certain patches produce more snowball-heavy games?

- Is early objective control more predictive of victory than early kills?

- Does the relationship between early-game advantage and winning differ by league?

### Proposed Research Question
After reviewing the available early-game metrics, I focus on the following research question:

<b></i>Is early objective control a stronger indicator of match outcome than early kills in professional League of Legends matches?</b></i>

This question is compelling because it contrasts macro-level strategic control (objectives) with combat-based advantage (kills). Understanding which type of early-game advantage better predicts victory provides insight into competitive strategy and match dynamics.

Importantly, this question can be addressed through both hypothesis testing and predictive modeling, aligning with the later stages of the project.

## Step 2: Data Cleaning and Exploratory Data Analysis

### Data Cleaning
The raw dataset contains both player-level and team-level rows (12 rows per game). Since this analysis focuses on match outcomes at the team level, I restrict the dataset to rows where position == "team", resulting in exactly two rows per game.

In [89]:
df_team = df[df["position"] == "team"].copy()

df_team.shape

(24830, 165)

In [90]:
df_team.groupby("gameid").size().value_counts()

2    12415
Name: count, dtype: int64

This results in 24,830 team-level observations, with exactly two rows per game.

In [99]:
bool_cols = ["firstdragon", "firstherald", "firsttower", "firstblood"]

for col in bool_cols:
    df_team[col] = df_team[col].fillna(0).astype(int)

In [100]:
objective_cols = ["firstdragon", "firstherald", "firsttower"]

df_team["early_objective_control"] = (
    df_team[objective_cols].sum(axis=1) >= 1
).astype(int)

To capture macro-level early advantage, I created a new binary variable, early_objective_control, defined as securing at least one of firstdragon, firstherald, or firsttower.

### Univariate Analysis
The dataset is approximately balanced between wins and losses, as expected since each game contributes one win and one loss.

In [121]:
df_team["result_label"] = df_team["result"].map({
    0: "Loss",
    1: "Win"
})

fig1 = px.histogram(
    df_team,
    x="result_label",
    color="result_label",
    title="League of Legends 2022 - Distribution of Match Outcomes",
    labels={
        "result_label": "Match Outcome",
        "count": "Number of Matches"
    },
    color_discrete_map={
        "Loss": "red",
        "Win": "blue"
    }
)

fig1.update_layout(
    yaxis_title="Number of Matches",
    template="plotly_white"
)

fig1.show()

In [128]:
df_team["objective_label"] = df_team["early_objective_control"].map({
    0: "No",
    1: "Yes"
})

fig2 = px.histogram(
    df_team,
    x="objective_label",
    color="objective_label",
    title="Distribution of Early Objective Control",
    labels={
        "objective_label": "Early Objective Control",
        "count": "Number of Matches"
    },
    color_discrete_map={
        "No": "red",
        "Yes": "blue"
    }
)

fig2.update_layout(
    yaxis_title="Number of Matches",
    template="plotly_white"
)

fig2.show()

### Bi-variate Analysis

In [43]:
## yes/no early objective results
[c for c in df_team.columns if "result" in c.lower() or "win" in c.lower()]

['result']

In [44]:
## yes - 0, no - 1
df_team["result"].value_counts()

result
0    12415
1    12415
Name: count, dtype: int64

Teams with early objective control have a substantially higher win rate (55.8%) compared to teams without (32.8%).

In [45]:
## finding early objective terms
[c for c in df_team.columns if "first" in c.lower()]

['firstPick',
 'firstblood',
 'firstbloodkill',
 'firstbloodassist',
 'firstbloodvictim',
 'firstdragon',
 'firstherald',
 'firstbaron',
 'firsttower',
 'firstmidtower',
 'firsttothreetowers']

In [46]:
[c for c in df_team.columns if "dragon" in c.lower() or "herald" in c.lower() or "tower" in c.lower()]

['firstdragon',
 'dragons',
 'opp_dragons',
 'dragons (type unknown)',
 'firstherald',
 'heralds',
 'opp_heralds',
 'firsttower',
 'towers',
 'opp_towers',
 'firstmidtower',
 'firsttothreetowers',
 'damagetotowers']

In [48]:
df_team[["firstdragon", "firstherald", "firsttower"]].head()

Unnamed: 0,firstdragon,firstherald,firsttower
10,0.0,1.0,1.0
11,1.0,0.0,0.0
22,0.0,1.0,0.0
23,1.0,0.0,1.0
34,0.0,,1.0


In [49]:
df_team[["firstdragon", "firstherald", "firsttower"]].sum()

firstdragon    11315.0
firstherald    10519.0
firsttower     11329.0
dtype: float64

<b>Note:</b> Going to use firstdragon, firstherald, firsttower for early game indicators because others like dragons are total dragons for the whole game. Firstbaron is usually midgame and firsttothreetowers can happen midgame.

So by these standards, objective control = 1 if team secured at least one of [firstdragon, firstherald, firsttower] otherwise it's 0.

In [50]:
objective_cols = ["firstdragon", "firstherald", "firsttower"]

df_team["early_objective_control"] = (
    df_team[objective_cols].sum(axis=1) >= 1
).astype(int)

df_team["early_objective_control"].value_counts()

early_objective_control
1    18589
0     6241
Name: count, dtype: int64

In [51]:
df_team.groupby("early_objective_control")["result"].mean()

early_objective_control
0    0.328473
1    0.557588
Name: result, dtype: float64

In [129]:
win_rates = (
    df_team.groupby("early_objective_control")["result"]
    .mean()
    .reset_index()
)

win_rates["objective_label"] = win_rates["early_objective_control"].map({
    0: "No",
    1: "Yes"
})

fig4 = px.bar(
    win_rates,
    x="objective_label",
    y="result",
    color="objective_label",
    title="League of Legends 2022 – Win Rate by Early Objective Control",
    labels={
        "objective_label": "Early Objective Control",
        "result": "Win Rate"
    },
    color_discrete_map={
        "No": "red",
        "Yes": "blue"
    }
)

fig4.update_layout(
    yaxis_title="Win Rate",
    template="plotly_white",
    showlegend=False
)

fig4.update_yaxes(tickformat=".0%")

fig4.show()

In [131]:
win_by_fb = (
    df_team.groupby("firstblood")["result"]
    .mean()
    .reset_index()
)

win_by_fb["firstblood_label"] = win_by_fb["firstblood"].map({
    0: "No First Blood",
    1: "First Blood Secured"
})

fig3 = px.bar(
    win_by_fb,
    x="firstblood_label",
    y="result",
    color="firstblood_label",
    title="League of Legends 2022 – Win Rate by First Blood",
    labels={
        "firstblood_label": "First Blood",
        "result": "Win Rate"
    },
    color_discrete_map={
        "No First Blood": "red",
        "First Blood Secured": "blue"
    }
)

fig3.update_layout(
    yaxis_title="Win Rate",
    template="plotly_white",
    showlegend=False
)

fig3.update_yaxes(tickformat=".0%")

fig3.show()

This allows comparison between combat-based early advantage (first blood) and objective-based advantage.

In [133]:
## objective effect
obj_rates = df_team.groupby("early_objective_control")["result"].mean()
obj_effect = obj_rates[1] - obj_rates[0]

## first blood effect
fb_rates = df_team.groupby("firstblood")["result"].mean()
fb_effect = fb_rates[1] - fb_rates[0]

obj_effect, fb_effect

(np.float64(0.22911481963259428), np.float64(0.22021810893981475))

In [135]:
effect_df = pd.DataFrame({
    "Metric": ["Early Objective Control", "First Blood"],
    "Win Rate Increase": [obj_effect, fb_effect]
})

fig5 = px.bar(
    effect_df,
    x="Metric",
    y="Win Rate Increase",
    color="Metric",
    title="Comparison of Early-Game Advantage Impact on Win Rate",
    labels={"Win Rate Increase": "Increase in Win Rate"},
    color_discrete_map={
        "Early Objective Control": "blue",
        "First Blood": "red"
    }
)

fig5.update_layout(
    template="plotly_white",
    showlegend=False
)

fig5.update_yaxes(tickformat=".0%")

fig5.show()

In [136]:
comparison_df = pd.DataFrame({
    "Metric": ["Objective", "Objective", "First Blood", "First Blood"],
    "Condition": ["No Advantage", "Advantage Secured",
                  "No Advantage", "Advantage Secured"],
    "Win Rate": [
        obj_rates[0],
        obj_rates[1],
        fb_rates[0],
        fb_rates[1]
    ]
})

In [138]:
fig6 = px.bar(
    comparison_df,
    x="Metric",
    y="Win Rate",
    color="Condition",
    barmode="group",
    title="League of Legends 2022 - Win Rates by Type of Early Advantage",
    color_discrete_map={
        "No Advantage": "red",
        "Advantage Secured": "blue"
    }
)

fig6.update_layout(
    template="plotly_white"
)

fig6.update_yaxes(tickformat=".0%")

fig6.show()

While both early objective control and first blood substantially increase win probability, their effect sizes are remarkably similar. This suggests that both macro-level control and early combat advantage play comparably important roles in determining match outcomes.

### Interesting Aggregates
This table shows how the relationship between early objective control and win rate varies across patches. In most patches, teams with early objective control maintain a consistent advantage.

In [97]:
agg_table = (
    df_team.groupby(["patch", "early_objective_control"])["result"]
    .mean()
    .unstack()
)

agg_table.head()

early_objective_control,0,1
patch,Unnamed: 1_level_1,Unnamed: 2_level_1
12.01,0.236559,0.558566
12.02,0.363328,0.559322
12.03,0.246429,0.550141
12.04,0.40249,0.556901
12.05,0.310606,0.551921


## Step 3: Assessment of Missingness

In [156]:
df_team.isna().sum().sort_values(ascending=False).head(15)

opp_atakhans               24830
playername                 24830
firstbloodkill             24830
firstbloodassist           24830
firstbloodvictim           24830
damageshare                24830
atakhans                   24830
champion                   24830
playerid                   24830
earnedgoldshare            24830
total cs                   24830
opp_void_grubs             23240
void_grubs                 23240
monsterkillsownjungle      20958
monsterkillsenemyjungle    20958
dtype: int64

In [None]:
missing = df_team.isna().sum()
missing = missing[(missing > 0) & (missing < len(df_team))]
missing.sort_values(ascending=False).head(15)

In [None]:
## does missingness of void_grubs depend on patch?
## it missing in most does suggest it wasnt added until after this year
df_team["void_grubs_missing"] = df_team["void_grubs"].isna().astype(int)
df_team["void_grubs_missing"].value_counts()

In [None]:
## is grubs missing from every patch in the year? this suggests no, that it was recorded for some patches
df_team.groupby("patch")["void_grubs_missing"].mean()

#### Column Selection
The column void_grubs contains substantial missingness (23,240 missing values out of 24,830 team-level rows). Because this missingness is non-trivial and appears structurally meaningful, it is selected for further analysis.

To analyze missingness, I create a missingness indicator:

In [139]:
df_team["void_grubs_missing"] = df_team["void_grubs"].isna().astype(int)

df_team["void_grubs_missing"].value_counts()

void_grubs_missing
1    23240
0     1590
Name: count, dtype: int64

This confirms that the majority of observations have missing values for this column.

#### NMAR Analysis

To determine whether the missingness is likely Not Missing At Random (NMAR), reason must be applied to how the data was generated.

Void Grubs were introduced in later League of Legends patches (post 2024) and did not exist in earlier patches. Therefore, in matches played before their introduction, this column would naturally be missing. Since patch version is observed in the dataset, and missingness can plausibly be explained by patch version, the missingness of void_grubs is consistent with Missing At Random (MAR), not NMAR.

There is no evidence that missingness depends on the value of void_grubs itself. As such, the data are unlikely to be NMAR.

#### Missingness Dependency Tests

##### Dependency on Patches
From exploratory analysis, missingness rates varied across patch versions:

Some patches exhibit 100% missingness, while others show slightly lower rates.

To simplify the test, patches were grouped into:

- High-missingness patches (rate = 1.0)

- Partially-missing patches (rate < 1.0)

In [149]:
df_team.groupby("patch")["void_grubs_missing"].mean()

patch
12.01    0.863103
12.02    0.874615
12.03    1.000000
12.04    0.912844
12.05    0.958435
12.06    1.000000
12.07    1.000000
12.08    1.000000
12.09    1.000000
12.10    0.908148
12.11    0.919063
12.12    0.942580
12.13    0.934608
12.14    0.902357
12.15    0.858696
12.16    1.000000
12.17    1.000000
12.18    1.000000
12.19    1.000000
12.20    1.000000
12.21    1.000000
12.23    1.000000
Name: void_grubs_missing, dtype: float64

In [150]:
## grouping variable to start permutation testing
df_team["high_missing_patch"] = (
    df_team["patch"].isin(
        df_team.groupby("patch")["void_grubs_missing"]
        .mean()
        .loc[lambda x: x == 1.0]
        .index
    )
).astype(int)

df_team["high_missing_patch"].value_counts()

high_missing_patch
0    18574
1     6256
Name: count, dtype: int64

In [151]:
## observed diff
obs_diff = (
    df_team.groupby("high_missing_patch")["void_grubs_missing"]
    .mean()
    .diff()
    .iloc[-1]
)

obs_diff

np.float64(0.08560353181867131)

Observed difference ≈ 0.0856

In [152]:
n_perms = 5000
perm_diffs = []

for _ in range(n_perms):
    shuffled = df_team["void_grubs_missing"].sample(frac=1).values
    temp = df_team.copy()
    temp["shuffled"] = shuffled
    
    diff = (
        temp.groupby("high_missing_patch")["shuffled"]
        .mean()
        .diff()
        .iloc[-1]
    )
    
    perm_diffs.append(diff)

perm_diffs = np.array(perm_diffs)

p_value = np.mean(perm_diffs >= obs_diff)

obs_diff, p_value

(np.float64(0.08560353181867131), np.float64(0.0))

p-value ≈ 0.0

The permutation test yields a p-value approximately equal to 0, indicating strong evidence that missingness of void_grubs depends on patch version. This supports the earlier reasoning that missingness is structurally tied to patch changes.

#### Independence from Side
Next, we test whether missingness depends on team side (Blue vs Red). There is no theoretical reason that side selection would influence whether Void Grubs existed in a patch.

#### Null Hypothesis:
Missingness of void_grubs is independent of side.

#### Alternative Hypothesis:
Missingness of void_grubs depends on side.

In [153]:
obs_side_diff = (
    df_team.groupby("side")["void_grubs_missing"]
    .mean()
    .diff()
    .iloc[-1]
)

obs_side_diff

np.float64(0.0)

In [154]:
n_perms = 5000
perm_diffs_side = []

for _ in range(n_perms):
    shuffled = df_team["void_grubs_missing"].sample(frac=1, replace=False).values
    
    temp = df_team.copy()
    temp["shuffled_missing"] = shuffled
    
    diff = (
        temp.groupby("side")["shuffled_missing"]
        .mean()
        .diff()
        .iloc[-1]
    )
    
    perm_diffs_side.append(diff)

perm_diffs_side = np.array(perm_diffs_side)

p_value_side = np.mean(np.abs(perm_diffs_side) >= np.abs(obs_side_diff))

obs_side_diff, p_value_side

(np.float64(0.0), np.float64(1.0))

The permutation test yields a large p-value, indicating no evidence that missingness of void_grubs depends on team side. This aligns with expectations, as side selection is unrelated to patch-level structural changes.

In [155]:
fig_side = px.histogram(
    perm_diffs_side,
    nbins=40,
    title="Permutation Distribution: Missingness vs Side",
    labels={"value": "Difference in Missingness Rate"}
)

fig_side.add_vline(
    x=obs_side_diff,
    line_color="red",
    annotation_text="Observed Difference",
    annotation_position="top right"
)

fig_side.update_layout(template="plotly_white")

fig_side.show()

## Step 4: Hypothesis Testing

### Research Question:
Does securing early objective control increase the probability of winning a match?

### Hypotheses

#### Null Hypothesis (H₀):
The probability of winning is the same for teams with early objective control and teams without early objective control.

#### Alternative Hypothesis (H₁):
Teams with early objective control have a higher probability of winning than teams without early objective control.

This is a one-sided test because we are specifically testing whether early objective control increases win probability.

### Test Statistic

The test statistic is the difference in win proportions:

Win rate (early objective = 1) − Win rate (early objective = 0)

This measures how much early objective control shifts the probability of winning.

In [157]:
obs_diff = (
    df_team.groupby("early_objective_control")["result"]
    .mean()
    .diff()
    .iloc[-1]
)

obs_diff

np.float64(0.22911481963259428)

The observed difference in win rates is approximately 22.9 percentage points.

In [None]:
n_perms = 5000
perm_diffs = []

for _ in range(n_perms):
    shuffled = df_team["result"].sample(frac=1, replace=False).values
    
    temp = df_team.copy()
    temp["shuffled_result"] = shuffled
    
    diff = (
        temp.groupby("early_objective_control")["shuffled_result"]
        .mean()
        .diff()
        .iloc[-1]
    )
    
    perm_diffs.append(diff)

perm_diffs = np.array(perm_diffs)

p_value = np.mean(perm_diffs >= obs_diff)

obs_diff, p_value

Under null hypothesis, match outcomes are independent of early objective control.

### Summary
The permutation test yields a p-value approximately equal to 0, indicating that the observed difference in win rates is extremely unlikely under the null hypothesis. Therefore, we reject the null hypothesis and conclude that early objective control is strongly associated with match outcome.

This provides statistical evidence that early objective control significantly increases a team’s probability of winning.

## Step 5: Framing a Prediction Problem

In [17]:
# TODO

## Step 6: Baseline Model

In [18]:
# TODO

## Step 7: Final Model

In [19]:
# TODO

## Step 8: Fairness Analysis

In [20]:
# TODO