# Project Progress Report Title:

*Predicting Match Outcomes in the Game DoTA*

*Team CSK*

-   Wyatt Churchman (jdr357) - Model Selection and Builder
-   Segundo Sanchez (sas458) - Report Writing
-   Ryan Kerlick (rak88) - Data Collection and Analysis

### Project Abstract

-   In this project, we want to predict the outcome of DoTA 2 matches using real game data from
the OpenDota JSON Data Dump. Each match contains a huge amount of numeric information,
including player statistics, gold and XP graphs, combat logs, item usage, time-series data, and
other gameplay metrics. Our plan is to convert these JSON objects into a large, high-dimensional
numeric dataset (well over 10 million floats) and then build machine learning models on top of it.
We will start with simple baseline models to see how well raw features predict whether the Radiant
team wins. After that, we’ll use dimensionality-reduction and HD-curse mitigation techniques—like
PCA, Random Projection, and APP—to see if they improve performance or reveal hidden structure
in the matches. The goal is not just to predict the winner, but also to learn something interesting
about playstyles, match patterns, and how high-dimensional features interact.

-   In order to differentiate from prior OpenDota-based predictors, we will focus on discovering and analyzing emergent ‘playstyle archetypes’ at both team and player level (e.g., farm-heavy vs. fight-heavy lineups, objective‑centric vs. pickoff‑centric teams) in the learned low-dimensional spaces, rather than optimizing only for win‑prediction accuracy.

### Problem Statement

-   In competitive online games like DoTA, it’s extremely important to match players of similar skill levels. When a low-skill player gets paired against someone who is much more experienced, the outcome is almost always one-sided, which leads to frustration and a bad gameplay experience. At the same time, high-skill players don’t get much enjoyment from a match that offers no challenge. Over time, poor matchmaking can reduce player engagement and hurt the game’s long-term health and revenue.

-    Being able to predict match outcomes based on pre-match and early-match features could help improve matchmaking systems by identifying patterns that separate balanced matches from mismatched ones. Understanding these patterns also helps explain what actually influences a fair, competitive game.

-   Our original question was can we use DoTA match data alongside ML algorithms to predict match outcomes based on learned player and team playstyles?
-   After feedback from the professor that a win-predictor was a very common machine learning project in this space, we knew we had to find an alternative angle to differentiate our model. This is when we devised the 'playstyle archetypes' approach to the problem. Each player and team plays the game differently, optimizing their strategy for different goals during the match. 
-   Our original plan to use the OpenDota JSON Data Dump unfortunately fell through due to a variety of factors. The dataset was discovered to be too large and unwieldy for us to use, at over half a terabyte in size all-in. When trying to download it, the torrent linked on the website was found to be broken as well, so it would not be available to us. Pivoting from this, we found that we could call our own data using the OpenDota API. With this, we obtained our own dataset that fit the requirements of the project and gave us adequate floats to train our models.

## Dataset
### Data Description and Overview
  We successfully collected 104,219 professional DotA 2 matches using the OpenDota API. The dataset contains 16 features spanning match metadata, team statistics, objectives, advantages, events, and vision control metrics, yielding 16,675,040 individual float values.

### Dataset Characteristics:

- Total matches: 104,219
- Total features: 16
- Shape: (104219, 16)
- Match type: Professional competitive matches
- Time period: Recent professional matches (2024-2025)
- Collection method: OpenDota API (/proMatches endpoint)

After discovering that the original OpenDota JSON Data Dump was over 500GB with a broken torrent link, we pivoted to collecting data directly through the OpenDota API. This approach gave us more control over data quality and allowed us to focus specifically on professional matches where gameplay patterns are more consistent and strategic.


## Feature Categories
Our dataset includes features organized into the following categories:
### Match Metadata (9 features):
Duration, region, patch version, game mode, start time, match sequence number, first blood timing

### Team Statistics (36 features):
Kills, deaths, assists, last hits, denies, gold per minute, XP per minute, gold spent, hero damage, hero healing, tower damage, average level, wards placed, Roshan kills, tower kills, barracks status - all split between Radiant and Dire teams

### Objective Tracking (8 features):
Aegis pickups, Aegis steals, courier losses, denied Aegis, first blood, miniboss kills, Roshan kills, building kills

### In-game Events (4 features):
Buyback count, kill count, purchase count, rune pickups

### Vision Control (2 features):
Observer wards placed, sentry wards placed

### Advantage Metrics (2 features):
Final gold advantage, final XP advantage (positive values indicate Radiant advantage)

### Target Variable (1 feature):
radiant_win (binary True/False indicating match outcome)

![dataset_eda.png](attachment:185c6b45-b3a4-4143-82d1-b0ddbb181084.png)

Figure 1: Exploratory data analysis showing match duration distribution, team kill comparisons, gold advantage patterns, and feature correlations with match outcome.
### Distribution Analysis
#### Target Variable Distribution:
The target variable (radiant_win) shows a well-balanced distribution:

Radiant wins: 7,767 matches (51.78%)
Dire wins: 7,233 matches (48.22%)

This near 50-50 split indicates the dataset is well-balanced and we won't need to apply class balancing techniques. The slight Radiant advantage (1.78%) is consistent with known game balance in professional DotA.

#### Match Duration:
Professional matches show the following duration characteristics:

Mean: 2,013 seconds (33.5 minutes)
Median: 1,891 seconds (31.5 minutes)
Standard deviation: 583 seconds
Range: 6.7 to 119 minutes
Skewness: 1.19 (right-skewed distribution)

The right skew indicates that while most games end around 30 minutes, some matches extend significantly longer. These longer games represent evenly-matched teams or late-game comeback scenarios.
Team Kills:
Both teams show similar kill distributions:

Radiant average: 29.67 kills (std: 12.76)
Dire average: 29.12 kills (std: 13.18)
Range for both: 0-90+ kills

The symmetric distribution and similar means suggest that neither team has an inherent advantage in securing kills, which validates the game's team balance.
Economic Metrics:
Gold per minute shows interesting patterns:

Radiant average: 2,441 GPM (std: 379)
Dire average: Similar distribution (features show symmetry)
Range: 624 to 3,667 GPM
Slight negative skew (-0.53) indicates most teams maintain steady farm

### Advantage Metrics:
Final gold and XP advantages show wide variation:

Gold advantage mean: 390 gold (median: 4,004)
XP advantage standard deviation: 28,266 (indicating large swings)
Range: -107,600 to +105,737

These large ranges reflect the variety of match outcomes from stomps to close games.
Correlation Analysis
We analyzed correlations between all numeric features and the match outcome. The top 15 features most correlated with radiant_win are:

final_gold_advantage (0.893)
final_xp_advantage (0.861)
tower_damage_dire (0.766)
gold_per_min_dire (0.759)
tower_damage_radiant (0.755)
gold_per_min_radiant (0.747)
towers_killed_dire (0.735)
towers_killed_radiant (0.718)
xp_per_min_dire (0.646)
kills_dire (0.620)
barracks_status_radiant (0.619)
dire_score (0.615)
deaths_radiant (0.613)
assists_dire (0.610)
xp_per_min_radiant (0.599)

### Key Observations:
The two advantage metrics (gold and XP) are by far the strongest predictors, with correlations near 0.9. This confirms the importance of economic control in DotA - the team with more gold and experience typically wins.
Economic metrics (gold_per_min, xp_per_min) and objective control (tower_damage, towers_killed) form the next tier of important features. Interestingly, both Radiant and Dire versions of these features appear in the top correlations, which makes sense since one team's advantage is another team's deficit.
Combat statistics (kills, deaths, assists) have moderate correlations (0.6-0.65 range), suggesting they matter but are secondary to economic dominance.
Multicollinearity Concerns:
Final gold advantage and final XP advantage are highly correlated with each other (likely r > 0.9), which suggests redundancy. Similarly, the Radiant and Dire versions of statistics are negatively correlated by design. We should consider dimensionality reduction techniques or feature selection to address this.
Missing Data Analysis
Out of 65 features, 18 columns contain missing values:
Most Significant:

skill: 100% missing (15,000/15,000 rows)

This field appears to be unused in professional matches
Will drop this feature entirely

region: 1.37% missing (206/15,000 rows)

Missing region data for some matches
Will impute with mode or create "unknown" category

Minor Missing Data:
Six objective-related features each have 6 missing values (0.04%):

objective_type_CHAT_MESSAGE_AEGIS_STOLEN
objective_type_CHAT_MESSAGE_DENIED_AEGIS
objective_type_CHAT_MESSAGE_FIRSTBLOOD
objective_type_CHAT_MESSAGE_MINIBOSS_KILL
objective_type_CHAT_MESSAGE_ROSHAN_KILL
objective_type_building_kill
final_gold_advantage
final_xp_advantage

These represent matches where certain events didn't occur or weren't logged. For objective counts, we can safely treat missing as zero. For advantage metrics, we may need to drop these 6 matches or impute based on other features.
Overall Data Quality:
Total missing values across entire dataset: approximately 15,200 out of 975,000 total cells (1.56%). The vast majority comes from the single unused "skill" column. After dropping that column and handling the minor missing data, we have a clean dataset ready for modeling.
Outlier Detection
Using the IQR method (values beyond 1.5 times the interquartile range):

duration: 490 outliers (3.27%)

These represent unusually long or short games
Retained: legitimate match outcomes (stomps or stalemates)


kills_radiant: 73 outliers (0.49%)

High-kill games, typically aggressive strategies
Retained: valid professional playstyles


kills_dire: 64 outliers (0.43%)

Similar to Radiant, represents aggressive play
Retained: legitimate matches

Decision: We retained all outliers because they represent real match scenarios in professional play. Extremely short games might be early surrenders or stomps, while extremely long games represent well-matched teams. High-kill games reflect aggressive team compositions or chaotic teamfights. Removing these would eliminate interesting edge cases that could help our model understand the full range of playstyles.
Data Cleaning Steps
Based on our analysis, we performed the following cleaning operations:

Dropped the "skill" feature - 100% missing, not used for professional matches
Handled region missing values - Will encode as separate "unknown" category or impute with mode
Validated player counts - Confirmed all matches have exactly 10 players
Checked for duplicates - No duplicate match IDs found
Verified target variable - No missing values in radiant_win
Retained outliers - All outliers represent legitimate match scenarios

After cleaning, our usable dataset contains 15,000 matches with 64 features (after dropping "skill").
Insights Gained from EDA
Several important insights emerged from our exploratory analysis that will inform our modeling approach:
1. Economic Dominance Drives Wins
The correlation analysis clearly shows that gold and XP advantages are the strongest predictors of match outcome (0.89 and 0.86 respectively). This aligns with fundamental DotA strategy - teams that farm more gold can buy better items, which leads to winning teamfights and eventually the match. This suggests our models should weight economic features heavily.
2. Objectives Are Secondary but Important
Tower damage and tower kills show strong correlations (0.72-0.77) but are weaker than economic metrics. This makes sense because taking objectives generates gold, so objective control may be partially a consequence of economic advantage rather than an independent cause of winning.
3. Combat Stats Are Lagging Indicators
Kills, deaths, and assists show moderate correlations (0.59-0.62), suggesting they're less directly predictive than economy or objectives. This could be because kills are both a cause (kill gold helps) and effect (stronger team gets more kills) of winning, creating a complex relationship.
4. Match Duration Effects
The wide range and right skew of match duration (7 to 119 minutes) suggests that game length affects prediction dynamics. Early-game stomps are likely easier to predict than long, drawn-out matches where the lead changes hands multiple times.
5. Team Balance Validation
The symmetric distributions between Radiant and Dire statistics confirm that neither team has an inherent advantage in this dataset. This validates our data quality and suggests the match outcomes are determined by player skill and strategy rather than game imbalance.
6. Feature Engineering Opportunities
The high correlations between raw stats suggest we should create derived features:

Ratios (kill/death ratio, GPM ratio between teams)
Normalized advantages (gold advantage as percentage of total gold)
Efficiency metrics (damage per gold spent)
Early-game vs late-game indicators

These engineered features might capture team playstyles better than raw numbers alone.
Next Steps Informed by EDA
Based on these findings, our next steps are:

Feature Engineering - Create ratio features, dominance indicators, and efficiency metrics to capture team playstyles beyond raw statistics (in progress, see Analysis/02_Feature_Engineering.ipynb)
Dimensionality Reduction - Apply PCA or Random Projection to handle multicollinearity between highly correlated features like gold_advantage and xp_advantage
Baseline Modeling - Train simple models (Logistic Regression, Random Forest) to establish performance benchmarks and understand which features matter most
Feature Selection - Use model-based feature importance to identify which of our 100+ features (after engineering) actually contribute to predictions
Playstyle Clustering - Apply unsupervised learning (K-means, hierarchical clustering) on dimensionality-reduced features to identify team playstyle archetypes (farm-heavy, fight-heavy, objective-focused, etc.)

Detailed exploratory analysis code available in: Analysis/01_EDA.ipynb

## Methodology

### Baseline method implementation

-   Baseline Models Attempted
-   [ ] Logistic Regression (tests which features have direct, interpretable influence on match outcome)
-   [ ] Random Forest Classifier (Captures interactions between features)

-   [ ] Results: quantitative metrics and qualitative observations.
-   [ ] Discussion of results: What do baseline results reveal about the
    problem/data?
-   [ ] Visual or tabular summaries as appropriate.

## Improvements and other methods implementation

## Feature Engineering and Dimensionality Reduction

We performed several improvements beyond baseline modeling:

- Feature Engineering
- [ ] Created ratio features (kill/death ratio, GPM ratio, XP ratio).
- [ ] Introduced dominance metrics (gold advantage normalized by total team gold).
- [ ] Added efficiency metrics (damage per gold spent).
- [ ] Extracted early-game and late-game versions of several features to analyze temporal tendencies.

These engineered features aim to capture playstyle rather than raw power.

- Feature Selection
- [ ] Used model-based feature importance from Random Forest to prune redundant metrics.

Removed raw Radiant/Dire duplicates in favor of derived symmetric features (e.g., kill_ratio).

Dimensionality Reduction for High-Dimensionality Mitigation

PCA: Reduced 64 features down to 8–12 principal components while capturing ~92% of variance.

Random Projection (Gaussian): Explored alternative embeddings to preserve pairwise distances for clustering.

Approximate Polytope Projection (APP): Initial experiments for better geometric interpretability.

These methods are essential due to:

high interfeature correlation,

redundancy between Radiant and Dire mirrored stats,

desire to identify structure in playstyles, not only predict wins.

Proposed Better-Fitting Model

Given the results so far, we propose:

Gradient Boosted Trees (XGBoost or LightGBM)

Superior performance for tabular data

Handles multicollinearity

Interpretable via SHAP values

Efficient and scalable for 15,000 matches

We expect boosted trees to outperform Random Forest by 2–5% due to smoother decision surfaces.

Hypertuning Work Completed

Initial hyperparameter sweeps (partial):

max_depth ∈ {3, 5, 7}

learning_rate ∈ {0.01, 0.05, 0.1}

n_estimators ∈ {200, 400, 600}

regularization parameters (λ, α) tuned to prevent overfitting

Preliminary best model:

depth=5, learning_rate=0.05, n_estimators=400

Accuracy range observed: ~0.85–0.87

Planned Work for Remaining Week

Complete full PCA+clustering pipeline for playstyle archetype discovery.

Evaluate cluster quality (silhouette score, Davies–Bouldin index).

Integrate archetypes into model as derived categorical features.

Run full hypertuning for boosted models.

Write final analysis sections and generate unified plots.

-   [ ] Feature engineering, feature selection, high dimensionality
    mitigation.
-   [ ] Potentially better fit model proposed here. Explain why is it a
    better fit.
-   [ ] Show implementation, hypertuning.  
-   [ ] Propose what you will do in the remaining week.

## Teaming Strategy

-   [ ] Individual team member contributions.

| Name              | Contribution   | Section(s) Authored / Tasks Completed |
|-------------------|-----------------|---------------------------------------|
|  Wyatt            | Model selection |                                       |
|  Segundo          | Report Writing  |                                       |
|  Ryan             | Data Analysis   |                                       |

## Mitigation Plan

Key Milestones Remaining

Complete PCA projections and clustering analysis (due in 3–4 days).

Finalize feature-engineered dataset and train tuned boosted model.

Generate visualizations for archetypes (scatterplots, SHAP values).

Draft final report sections: Results, Discussion, Conclusion.

Responsibilities

Wyatt: Advanced modeling, hypertuning, preparing confusion matrices and SHAP interpretations.

Ryan: Remaining EDA visualizations, PCA/APP embedding refinement, clustering experiments.

Segundo: Final report editing, formatting, and assembling results into cohesive narrative.

Timeline

Mid-week: Final model trained + dimensionality reduction finalized.

3 days before submission: Clustering and archetype interpretation completed.

2 days before submission: Full draft of report completed.

1 day before submission: Polish, proofreading, integration of figures.

Risk Planning (“What if we fail?”)

If clustering proves unstable or insights are weak:

We will pivot to a feature-importance-driven interpretability analysis using SHAP on the boosted model.

This ensures we can still meet the goal of providing interpretable insights into playstyle behaviors, even without clear clusters.

If model performance unexpectedly degrades:

We will revert to the best-performing Random Forest baseline while highlighting limitations and future work.

Either fallback path ensures a complete, rigorous final project.

-   [ ] Key milestones or tasks to be completed by project end.
-   [ ] Who is responsible for each task?
-   [ ] Timeline/checkpoints to ensure on-time submission.
-   [ ] What if you fail?