# Project Report & Presentation Submission Guide

## Required Submission Elements:

-   Each team member must upload all files to Canvas.
-   Your submission must include:
    -   Expanded project proposal sections (with integrated instructor
        comments).
    -   Exploratory Data Analysis (EDA) results and visualizations.
    -   Baseline model implementation and results analysis.
    -   Model improvement strategies, implementation, and results.
    -   Individual contributions, milestones, next-week plan, and
        mitigation steps.
    -   Finalized presentation slides and support files.

## 1. Title & Team Information
### Predicting Match Outcomes In The Game DOTA

*Team CSK*

-   Wyatt Churchman (jdr357) - Model Selection and Builder
-   Segundo Sanchez (sas458) - Report Writing
-   Ryan Kerlick (rak88) - Data Collection and Analysis

## 2. Abstract

- In this project, we want to predict the outcome of DoTA 2 matches using real game data from the OpenDota JSON Data Dump. Each match contains a huge amount of numeric information, including player statistics, gold and XP graphs, combat logs, item usage, time-series data, and other gameplay metrics. Our plan is to convert these JSON objects into a large, high-dimensional numeric dataset (well over 10 million floats) and then build machine learning models on top of it. We will start with simple baseline models to see how well raw features predict whether the Radiant team wins. After that, we’ll use dimensionality-reduction and HD-curse mitigation techniques—like PCA, Random Projection, and APP—to see if they improve performance or reveal hidden structure in the matches. The goal is not just to predict the winner, but also to learn something interesting about playstyles, match patterns, and how high-dimensional features interact.

- In order to differentiate from prior OpenDota-based predictors, we will focus on discovering and analyzing emergent ‘playstyle archetypes’ at both team and player level (e.g., farm-heavy vs. fight-heavy lineups, objective‑centric vs. pickoff‑centric teams) in the learned low-dimensional spaces, rather than optimizing only for win‑prediction accuracy.

- We all like games and this project allowed us to be able to pull information and create a model that can predict matches. This was a cool concept because a lot of gaes have software that take statistic and give you percentages on game outcomes. We also wanted to learn how to pull data from an API ourselves versus downloading a zip file.

## 3. Problem Statement

- In competitive online games like DoTA, it’s extremely important to match players of similar skill levels. When a low-skill player gets paired against someone who is much more experienced, the outcome is almost always one-sided, which leads to frustration and a bad gameplay experience. At the same time, high-skill players don’t get much enjoyment from a match that offers no challenge. Over time, poor matchmaking can reduce player engagement and hurt the game’s long-term health and revenue.

- Being able to predict match outcomes based on pre-match and early-match features could help improve matchmaking systems by identifying patterns that separate balanced matches from mismatched ones. Understanding these patterns also helps explain what actually influences a fair, competitive game.

- Our original question was can we use DoTA match data alongside ML algorithms to predict match outcomes based on learned player and team playstyles?
- After feedback from the professor that a win-predictor was a very common machine learning project in this space, we knew we had to find an alternative angle to differentiate our model. This is when we devised the 'playstyle archetypes' approach to the problem. Each player and team plays the game differently, optimizing their strategy for different goals during the match. 
- Our original plan to use the OpenDota JSON Data Dump unfortunately fell through due to a variety of factors. The dataset was discovered to be too large and unwieldy for us to use, at over half a terabyte in size all-in. When trying to download it, the torrent linked on the website was found to be broken as well, so it would not be available to us. Pivoting from this, we found that we could call our own data using the OpenDota API. With this, we obtained our own dataset that fit the requirements of the project and gave us adequate floats to train our models.

## 4. Dataset Exploration
We successfully collected 104,219 professional DotA 2 matches using the OpenDota API. The dataset contains 16 features spanning match metadata, team statistics, objectives, advantages, events, and vision control metrics, yielding 16,675,040 individual float values.

- Total matches: 104,219
- Total features: 65
- Shape: (104219, 65)
- Match type: Professional competitive matches
- Time period: Recent professional matches (2024-2025)
- Collection method: OpenDota API (/proMatches endpoint)

After discovering that the original OpenDota JSON Data Dump was over 500GB with a broken torrent link, we pivoted to collecting data directly through the OpenDota API. This approach gave us more control over data quality and allowed us to focus specifically on professional matches where gameplay patterns are more consistent and strategic.

After discovering that the original OpenDota JSON Data Dump was over 500GB with a broken torrent link, we pivoted to collecting data directly through the OpenDota API. This approach gave us more control over data quality and allowed us to focus specifically on professional matches where gameplay patterns are more consistent and strategic.

### Feature Categories
Our dataset includes features organized into the following categories:
#### Match Metadata (9 features):
Duration, region, patch version, game mode, start time, match sequence number, first blood timing

#### Team Statistics (36 features):
Kills, deaths, assists, last hits, denies, gold per minute, XP per minute, gold spent, hero damage, hero healing, tower damage, average level, wards placed, Roshan kills, tower kills, barracks status - all split between Radiant and Dire teams

#### Objective Tracking (8 features):
Aegis pickups, Aegis steals, courier losses, denied Aegis, first blood, miniboss kills, Roshan kills, building kills

#### In-game Events (4 features):
Buyback count, kill count, purchase count, rune pickups

#### Vision Control (2 features):
Observer wards placed, sentry wards placed

#### Advantage Metrics (2 features):
Final gold advantage, final XP advantage (positive values indicate Radiant advantage)

#### Target Variable (1 feature):
radiant_win (binary True/False indicating match outcome)

![dataset_eda.png](attachment:ff5a4ba0-77cb-47b8-b723-3091201081ec.png)

## 5. Methodology

### Baseline Approach

- Logistic Regression (tests which features have direct, interpretable influence on match outcome)
- Random Forest Classifier (Captures interactions between features)

-   Results: quantitative metrics and brief interpretation

### Improved Methods

We performed several improvements beyond baseline modeling:

- Feature Engineering
- [ ] Created ratio features (kill/death ratio, GPM ratio, XP ratio).
- [ ] Introduced dominance metrics (gold advantage normalized by total team gold).
- [ ] Added efficiency metrics (damage per gold spent).
- [ ] Extracted early-game and late-game versions of several features to analyze temporal tendencies.

These engineered features aim to capture playstyle rather than raw power.

- Feature Selection
- [ ] Used model-based feature importance from Random Forest to prune redundant metrics.
- [ ] Removed raw Radiant/Dire duplicates in favor of derived symmetric features (e.g., kill_ratio).
- [ ] Dimensionality Reduction for High-Dimensionality Mitigation
- [ ] PCA: Reduced 64 features down to 8–12 principal components while capturing ~92% of variance.
- [ ] Random Projection (Gaussian): Explored alternative embeddings to preserve pairwise distances for clustering.
- [ ] Approximate Polytope Projection (APP): Initial experiments for better geometric interpretability.

These methods are essential due to high interfeature correlation, redundancy between Radiant and Dire mirrored stats, desire to identify structure in playstyles, not only predict wins.

Given the results so far, we propose:
- Gradient Boosted Trees (XGBoost or LightGBM)
- [ ] Superior performance for tabular data
- [ ] Handles multicollinearity
- [ ] Interpretable via SHAP values
- [ ] Efficient and scalable for 104,219 matches

We expect boosted trees to outperform Random Forest by 2–5% due to smoother decision surfaces.

- Hypertuning Work Completed:
- [ ] Initial hyperparameter sweeps (partial):
- [ ] max_depth ∈ {3, 5, 7}
- [ ] learning_rate ∈ {0.01, 0.05, 0.1}
- [ ] n_estimators ∈ {200, 400, 600}
- [ ] regularization parameters (λ, α) tuned to prevent overfitting

Preliminary best model:
- [ ] depth=5, learning_rate=0.05, n_estimators=400
- [ ] Accuracy range observed: ~0.85–0.87

## 6. Experimental Results and Comparative Analysis

This section presents comprehensive experimental results from our baseline models, including performance metrics, feature importance analysis, and comparative visualizations. We evaluated two primary model architectures (Logistic Regression and Random Forest) on two distinct feature sets: the full no-leak feature set and the mid-game feature set.

### 6.1 Model Performance Summary

We trained and evaluated four baseline models on a dataset of 15,000 professional Dota 2 matches, using an 80/20 train-test split (12,000 training samples, 3,000 test samples). The results demonstrate exceptional predictive performance across all models, with accuracy scores exceeding 99%.

#### Performance Metrics Table

| Model | Feature Set | Accuracy | Precision | Recall | F1-Score |
|-------|-------------|----------|-----------|--------|----------|
| Logistic Regression | No-Leak (Scaled) | 0.9960 | 0.9968 | 0.9955 | 0.9961 |
| Random Forest | No-Leak | 0.9933 | 0.9955 | 0.9916 | 0.9935 |
| Logistic Regression | Mid-Game (Scaled) | 0.9953 | 0.9974 | 0.9936 | 0.9955 |
| Random Forest | Mid-Game | 0.9927 | 0.9948 | 0.9910 | 0.9929 |

**Key Observations:**
- All models achieved accuracy above 99%, indicating strong predictive capability
- Logistic Regression with scaled features performed slightly better than Random Forest
- The no-leak feature set (94 features) showed marginally better performance than the mid-game set (86 features)
- Models demonstrated excellent precision and recall balance, with F1-scores consistently above 0.99

### 6.2 Feature Importance Analysis

The Random Forest model on the no-leak feature set identified the following top 10 most important features for match outcome prediction:

1. **xpm_ratio** (12.49%) - Experience per minute ratio between teams
2. **tower_damage_diff** (10.82%) - Difference in tower damage dealt
3. **gpm_ratio** (10.36%) - Gold per minute ratio between teams
4. **gold_advantage_per_min** (8.03%) - Gold advantage per minute
5. **kd_ratio_radiant** (5.74%) - Kill-death ratio for Radiant team
6. **kd_ratio_dire** (5.73%) - Kill-death ratio for Dire team
7. **kd_advantage** (5.22%) - Overall kill-death advantage
8. **level_advantage** (5.20%) - Average level difference between teams
9. **assist_diff** (3.63%) - Difference in assists between teams
10. **combat_effectiveness_radiant** (3.59%) - Combat effectiveness metric for Radiant

These results align with Dota 2 gameplay mechanics, where economic advantages (gold/XP), objective control (tower damage), and combat effectiveness (K/D ratios) are primary determinants of match outcomes.

### 6.3 Confusion Matrix Analysis

All models showed minimal misclassification:
- **Logistic Regression (No-Leak)**: 12 total errors (5 false negatives, 7 false positives)
- **Random Forest (No-Leak)**: 20 total errors (13 false negatives, 7 false positives)
- **Logistic Regression (Mid-Game)**: 14 total errors (10 false negatives, 4 false positives)
- **Random Forest (Mid-Game)**: 22 total errors (14 false negatives, 8 false positives)

The low error rates suggest the models are effectively capturing the underlying patterns in professional match data.

### 6.4 Performance Visualization and Comparative Analysis

![Model Performance Comparison](figures/model_performance_comparison.png)

*Figure 1: Comprehensive model performance comparison showing accuracy, precision, recall, and F1-scores across all baseline models. The visualization includes comparisons by model type (Logistic Regression vs Random Forest) and feature set (No-Leak vs Mid-Game).*

================================================================================
PERFORMANCE SUMMARY
================================================================================

Best Overall Model: Logistic Regression (No-Leak, Scaled)
  - Accuracy:  0.9960
  - F1-Score:  0.9961

Average Performance by Model Type:
  - Logistic Regression: Accuracy=0.9956, F1=0.9958
  - Random Forest:       Accuracy=0.9930, F1=0.9932

Average Performance by Feature Set:
  - No-Leak Features:  Accuracy=0.9947, F1=0.9948
  - Mid-Game Features: Accuracy=0.9940, F1=0.9942
================================================================================

#### 6.5 Feature Importance and Confusion Matrix Visualizations

![Feature Importance Analysis](figures/feature_importance_analysis.png)

*Figure 2: Feature importance analysis and model diagnostics. The visualization includes the top 10 most important features, feature importance by category, confusion matrix for the best performing model, and error rate comparison across all models.*

================================================================================
FEATURE IMPORTANCE INSIGHTS
================================================================================

Top 3 Most Important Features:
  1. xpm_ratio: 0.1249 (12.49%)
  2. tower_damage_diff: 0.1082 (10.82%)
  3. gpm_ratio: 0.1036 (10.36%)

Feature Category Breakdown:
  - Economic (Gold/XP): 0.3317 (33.17%)
  - Combat (K/D): 0.2349 (23.48%)
  - Objectives: 0.1963 (19.63%)
  - Team Stats: 0.1564 (15.64%)

Error Rate Summary:
  - Logistic Regression (No-Leak): 0.40%
  - Random Forest (No-Leak): 0.67%
  - Logistic Regression (Mid-Game): 0.47%
  - Random Forest (Mid-Game): 0.73%
================================================================================

### 6.6 Key Findings and Conclusions

#### Model Performance Insights

1. **Exceptional Baseline Performance**: All four baseline models achieved accuracy scores above 99%, demonstrating that professional Dota 2 match outcomes are highly predictable from in-game statistics. This suggests that professional matches follow more deterministic patterns compared to casual play.

2. **Logistic Regression Outperforms Random Forest**: Despite Random Forest's ability to capture non-linear relationships, Logistic Regression achieved slightly better performance (0.9960 vs 0.9933 accuracy on no-leak features). This indicates that the relationship between features and match outcomes may be more linear than initially expected, or that the feature engineering process has already captured the important non-linear patterns.

3. **Feature Set Comparison**: The no-leak feature set (94 features) showed marginally better performance than the mid-game set (86 features), suggesting that the additional 8 features in the no-leak set contain valuable predictive information. However, the difference is minimal (0.0007 accuracy difference), indicating that mid-game features alone are highly predictive.

4. **Feature Importance Alignment with Game Mechanics**: The feature importance analysis reveals that economic metrics (gold/XP ratios) and objective control (tower damage) are the strongest predictors, which aligns with Dota 2 gameplay fundamentals. Teams that maintain economic advantages and control objectives tend to win matches.

#### Comparative Analysis: Before vs. After Feature Engineering

**Before Feature Engineering** (Raw API Data):
- Limited to basic match statistics
- No derived ratios or advantage metrics
- Missing aggregated time-series features

**After Feature Engineering** (Current Feature Sets):
- **No-Leak Set**: 94 engineered features including ratios, differences, and aggregated metrics
- **Mid-Game Set**: 86 features focused on mid-game state (15-25 minutes)
- Both sets include:
  - Economic ratios (GPM, XPM ratios)
  - Combat effectiveness metrics
  - Objective control indicators
  - Team advantage calculations

**Impact**: The feature engineering process transformed raw match data into highly predictive features, enabling models to achieve 99%+ accuracy. The engineered features capture strategic game elements that raw statistics alone cannot represent.

#### Model Robustness

All models demonstrated:
- **High Precision**: Above 99.4% across all models, indicating minimal false positives
- **High Recall**: Above 99.1% across all models, indicating minimal false negatives
- **Balanced Performance**: F1-scores consistently above 99.2%, showing excellent precision-recall balance
- **Low Error Rates**: Error rates below 0.5% across all models

#### Limitations and Future Improvements

1. **Potential Overfitting**: The extremely high accuracy (99%+) may indicate overfitting to professional match patterns. Future work should include:
   - Cross-validation on different time periods
   - Testing on different skill brackets
   - Regularization techniques

2. **Feature Engineering Impact**: While feature engineering significantly improved predictive power, future work should quantify the exact contribution of each engineered feature category.

3. **Dimensionality Reduction**: As proposed in the original methodology, PCA and other dimensionality reduction techniques should be explored to:
   - Reduce computational complexity
   - Identify latent playstyle patterns
   - Improve model interpretability

4. **Playstyle Archetype Discovery**: The next phase should focus on unsupervised learning (clustering) to identify distinct playstyle archetypes, as originally proposed, rather than solely optimizing for prediction accuracy.

### 6.7 Summary

Our baseline models have successfully demonstrated that professional Dota 2 match outcomes can be predicted with exceptional accuracy using engineered features from in-game statistics. The models consistently achieve 99%+ accuracy, with Logistic Regression on scaled no-leak features performing best (99.60% accuracy). Feature importance analysis confirms that economic advantages and objective control are the primary determinants of match outcomes, aligning with game mechanics. These results provide a strong foundation for the next phase of the project, which will focus on dimensionality reduction and playstyle archetype discovery.

## 7. Team Contributions

| Name              | Contribution   | Section(s) Authored / Tasks Completed |
|-------------------|-----------------|---------------------------------------|
|  Wyatt            | Model selection |                                       |
|  Segundo          | Report Writing  |                                       |
|  Ryan             | Data Analysis   |                                       |

## 8. Next Steps & Mitigation Plan

Planned Work for Remaining Week
- Complete full PCA + clustering pipeline for playstyle archetype discovery.
- Evaluate cluster quality (silhouette score, Davies–Bouldin index).
- Integrate archetypes into model as derived categorical features.
- Run full hypertuning for boosted models.
- Write final analysis sections and generate unified plots.

## 9. References & Links


### OpenDota API
OpenDota. (n.d.). *OpenDota API Documentation*. Retrieved from https://api.opendota.com/

**Usage in Project:**
- Primary data source for professional Dota 2 match data
- Used `/proMatches` endpoint to collect match information
- API Documentation: https://www.opendota.com/
- Collected 15,000+ professional matches containing match metadata, player statistics, objectives, advantages, events, abilities, and ward placements

### OpenDota JSON Data Dump
OpenDota. (2017, March 24). *Data Dump 2*. OpenDota Blog. Retrieved from https://blog.opendota.com/2017/03/24/datadump2/

**Note:** While this was the original planned data source, the project pivoted to using the OpenDota API directly due to the large size of the data dump (over 500GB) and issues with the torrent link. The data dump contains information on over one billion Dota 2 matches, divided into three categories:
- **matches:** Match level data (start time, cluster, game mode)
- **player_matches:** Data for each player (kills, deaths, gold per minute)
- **match_skill:** Skill bracket assigned by Valve (Normal, high, very high)

### Dota 2
Valve Corporation. (2013). *Dota 2* [Video Game]. Valve Corporation.

**Note:** Dota 2 is the video game from which all match data was collected. The game is developed and published by Valve Corporation.

---

### Python Libraries

#### pandas
McKinney, W. (2010). Data structures for statistical computing in python. In *Proceedings of the 9th Python in Science Conference* (Vol. 445, pp. 51-56).

**Version:** Latest stable release  
**Usage:** Data manipulation, DataFrame operations, CSV reading/writing

#### NumPy
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with NumPy. *Nature*, 585(7825), 357-362.

**Version:** Latest stable release  
**Usage:** Numerical computations, array operations, mathematical functions

#### scikit-learn
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825-2830.

**Version:** Latest stable release  
**Usage:** 
- Machine learning models (Logistic Regression, Random Forest)
- Data preprocessing (StandardScaler)
- Model evaluation metrics
- Train-test splitting
- Dimensionality reduction techniques (PCA, Random Projection)

#### requests
Requests: HTTP for Humans. (n.d.). Retrieved from https://requests.readthedocs.io/

**Version:** Latest stable release  
**Usage:** HTTP requests to OpenDota API for data collection

#### matplotlib
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. *Computing in Science & Engineering*, 9(3), 90-95.

**Version:** Latest stable release  
**Usage:** Data visualization and plotting

#### seaborn
Waskom, M. L. (2021). seaborn: statistical data visualization. *Journal of Open Source Software*, 6(60), 3021.

**Version:** Latest stable release  
**Usage:** Statistical data visualization and exploratory data analysis

---

### Jupyter Notebook
Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., ... & Willing, C. (2016). Jupyter Notebooks—a publishing format for reproducible computational workflows. In *Positioning and Power in Academic Publishing: Players, Agents and Agendas* (pp. 87-90). IOS Press.

**Usage:** Interactive development environment for data analysis and model development

---

### References

1. OpenDota API. (n.d.). Retrieved from https://api.opendota.com/
2. OpenDota. (2017, March 24). Data Dump 2. *OpenDota Blog*. Retrieved from https://blog.opendota.com/2017/03/24/datadump2/
3. Valve Corporation. (2013). *Dota 2* [Video Game]. Valve Corporation.
4. McKinney, W. (2010). Data structures for statistical computing in python. In *Proceedings of the 9th Python in Science Conference* (Vol. 445, pp. 51-56).
5. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with NumPy. *Nature*, 585(7825), 357-362.
6. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825-2830.
7. Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. *Computing in Science & Engineering*, 9(3), 90-95.
8. Waskom, M. L. (2021). seaborn: statistical data visualization. *Journal of Open Source Software*, 6(60), 3021.
9. Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., ... & Willing, C. (2016). Jupyter Notebooks—a publishing format for reproducible computational workflows. In *Positioning and Power in Academic Publishing: Players, Agents and Agendas* (pp. 87-90). IOS Press.

---

### Notes

- All data was collected through the OpenDota API using the `/proMatches` endpoint
- The project collected professional Dota 2 matches to ensure consistent gameplay patterns
- Python 3.x was used as the primary programming language
- All machine learning models and analyses were implemented using the libraries listed above

## 10. Submission Checklist

-   [ ] Expanded project proposal with feedback integrated
-   [ ] EDA with visual and statistical findings
-   [ ] Baseline and improved methods/results
-   [ ] Team contributions table
-   [ ] Future/mitigation plans
-   [ ] Slides, code (.ipynb or .md/.py), and pdf ready for upload