# Project Report & Presentation Submission Guide

## Required Submission Elements:

-   Each team member must upload all files to Canvas.
-   Your submission must include:
    -   Expanded project proposal sections (with integrated instructor
        comments).
    -   Exploratory Data Analysis (EDA) results and visualizations.
    -   Baseline model implementation and results analysis.
    -   Model improvement strategies, implementation, and results.
    -   Individual contributions, milestones, next-week plan, and
        mitigation steps.
    -   Finalized presentation slides and support files.

## 1. Title & Team Information

-   Project Title
    Predicting Match Outcomes In The Game DOTA

-   Team Member Names, NetIDs, and assigned roles
-   Wyatt Churchman, JDR357, Data gathering and model selection

## 2. Abstract

-   In this project, we want to predict the outcome of DoTA 2 matches using real game data from the OpenDota JSON Data Dump. Each match contains a huge amount of numeric information, including player statistics, gold and XP graphs, combat logs, item usage, time-series data, and other gameplay metrics. Our plan is to convert these JSON objects into a large, high-dimensional numeric dataset (well over 10 million floats) and then build machine learning models on top of it. We will start with simple baseline models to see how well raw features predict whether the Radiant team wins. After that, we’ll use dimensionality-reduction and HD-curse mitigation techniques—like PCA, Random Projection, and APP—to see if they improve performance or reveal hidden structure in the matches. The goal is not just to predict the winner, but also to learn something interesting about playstyles, match patterns, and how high-dimensional features interact.

-   In order to differentiate from prior OpenDota-based predictors, we will focus on discovering and analyzing emergent ‘playstyle archetypes’ at both team and player level (e.g., farm-heavy vs. fight-heavy lineups, objective‑centric vs. pickoff‑centric teams) in the learned low-dimensional spaces, rather than optimizing only for win‑prediction accuracy.

-   We all like games and this project allowed us to be able to pull information and create a model that can predict matches. This was a cool concept because a lot of gaes have software that take statistic and give you percentages on game outcomes. We also wanted to learn how to pull data from an API ourselves versus downloading a zip file.

## 3. Problem Statement

-   In competitive online games like DoTA, it’s extremely important to match players of similar skill levels. When a low-skill player gets paired against someone who is much more experienced, the outcome is almost always one-sided, which leads to frustration and a bad gameplay experience. At the same time, high-skill players don’t get much enjoyment from a match that offers no challenge. Over time, poor matchmaking can reduce player engagement and hurt the game’s long-term health and revenue.

-    Being able to predict match outcomes based on pre-match and early-match features could help improve matchmaking systems by identifying patterns that separate balanced matches from mismatched ones. Understanding these patterns also helps explain what actually influences a fair, competitive game.

-   Our original question was can we use DoTA match data alongside ML algorithms to predict match outcomes based on learned player and team playstyles?
-   After feedback from the professor that a win-predictor was a very common machine learning project in this space, we knew we had to find an alternative angle to differentiate our model. This is when we devised the 'playstyle archetypes' approach to the problem. Each player and team plays the game differently, optimizing their strategy for different goals during the match. 
-   Our original plan to use the OpenDota JSON Data Dump unfortunately fell through due to a variety of factors. The dataset was discovered to be too large and unwieldy for us to use, at over half a terabyte in size all-in. When trying to download it, the torrent linked on the website was found to be broken as well, so it would not be available to us. Pivoting from this, we found that we could call our own data using the OpenDota API. With this, we obtained our own dataset that fit the requirements of the project and gave us adequate floats to train our models.

## 4. Dataset Exploration

-    We successfully collected 15,000 professional DotA 2 matches using the OpenDota API. The dataset contains 65 features spanning match metadata, team statistics, objectives, advantages, events, and vision control metrics, yielding 750,000 individual float values.

###Dataset Characteristics:

-Total matches: 15,000

-Total features: 65

-Shape: (15000, 65)

-Match type: Professional competitive matches

-Time period: Recent professional matches (2024-2025)

-Collection method: OpenDota API (/proMatches endpoint)


#
After discovering that the original OpenDota JSON Data Dump was over 500GB with a broken torrent link, we pivoted to collecting data directly through the OpenDota API. This approach gave us more control over data quality and allowed us to focus specifically on professional matches where gameplay patterns are more consistent and strategic.
#
###Feature Categories
Our dataset includes features organized into the following categories:
###Match Metadata (9 features):
Duration, region, patch version, game mode, start time, match sequence number, first blood timing

###Team Statistics (36 features):
Kills, deaths, assists, last hits, denies, gold per minute, XP per minute, gold spent, hero damage, hero healing, tower damage, average level, wards placed, Roshan kills, tower kills, barracks status - all split between Radiant and Dire teams

###Objective Tracking (8 features):
Aegis pickups, Aegis steals, courier losses, denied Aegis, first blood, miniboss kills, Roshan kills, building kills

###In-game Events (4 features):
Buyback count, kill count, purchase count, rune pickups

###Vision Control (2 features):
Observer wards placed, sentry wards placed

###Advantage Metrics (2 features):
Final gold advantage, final XP advantage (positive values indicate Radiant advantage)

###Target Variable (1 feature):
radiant_win (binary True/False indicating match outcome)

## 5. Methodology

### Baseline Approach

-   [ ] Logistic Regression (tests which features have direct, interpretable influence on match outcome)
-   [ ] Random Forest Classifier (Captures interactions between features)

-   Results: quantitative metrics and brief interpretation

### Improved Methods

-   Feature engineering, selection, high-dimensionality handling
-   New or tuned models, implementation details with rationale
-   Comparative analysis of baseline vs. improved methods

## 6. Experimental Results and Comparative Analysis

-   Tabular and visual summaries of model performance
-   Comparative charts showing before/after improvements

## 7. Team Contributions

| Name              | Contribution   | Section(s) Authored / Tasks Completed |
|-------------------|-----------------|---------------------------------------|
|  Wyatt            | Model selection |                                       |
|  Segundo          | Report Writing  |                                       |
|  Ryan             | Data Analysis   |                                       |

## 8. Next Steps & Mitigation Plan

-   Planned remaining tasks for final week, responsibilities, and
    checkpoints
-   Plan if data/tasks/models require change or team dynamics shift

## 9. References & Links

-   Cite all data sources, packages, and relevant literature

## 10. Submission Checklist

-   [ ] Expanded project proposal with feedback integrated
-   [ ] EDA with visual and statistical findings
-   [ ] Baseline and improved methods/results
-   [ ] Team contributions table
-   [ ] Future/mitigation plans
-   [ ] Slides, code (.ipynb or .md/.py), and pdf ready for upload