Skip to content

Collect and processes replays to use in ML models training

License

Notifications You must be signed in to change notification settings

dvarkless/sc2_replay_converter

Repository files navigation

Starcraft II replay converter

Extracts data from websites and creates datasets for ML or analysis purposes.

SetupConfigurationUsageTable schemes

About the Project

This repository is dedicated to gathering and organizing datasets for machine learning based StarCraft II bots. The aim of this project is twofold - firstly, it provides a tool to collect replay data that can be used in supervised training methods; secondly, it creates datasets suitable for use with value functions in reinforcement learning algorithms.

Available functionality:

  • Collect replays from two websites
  • Preprocess data into a human readable form
  • Transform data and load it into the DB.

Limitations to consider:

  • The only available game mode is 1v1.
  • Made for game version from 5.0.0 to 5.0.11

Prerequisites

  • Python <= 3.9 (the latest sc2replay library is available in Python version 3.9).
  • Access to configured PostgreSQL database.
  • Packages listed in requirements.txt.
  • Optionally: jupyter notebook

Setup

  1. Create a new database in postgres (You can use this guide, for linux or this guide for windows)
  • Create a new database (using psql):
create database sc2replays;
\c sc2replays
  1. Clone the repository by running
git clone https://github.com/dvarkless/sc2_replay_converter.git
  1. Create a python virtual environment:
cd sc2_replay_converter
python -m venv venv
  1. If you are using Linux or Mac:
source ./venv/bin/activate

If you are using Windows:

./venv/Scripts/activate.ps1
  1. Install packages:
pip install -r requirements.txt
  1. Download submodule
git submodule update --init --recursive

Configuration

Configuration files can be found in ./configs directory

Database access:

File ./configs/secrets.yml

db_host: localhost # Database url address
db_name: sc2replays # Database name
db_user: dvarkless # Username which can interract with the DB
db_password: password # Password for this user, set to `None` if it is not set

File ./configs/downloader_config.yml

The only reasonable thing to change here is user-agent:

headers:
  user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
  # Chrome from Windows device

If you want to add another site, you should add it into the config and write another method in class ReplayDownloader (def name_yield: ...).

Usage

The example code is provided in the download_and_process.ipynb

  1. Collect replays:
from replay_downloader import ReplayDownloader

REPLAY_DIR = "../replays"
DOWNLOADER_CONFIG = "./configs/downloader_config.yml"

downloader = ReplayDownloader(REPLAY_DIR, DOWNLOADER_CONFIG, max_count=500, jupyter=True)
downloader.start_download("sc2rep")
# downloader.start_download("spawningtool")
  1. Preprocess files
from replay_process import ReplayProcess, ReplayFilter
from datetime import datetime

REPLAY_DIR = "../replays"
SECRETS = "./configs/secrets.yml"
GAME_INFO_FILE = "./starcraft2_replay_parse/game_info.csv"

processor = ReplayProcess(
    SECRETS,
    DATABASE_CONFIG,
    GAME_INFO_FILE,
    jupyter=True
)

# Setup filter
replay_filter = ReplayFilter()
replay_filter.is_1v1 = True # Select only 1v1 games
replay_filter.game_len = [1920, 38400] # Games with length from 2 to 40 mins
replay_filter.time_played = datetime(2021, 1, 1) # Earliest allowed game

# Process replays (this should take a while)
processor.process_replays(REPLAY_DIR, filt=replay_filter)
  1. Create dataset tables
from itertools import product
from pipeline import PipelineComposer

MINS_PER_SAMPLE = 4 # Take first samples every 4 minutes on average
PRED_STEP = 1 # Take every second samples 1 minute later
MIN_LEAGUE = 3 # Min league is Gold

r_pairs = product("ZTP", repeat=2) # ((Z, Z), (Z, T), ...)
matchups = ["v".join((r1, r2)) for r1, r2 in r_pairs] # ['ZvZ', 'ZvT', ...]
composer = PipelineComposer("ZvZ", tick_step=32)

# Create pipelines for each table type
for matchup in matchups:
    composer.change_matchup(matchup)
    comp_pipeline = composer.get_compositon(MINS_PER_SAMPLE, PRED_STEP, MIN_LEAGUE)
    comp_pipeline.run()

Table schemes:

Table schemes can be found in ./queries/create_*.sql

Dataset tables are created dynamically.
PRIMARY KEYS: tick, game_id. FOREIGN KEY: game_id REFERENCES game_info.
Their structure:

*_comp tables:

[NOTE] This tables are used to train which unit the agent should build next based on army composition and scouting info.

player_unit: INTEGER,
...
player_building: INTEGER,
...
player_minerals_available: INTEGER, 
player_vespene_available: INTEGER, 
enemy_unit: INTEGER,
...
out_unit: NUMERIC(4, 3) # 0.001 # player's units in 1 minute from current tick
...

*_winprob tables:

[NOTE] This tables are used to train agents to predict game outcome based on the available information.

game_id: INTEGER,
tick: INTEGER,
player_unit: INTEGER,
...
player_building: INTEGER,
...
player_upgrade: INTEGER,
...
player_minerals_available: INTEGER, 
player_vespene_available: INTEGER, 
enemy_unit: INTEGER,
...
enemy_building: INTEGER,
...
out_winprob: NUMERIC(4, 3) # 0.001 # probability what this game ends in 1 minute
								   # with 1 - player's win 
								   # or 0 - player's defeat

*_enemycomp tables:

[NOTE] This tables are used to train agents to predict enemy composition based on scouted buildings.

game_id: INTEGER,
tick: INTEGER,
enemy_building: INTEGER,
...
out_unit: NUMERIC(4, 3) # 0.001 # enemy units in 1 minute from now

matchups:

First letter of matchup means player's game race.
The last letter is enemy's race.
For example, 'ZvT' means player = 'Zerg', enemy = 'Terran'.
This affect table's unit, building and upgrades columns. Columns can be found in ./starcraft2_replay_parse/data/game_info.csv.

[NOTE] Mirror matchups count twice, player and enemy change their places.

License

Distributed under the MIT License. See LICENSE.txt for more information.