# League of Legends Performance Predictor
Zachary Schwartz
4/28

# Background
League of Legends is a 5v5 battle, where both teams are attempting to destroy the opponents base. Each player selects one of 168 champions to play, and a specific role (also known as position) to play in the game. Most champions are only playable in one or two roles, and similarly most people that play the game are only good at one or two roles. People tend to play the roles and champions they are best at, which will make up most of the data used in this project. In the game, there are objectives that players will fight over, and destroying them is beneficial to them team. Getting more objectives than the opposing team is generally a sign of a winning team. As well, when I refer to statistics, I am referring to a list of performance metrics that can give a sense of how well a player performed, which I will get more into later.  
  
There is a ranked ladder that is made up of different divisions. Since I would like to use the model myself, I gathered data that was similarish to my skill level. I gathered games from these ranks, in order from highest to lowest skill: Challenger, Grandmaster, Masters, Diamond, and Emerald. Challenger is the top 300 players in a region, and grandmaster is the top 1000. Masters has roughly 4000 people in it (Best 0.5% of players). Challenger, Grandmaster, Masters will sometimes be referred to as Masters+. Diamond and Emerald are both split into 4 divisions from most to least skilled: I, II, III, and IV.  
  
Personally, I am ranked usually on the lower end of masters, and I played competitevly for IWU all of my years here. This year we ranked as a top 64 team in the country, and I wanted to do a project that could somehow augment my play, or just generally be based on the game that I love  
  
Note: The data gathering aspect of the code will not work since the developer api key is needed to run it, which resets every 24 hours.

# Introduction
The goal of this project is to use deep learning to predict what players are going to do in a league of legends game. More specifically, each player can be represented by multiple statistics that can be accessed after a game. These stats can be seen in the **player_columns** list below, but involve damage dealt, damage taken, assists, etc. The model is End to End, meaning that it involves one network that processes all of the data. There are 1760 inputs into the model, and 256 outputs. The inputs are made up from the players in a game, and the champions they are playing. Each player or champion is represented by 24 statistics, each statistic being represented by 4 values, the 25th, 50th, 75 percentile, and the standard deviation. 4 values per statistic, multiplied by 22 statistics, multiplied by 10 players and 10 champions, giving us 1760 inputs. The outputs are represnted by 24 statistics per player, however there is only 1 value per statistic, and only a value given for the 10 players in the game. That leaves us with 240 values, and then an extra 13 statistics (1 for each team) to represent the objectives present in the game, leaving us off with 266 predictions. These are towers and other neutral objectives that provide value for a team when destroyed, and the model will predict how many of each are destroyed by which team, the length of the game, and ultimately who ends up winning as well.  



If this model turns out to be effective, it could be a very useful tool for competitive games. The model could be run with your opponents stats, and could show your team what champions they are strong with, which would provide a massive advantage in the drafting phase when champions are picked and banned. Conversely, it could also give your own team a better sense of what your best picks are, and what synergies work and don't work.  
  
However, my hypothesis going into this project is that it would be unable to make effective predictions. There are many limitations to what could affect the overall accuracy of the model. Are people playing worse than usual? Can player impact not be accurately detected by their statistics? Is the data I've gathered now outdated due to recent balance changes? Is there not enough data? Does the IQR + STD provide enough info for a model? Did I make the right choices in which statistics to involve or remove? Was an End to End model a good choice, or should I have had separate networks to analyze players/champions separately?  
  
After doing some research, it does appear that match outcome has been predicted by deep learning models, however it doesnt seem like they tried to predict the actual stats of the players in a game like my project is attempting. I can use this paper as a benchmark to see if my methods were effective in at least figuring out which team ended up winning with the same accuracy that this paper has.
https://dl.acm.org/doi/abs/10.1145/3472538.3472579#:~:text=In%20this%20paper%2C%20we%20propose,which%20occurs%20before%20gameplay%20begins.
  

Below is setup, importing various modules I will use, and setting up global variables to be used throughout.

In [1]:
import sklearn
import requests
import time
import sqlite3
import json
import threading
import torch
import torch.nn
from torchvision import models
from torch.utils.data import DataLoader, TensorDataset, Dataset
from sklearn.model_selection import train_test_split

region_list = [
    "br1",
    "eun1",
    "euw1",
    "jp1",
    "kr",
    "la1",
    "la2",
    "na1",
    "oc1",
    "ph2",
    "ru",
    "sg2",
    "th2",
    "tr1",
    "tw2",
    "vn2",
]

continents_dictionary = {
    "americas": ["na1", "br1", "la1", "la2"],
    "asia": ["jp1", "kr"],
    "europe": ["eun1", "euw1", "ru", "tr1"],
    "sea": ["oc1", "ph2", "sg2", "th2", "tw2", "vn2"],
}

league_list = ["challenger", "grandmaster", "master"]
division_list = ["I", "II", "III", "IV"]
roles = ["TOP", "JUNGLE", "MIDDLE", "BOTTOM", "UTILITY"]
match_file_path = "match_database.db"
api_key = ""

player_columns =   [
    "assists",
    "champLevel",
    "damageDealtToObjectives",
    "damageSelfMitigated",
    "deaths",
    "goldSpent",
    "killingSprees",
    "kills",
    "largestKillingSpree",
    "largestMultiKill",
    "timeCCingOthers",
    "totalDamageDealt",
    "totalDamageDealtToChampions",
    "totalDamageShieldedOnTeammates",
    "totalDamageTaken",
    "totalHeal",
    "totalHealsOnTeammates",
    "totalMinionsKilled",
    "totalTimeSpentDead",
    "visionWardsBoughtInGame",
    "wardsKilled",
    "wardsPlaced",
    "gameDuration",
    "win"
]

team_columns = [
    "win",
    "BaronFirst",
    "ChampionFirst",
    "DragonFirst",
    "InhibitorFirst",
    "RiftHeraldFirst",
    "TurretFirst",
    "BaronKills",
    "ChampionKills",
    "DragonKills",
    "InhibitorKills",
    "RiftHeraldKills",
    "TurretKills"
]

As it is important to set up global variables, it is also important to create the databases as well. It takes up a lot of space but here is the code that would create them if running this notebook from scratch. This is also valuable to read through to know what statistics I am using as predictions, and as labels. The columns in champ_stats and player_stats are the inputs for the model, and the columns in player_matches and team_matches are the outputs.

In [2]:
match_conn = sqlite3.connect(match_file_path)
match_cursor = match_conn.cursor()
match_cursor.execute('''CREATE TABLE IF NOT EXISTS players (
                    summonerid TEXT,
                    rank TEXT,
                    region TEXT,
                    puuid TEXT
                  )''')
match_conn.close()



match_cursor.execute('''CREATE TABLE IF NOT EXISTS player_matches (
                    matchId INTEGER,
                    puuid TEXT,
                    teamPosition TEXT,
                    championName TEXT,
                    gameDuration INTEGER,
                    win INTEGER,
                    assists INTEGER,
                    champLevel INTEGER,
                    damageDealtToObjectives INTEGER,
                    damageSelfMitigated INTEGER,
                    deaths INTEGER,
                    goldSpent INTEGER,
                    killingSprees INTEGER,
                    kills INTEGER,
                    largestKillingSpree INTEGER,
                    largestMultiKill INTEGER,
                    timeCCingOthers INTEGER,
                    totalDamageDealt INTEGER,
                    totalDamageDealtToChampions INTEGER,
                    totalDamageShieldedOnTeammates INTEGER,
                    totalDamageTaken INTEGER,
                    totalHeal INTEGER,
                    totalHealsOnTeammates INTEGER,
                    totalMinionsKilled INTEGER,
                    totalTimeSpentDead INTEGER,
                    visionWardsBoughtInGame INTEGER,
                    wardsKilled INTEGER,
                    wardsPlaced INTEGER
                  )''')
match_conn.commit()

match_cursor.execute('''CREATE TABLE IF NOT EXISTS champ_stats (
                    championName TEXT,
                    teamPosition TEXT,
                    assists INTEGER,
                    champLevel INTEGER,
                    damageDealtToObjectives INTEGER,
                    damageSelfMitigated INTEGER,
                    deaths INTEGER,
                    goldSpent INTEGER,
                    killingSprees INTEGER,
                    kills INTEGER,
                    largestKillingSpree INTEGER,
                    largestMultiKill INTEGER,
                    timeCCingOthers INTEGER,
                    totalDamageDealt INTEGER,
                    totalDamageDealtToChampions INTEGER,
                    totalDamageShieldedOnTeammates INTEGER,
                    totalDamageTaken INTEGER,
                    totalHeal INTEGER,
                    totalHealsOnTeammates INTEGER,
                    totalMinionsKilled INTEGER,
                    totalTimeSpentDead INTEGER,
                    visionWardsBoughtInGame INTEGER,
                    wardsKilled INTEGER,
                    wardsPlaced INTEGER
                  )''')
match_conn.commit()
match_cursor.execute('''CREATE TABLE IF NOT EXISTS player_stats (
                    puuid TEXT,
                    teamPosition TEXT,
                    assists INTEGER,
                    champLevel INTEGER,
                    damageDealtToObjectives INTEGER,
                    damageSelfMitigated INTEGER,
                    deaths INTEGER,
                    goldSpent INTEGER,
                    killingSprees INTEGER,
                    kills INTEGER,
                    largestKillingSpree INTEGER,
                    largestMultiKill INTEGER,
                    timeCCingOthers INTEGER,
                    totalDamageDealt INTEGER,
                    totalDamageDealtToChampions INTEGER,
                    totalDamageShieldedOnTeammates INTEGER,
                    totalDamageTaken INTEGER,
                    totalHeal INTEGER,
                    totalHealsOnTeammates INTEGER,
                    totalMinionsKilled INTEGER,
                    totalTimeSpentDead INTEGER,
                    visionWardsBoughtInGame INTEGER,
                    wardsKilled INTEGER,
                    wardsPlaced INTEGER
                  )''')
match_conn.commit()

match_cursor.execute('''CREATE TABLE IF NOT EXISTS match_stats (
                    matchId TEXT,
                    win INTEGER,
                    BaronFirst INTEGER,
                    ChampionFirst INTEGER,
                    DragonFirst INTEGER,
                    InhibitorFirst INTEGER,
                    RiftHeraldFirst INTEGER,
                    TurretFirst INTEGER,
                    BaronKills INTEGER,
                    ChampionKills INTEGER,
                    DragonKills INTEGER,
                    InhibitorKills INTEGER,
                    RiftHeraldKills INTEGER,
                    TurretKills INTEGER
                  )''')
match_conn.commit()

match_conn.close()

# Part 1: Data Gathering
An equally important step to the actual training of the model for this project was in creating the dataset the model was to be trained on. The basic steps of which can be seen below  
1. Setup a riot developer account, which allows me to query their databases for the information I will use https://developer.riotgames.com/
2. Every account has a summonerId and a puuid (Player Universally Unique IDentifiers). The summonerIds can be queried many at a time by the different ranks, and will be our querying the databse for information. https://developer.riotgames.com/apis#league-v4/GET_getLeagueEntries
3. Unfortunately although summonerIds are easier to obtain than the puuids, I can only access a players match history through their puuid, so the next step will be finding the puuids. I can then query riots database with a summoner Id, and get the associated puuids. https://developer.riotgames.com/apis#summoner-v4/GET_getBySummonerId
4. Now that I have every players puuid, the next step is gathering a list of every match played. Each match has a unique id, and I can query the database for a list of matchIds associated with a player https://developer.riotgames.com/apis#match-v5/GET_getMatchIdsByPUUID
5. Now I can now query with a matchId, and be returned all of the statistics associated. https://developer.riotgames.com/apis#match-v5/GET_getMatch
6. For the last bit of organizing, I now need to set up our inputs to the model. Since every player and champion will be represented by one large tensor of values, each of which that are calculated from finding averages and standard deviations, it would take the model a very long time to run those calculations each time to run the model. Becaue of this, I can run these calculations once for each player and champion, and then store the result since looking up precomputed data is much quicker than recomputing it.  
   
Put simply, gather summonerIds, use summonerIds to gather puuids, use puuids to gather matchId list, use matchId list to gather match statistics, use match statistics to generate calculations for the model input  
  
Other things of note: There is a rate limit of 200 calls per minute, per region or continent depending on the kind of call. If you as the reader would get your own developer key, you would see some of these functions run significantly faster than others because of whether they can query a region or continent. Continents are just made up of multiple regions, for example the Americas are made up of the north american, brazilian, and two separate latin american servers. Running all of these would take over 24 hours in one sitting.  
  
**Databases**  
  
The tables I made throughout this project went through many modifications (generally getting smaller as I learned what data I didnt actually need.) I originally created two databases, which although was not ultimately necessary, it did reduce the size of both databases, and made them faster to download and upload individually. My thinking to create the two databases was that I wanted the statistics related to matches in a separate place from the secondary info (user ids and list of matchids) that was needed to acquire them in the first place. The first database was the player database, which stored the summonerId, rank, region, puuid, and eventually the list of matchIds played in a region. Steps 2-4 involve the player database.  
  
The match database has the bulk of the information stored, with 4 separate tables. Firstly, the player_matches holds all the statistics and other identifying information for each game played. Each matchId has 10 separate rows, one for each player and their performances. This table was used directly to create the rest the inputs and outputs. The first 4 columns are identifying info, matchid, puuid, champion, and role. The next 2 columns are used for labels, and the last 24 columns are the statistics recorded for the player in that game. The next two tables data are calculations based on the information found in player_matches  
  
player_stats and champ_stats both hold averages based on the statistics gathered. These two tables make up the inputs to the model. In champ_stats each row is a champion, and the roles they have played in the data. (Most champions have about 4 rows, but ususally about 2 of those have less than 10 games since the champion isn't meant to be played there). The rest of the columns are dedicated to the stastistics I gathered that are being used to make the predictions. The actual values stored is a list of 4 values, the 25th, 50th, 75th percentile, and the standard deviation, for each column. player_stats is the exact same as champ_stats, but instead of champion names they are puuids instead. When gathering inputs to the model, a given match id will find out what champions and players participated in the game, and will then find each of those precalculated statistics from these tables.
  
team_matches holds the objective information for both the teams in a match, thus there are two rows per matchId. It records which team won, the first team to take each objective, and how many of the objectives they gathered. This data is only used for labels in the model.

All of the data gathering functions were built with their own call function, that allows each to be called one by one in case the api key resets.  
  
Generally, the functions mentioned in steps 2-5 above all follow the same structure. They accept a region, and possibly ranks as well as inputs. Then in a while loop will request values from the api based upon the arguments given, and the purpose of the function. A while loop is used rather than a for loop since it can handle errors more easily in case of database or server errors. If the request is successful (a 200 status code is recieved), then the value is loaded and inserted into a database, and then a counter variable is incremented to move onto the next request. There are also try excepts to help debug and see where issues arise. The error code of 429 means that the function must wait a few seconds before requesting again. I believe this was generally the most effective style of retrieving and storing data.  
  
for get_all_players specifically, its made to handle both masters+ and diamond/emerald ranks despite them having different api requests, to minimize duplicated code. The call_get_all_players function has a few for loops to make sure to find all the players in their differing regions and ranks.

In [3]:
def get_all_players(region, rank, division):
    # Gathers every summonerId in masters +, diamond, and emerald
    player_conn = sqlite3.connect(match_file_path)
    player_cursor = player_conn.cursor() # Create connection to database
    count = 1
    while True:
        # Handles which request to send depending on if the league is masters+ or not
        if division != "":
            request = requests.get(f"https://{region}.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/{rank.upper()}/{division}?page={str(count)}&api_key={api_key}")
        else:
            request = requests.get(f"https://{region}.api.riotgames.com/lol/league/v4/{rank}leagues/by-queue/RANKED_SOLO_5x5?page={str(count)}&api_key={api_key}")
    
        if request.status_code == 200: # If the request is successful, input basic info about the user
            
            try:
                users = json.loads(request.text)
                if users == []:
                    break
                    
                for record in users:
                    # Input the data used to find more about these players
                    player_cursor.execute(
                        """
                        INSERT INTO players (summonerId, rank, region)
                        VALUES (?, ?, ?)
                        """,
                        (
                            record["summonerId"],
                            rank + division,
                            region,
                        ),
                    )
                    player_conn.commit()
                    count += 1
                    
            except Exception as e: # Except statement in case of error
                print("Error processing API response:", e)
                
        elif request.status_code == 429: # Should I get rate limited, the program will wait to try to request info again
            print(region, division)
            time.sleep(10)
            
        else:
            print(request.status_code)

def call_get_all_players():
    # Stores every summonerid from every region from emerald IV to challenger in player database
    threads = []
    for region in region_list: # creates a unique thread for each region, then calls the puuid finder
        # First loops through masters+ leagues, then loops through the different divisions of diamond and emerald
        for league in league_list:
            thread = threading.Thread(target=get_all_players, args=(region, league, "",)) # Using threading makes this function run significantly faster
            threads.append(thread)
            thread.start()
            print("new_thread created")
    
        for division in division_list:
            thread = threading.Thread(target=get_all_players, args=(region, "diamond", division,)) # Using threading makes this function run significantly faster
            threads.append(thread)
            thread.start()
    
            thread = threading.Thread(target=get_all_players, args=(region, "emerald", division,)) # Using threading makes this function run significantly faster
            threads.append(thread)
            thread.start()

    for thread in threads: # ensures the main thread waits for all threads to finish
        thread.join()

In [4]:
def get_puuids_by_region(region):
    # Given a region, places every users puuid into the database
    region_conn = sqlite3.connect(match_file_path)
    region_cursor = region_conn.cursor()
    
    summoner_id_list = region_cursor.execute("SELECT summonerId FROM players WHERE region = ?", (region,)).fetchall() # create list of every summonerid, then loop through the list
    summoner_ids = [row[0] for row in summoner_id_list]
    index = 0

    # Makes request to server for the information regarding this summonerid
    while index < len(summoner_ids):
        request = requests.get(
            "https://"
            + region
            + ".api.riotgames.com/lol/summoner/v4/summoners/"
            + summoner_ids[index]
            + api_key
        )
        
        if request.status_code == 200: # If successful, find the matching puuid for the given summoner id
            try:
                user = json.loads(request.text)
                region_cursor.execute(
                    """
                            UPDATE players
                            SET puuid = ?
                            WHERE summonerId = ?
                        """,
                    (user["puuid"], summoner_ids[index]),
                )
                region_conn.commit()
                index += 1
                
            except Exception as e:
                print("Error processing API response:", e)
                
        elif request.status_code == 429:
            # If too many requests were sent, wait a few moments before trying again
            print("sleeping")
            time.sleep(20)
            
        else:
            print(request.status_code)


def call_puuids():
    threads = []
    for region in region_list: # Creates a unique thread for each region, then calls the puuid finder
        thread = threading.Thread(target=get_puuids_by_region, args=(region,)) # Using threading makes this function run significantly faster
        threads.append(thread)
        thread.start()
        print("new_thread created")

    for thread in threads: # ensures the main thread waits for all threads to finish
        thread.join()

Now that all of the puuids are acquired, the next part is to start looking for all the matches played. This function had a unique issue, where I couldn't just accept every single matchId played because then I would have 10 repeats of the same match in our data. That would pose an issue in the next function where I would then gather data based on every matchId, and if there are repeats I could end up with up to 120 rows of data about the same game, when I should only have 12 (10 for each of the players, 2 for each team). To solve this, I added every matchId found to one set. The nature of sets is that they will not accept non unique values, which meant I could add every matchId found to it, without having to worry about conditional statements. The downside of this is that one large set isn't really ideal for database management, since now all the matchIds found in a region would go into just one row. There may be a better way to handle that, however my function didn't pose me any major issues. Otherwise this function functions similarly to the others, except that it only has one insert statement at the end, as opposed to throughout the while loop like the others. The while loop in this function merely adds the found matchIds to the set. This is also the first function that now works with continents, which slows down the speed at which this function runs compared to the others.

In [5]:
def get_matches_played(continent, regions):
    # Given a continent and regions, collect every unique matchId played for each region
    continent_conn = sqlite3.connect(match_file_path)
    continent_cursor = continent_conn.cursor()

    for region in regions: # loop through all regions within the given continent
        puuid_list = continent_cursor.execute(
            "SELECT puuid FROM players WHERE region = ?", (region,)
        ).fetchall()
        puuids = [row[0] for row in puuid_list] # find all puuids for the given region
        index = 0 # tracks where in the puuids the function is
        match_count = 0 # This variable tracks where to begin the search in a players history
        repeat_matches = set() # Holds the list of mathes for a region, since it is a set it will not take non unique values

        # In case the databse already has some matches, this code will add them to repeat_matches so they dont get mixed up
        # repeat_list = continent_cursor.execute(
        #     "SELECT matches FROM players WHERE region = ?", (region,)
        # ).fetchall()
        # [repeat_matches.add(row[0]) for row in repeat_list]

        # This code loops requests every match played by a user this season, and adds them to the set
        # Loop increments by 100 to look at the next 100 matches, and only moves on from a user once the request comes back empty
        while index < len(puuids):
            request = requests.get(
                f"https://{continent}.api.riotgames.com/lol/match/v5/matches/by-puuid/{puuids[index]}/ids?startTime=16417728&type=ranked&start={str(match_count)}&count=100&{api_key}"
            )
            if request.status_code == 200:
                
                try:
                    matches = json.loads(request.text)
                    
                    if len(matches) == 0: # Base Case for incrementing the while loop, moving on from the user
                        index += 1
                        continue
                        
                    for match in matches:
                        repeat_matches.add(match) # Adds the matches to the set
                    match_count += 100 # The request brings back 100 matches at a time, so this code looks at next 100
                    
                except Exception as e:
                    print("Error processing API response:", e)
                    
            elif request.status_code == 429:
                # If too many requests were sent, wait a few moments before trying again
                print("sleeping" + continent + region + str(index))
                time.sleep(20)
                
            else:
                print(request.status_code)

        # Perhaps suboptimal database management, but I place the entire list of matches in 1 row per region
        # This occurs after the while loop has finished finding every match
        continent_cursor.execute(
                            """
                                UPDATE players
                                SET matches = ?
                                WHERE puuid = ?
                            """,
                            (json.dumps(list(repeat_matches)), puuids[1]),
                                )
        continent_conn.commit()


def call_match_list():
    # Calls function that stores every match played
    threads = []
    for continent, regions in continents_dictionary.items():
        thread = threading.Thread(
            target=get_matches_played,
            args=(continent,regions,),)
        threads.append(thread)
        thread.start()
        print("new_thread created")

    for thread in threads:
        thread.join()

This function went through many iterations, since I wasn't sure which features I would end up using for the model. Because of the way I stored the matchIds previously, there is some string splicing required at the beginning to make the list iterable. Ultimately though it isn't much different from the previous functions, it requests the statistics and places some into player rows, and some into team rows.

In [6]:
def store_matches(continent, regions):
    # Stores all values from a match into the match database, from every match
    player_conn = sqlite3.connect(match_file_path)
    player_cursor = player_conn.cursor()
    match_conn = sqlite3.connect(match_file_path)
    match_cursor = match_conn.cursor()
    
    for region in regions:
        # Gather list the list of all matches in a region
        print(region)
        match_list = player_cursor.execute(
            "SELECT matches FROM players WHERE region = ?", (region,)
        ).fetchall()
        
        matches = [row[0] for row in match_list] # the match ids are stored as a list with one element, a giant string
        
        # Some string splicing needs to be done in order to properly loop through the list
        matches = matches[0][1:-1]
        matches = [match[1:-2] for match in matches.split()] # this line splits this string into a list that can be looped through
        
        match_index = 0 # index for overall match list
        
        while match_index < len(matches):
            request = requests.get(
                f"https://{continent}.api.riotgames.com/lol/match/v5/matches/{matches[match_index]}?api_key={api_key}"
            )
            if request.status_code == 200:
                try:
                    match_data = json.loads(request.text)

                    # Create variables for ease of use later
                    info = match_data["info"]
                    teams = info["teams"]
                    # Inserts the two team rows into database
                    for team in teams:
                        objectives = team['objectives']
                        match_cursor.execute(
                            """
                            INSERT INTO team_matches (
                                matchId,
                                win,
                                BaronFirst,
                                ChampionFirst,
                                DragonFirst,
                                InhibitorFirst,
                                RiftHeraldFirst,
                                TurretFirst,
                                BaronKills,
                                ChampionKills,
                                DragonKills,
                                InhibitorKills,
                                RiftHeraldKills,
                                TurretKills
                            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                            """,
                            (matches[match_index],
                            int(team['win']),
                            objectives['baron']['first'],
                            objectives['champion']['first'],
                            objectives['dragon']['first'],
                            objectives['inhibitor']['first'],
                            objectives['riftHerald']['first'],
                            objectives['tower']['first'],
                            objectives['baron']['kills'],
                            objectives['champion']['kills'],
                            objectives['dragon']['kills'],
                            objectives['inhibitor']['kills'],
                            objectives['riftHerald']['kills'],
                            objectives['tower']['kills'],
                            )
                        )
                    # Inserts the 10 rows of participants into database
                    participants = info["participants"]
                    for participant in participants:
                        match_cursor.execute(
                        """
                        INSERT INTO player_matches (
                                    "matchId",
                                    "puuid",
                                    "teamPosition",
                                    "championName",
                                    "gameDuration",
                                    "win",
                                    "assists",
                                    "champLevel",
                                    "damageDealtToObjectives",
                                    "damageSelfMitigated",
                                    "deaths",
                                    "goldSpent",
                                    "killingSprees",
                                    "kills",
                                    "largestKillingSpree",
                                    "largestMultiKill",
                                    "timeCCingOthers",
                                    "totalDamageDealt",
                                    "totalDamageDealtToChampions",
                                    "totalDamageShieldedOnTeammates",
                                    "totalDamageTaken",
                                    "totalHeal",
                                    "totalHealsOnTeammates",
                                    "totalMinionsKilled",
                                    "totalTimeSpentDead",
                                    "visionWardsBoughtInGame",
                                    "wardsKilled",
                                    "wardsPlaced"
                        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                    """,
                        (
                            matches[match_index],
                            participant['puuid'],
                            participant['teamPosition'],
                            participant['championName'],
                            info['gameDuration'],
                            participant['win'],
                            participant['assists'],
                            participant['champLevel'],
                            participant['damageDealtToObjectives'],
                            participant['damageSelfMitigated'],
                            participant['deaths'],
                            participant['goldSpent'],
                            participant['killingSprees'],
                            participant['kills'],
                            participant['largestKillingSpree'],
                            participant['largestMultiKill'],
                            participant['timeCCingOthers'],
                            participant['totalDamageDealt'],
                            participant['totalDamageDealtToChampions'],
                            participant['totalDamageShieldedOnTeammates'],
                            participant['totalDamageTaken'],
                            participant['totalHeal'],
                            participant['totalHealsOnTeammates'],
                            participant['totalMinionsKilled'],
                            participant['totalTimeSpentDead'],
                            participant['visionWardsBoughtInGame'],
                            participant['wardsKilled'],
                            participant['wardsPlaced']
                        ),
                    )
                
                    match_conn.commit()
                    match_index += 1
                    
                except Exception as e:
                    # End case
                    print("Error processing API response:", e)
                    if "list index out of range" in str(e):
                        match_index +=1
                        continue
                        
            elif request.status_code == 429:
                # If too many requests were sent, wait a few moments before trying again
                print("sleeping " + continent + " " + region + " " + str(match_index))
                time.sleep(10)
                
            elif request.status_code == 404:
                # If match doesnt have any info, continue on
                match_index += 1
                continue
                
            else:
                print(request.status_code)


def call_store_matches():
    # Calls function that stores every statistic from every match played
    threads = []
    for continent, regions in continents_dictionary.items():
        thread = threading.Thread(
            target=store_matches,
            args=(continent, regions, ),)
        threads.append(thread)
        thread.start()
        print("new_thread created")

    for thread in threads:
        thread.join()

To gather data on players, there is an issue of the amount of data played. Some people play one role and a few champs for hundreds of games, some people play a little of everything, and some barely play at all. One possible choice was using an RNN or GRU, which takes data of varying lengths, modifying its outputs as it takes in the random length of data. I decided against it since I didn't believe it would really fit the problem very well. Moreso, a large part of RNNs is deciding importance of inputs, and I imagine each game should generally be as equally valuable as the last. That being said there have been improvements that handle those issues, and an RNN based model could be an idea for a future project  
  
To capture information on a wild gaps in the amount of games played, I decided to combine it all into key statistics that capture the essence of the data. Essentially statistics that the model could understand the general idea for how well this player or champion performed as a certain statistic. Since only 4 values are stored per statistics, this also speeds up the model significantly. This function prepares this data in an sqlite database, taking in a champion or player, then the role they played. Every statistic on them is called that I've stored, and the percentiles are taken from those values.  
  
The main issue of this approach is the case where there is one game of someone playing a role once, or the champion playing in a role they aren't supposed to. The rest of the data in that game could be valuable, so I don't want to completely throw out or neglect games with offrole players or champions. This is most apparent because some of these games they may perform incredibly well or poorly and have wide swings, affecting data quality. Another side effect of this issue is when they play only 1 game, the torch.std doesn't work on a single value, and returns NaN isntead. Although I could have gone into the database to change the values, I elected to just handle the case in find_stats later on. That may increase the time it takes to create a dataset object because of it. Ideally I would have so much data, that any time there is an offrole player or champion I could ignore that game, but I want to use every piece of data I can

In [7]:
def gather_stats(column_type, name, role, table):
    # Given a champion name or puuid, enter a tensor of the iqr and std into champ_stats
    # This is not used in the actual model, but was instead created to have the values ready, rather than having to run the calculations during training
    
    stats = torch.tensor([])
    match_conn = sqlite3.connect(match_file_path)
    match_cursor = match_conn.cursor()
    # Gather all the games of a player or champion in that specific role
    # The stats for a playe or character will be different based on the role played, so its important not to combine them.
    column_vals = match_cursor.execute(f"SELECT * FROM player_matches WHERE {column_type} = ? AND teamPosition = ?", (name, str(role))).fetchall()

    # Loop through each of the player statistic columns, ignoring the last 2 since they are part of what I am are predicting
    match_cursor.execute(f"INSERT INTO {table} ({column_type}, teamPosition) VALUES (?, ?)", (name, role))
    match_conn.commit()
    for index, column in enumerate(player_columns[:-2]):
        data = torch.tensor([])
        # Loop through every row to gather each instance of the statistic
        for row in column_vals:
            data = torch.cat((data, torch.tensor([float(row[index+6])])))
        
        # Gather the iqr of the statistics
        iqr = torch.quantile(data, torch.tensor([0.25, 0.5, 0.75])).unsqueeze(0)
        stats = torch.cat((stats, iqr), dim=0)
        
        # Gather the standard deviation
        std_dev = torch.std(data).unsqueeze(0).unsqueeze(0)
        stats = torch.cat((stats, std_dev), dim=1)
        # This line was supposed to fix the NaN issue, but it didn't work
        torch.nan_to_num(stats)
        
        # Convert into a format that sqlite accepts
        stats_list = stats.numpy()
        stats_list = json.dumps(stats_list.tolist())
        match_cursor.execute(f"UPDATE {table} SET {column} = ? WHERE {column_type} = '{name}' AND teamPosition = '{role}'", (stats_list,))
        match_conn.commit()
        
        # Reset the variables
        stats = torch.empty(0)
    match_conn.close()


def call_gather_stats():
    # Creates player_stats and champ_stats which will be called in the training of the model
    match_conn = sqlite3.connect(match_file_path)
    match_cursor = match_conn.cursor()
    
    # Gather every champion and user
    champion_names = match_cursor.execute("SELECT DISTINCT championName FROM player_matches").fetchall()
    player_names = match_cursor.execute("SELECT DISTINCT puuid FROM player_matches").fetchall()
    match_conn.close()

    # Loop through every champion, user and their roles to create the tables
    for champ in champion_names:
        for role in roles:
            gather_stats("championName", champ[0], role, "champ_stats")
    for player in player_names:
        for role in roles:
            gather_stats("puuid", player[0], role, "player_stats")

To simplify the calls, this function calls all of the separate calls to the data gathering process in one place

In [8]:
def call_all_data():
    call_get_all_players() # Gather summonerIds
    call_puuids() # Gather puuids
    call_match_list() # Gather list of matchIds
    call_store_matches() # Gather match statistics
    call_gather_stats() # Generate percentiles and stds based on champions and players

These next 2 functions are called in the dataloader. find_stats takes all of the values stored in player_stas and champ_stats that are related to the specific match id given, so it uses the related players and champions. find_labels instead takes the related values from player_matches and team_matches, the true values of the match played. Both functions are essentially repeated calls to their respective two tables in match_database. The major difference between the two is how they are stored in the database. Where each value that find_labels looks at is a single float, each value find_stats looks at is a list of 4 values. I wanted each input to be one large tensor of values, so I had to extract the tensor, and use .view(-1) on it to seamlessly add it to the tensor. Also due to the NaN issue from gather_stats, I added the conditional to check if the std failed, and replace with a 0 float instead.  
  
The next step is some quick data cleaning, as after running call_all_data(), not everything is perfect

In [9]:
match_conn = sqlite3.connect(match_file_path)
match_cursor = match_conn.cursor()

# Some matches only get 8 entries stored, with no team position
match_cursor.execute("DELETE FROM player_matches WHERE teamPosition = '' OR teamPosition IS NULL")

# Some teams end up with multiple instances of the same team, although this may have been because I ran store_matches multiple times and never removed the inputs
match_cursor.execute("""
    DELETE FROM team_matches
    WHERE rowid NOT IN (
        SELECT MIN(rowid)
        FROM team_matches
        GROUP BY win, matchId
    )
""")
match_conn.commit()

# Some matches simply lose a player, and I can't run a model down a player
match_cursor.execute("""DELETE FROM player_matches
WHERE matchId IN (
    SELECT matchId
    FROM player_matches
    GROUP BY matchId
    HAVING COUNT(*) = 9
);""")
match_conn.commit()

# Some matches simply lose a team as well, and I can't run a model down a team
match_cursor.execute("""DELETE FROM player_matches
WHERE matchId IN (
    SELECT matchId
    FROM team_matches
    GROUP BY matchId
    HAVING COUNT(*) = 1
);""")
match_conn.commit()

In [10]:
def find_stats(match_id):
    # Given a match id, return all of the aggregated performanced related to that match (data for model)
    input_list = []
    match_conn = sqlite3.connect(match_file_path)
    match_cursor = match_conn.cursor()
    
    # Gather the the info from the database
    match_participants = match_cursor.execute("SELECT championName, puuid, teamPosition FROM player_matches WHERE matchId = ?", (str(match_id),)).fetchall()

    for participant in match_participants:
        # Both of these calls only return one row, so the next for loops will automatically index into the only row gotten from the call
        match_champ_stats = match_cursor.execute("SELECT * FROM champ_stats WHERE championName = ? AND teamPosition = ?", (participant[0], participant[2],)).fetchall()
        match_player_stats = match_cursor.execute("SELECT * FROM player_stats WHERE puuid = ? AND teamPosition = ?", (participant[1], participant[2],)).fetchall()
        
        # Skipping the first 2 columns since they are identifying columns
        for stat in match_champ_stats[0][2:]:
            stat_list = json.loads(stat)
            stat_tensor = torch.tensor(stat_list)
            # Some of the standard deviation values were marked as NaN, so this conditional handles replacement
            if torch.isnan(stat_tensor[0][-1]):
                stat_tensor[0][-1] = 0.0
            input_list.append(stat_tensor.view(-1))
        
        for stat in match_player_stats[0][2:]:
            stat_list = json.loads(stat)
            stat_tensor = torch.tensor(stat_list)
            if torch.isnan(stat_tensor[0][-1]):
                stat_tensor[0][-1] = 0.0
            input_list.append(stat_tensor.view(-1))
            
    match_conn.close()
    # Concatenates the input list of tensors into one row
    return torch.cat(input_list, dim=0)

In [11]:
def find_labels(match_id):
    # Given a match id, return all of the statistics related to that match (labels for model)
    label_list = []
    match_conn = sqlite3.connect(match_file_path)
    match_cursor = match_conn.cursor()
    
    # Gather the the info from the database
    player_match_vals = match_cursor.execute("SELECT * FROM player_matches WHERE matchId = ?", (str(match_id),)).fetchall()
    team_match_vals = match_cursor.execute("SELECT * FROM team_matches WHERE matchId = ?", (str(match_id),)).fetchall()
    match_conn.close()
    
    #loop through the player rows, and then the team rows, concatenating all into one large tensor
    for player in player_match_vals:
        for column in player[4:]:
            label_list.append(float(column))

    for team in team_match_vals:
        for column in team[1:]:
            label_list.append(float(column))

    # Creates a tensor out of the label_list
    return torch.tensor(label_list, dtype=torch.float32)

This dataset is fairly standard compared to the ones we've seen in class, it has inputs and labels as attribrutes of the class, and a get item method that gathers them when called. The main difference is that as an argument the class knows if it is a training or test module, and creates the dataset as such. If I want to get different training/test datasets, I can always modify the random_state.

In [None]:
class GameDataset(Dataset):
    def __init__(self, incorrect_match_ids, train):
        # The matchIds come in as strings, with parentheses attached, so this for loop removes them making suitable for sqlite queries
        match_ids = [match_id[0] for match_id in incorrect_match_ids]
        self.stats = []
        self.labels = []
        
        # Creates a split of the matchIds to usefor the data
        train_match_ids, test_match_ids = train_test_split(match_ids, test_size=0.1, random_state=42)

        # This conditional selects which matchids will be used for the dataset
        if train is True:
            self.match_ids = train_match_ids
        else:
            self.match_ids = test_match_ids 
            
        for match_id in self.match_ids:
            stats = find_stats(match_id)
            labels = find_labels(match_id)
            self.stats.append(stats)
            self.labels.append(labels)
            
            # Tracks progress for creating a dataset object
            if len(self.stats) % 1000 == 0:
                print(len(self.stats))

    def __len__(self):
        return len(self.stats)

    def __getitem__(self, ids):
        return self.stats[ids], self.labels[ids]


In [None]:
def create_dataloader():
    match_conn = sqlite3.connect(match_file_path)
    match_cursor = match_conn.cursor()
    matches = match_cursor.execute("SELECT DISTINCT matchId FROM player_matches").fetchall()
    match_conn.close()
    print("beginning attempt")
    train_dataset = GameDataset(matches, train=True)
    print("almost there...")
    test_dataset = GameDataset(matches, train=False)
    train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_dl = DataLoader(test_dataset, batch_size=64, shuffle=False)
    print("finished!")
    return train_dl, test_dl

train_dl, test_dl = create_dataloader()

I decided to go with use of fully connected layers, with some splashes of normalization and regularization. I normalized each layer before running it through a fully connected layer. I believed fully connected layers would prove incredibly valuable for this problem. For example, if one top laner is playing a very strong champion, and also performs much better at the role than his opponent, then that would affect the prediction of how many kills and deaths both top laners would have, but it should have little effect on the kills and deaths of the bottom laner, since they are across the map. By doing this, the model could supposedly see which values have the biggest impact for each player. Unfortunately the model does not seem very effective, and isn't able to make sense of any of the values.

In [None]:
class GameModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.norm1 = torch.nn.BatchNorm1d(1760)
        self.norm2 = torch.nn.BatchNorm1d(1550)
        self.norm3 = torch.nn.BatchNorm1d(1400)
        self.norm4 = torch.nn.BatchNorm1d(1250)
        self.norm5 = torch.nn.BatchNorm1d(1100)
        self.norm6 = torch.nn.BatchNorm1d(900)
        self.dropout1 = torch.nn.Dropout(0.2)  
        self.dropout2 = torch.nn.Dropout(0.2)  
        self.fc1 = torch.nn.Linear(1760, 1550)
        self.fc2 = torch.nn.Linear(1550, 1400)
        self.fc3 = torch.nn.Linear(1400, 1250)
        self.fc4 = torch.nn.Linear(1250, 1100)
        self.fc5 = torch.nn.Linear(1100, 900)
        self.fc6 = torch.nn.Linear(900, 600)
        self.fc7 = torch.nn.Linear(600, 400)
        self.fc8 = torch.nn.Linear(400, 266)
        
    
    def forward(self, x):
        x = self.norm1(x)
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = self.norm2(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.norm3(x)
        x = torch.relu(self.fc3(x))
        x = self.norm4(x)
        x = torch.relu(self.fc4(x))
        x = self.norm5(x)
        x = torch.relu(self.fc5(x))
        x = self.norm6(x)
        x = torch.relu(self.fc6(x))
        x = torch.relu(self.fc7(x))
        x = self.fc8(x)
        return x

model = GameModel()

Standard train_model function, I used MSELoss since the data was not catagorical, and adam optimizer since it seems like its the best. Most of this code is taken directly by my wonderful professor.

In [None]:
def train_model(model, epochs, learning_rate, dataloader):
    start_time = time.time()
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    for epoch in range(epochs):
        running_loss = 0.0
        for i, data in enumerate(dataloader, 0):
                inputs, labels = data
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
                if i % 100 == 90:    # print every 100 mini-batches
                    print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 6400:.3f}')
                    running_loss = 0.0
    print(f"Finished Training.  Elapsed time: {time.time()-start_time:0.2f}")
    print(f"Finished Training.  Elapsed time: {time.time()-start_time:0.2f}")

train_model(model, 70, 0.001, train_dl)

I wanted to find out by what percent are my predictions are off by, so I can know how to modify my model. This function provides a percent for each batch in the testing batch.

In [29]:
def accuracy_percent(given_model, dataloader):
    all_predictions = []
    
    for inputs, labels in dataloader:
        with torch.no_grad():
            outputs = given_model(inputs) # Model makes predictions
        abs_diff = torch.abs(outputs - labels) # Find absolute value difference in the predictions
        percentages = (abs_diff / (labels + 0.0000000001)) * 100 # Calculate the percent, while avoiding infinite values
        all_predictions.append(torch.mean(percentages)) # Add the average percent difference to return
        
    return all_predictions

accuracy_percent(model, test_dl)

[tensor(4.3720e+13),
 tensor(3.9440e+13),
 tensor(3.6535e+13),
 tensor(4.5583e+13),
 tensor(4.0176e+13),
 tensor(4.4834e+13),
 tensor(4.1637e+13),
 tensor(4.7029e+13),
 tensor(4.2871e+13),
 tensor(3.8384e+13),
 tensor(4.2960e+13),
 tensor(3.9691e+13),
 tensor(4.0508e+13),
 tensor(3.6381e+13),
 tensor(3.9320e+13),
 tensor(3.7066e+13),
 tensor(4.0170e+13),
 tensor(3.9836e+13),
 tensor(3.6931e+13),
 tensor(1.4939e+13)]

# Results  
  
Unfortunately I could not get the model to provide anything useful. The predicted outputs have massive losses compared to the real outcomes of the games. The loss is quite high, and even eyeballing the results shows that the predictions are rarely near what they should be. train_model usually outputs losses during training upwards of 1 million. I was a little rushed in analyzing the data, and it is possible that the model isn't quite as innacurate as I think it is now, its difficult to say since so many of the values have different scales (damage to champions is measured in tens of thousands, kills is usually single digits)
  
It's difficult to pinpoint exactly where it went wrong, maybe I didn't have enough features, matches, the task was way beyond me in the first place or I messed up execution of the model. I believe there is some truth to all four possibilities, thought it would be difficult to assign a percentage. I ended up only using 22 features to try to make all these predictions, and maybe that isnt enough information to go off of. It's likely that even the whole api doesn't have enough info to make better predictions. Ideally we would take in player positioning, how much damage dealt at different points in the game, or other even more match specific information. Although the IQR + STD method made the model "work" it likely is too inspecific to get meaningful data out of. As well after data cleaning, I only ended up with about 12,000 games, which considering the amount of players and champions, isn't really a lot to make good predictions. Not to mention the data was taken across multiple balance patches and ranks, further diluting the quality. Perhaps if I took another stab at it, the project could turn out better, or perhaps theres inherently not enough data to account for the randomness of human action in the game.

# Conclusion  
  
In this project I gathered information on a great number of league games and tried to see how much of the game could be predicted by the players and champions within it. It required learning about using an API, and slowly step by step gathering the info that allowed me to get the statistics I wanted to train on. Once I had them, I could begin setting up the model to train on the data gathered and see how effective it really was. I used an end to end model which was a simple way to process all the data, but a GRU could also possibly provide effective results.  
  
I learned a lot over the course of this project, first being about using an api. How to look through documentation, make requests, and then also storing that information in sqlite was definitely very valuable. This project also showed a lot of value in planning, as had I planned better it would have gone a little smoother. I'm always tempted to rush into projects to get started instantly, but this taught me the value of looking at the bigger picture and planning my inputs and outputs (kind of like planning a cpu, sort of). I lost a lot of time attempting long runs of code without being 100% certain that it will work, which I will try not to repeat in future projects. By being greedy and not testing properly before running code, I ended up wasting far more time than if I had been careful in the first place. As well, using the time module to help give a sense of how long something takes, and if that is correct. It will greatly help any future project I do, even if its not related to deep learning. I learned a lot about shaping up the various inputs and outputs of the databases, datasets, and dataloaders. Towards the end I also researched more possible deep learning models that may have given a better outcome than my own.  
  
There are quite a few ways the project could be continued from here. I unfortunately didn't end up training the model as much as I would have liked, so I'm sure it could've been more accurate. Beyond that a GRU model could be an interesting way to go, as well as selecting more features than I did to make predictions.  I tried to stick with features I knew would be valuable, to minimze redundancy. As well, a CNN could also possibly have had some success, since the way I adapted the input data could be read by one and possibly have found patterns. Making the model more conventionally useful by adding a function that accepts the list of players and champions from a new game, and runs the inference very quickly, for when I am about to play a game, essentially making the whole project quick and easy to use. This of course would have to come after finding a working model for this problem. Lastly, pivoting to finding just the likelihood of one team winning, as opposed to every possible statistic could make for a more accurate project