# Exploring the data
The data is divided into two main groups: `root_games.csv` and `tail_games[1-4].csv`. For our analysis, we can merge both groups into a larger dataset. This dataset will contain 1518 features, so to select the most important features we need to analyse our data.

How to view all features and select the best ones? The dataset author suggests exploring the JSON file using tools like [json formatter](https://jsonformatter.curiousconcept.com/) to inspect the `dummy_league_match.json` file. Below we list the selected features based on JSON file.
1. Every feature from "**teams**" field (e.g., teamID, win or lose, first blood, first tower, etc.).
2. Participant related features ("**participants**" field) such as number of kills, deaths, assists; [vision score](https://leagueoflegends.fandom.com/wiki/Vision_score); and timeline information such as creeps per minute, xp per minute, etc.

With this in mind, how can we extract this information from the dataset?

# Extracting the information

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
team_fields = [
    # Blue team
    'teamId_100.win', 'teamId_100.firstBlood', 'teamId_100.firstTower', 'teamId_100.firstInhibitor',
    'teamId_100.firstBaron', 'teamId_100.firstDragon', 'teamId_100.firstRiftHerald', 'teamId_100.towerKills',
    'teamId_100.inhibitorKills', 'teamId_100.baronKills', 'teamId_100.dragonKills', 'teamId_100.riftHeraldKills',
    # Red team
    'teamId_200.win', 'teamId_200.firstBlood', 'teamId_200.firstTower', 'teamId_200.firstInhibitor',
    'teamId_200.firstBaron', 'teamId_200.firstDragon', 'teamId_200.firstRiftHerald', 'teamId_200.towerKills',
    'teamId_200.inhibitorKills', 'teamId_200.baronKills', 'teamId_200.dragonKills', 'teamId_200.riftHeraldKills',
]

participant_fields = [
    # teamId
    'participant1.teamId', 'participant2.teamId', 'participant3.teamId', 'participant4.teamId',
    'participant5.teamId', 'participant6.teamId', 'participant7.teamId', 'participant8.teamId',
    'participant9.teamId', 'participant10.teamId',
    # Kill, death, assists
    'participant1.stats.kills', 'participant1.stats.deaths', 'participant1.stats.assists',
    'participant2.stats.kills', 'participant2.stats.deaths', 'participant2.stats.assists',
    'participant3.stats.kills', 'participant3.stats.deaths', 'participant3.stats.assists',
    'participant4.stats.kills', 'participant4.stats.deaths', 'participant4.stats.assists',
    'participant5.stats.kills', 'participant5.stats.deaths', 'participant5.stats.assists',
    'participant6.stats.kills', 'participant6.stats.deaths', 'participant6.stats.assists',
    'participant7.stats.kills', 'participant7.stats.deaths', 'participant7.stats.assists',
    'participant8.stats.kills', 'participant8.stats.deaths', 'participant8.stats.assists',
    'participant9.stats.kills', 'participant9.stats.deaths', 'participant9.stats.assists',
    'participant10.stats.kills', 'participant10.stats.deaths', 'participant10.stats.assists',
    # Vision score
    'participant1.stats.visionScore', 'participant2.stats.visionScore', 'participant3.stats.visionScore',
    'participant4.stats.visionScore', 'participant5.stats.visionScore', 'participant6.stats.visionScore',
    'participant7.stats.visionScore', 'participant8.stats.visionScore', 'participant9.stats.visionScore',
    'participant10.stats.visionScore',
    # Lane phase farm
    'participant1.timeline.creepsPerMinDeltas.0-10', 'participant2.timeline.creepsPerMinDeltas.0-10',
    'participant3.timeline.creepsPerMinDeltas.0-10', 'participant4.timeline.creepsPerMinDeltas.0-10',
    'participant5.timeline.creepsPerMinDeltas.0-10', 'participant6.timeline.creepsPerMinDeltas.0-10',
    'participant7.timeline.creepsPerMinDeltas.0-10', 'participant8.timeline.creepsPerMinDeltas.0-10',
    'participant9.timeline.creepsPerMinDeltas.0-10', 'participant10.timeline.creepsPerMinDeltas.0-10',
    # Lane phase capitalized gold
    'participant1.timeline.goldPerMinDeltas.0-10', 'participant2.timeline.goldPerMinDeltas.0-10',
    'participant3.timeline.goldPerMinDeltas.0-10', 'participant4.timeline.goldPerMinDeltas.0-10',
    'participant5.timeline.goldPerMinDeltas.0-10', 'participant6.timeline.goldPerMinDeltas.0-10',
    'participant7.timeline.goldPerMinDeltas.0-10', 'participant8.timeline.goldPerMinDeltas.0-10',
    'participant9.timeline.goldPerMinDeltas.0-10', 'participant10.timeline.goldPerMinDeltas.0-10',
    # Crowd control time
    'participant1.stats.timeCCingOthers', 'participant2.stats.timeCCingOthers',
    'participant3.stats.timeCCingOthers', 'participant4.stats.timeCCingOthers',
    'participant5.stats.timeCCingOthers', 'participant6.stats.timeCCingOthers',
    'participant7.stats.timeCCingOthers', 'participant8.stats.timeCCingOthers',
    'participant9.stats.timeCCingOthers', 'participant10.stats.timeCCingOthers',
]

merged_fields = team_fields + participant_fields

In [4]:
root_games = pd.read_csv("../DATA/root_games.csv", usecols=merged_fields)
root_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1087 entries, 0 to 1086
Columns: 104 entries, participant1.teamId to teamId_200.riftHeraldKills
dtypes: bool(12), float64(20), int64(70), object(2)
memory usage: 794.1+ KB


# Manipulating the dataframe
Now we need to manipulate the dataframe. First we need to remove the null values (from timeline events). We can remove them because they are matches that ended before 10 minutes and, therefore, due to surrender by one of the teams. We also need to prepare the dataset for the modeling process. 

In [5]:
# Open the files as dataframes
tail_games1 = pd.read_csv("../DATA/tail_games1.csv", usecols=merged_fields)
tail_games2 = pd.read_csv("../DATA/tail_games2.csv", usecols=merged_fields)
tail_games3 = pd.read_csv("../DATA/tail_games3.csv", usecols=merged_fields)
tail_games4 = pd.read_csv("../DATA/tail_games4.csv", usecols=merged_fields)

# Merge all dataframes into one
tail_games = pd.concat([tail_games1, tail_games2, tail_games3, tail_games4])
tail_games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16414 entries, 0 to 1413
Columns: 104 entries, participant1.teamId to teamId_200.riftHeraldKills
dtypes: bool(12), float64(20), int64(70), object(2)
memory usage: 11.8+ MB


## Dealing with null values

In [6]:
tail_games.dropna(inplace=True)

In [7]:
tail_games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16272 entries, 0 to 1413
Columns: 104 entries, participant1.teamId to teamId_200.riftHeraldKills
dtypes: bool(12), float64(20), int64(70), object(2)
memory usage: 11.7+ MB


## Converting boolean values to integer values
Some columns are specified as booleans (e.g., `teamId_100.firstBaron`). In addition, the matches outcome is identified by "Win" or "Fail". So, we need to convert all this data into dummy integers, that is, zeros and ones.

In [8]:
tail_games['teamId_100.win'] = tail_games['teamId_100.win'].map({'Win': 1, 'Fail': 0})
tail_games['teamId_200.win'] = tail_games['teamId_200.win'].map({'Win': 1, 'Fail': 0})
tail_games[['teamId_100.win', 'teamId_200.win']]

Unnamed: 0,teamId_100.win,teamId_200.win
0,1,0
1,0,1
2,1,0
3,0,1
4,0,1
...,...,...
1409,0,1
1410,1,0
1411,1,0
1412,0,1


In [9]:
tail_games.replace([True, False], [1, 0], inplace=True)

In [10]:
tail_games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16272 entries, 0 to 1413
Columns: 104 entries, participant1.teamId to teamId_200.riftHeraldKills
dtypes: float64(20), int64(84)
memory usage: 13.0 MB


# Preparation for the final dataset

## Creating explainable features

In [11]:
blue_kills = [
    'participant1.stats.kills', 'participant2.stats.kills',
    'participant3.stats.kills', 'participant4.stats.kills',
    'participant5.stats.kills'
]
red_kills = [
    'participant6.stats.kills', 'participant7.stats.kills',
    'participant8.stats.kills', 'participant9.stats.kills',
    'participant10.stats.kills'
]

blue_deaths = [
    'participant1.stats.deaths', 'participant2.stats.deaths',
    'participant3.stats.deaths', 'participant4.stats.deaths',
    'participant5.stats.deaths'
]
red_deaths = [
    'participant6.stats.deaths', 'participant7.stats.deaths',
    'participant8.stats.deaths', 'participant9.stats.deaths',
    'participant10.stats.deaths'
]

blue_assists = [
    'participant1.stats.assists', 'participant2.stats.assists',
    'participant3.stats.assists', 'participant4.stats.assists',
    'participant5.stats.assists'
]
red_assists = [
    'participant6.stats.assists', 'participant7.stats.assists',
    'participant8.stats.assists', 'participant9.stats.assists',
    'participant10.stats.assists'
]

blue_vision_score = [
    'participant1.stats.visionScore', 'participant2.stats.visionScore',
    'participant3.stats.visionScore', 'participant4.stats.visionScore',
    'participant5.stats.visionScore'
]
red_vision_score = [
    'participant6.stats.visionScore', 'participant7.stats.visionScore',
    'participant8.stats.visionScore', 'participant9.stats.visionScore',
    'participant10.stats.visionScore'
]

blue_cs_perMin = [
    'participant1.timeline.creepsPerMinDeltas.0-10', 'participant2.timeline.creepsPerMinDeltas.0-10',
    'participant3.timeline.creepsPerMinDeltas.0-10', 'participant4.timeline.creepsPerMinDeltas.0-10',
    'participant5.timeline.creepsPerMinDeltas.0-10'
]
red_cs_perMin = [
    'participant6.timeline.creepsPerMinDeltas.0-10', 'participant7.timeline.creepsPerMinDeltas.0-10',
    'participant8.timeline.creepsPerMinDeltas.0-10', 'participant9.timeline.creepsPerMinDeltas.0-10',
    'participant10.timeline.creepsPerMinDeltas.0-10'
]

blue_gold_perMin = [
    'participant1.timeline.goldPerMinDeltas.0-10', 'participant2.timeline.goldPerMinDeltas.0-10',
    'participant3.timeline.goldPerMinDeltas.0-10', 'participant4.timeline.goldPerMinDeltas.0-10',
    'participant5.timeline.goldPerMinDeltas.0-10'
]
red_gold_perMin = [
    'participant6.timeline.goldPerMinDeltas.0-10', 'participant7.timeline.goldPerMinDeltas.0-10',
    'participant8.timeline.goldPerMinDeltas.0-10', 'participant9.timeline.goldPerMinDeltas.0-10',
    'participant10.timeline.goldPerMinDeltas.0-10'
]

blue_cc_time = [
    'participant1.stats.timeCCingOthers', 'participant2.stats.timeCCingOthers',
    'participant3.stats.timeCCingOthers', 'participant4.stats.timeCCingOthers',
    'participant5.stats.timeCCingOthers'
]
red_cc_time = [
    'participant6.stats.timeCCingOthers', 'participant7.stats.timeCCingOthers',
    'participant8.stats.timeCCingOthers', 'participant9.stats.timeCCingOthers',
    'participant10.stats.timeCCingOthers'
]

In [12]:
tail_games['teamId_100.kills'] = tail_games[blue_kills].sum(axis=1)
tail_games['teamId_200.kills'] = tail_games[red_kills].sum(axis=1)

tail_games['teamId_100.deaths'] = tail_games[blue_deaths].sum(axis=1)
tail_games['teamId_200.deaths'] = tail_games[red_deaths].sum(axis=1)

tail_games['teamId_100.assists'] = tail_games[blue_assists].sum(axis=1)
tail_games['teamId_200.assists'] = tail_games[red_assists].sum(axis=1)

tail_games['teamId_100.visionScore'] = tail_games[blue_vision_score].sum(axis=1)
tail_games['teamId_200.visionScore'] = tail_games[red_vision_score].sum(axis=1)

tail_games['teamId_100.csPerMin'] = tail_games[blue_cs_perMin].sum(axis=1)
tail_games['teamId_200.csPerMin'] = tail_games[red_cs_perMin].sum(axis=1)

tail_games['teamId_100.goldPerMin'] = tail_games[blue_gold_perMin].sum(axis=1)
tail_games['teamId_200.goldPerMin'] = tail_games[red_gold_perMin].sum(axis=1)

tail_games['teamId_100.crowdControlTime'] = tail_games[blue_cc_time].sum(axis=1)
tail_games['teamId_200.crowdControlTime'] = tail_games[red_cc_time].sum(axis=1)

In [13]:
teamId = [
        'participant1.teamId', 'participant2.teamId', 'participant3.teamId',
        'participant4.teamId', 'participant5.teamId', 'participant6.teamId',
        'participant7.teamId', 'participant8.teamId', 'participant9.teamId',
        'participant10.teamId'
        ]
columns_to_drop = blue_kills + red_kills + blue_deaths + red_deaths + \
     blue_assists + red_assists + blue_vision_score + red_vision_score + \
        blue_cs_perMin + red_cs_perMin + blue_gold_perMin + red_gold_perMin + \
                blue_cc_time + red_cc_time + teamId
tail_games.drop(columns_to_drop, axis=1, inplace=True)

## Rename all features

In [14]:
new_column_names = tail_games.columns.str.replace("teamId_100", "blue", regex=True).str.replace("teamId_200", "red", regex=True)
tail_games.columns = new_column_names

In [15]:
tail_games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16272 entries, 0 to 1413
Data columns (total 38 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   blue.win               16272 non-null  int64  
 1   blue.firstBlood        16272 non-null  int64  
 2   blue.firstTower        16272 non-null  int64  
 3   blue.firstInhibitor    16272 non-null  int64  
 4   blue.firstBaron        16272 non-null  int64  
 5   blue.firstDragon       16272 non-null  int64  
 6   blue.firstRiftHerald   16272 non-null  int64  
 7   blue.towerKills        16272 non-null  int64  
 8   blue.inhibitorKills    16272 non-null  int64  
 9   blue.baronKills        16272 non-null  int64  
 10  blue.dragonKills       16272 non-null  int64  
 11  blue.riftHeraldKills   16272 non-null  int64  
 12  red.win                16272 non-null  int64  
 13  red.firstBlood         16272 non-null  int64  
 14  red.firstTower         16272 non-null  int64  
 15  red

# Exporting as CSV file

In [16]:
tail_games.to_csv('../DATA/league_games.csv')