## League of Legends Churn Prediction [Extraction] - Ben Pacheco

In this notebook I will call on the Riot API endpoints to extract the JSON data that I need for my churn prediction. After I extract all the sufficient data I will transform it into tabular form in the next notebook.

### Preface

#### Accessing the Data

In order to get the final master dataset that I need to perform analysis on I need to query from multiple endpoints. 

* From the league v5 endpoint I will get the ranked data.
* Using that ranked data I will then query for summoner data.
* Using that summoner data I will query for the player user ids (puuids)
* And finally using those puuids will allow me to query for match data.

Joining the ranked + match data should give me all the data that I need. Let's get started!

#### Limitations

We are limited by Riot to 100 requests per 2 min. You will notice throughout the notebook that I had to call on several endpoints multiple times in order to not seem suspicious to the server.

The request outputs and some prints have been omitted due to loading speed.

In [4]:
#importing packages
import os
%load_ext dotenv
%dotenv
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import schedule
import math

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [5]:
api_key = str(os.getenv("API_KEY"))

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Charset": "application/x-www-form-urlencoded; charset=UTF-8",
    "Origin": "https://developer.riotgames.com"}
#append all data to this list
league_list = []

In [None]:
list_of_tiers = ["IRON/I", "IRON/II", "IRON/III", "IRON/IV", "BRONZE/I",  "BRONZE/II", "BRONZE/III", "BRONZE/IV"]


for tier in list_of_tiers:
    for page in range(1, 11, 1):
        url = "https://na1.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/"+tier+"?page="+str(page)+"&api_key="+api_key
        req = requests.get(url, headers=headers)
        if req.status_code == 200:
            data_json = json.loads(req.text)
            league_list.extend(data_json)

In [None]:
list_of_tiers = ["SILVER/I", "SILVER/II", "SILVER/III", "SILVER/IV", "GOLD/I", "GOLD/II", "GOLD/III", "GOLD/IV"]


for tier in list_of_tiers:
    for page in range(1, 11, 1):
        url = "https://na1.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/"+tier+"?page="+str(page)+"&api_key="+api_key
        req = requests.get(url, headers=headers)
        if req.status_code == 200:
            data_json = json.loads(req.text)
            league_list.extend(data_json)

In [None]:
list_of_tiers = ["PLATINUM/I", "PLATINUM/II", "PLATINUM/III", "PLATINUM/IV",
                 "DIAMOND/I", "DIAMOND/II",  "DIAMOND/III", "DIAMOND/IV"]


for tier in list_of_tiers:
    for page in range(1, 11, 1):
        url = "https://na1.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/"+tier+"?page="+str(page)+"&api_key="+api_key
        req = requests.get(url, headers=headers)
        if req.status_code == 200:
            data_json = json.loads(req.text)
            league_list.extend(data_json)

In [None]:
#master
url = "https://na1.api.riotgames.com/lol/league/v4/masterleagues/by-queue/RANKED_SOLO_5x5?api_key="api_key
req = requests.get(url, headers=headers)
if req.status_code == 200:
    data_json = json.loads(req.text)
    for entry in data_json['entries']:
        entry['tier'] = data_json['tier']
        entry['leagueId'] = data_json['leagueId']
        entry['queue'] = data_json['queue']
        league_list.append(entry)

In [None]:
#grandmaster
url = "https://na1.api.riotgames.com/lol/league/v4/grandmasterleagues/by-queue/RANKED_SOLO_5x5?api_key="+api_key
req = requests.get(url, headers=headers)
if req.status_code == 200:
    data_json = json.loads(req.text)
    for entry in data_json['entries']:
        entry['tier'] = data_json['tier']
        entry['leagueId'] = data_json['leagueId']
        entry['queue'] = data_json['queue']
        league_list.append(entry)

In [None]:
#challenger
url = "https://na1.api.riotgames.com/lol/league/v4/challengerleagues/by-queue/RANKED_SOLO_5x5?api_key="+api_key
req = requests.get(url, headers=headers)
if req.status_code == 200:
    data_json = json.loads(req.text)
    for entry in data_json['entries']:
        entry['tier'] = data_json['tier']
        entry['leagueId'] = data_json['leagueId']
        entry['queue'] = data_json['queue']
        league_list.append(entry)

In [None]:
import time

start_time = time.time()
with open('../data/raw/league_json.json', mode='a') as file:
    for league in league_list:
        league.pop("miniSeries", None)
        league.pop("entries", None)
        league.pop("name", None)
        try:
            json.dump(league, file)
            file.write('\n')
        except Exception as e:
            print("No league found: {} with error {}".format(str(league), str(e)))

end_time = time.time()
print(end_time - start_time)

In [6]:
league_dict_list = []
with open('../data/raw/league_json.json', encoding='utf-8') as json_file:
    for line in json_file:
        json_data = json.loads(line)
        league_dict_list.append(json_data)
league_data = pd.DataFrame(league_dict_list, columns = json_data.keys())

I've obtained my first dataset, I need to loop through the summonerName column to continue further.

In [7]:
league_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53435 entries, 0 to 53434
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   summonerId    53435 non-null  object
 1   summonerName  53435 non-null  object
 2   leaguePoints  53435 non-null  int64 
 3   rank          53435 non-null  object
 4   wins          53435 non-null  int64 
 5   losses        53435 non-null  int64 
 6   veteran       53435 non-null  bool  
 7   inactive      53435 non-null  bool  
 8   freshBlood    53435 non-null  bool  
 9   hotStreak     53435 non-null  bool  
 10  tier          53435 non-null  object
 11  leagueId      53435 non-null  object
 12  queue         4235 non-null   object
dtypes: bool(4), int64(3), object(6)
memory usage: 3.9+ MB


In [8]:
league_data.head()

Unnamed: 0,summonerId,summonerName,leaguePoints,rank,wins,losses,veteran,inactive,freshBlood,hotStreak,tier,leagueId,queue
0,NNtfHnUGL5EBl6-BAbPZvPL9MXKwHH4nh28ny546ezLceL2u,Soundwave2017,0,I,3,11,False,False,False,False,IRON,a311bb83-f707-4834-a07f-54d3c9fd0178,
1,LIb_E8K_EyNMOXJcVjde50-dC-zvW5H-WxmpxBtA0Ex1PcoS,kalelover666,74,I,6,10,False,False,False,True,IRON,8e8a1f5e-0135-4dee-9886-57411a0bf631,
2,mMuKeCWaFDHLCBjKBTX02xql_RdrI_SWkD8lfvbuxCj5jIyB,Davidc409,100,I,33,32,False,False,False,False,IRON,605ac070-5ef0-4f60-a2a2-7648710a3e55,
3,1cs3ovWaXUtFrEz9ijGUXG5EdHshSJd9SIV64xTnbclfpWFU,brownknight420,53,I,5,6,False,False,False,False,IRON,27f966d1-476d-44ea-a78e-bf0e31b319ba,
4,WNu7KL4i1uHT_bZ_t6kwFi31SFsgcRiPi5JTLJt9YG2w-vFj,1st Lt LeBlanc,8,I,6,14,False,False,False,False,IRON,5069447f-305f-4b2a-ad36-9539ff7a097e,


I ran into a obstacle in querying the API. The whole process takes about 8 hours for the amount of rows I was pulling in. I left the cell running overnight but the Riot server kept manually disconnecting me. I had to implement a class that would pause and restart the process in order to not trigger the timeout.

In [None]:
summoner_list = []
summoner_names = league_data['summonerName']

class Timedout:
    def __init__(self, list_of_names):
        self.list_of_names = list_of_names
        self.number_of_requests_per_limit = 100
        self.number_of_iterations = math.ceil(len(self.list_of_names)/ self.number_of_requests_per_limit)
        self.batch = 0
        self.interval_in_seconds = 120 #120 seconds is our limit here
        self.execution_time = 0;

    def request_name(self):
        start_time = time.time()
        print(self.list_of_names)
        for i in range (0, self.number_of_requests_per_limit):
            # we don't want to access the index for the element out of list
            try:
                url = "https://na1.api.riotgames.com/lol/summoner/v4/summoners/by-name/"+self.list_of_names[i].replace(' ', '%20')+"?api_key="+api_key
                req = requests.get(url, headers=headers)
                if req.status_code == 200:
                    data_json = json.loads(req.text)
                    summoner_list.append(data_json)
            except (IndexError, KeyError) as error:
                print(self.list_of_names[i].replace(' ', '%20'))
                break
        if(len(self.list_of_names) > self.number_of_requests_per_limit):
            # slice list_of_names so the next time request_name is called it will call the next batch
            self.list_of_names = self.list_of_names[self.number_of_requests_per_limit:].reset_index(drop=True)
        self.execution_time = time.time() - start_time

    def my_schedule(self):
        while 1:
            if (self.batch >= self.number_of_iterations):
                break
            print("Batch number", self.batch)
            Timedout.request_name(self)
            schedule.run_pending()
            sleep_time = self.interval_in_seconds - self.execution_time
            if (sleep_time < 0):
                sleep_time = 0
            print("Sleep for: ", sleep_time)
            time.sleep(sleep_time)
            self.batch += 1



myClass = Timedout(summoner_names)
myClass.my_schedule()

In [None]:
start_time = time.time()
with open('../data/raw/summoner_json.json', mode='a') as file:
    for summoner in summoner_list:
        try:
            json.dump(summoner, file)
            file.write('\n')
        except Exception as e:
            print("No summoner found: {} with error {}".format(str(summoner), str(e)))

end_time = time.time()
print(end_time - start_time)

In [9]:
summoner_dict_list = []
with open('../data/raw/summoner_json.json', encoding='utf-8') as json_file:
    for line in json_file:
        json_data = json.loads(line)
        summoner_dict_list.append(json_data)
summoner_data = pd.DataFrame(summoner_dict_list, columns = json_data.keys())

Querying the summoner data complete! Now we need to use the puuids to get match data.

In [10]:
summoner_data.head()

Unnamed: 0,id,accountId,puuid,name,profileIconId,revisionDate,summonerLevel
0,NE4oDvvTCBGJIKhEDCpoDXo_vujt5xfiJDWdgopM8v4nlGiH,nLnSo8JWwox13L8E-02EBESRwT2k0MZY4rEsrylwJcE9FC...,qVOdaQQvh_u8SusqUUEPl2KvqehyNpsT-RkhUFO60Ss9fA...,Soundwave2017,5240,1652239392000,211
1,Fl9aoX72L1_vXqFbWLkMt2sjLPPFTJf01EOWG6eUfl8PXID9,GvRSKZcw57uUl1KWfg0DrubVZwqc14CR85W2Ye5jU9lhWB...,1j0dLBNv6C1Efi1uu6b4UBovbtqU1g3WcHFDO1MzU5nMei...,kalelover666,4902,1652161979000,240
2,fNpQis_en_lhQvYttAasaOsdq-y-ZS4TwtDfCsc-gQK4a-JG,vLM5Ex6Mi1oIuZ2vKlj0r_qm4JEYqKd0Qr3Qy3XFmmWB0a...,kGVq3XIa5GvuXZmQfNiPP9_Ijns6oahP8mIjxl_iZbgxcM...,Davidc409,4272,1652204358000,153
3,jA3Lwl4tcfhBI0gGvfn17opUtGsmzyzneJMUln-rCvjz7Jr3,4-8phKXR7ZGhmxdvO8pdKokPXJTnUWY-MFkQL29RIsPyTW...,D5E-826NdMh-w1pNJj7ZmTTHgpSIbYobQhng4QDT2I9Kx8...,brownknight420,5082,1652242178000,159
4,erasEE67tBk9w1o1iJMrzKtE_CfXlOFEDFpQHlde3N56G-NS,zJrkNxjxH_WqGpYwEN5s-QPx2hGK97fKV0e5UhNdZ2K54T...,IYSM3VZ46iG6wuI832EHoILn4qywJvcSMv0Exc0Yc3ZH5t...,1st Lt LeBlanc,1162,1645664109000,141


In [11]:
summoner_data.puuid.value_counts()

aSEgGIsEBsnZuZ5tn6gOjFtaKdwPJRt2LsDqKsgoMlOqQqT7CIOMDzcfJlrwTiqLlLmfUssJiL5GbA    2
FM2sYkSMZEp2zEjbN_4LCWmYjuKQqTZ3n7g_-53KWvsgQNNqvvXzsqL046jYBJIiVpIIH9NrMxgcJA    2
l2J6OT9vbv0P2SwVBgax7ivcsujWB0U5rQbYLraT8OPPACPCepdgqr3rNy-T9KYjLGa7Ydl0Q-r4dQ    2
o2zGARIyRajzc1OFpGLstEXFEGSz-tCcEeWt3JCU-JIcvyvScD-ckmW46BnEp-xmauHc8vD-aDh_MA    2
dof9apSv11Mm99isv5ASQJJKA9D06MD3DHIvgsLq1rxkB9pvVO-_NYhqdIPIFAZ3YdGjme9Nhncqug    2
                                                                                 ..
mRcza6dmI7XqUQJOwzhkM08t0T8n3_uS7c3wQdrzwaBC-zp0C4PikHJIrAPYgDrOqrdR2J3Rxi9wiQ    1
rBuRDV9FcPkvk5oOctfdoH1gBru_z7KCOZAXPR0Xv-zW8Wd6QktGocYNfETrB_x0VHxlIkyc7z8QsA    1
EOcljVwVtj--8NGiLDer-tmdpdcgJMeZyEliUsTKqKHwu5ZbVpGivgYn7bPBN_-RYD7BKUa3pAipsQ    1
_cVzazEXis0uY-RC9OCaX3oxiBGz5V0z4yDU5axrDirjC-Vna9N92tkkv-KYqa8qGoPBmmW96BqQ_g    1
sQYQFOr-ZV-4d3mGDNNrT6Wmh3-yyLYZUkHmYEqQ6oPKYlfFvZjCCDC1Ht_V9NBqYTR0hGfYROP70w    1
Name: puuid, Length: 50560, dtype: int64

Get the match ids per puuid.

In [None]:
#get match puuids
puuid_list = []
match_puuids = summoner_data['puuid']

class Timedout:
    def __init__(self, list_of_names):
        self.list_of_names = list_of_names
        self.number_of_requests_per_limit = 100
        self.number_of_iterations = math.ceil(len(self.list_of_names)/ self.number_of_requests_per_limit)
        self.batch = 0
        self.interval_in_seconds = 122 #120 seconds is our limit here
        self.excecution_time = 0;

    def request_name(self):
        start_time = time.time()
        print(self.list_of_names)
        for i in range (0, self.number_of_requests_per_limit):
            # we don't want to access the index for the element out of list
            try:
                url = "https://americas.api.riotgames.com/lol/match/v5/matches/by-puuid/"+self.list_of_names[i]+"/ids?type=ranked&start=0&count=100&api_key="+api_key
                req = requests.get(url, headers=headers)
                if req.status_code == 200:
                    data_json = json.loads(req.text)
                    puuid_list.append(data_json)
                time.sleep(1)
            except (IndexError, KeyError, ConnectionError) as error:
                if (ConnectionError):
                    print(error)
                    time.sleep(60)
                    pass
                else: 
                    break
        if(len(self.list_of_names) > self.number_of_requests_per_limit):
            # slice list_of_names so the next time request_name is called it will call the next batch
            self.list_of_names = self.list_of_names[self.number_of_requests_per_limit:].reset_index(drop=True)
        self.excecution_time = time.time() - start_time
        print(success_rate) # success rate of this batch

    def my_schedule(self):
        while 1:
            if (self.batch >= self.number_of_iterations):
                break
            print("Batch number", self.batch)
            Timedout.request_name(self)
            schedule.run_pending()
            sleep_time = self.interval_in_seconds - self.excecution_time
            if (sleep_time < 0):
                sleep_time = 0
            print("sleep for: ", sleep_time)
            time.sleep(sleep_time)
            self.batch += 1



myClass = Timedout(match_puuids)
myClass.my_schedule()

In [None]:
start_time = time.time()
with open('../data/raw/puuid_json.json', mode='a') as file:
    flat_list = list(np.concatenate(puuid_list).flat)
    for puuid in flat_list:
        try:
            json.dump(puuid, file)
            file.write('\n')
        except Exception as e:
            print("No puuid found: {} with error {}".format(str(puuid), str(e)))

end_time = time.time()
print(end_time - start_time)

All thats left is to query the match data with given match ids.

In [None]:
#get matchs per match id from flattened list
puuid_flattened_list = []
with open('../data/raw/puuid_json.json', encoding='utf-8') as json_file:
    for line in json_file:
        json_data = json.loads(line)
        puuid_flattened_list.append(json_data)
        
match_list = []

class Timedout:
    def __init__(self, list_of_names):
        self.list_of_names = list_of_names
        self.number_of_requests_per_limit = 100
        self.number_of_iterations = math.ceil(len(self.list_of_names)/ self.number_of_requests_per_limit)
        self.batch = 0
        self.interval_in_seconds = 180 #180 seconds is our limit here
        self.excecution_time = 0;

    def request_name(self):
        start_time = time.time()
        print(self.list_of_names)
        success_rate = 0
        for i in range (0, self.number_of_requests_per_limit):
            # we don't want to access the index for the element out of list
            try:
                url = "https://americas.api.riotgames.com/lol/match/v5/matches/"+self.list_of_names[i]+"?api_key="+api_key
                req = requests.get(url, headers=headers)
                if req.status_code == 200:
                    success_rate += 1
                    data_json = json.loads(req.text)
                    match_list.append(data_json)
                time.sleep(1)
            except (IndexError, KeyError, ConnectionError) as error:
                if (ConnectionError):
                    print(error)
                    time.sleep(60)
                    pass
                else: 
                    break
        if(len(self.list_of_names) > self.number_of_requests_per_limit):
            # slice list_of_names so the next time request_name is called it will call the next batch
            self.list_of_names = self.list_of_names[self.number_of_requests_per_limit:]
        self.excecution_time = time.time() - start_time
        print(success_rate) # success rate of this batch

    def my_schedule(self):
        while 1:
            if (self.batch >= self.number_of_iterations):
                break
            print("Batch number", self.batch)
            Timedout.request_name(self)
            schedule.run_pending()
            sleep_time = self.interval_in_seconds - self.excecution_time
            if (sleep_time < 0):
                sleep_time = 0
            print("sleep for: ", sleep_time)
            time.sleep(sleep_time)
            self.batch += 1


myClass = Timedout(puuid_flattened_list)
myClass.my_schedule()

In [None]:
start_time = time.time()
with open('../data/raw/match_json.json', mode='a') as file:
    for match in match_list:
        try:
            json.dump(match, file)
            file.write('\n')
        except Exception as e:
            print("No match found: {} with error {}".format(str(match), str(e)))

end_time = time.time()
print(end_time - start_time)

We have all the JSON's we need to form the master data set, follow along in the next book for EDA and wrangling!.