# Data Science for League of Legends Champion Analysis

## Part 1: Collecting Champion Data via Web Scraping

## Introduction:
Do you play League of Legend? Do you wonder why you lose games? Do you want to know which factors affect the win rate most? Here is the solution!
This project first collects the data from the relevant URL via web scraping. The results of collection are then stored as a DataFrame, which will soon be 
exported as CSV files. Machine learning models will be built in Part 2. 

## Note:

Since the results from web scraping is very lengthy, most of them are not displayed.

## Objectives:

Here are all of objectives that will be completed throughout Part 1.
1. [Collecting Data via web scraping](#1)<br>
2. [Selecting the only relevant part from the results of web scraping](#2)<br>
3. [Storing data into DataFrames with appropriate table structures](#3)<br>
4. [Conclusions of Part 1](#4)<br>

Import the relevant modules for web scaping and data processing

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

### Collecting data via web scraping <a id="1"></a>

Get the correct URL and collect the data via `requests.get()` function

**Note:** The results are only applicable for Patch 11.17

In [2]:
url = "https://u.gg/lol/one-for-all-tier-list"
data = requests.get(url).text


Create a `BeatutifulSoup` object

In [3]:
soup = BeautifulSoup(data,"html5lib")  

Get the body part

In [4]:
x=soup.body

The data we want start from **script**, so we need to use `find_all` function to refine the results

In [5]:
x=soup.body
scripts=x.find_all("script")

In [6]:
lists = list(scripts[0]) # Convert the results to list 
string = str(lists[0]) # Convert results back to string

### Selecting the only relevant part from the results of web scraping <a id="2"></a>

In [7]:
left = [i for i in range(len(string)) if string.startswith("{", i)] # Find all indexes starting with "{"
right = [i for i in range(len(string)) if string.startswith("}", i)] # Find all indexes starting with "}"
relevant_data = string[left[1]:(right[-2]+1)] # Collect all parts by selecting the right indexes found in previous rows

Up to this point, all of relevant data are extracted. However, since the data are not in `Dictionary` form, it is hard to insert them into DataFrame. 
Thus, we need to import `json` and convert the data from string to dictionary format.

In [8]:
import json
data_dict = json.loads(relevant_data)

The results are now in dictionary form, but we still want to further refine them.

Create an empty DataFrame about all basic attributes of champions

In [9]:
hero_base_data_set=pd.DataFrame(columns=["Champion Name","Champion Title","Champion ID","Attack","Defense","Magic","Difficulty","Main Tag","Alternative Tag"])
hero_base_data_set

Unnamed: 0,Champion Name,Champion Title,Champion ID,Attack,Defense,Magic,Difficulty,Main Tag,Alternative Tag


* We now have the perfect data structures for the web scraping results. We only want the attributes listed as the empty DataFrame above. 
* Keep in mid that champions can have 1 or 2 tags, so we use `if` function to deal with it. If champions have only 1 tag then their **Main Tag** and **Alternavtive Tag** are the same.
* This table describes the base data, so I named it as **hero_base_data_set**.

### Storing data into DataFrames with appropriate table structures <a id="3"></a>

In [10]:
basedata=data_dict['https://static.u.gg/assets/lol/riot_static/11.16.1/data/en_US/champion.json']['data']

In [11]:
for item in basedata:     
    champion_name = basedata[item]['name']
    champion_title = basedata[item]['title']
    champion_id = basedata[item]['key']
    attack = basedata[item]['info']['attack']
    defense = basedata[item]['info']['defense']
    magic = basedata[item]['info']['magic']
    difficulty = basedata[item]['info']['difficulty']
    if len(basedata[item]['tags'])==1:
        tag1 = basedata[item]['tags'][0]
        tag2 = tag1
    elif len(basedata[item]['tags'])==2:
        tag1 = basedata[item]['tags'][0]
        tag2 = basedata[item]['tags'][1]
    else:
        tag1 = "Unknown"
        tag2 = tag1
               
    hero_base_data_set=hero_base_data_set.append({"Champion Name":champion_name,"Champion Title":champion_title,"Champion ID":champion_id,"Attack":attack,"Defense":defense,    
                                                    "Magic":magic,"Difficulty":difficulty,"Main Tag":tag1, "Alternative Tag":tag2},ignore_index=True)
    
hero_base_data_set          

Unnamed: 0,Champion Name,Champion Title,Champion ID,Attack,Defense,Magic,Difficulty,Main Tag,Alternative Tag
0,Annie,the Dark Child,1,2,3,10,6,Mage,Mage
1,Olaf,the Berserker,2,9,5,3,3,Fighter,Tank
2,Galio,the Colossus,3,1,10,6,5,Tank,Mage
3,Twisted Fate,the Card Master,4,6,2,6,9,Mage,Mage
4,Xin Zhao,the Seneschal of Demacia,5,8,6,3,2,Fighter,Assassin
...,...,...,...,...,...,...,...,...,...
151,Pyke,the Bloodharbor Ripper,555,9,3,1,7,Support,Assassin
152,Yone,the Unforgotten,777,8,4,4,8,Assassin,Fighter
153,Sett,the Boss,875,8,5,1,2,Fighter,Tank
154,Lillia,the Bashful Bloom,876,0,2,10,8,Fighter,Mage


Import the project token and save the DataFrame above as a CSV file

In [12]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='55eb1fc7-84bf-450f-ace7-ed8423edb869', project_access_token='p-cf7918da44037827f4061ca5d48b70f5e7b32452')
pc = project.project_context


In [13]:
project.save_data(data=hero_base_data_set.to_csv(index=True),file_name='Hero Base Data with Index.csv',overwrite=True)

{'file_name': 'Hero Base Data with Index.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'lolbackup-donotdelete-pr-taxcphl3f5hkzi',
 'asset_id': 'e4dd7fe1-9f15-435a-89e7-fa093ab5517e'}

Now we go through to data about win and loss of champions. We first examine the data structures of the results.

In [14]:
win_loss_data=data_dict['https://stats2.u.gg/lol/1.1/champion_ranking/world/11_17/one_for_all/overall/1.4.0.json']['data']['win_rates']

In [15]:
win_loss_data2=data_dict['https://stats2.u.gg/lol/1.1/champion_ranking/world/11_17/one_for_all/overall/1.4.0.json']['data']['win_rates']['adc']

In [16]:
win_loss_data2[2]

{'win_rate': 68.11926605504587,
 'pick_rate': 1.5324054548010684,
 'ban_rate': 3.251089554337129,
 'avg_damage': 18.085357798165138,
 'avg_kda': 2.7512096774193546,
 'avg_cs': 70.68348623853211,
 'avg_gold': 12691.720183486239,
 'champion_id': '80',
 'role': 'adc',
 'champion_link': {'champion_id': '80', 'role': 'adc'},
 'worst_against': {'bad_against': [{'champion_id': 27,
    'wins': 3,
    'matches': 3,
    'win_rate': 0,
    'opp_win_rate': 100},
   {'champion_id': 74,
    'wins': 3,
    'matches': 3,
    'win_rate': 0,
    'opp_win_rate': 100},
   {'champion_id': 31,
    'wins': 5,
    'matches': 6,
    'win_rate': 16.666666666666664,
    'opp_win_rate': 83.33333333333334},
   {'champion_id': 16,
    'wins': 2,
    'matches': 3,
    'win_rate': 33.333333333333336,
    'opp_win_rate': 66.66666666666666},
   {'champion_id': 90,
    'wins': 2,
    'matches': 3,
    'win_rate': 33.333333333333336,
    'opp_win_rate': 66.66666666666666},
   {'champion_id': 86,
    'wins': 2,
    'match

* After checking the data structures, we select all attributes about win and loss of champions, which are shown as below. 
* Another empty DataFrame is formed for these attributes below.

In [17]:
hero_win_loss_data = pd.DataFrame(columns=["Champion ID","Pick Rate","Ban Rate","Average Damage","Average KDA","Average Gold","Overall Win Rate"])
hero_win_loss_data

Unnamed: 0,Champion ID,Pick Rate,Ban Rate,Average Damage,Average KDA,Average Gold,Overall Win Rate


In [18]:
for role in win_loss_data:
    for item in win_loss_data[role]:
        champion_id = item['champion_id']
        pick_rate = item['pick_rate']
        ban_rate = item['ban_rate']
        avg_damage = item['avg_damage']
        avg_kda = item['avg_kda']
        avg_gold = item['avg_gold']
        overall_win_rate = item['win_rate']
        role = item['role']
        hero_win_loss_data = hero_win_loss_data.append({"Champion ID":champion_id,"Pick Rate":pick_rate,"Ban Rate":ban_rate,"Average Damage":avg_damage,
                                                       "Average KDA":avg_kda,"Average Gold":avg_gold,"Overall Win Rate":overall_win_rate},ignore_index=True)
        
hero_win_loss_data    

Unnamed: 0,Champion ID,Pick Rate,Ban Rate,Average Damage,Average KDA,Average Gold,Overall Win Rate
0,5,0.604527,1.672993,16.206645,2.411706,12538.750000,72.093023
1,24,1.001687,3.286236,15.558144,2.121140,12160.592982,69.122807
2,80,1.532405,3.251090,18.085358,2.751210,12691.720183,68.119266
3,82,2.312667,7.451146,15.602570,1.962002,12141.948328,67.173252
4,85,1.054407,1.490229,18.676103,2.523585,13244.830000,64.666667
...,...,...,...,...,...,...,...
1535,60,0.193308,1.455082,15.384000,1.655367,11225.000000,23.636364
1536,60,0.193308,1.455082,13.060436,1.553719,11040.236364,23.636364
1537,60,0.193308,1.455082,16.349545,1.567493,12062.745455,23.636364
1538,60,0.193308,1.455082,12.402218,1.485255,11060.090909,23.636364


Since the DataFrame above has lots of duplicate rows, we need to drop duplicates. We do so by considering dropping them by choosing the **Champion ID** as criteria, with keeping the first one.

In [19]:
hero_win_loss_data_refined = hero_win_loss_data.drop_duplicates(subset=['Champion ID'],keep="first")
hero_win_loss_data_refined 

Unnamed: 0,Champion ID,Pick Rate,Ban Rate,Average Damage,Average KDA,Average Gold,Overall Win Rate
0,5,0.604527,1.672993,16.206645,2.411706,12538.750000,72.093023
1,24,1.001687,3.286236,15.558144,2.121140,12160.592982,69.122807
2,80,1.532405,3.251090,18.085358,2.751210,12691.720183,68.119266
3,82,2.312667,7.451146,15.602570,1.962002,12141.948328,67.173252
4,85,1.054407,1.490229,18.676103,2.523585,13244.830000,64.666667
...,...,...,...,...,...,...,...
149,526,0.298749,0.653733,14.648400,2.451485,11345.223529,29.411765
150,79,1.233657,1.117672,17.852595,1.839076,12676.501425,26.495726
151,40,0.390131,0.579924,14.124847,2.005690,11622.387387,26.126126
152,8,0.499086,2.207226,15.598352,1.324521,11062.147887,25.352113


In [20]:
project.save_data(data=hero_win_loss_data_refined.to_csv(index=True),file_name='Hero Win Loss Data Refined.csv',overwrite=True)

{'file_name': 'Hero Win Loss Data Refined.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'lolbackup-donotdelete-pr-taxcphl3f5hkzi',
 'asset_id': 'c1e172ca-fdd8-4dd6-9a86-58b4ca31a247'}

We want to concatenate 2 tables above. Here are the steps.
* Another table called **all_hero_data** is created by copying the data from the **hero_base_data_set** table. 
* All columns except **Chmampion ID** from the **hero_win_loss_data_refined** table will be added to **hero_base_data_set** table.

In [21]:
all_hero_data = hero_base_data_set.copy()
all_hero_data[['Pick Rate','Ban Rate','Average Damage','Average KDA','Average Gold','Overall Win Rate']]=np.nan # Insert empty values to additional columns
all_hero_data

Unnamed: 0,Champion Name,Champion Title,Champion ID,Attack,Defense,Magic,Difficulty,Main Tag,Alternative Tag,Pick Rate,Ban Rate,Average Damage,Average KDA,Average Gold,Overall Win Rate
0,Annie,the Dark Child,1,2,3,10,6,Mage,Mage,,,,,,
1,Olaf,the Berserker,2,9,5,3,3,Fighter,Tank,,,,,,
2,Galio,the Colossus,3,1,10,6,5,Tank,Mage,,,,,,
3,Twisted Fate,the Card Master,4,6,2,6,9,Mage,Mage,,,,,,
4,Xin Zhao,the Seneschal of Demacia,5,8,6,3,2,Fighter,Assassin,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151,Pyke,the Bloodharbor Ripper,555,9,3,1,7,Support,Assassin,,,,,,
152,Yone,the Unforgotten,777,8,4,4,8,Assassin,Fighter,,,,,,
153,Sett,the Boss,875,8,5,1,2,Fighter,Tank,,,,,,
154,Lillia,the Bashful Bloom,876,0,2,10,8,Fighter,Mage,,,,,,


Data are added based on the same **Champion ID** from two tables called **hero_base_data_set** and **hero_win_loss_data_refined**.

In [22]:
for i in range(1,hero_win_loss_data_refined.shape[1]):
    columns = hero_win_loss_data_refined.iloc[:,i]
    for j in range(0,all_hero_data.shape[0]):
        for k in range(0,hero_win_loss_data_refined.shape[0]):
            if all_hero_data['Champion ID'][j]==hero_win_loss_data_refined['Champion ID'][k]:
                all_hero_data.iloc[j,i+8]=hero_win_loss_data_refined.iloc[k,i]
            else:
                pass
all_hero_data              

Unnamed: 0,Champion Name,Champion Title,Champion ID,Attack,Defense,Magic,Difficulty,Main Tag,Alternative Tag,Pick Rate,Ban Rate,Average Damage,Average KDA,Average Gold,Overall Win Rate
0,Annie,the Dark Child,1,2,3,10,6,Mage,Mage,0.987628,5.672712,20.622160,2.146867,13331.708185,58.362989
1,Olaf,the Berserker,2,9,5,3,3,Fighter,Tank,0.309293,0.797835,15.141841,1.804788,11725.920455,54.545455
2,Galio,the Colossus,3,1,10,6,5,Tank,Mage,0.727541,1.230142,15.852937,2.349731,11926.589372,45.893720
3,Twisted Fate,the Card Master,4,6,2,6,9,Mage,Mage,0.868129,0.499086,15.523543,1.607903,12372.056680,40.485830
4,Xin Zhao,the Seneschal of Demacia,5,8,6,3,2,Fighter,Assassin,0.604527,1.672993,16.206645,2.411706,12538.750000,72.093023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151,Pyke,the Bloodharbor Ripper,555,9,3,1,7,Support,Assassin,2.664136,8.132996,15.466116,2.679753,15424.076517,38.390501
152,Yone,the Unforgotten,777,8,4,4,8,Assassin,Fighter,3.567412,18.624350,15.389760,1.596215,12084.491626,61.970443
153,Sett,the Boss,875,8,5,1,2,Fighter,Tank,2.871503,11.053704,17.243182,2.057845,12418.867809,62.668299
154,Lillia,the Bashful Bloom,876,0,2,10,8,Fighter,Mage,1.690567,3.187825,17.974692,2.415511,12841.209979,63.617464


### Conclusions of Part 1 <a id="4"></a>

Great. This is the final dataset we want. We have all of essential attributes of all of champions in the game. The dataset will be saved to CSV file. In Part 2 the CSV file is read and machine learning models are formed.

In [23]:
project.save_data(data=all_hero_data.to_csv(index=True),file_name='All Hero Data.csv',overwrite=True)

{'file_name': 'All Hero Data.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'lolbackup-donotdelete-pr-taxcphl3f5hkzi',
 'asset_id': '401fced6-64db-4d17-8c94-77e100df0e1e'}

## Thanks for reading