<a href="https://colab.research.google.com/github/gabrielcordeiro2/LOL-Champion-Analysis/blob/main/Scrapping_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Important:** if you have any errors:
- go to Runtime > Restart Runtime, or just press Ctrl + M + (.)
- try run all boxes again

**Important:** you need to download and upload the **files folder** located in this repository:
- https://github.com/gabrielcordeiro2/LOL-Champion-Analysis
- or you can directly download using command line:

In [1]:
!apt install unzip -q
!wget https://tinyurl.com/3e6vnh2j -q
!unzip 3e6vnh2j

Reading package lists...
Building dependency tree...
Reading state information...
unzip is already the newest version (6.0-21ubuntu1.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 62 not upgraded.
Archive:  3e6vnh2j
   creating: files/
 extracting: files/__init__.py       
  inflating: files/lol_genders.py    
  inflating: files/lol_names.py      
  inflating: files/lol_regions.py    
  inflating: files/wiki_names.py     


Import necessary files and libs:

In [2]:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup
from files.lol_regions import regions
from files.lol_genders import genders
from files.lol_names import correction_roles
from files.wiki_names import wiki_tags

### __Champion Voices:__

Get champions list from API:

In [3]:
version = "12.12.1"
response = requests.get(f"http://ddragon.leagueoflegends.com/cdn/{version}/data/en_US/champion.json").json()
champions_all_info = list(response['data'].items())
champions = list(response['data'].keys())

Apply search correction in names:

In [4]:
lol_champions = list(map(wiki_tags.get, champions, champions))

In [5]:
def remove_tag_voices(v_line): # "text 123" --> text 123
    tag_removed = str(v_line)[3:-4]
    if ((tag_removed[0] == '"') and (tag_removed[-1] == '"')):
        return(tag_removed[1:-1],True)
    else:
        return(tag_removed,False)

In [6]:
def scrap_voices(dataframe, champ_name):
    champ = requests.get(f"https://leagueoflegends.fandom.com/wiki/{champ_name}/LoL/Audio")
    if champ.status_code == 404:
        print(i + " " + str(champ.status_code))
    no_trivia = re.sub(r'<span class="mw-headline" id="Trivia">.*', '', champ.text, flags=re.DOTALL).strip()
    soup = BeautifulSoup(no_trivia, 'html.parser')
    unsorted_lines = list(soup.find('div', class_="mw-parser-output").find_all('i'))
    
    sorted_lines = list(map(remove_tag_voices, unsorted_lines))
    formatted_lines = pd.DataFrame([[champ_name,line[0],line[1]] for line in sorted_lines],
                                   columns=['champion', 'voice_line', 'is_spoken'])
    dataframe = pd.concat([dataframe,formatted_lines])
    return dataframe

Run multi-page scrapping: (150+ Pages)

In [7]:
unfiltered_voices = pd.DataFrame(columns=['champion', 'voice_line', 'is_spoken'])
for champion in lol_champions:
    unfiltered_voices = scrap_voices(unfiltered_voices, champion)
unfiltered_voices.to_csv("files/unfiltered_voices.csv", index=False, encoding='utf-8')

In [None]:
unfiltered_voices

Create function to Clean Html tags:

In [9]:
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

def cleanhtml(row):
    cleantext = re.sub(CLEANR, '', row)
    return cleantext

Organize and drop audio files:

In [10]:
voice_df = pd.DataFrame(pd.read_csv("files/unfiltered_voices.csv"))
voice_df = voice_df.dropna() # Remove empty rows
voice_df.drop(voice_df[voice_df.voice_line.str.endswith(".ogg")].index, inplace = True) # Remove audios
voice_df.drop_duplicates(subset ="voice_line", keep = "first", inplace = True)
voice_df.voice_line = voice_df.voice_line.map(cleanhtml) # Remove Html tags
voice_df.sort_values(by="champion",ignore_index=True, inplace=True)

In [None]:
voice_df

### __Champion gender:__

Set main dataframe:

In [24]:
lol_df = pd.DataFrame(voice_df)

Add gender to respective champions:

In [30]:
for k,v in genders.items():
    lol_df.loc[lol_df["champion"] == k, "gender"] = v

In [None]:
voice_df

In [None]:
lol_df

### __Champion roles:__

Create list from API keys:

In [33]:
champions_keys = []
for info_tuple in champions_all_info:
    champions_keys.append(info_tuple[0])

In [None]:
champions_keys

Storage information about champíons:

In [35]:
champions_tuple = tuple(champions_all_info)

Create list with champions:

In [36]:
list_champions = []
for i in champions_tuple:
    list_champions.append(i[1]["name"])

Create list with champions and tags:

In [37]:
list_tags = []
for k,c in zip(champions_keys, list_champions):
    tag_champion = response['data'][k]['tags']
    try:
        list_tags.append([c, f"{tag_champion[0]}, {tag_champion[1]}"])
    except:
        list_tags.append([c, f"{tag_champion[0]}"])

In [None]:
list_tags

Create Dataframe:

In [39]:
tags_df = pd.DataFrame(list_tags, columns=["champion", "role"])

Apply name correction in dataframe::

In [40]:
for key, value in correction_roles.items():
    lol_df.loc[lol_df["champion"] == key, "champion"] = value

Merge Dataframes:

In [41]:
lol_df = lol_df.merge(tags_df)

In [None]:
lol_df

### __Champion Stats with Selenium:__

**Note:**
- you need to install Selenium and Kora to proceed.
- Selenium will emulate an hidden browser in your Colab.

In [None]:
!pip install requests -q
!pip install kora -q
!pip install selenium -q

In [45]:
from kora.selenium import wd
from selenium.webdriver.common.by import By

In [46]:
url = "https://na.op.gg/statistics/champions?hl=en_US&region=global"

wd.get(url)
wd.implicitly_wait(7)
wd.find_element(By.XPATH, '//*[@id="content-container"]/div[2]/table/tbody/tr[118]/td[2]')

html = wd.page_source
wd.close()

Transform html table into dataframe:

In [47]:
stats_df = pd.read_html(html)
stats_df = stats_df[1]

In [None]:
stats_df

Clean and organize column names:

In [50]:
stats_df.drop(["#", "CS", "Gold", "Games played", "KDA"], inplace=True, axis=1, errors='ignore')
stats_df.rename(
inplace=True,
columns= {"Champion": "champion", 
          "Win rate": "win_rate",
          "Pick ratio per game": "pick_rate",
          "Ban ratio per game": "ban_rate"})
          
stats_df.sort_values("champion", inplace=True, axis=0, ignore_index=True)

Merge Dataframes:

In [51]:
lol_df = lol_df.merge(stats_df, sort=True)

In [None]:
stats_df

In [None]:
lol_df

### __Champion Region:__

In [56]:
def get_json_and_scrap(reg):
    url = f'https://universe-meeps.leagueoflegends.com/v1/en_gb/factions/{reg}/index.json'
    response = requests.get(url).json()

    region_name = response['faction']['name']
    region_members = response['associated-champions']

    for i in region_members:
        champ_name = i['title']
        champs_with_region.append([champ_name, region_name])
    return

Run multi-page scrapping:

In [57]:
champs_with_region = []
for region in regions:  #   ~8 seconds
    get_json_and_scrap(region)

In [None]:
champs_with_region

Create Dataframe:

In [59]:
scrap_region_df = pd.DataFrame(champs_with_region, columns=["champion","region"])

In [None]:
scrap_region_df

Create an function to organize dataframes:

In [61]:
def drop_and_sort_rows(dframe):
    dframe.drop_duplicates(subset="champion", keep="first", inplace=True)
    dframe.sort_values(by="champion",ignore_index=True, inplace=True)
    return

Organize and apply name correction:

In [62]:
scrap_region_df.champion.replace("’","'", regex=True, inplace=True)
drop_and_sort_rows(scrap_region_df)

Create template with organized champions:

In [63]:
main_champions = pd.DataFrame({"champion":lol_df.champion.unique()})
drop_and_sort_rows(main_champions)

Merge data to region Dataframe:

In [64]:
full_region_df = main_champions.merge(scrap_region_df, how="left", sort=True)
drop_and_sort_rows(full_region_df)

Add Runeterra for champions without region:

In [65]:
full_region_df.region.loc[full_region_df.region.isnull()] = "Runeterra"

Merge and organize dataframes:

In [66]:
lol_df = lol_df.merge(full_region_df, sort=True)

lol_info_df = lol_df.drop(["voice_line", "is_spoken"], axis=1)
drop_and_sort_rows(lol_info_df)

In [None]:
# Voices Dataframe:
voice_df

In [None]:
# Complete Dataframe with voices:
lol_df

In [None]:
# Complete dataframe without voices:
lol_info_df