# Scraping Data

This notebook will cover the web-scraping and data collection part of the project. I am getting the data from [Tim Sevenhuysen's](https://twitter.com/TimSevenhuysen) website: [Oracle's Elixir](https://oracleselixir.com/). I will scrape and store all player and match statistics from his website into sqlite3 databases. 

### League of Legends team and player statistics
Since most of the data is in HTML tables, I will first use a parser that will make it easier to get these tables. I have copied some code from [Scott Rome](http://srome.github.io/Parsing-HTML-Tables-in-Python-with-BeautifulSoup-and-pandas/) who has made an excellent and robust script to extract tables from HTML pages. Basically, we want the statistics for each region in its own databse. This is easy to do since the links are all formatted nicely: for example, all European matches all contain "/eu/" in the url. I have a list of such patterns for each region as `to_scrape`. I will create an HTML parser, give it the url and a pattern, and it will get all the tables matching that pattern and put them all in a sqlite3 database. Simple!

In [1]:
from scrape_utils import HTMLTableParser
import os
directory = ".\\databases"
url = "https://oracleselixir.com/statistics/player-stats/"
to_scrape = ["na","eu","lck","lms","lpl","international","cblol","tcl"]
parser = HTMLTableParser()
directory = ".\\databases"
os.chdir(directory)
for ii in to_scrape:
    db = ii+".db"
    parser.scrape_data(url,ii,db)

Creating database na.db and inserting tables...


  chunksize=chunksize, dtype=dtype)


Inserted 213 tables into na.db
Creating database eu.db and inserting tables...
Inserted 99 tables into eu.db
Creating database lck.db and inserting tables...
Inserted 71 tables into lck.db
Creating database lms.db and inserting tables...
Inserted 44 tables into lms.db
Creating database lpl.db and inserting tables...
Inserted 30 tables into lpl.db
Creating database international.db and inserting tables...
Inserted 67 tables into international.db
Creating database cblol.db and inserting tables...
Inserted 30 tables into cblol.db
Creating database tcl.db and inserting tables...
Inserted 15 tables into tcl.db


Let's also get the data dictionaries for the tables; I'm not sure what some of the columns mean so it's important to have some documentation.

In [3]:
import sqlite3
url="http://oracleselixir.com/definitions/"
conn = sqlite3.connect("data_dictionary.db")
hp = HTMLTableParser()
table = hp.parse_url(url)[0][1]
table.columns = ["Variable","Description"]
name="player_team_stats_dictionary"
print(table.head())
table.to_sql(name,con=conn,if_exists='fail')

  Variable                                        Description
0       GP                                       Games Played
1        W                                               Wins
2        L                                             Losses
3      AGT  Average Game Time (sometimes also called “G Len”)
4       P%  Percentage of games champion was picked in the...


## Match data

Now that we have all the team and player statistics, lets put the match data into SQL databases. I had to download the match data and convert it to a CSV in excel beforehand. I will just use Pandas and sqlite3 to create the SQL tables. 

In [10]:
import sqlite3, os, glob
import pandas as pd
directory = "..\\matchdata"
conn = sqlite3.connect("match_data.db")
cur = conn.cursor()
os.chdir(directory)
for ii in glob.glob("*.txt"):
    df = pd.read_csv(ii,sep="\t",encoding='latin')
    name=ii.split("OraclesElixir")[0].replace("a-","a")
    print(name)
    try:
        df.to_sql(name,con=conn,if_exists='fail')
    except:
        continue

  interactivity=interactivity, compiler=compiler, result=result)


2016-complete-match-data


  interactivity=interactivity, compiler=compiler, result=result)


2017matchdata


  interactivity=interactivity, compiler=compiler, result=result)


2018-spring-match-data


There are some errors about mixed data types but we will take care of those later on! For now, lets get the data dictionary for the match data:

In [11]:
directory = "..\\databases"
os.chdir(directory)
url="http://oracleselixir.com/match-data/match-data-dictionary/"
conn = sqlite3.connect("data_dictionary.db")
hp = HTMLTableParser()
table = hp.parse_url(url)[0][1]
table.columns = ["Variable","Description"]
table = table.drop(0,axis=0)
name="matches_data_dictionary"
print(table.head())
table.to_sql(name,con=conn,if_exists='fail')

  Variable                                        Description
1   gameid         Unique game identifier from Riot’s server.
2      url                                 Match history link
3   league                                             League
4    split  Time period covered, denoted by year and suffi...
5     week  Within-split week and day (“week within season...


And we're done!