## Web Scraping

The following code scrapes data from the premier league website. There are a total of 42 stats (some of which may need to be discarded) that it collects for each team from season 2006-2007 until 2017-2018, making it an approximate total of 504 web requests performed. These stats will serve as features of a dataset that'll be fed into an ANN and RNN.

In [None]:
pip install requests

In [4]:
pip install progressbar

Defaulting to user installation because normal site-packages is not writeable
Collecting progressbar
  Downloading progressbar-2.5.tar.gz (10 kB)
Building wheels for collected packages: progressbar
  Building wheel for progressbar (setup.py) ... [?25ldone
[?25h  Created wheel for progressbar: filename=progressbar-2.5-py3-none-any.whl size=12082 sha256=26c4f550b0a8b87b352d93e57289cb9cb25ae04fdc9c742a1a2476302f02defd
  Stored in directory: /Users/ali/Library/Caches/pip/wheels/d7/d9/89/a3f31c76ff6d51dc3b1575628f59afe59e4ceae3f2748cd7ad
Successfully built progressbar
Installing collected packages: progressbar
Successfully installed progressbar-2.5
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install bs4

Defaulting to user installation because normal site-packages is not writeable
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 3.4 MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.12.3 bs4-0.0.2 soupsieve-2.5
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [26]:
pip install --upgrade pip

Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.1 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
Successfully installed pip-24.0
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [27]:
pip install --upgrade pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.2.1-cp39-cp39-macosx_11_0_arm64.whl.metadata (19 kB)
Downloading pandas-2.2.1-cp39-cp39-macosx_11_0_arm64.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.0
    Uninstalling pandas-2.2.0:
      Successfully uninstalled pandas-2.2.0
Successfully installed pandas-2.2.1
Note: you may need to restart the kernel to use updated packages.


In [28]:
import os
import json
import requests
import progressbar
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

## Page

In [29]:
webpage = requests.get('https://www.premierleague.com/stats/top/clubs/wins?se=578')
soup = BeautifulSoup(webpage.text, 'html.parser')

if not os.path.exists('files'):
    os.makedirs('files/stats')
    os.makedirs('files/results')

## Links

**select** is used instead of find_all since it allows the use of javascript selectors <br>

------

**General** <br>
*index 0-5* <br>
wins, losses, goals, yellow cards, red cards, substitutions on

**Attack** <br>
*index 6-15* <br>
shots, shots on target, hit woodwork, goals from header, goals from penalty, goals from free kick, goals from inside box, goals from outside box, goals from counter attack, offsides

**Defence** <br>
*index 16-29* <br>
clean sheets, goals conceded, saves, blocks, interceptions, tackles, last man tackles, clearances, headed clearances, caught opponent offside, own goals, penalties conceded, goals conceded from penalty, fouls

**Team Play** <br>
*index 30-35* <br>
passes, through balls, long passes, backwards passes, crosses, corners taken

**Others** <br>
*index 36-42 <br>
non-duplicates attributes from top i.e. don't appear in more*  <br>
touches, big chances missed, clearances off line, dispossessed, penalties saved, high claims, punches

In [8]:
def attributes(links):
    return [link[link.rfind('/')+1:] for link in links]

def uniques(links):
    l = []
    for link in links:
        if link not in l:
            l.append(link)
    return l

top = [link['href'] for link in soup.select('a.topStatsLink')]
more = [link['href'] for link in soup.select('nav.moreStatsMenu a')]
links = uniques(attributes(more) + attributes(top))

## Dates

Data is being collected from the 2006/2007 season since detailed and constant stats were collected from then onwards.

Dates (i.e. seasons and their corresponding ids) cannot be scraped since it doesn't appear in the html so I'll manually extract it from the html that Chrome Dev Tools displays. Note that this html often varies to the plain output html from cURL or when downloading the page since the browser is capable of executing some js and changing the document.

In [9]:
dates = {'2006-2007':15, '2007-2008':16, '2008-2009':17, '2009-2010':18, 
         '2010-2011':19, '2011-2012':20, '2012-2013':21, '2013-2014':22, 
         '2014-2015':27, '2015-2016':42, '2016-2017':54, '2017-2018':79,
         '2018-2019':210, '2019-2020':274, '2020-2021':363, '2021-2022':418,
         '2022-2023':489, '2023-2024':578}

## Data

### Stats

The data couldn't be scraped from the webpage since the site uses ajax to update the table that includes the stats i.e. when you visit the webpage for a specific attribute (one of the links), the html contains the data for 'all seasons' and when you choose a particular season then the table is updated using ajax meaning that the source doesn't reflect the new values and still (and only will) show the data for 'all seasons'.

To hack around this, visit the Network tab of Chrome Dev Tools and filter response resources by XHR (i.e. XML and JSON documents from XMLHttpRequests) - which is responses from AJAX requests. We can see that the last document contains the data from the ajax request and is what we need, however we can't simply send a get request for the link to that JSON document because it's an API that needs to authorise the request source. This will result in a 403 forbidden response. 

To get around that last hurdle, we need to include the headers (particularly general and request) sent by the premier league website to the API to block our identity and clone the premier league's. We get the required headers from the same place in the Chrome Dev Tools, clicking on the JSON document/response needed, and then clicking on the Headers tab of the panel to the right.

https://www.agenty.com/docs/how-to/18/how-to-crawl-an-infinite-scrolling-ajax-website <br>
https://www.codementor.io/codementorteam/how-to-scrape-an-ajax-website-using-python-qw8fuitvi <br>
https://www.quora.com/What-are-the-best-ways-to-scrape-the-AJAX-driven-websites <br>
https://stackoverflow.com/questions/44080707/web-scraping-a-strange-html-setup-with-python-beautifulsoup-urllib

-----

One thing that was noticed is that if there's a NaN value for all teams for a certain attribute then that stat wasn't collected, however, if it's just for some/few teams then the value for that team is meant to be 0 e.g. red cards for Burnley in 2017-2018 was 0 as per their stats page (https://www.premierleague.com/clubs/43/Burnley/stats?se=79) but the page where the data is sourced from (https://www.premierleague.com/stats/top/clubs/total_red_card?se=79) doesn't display them in the table since their value is 0.

In [10]:
for date in dates.keys():

    df = pd.DataFrame()
    bar = progressbar.ProgressBar(maxval=len(links), widgets=[date + '\t', progressbar.Bar('-', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    for i, attribute in zip(range(len(links)), links):

        # setup
        api = 'https://footballapi.pulselive.com/football/stats/ranked/teams/' + attribute
        headers = {'Origin': 'https://www.premierleague.com'}
        params = {'page': '0', 'pageSize': '20', 'compSeasons': dates[date], 'comps': '1', 'altIds': 'true'}

        # request
        response = requests.get(api, params=params, headers=headers)
        data = json.loads(response.text)

        # parse
        teams = []; values = [];
        for team in data['stats']['content']:
            teams.append(team['owner']['name'])
            values.append(team['value'])
        series = pd.Series(values, teams, float, attribute)
        if df.index.empty:
            df = pd.DataFrame(series)
        else:
            df = df.join(series)

        # progress
        bar.update(i+1)

    bar.finish()
    df.dropna(axis=1, how='all', inplace=True)
    df.fillna(0, inplace=True)
    df.to_csv('files/stats/' + date + '.csv')

2006-2007	[                                                              ]   0%

2006-2007	[--------------------------------------------------------------] 100%
2007-2008	[--------------------------------------------------------------] 100%
2008-2009	[--------------------------------------------------------------] 100%
2009-2010	[--------------------------------------------------------------] 100%
2010-2011	[--------------------------------------------------------------] 100%
2011-2012	[--------------------------------------------------------------] 100%
2012-2013	[--------------------------------------------------------------] 100%
2013-2014	[--------------------------------------------------------------] 100%
2014-2015	[--------------------------------------------------------------] 100%
2015-2016	[--------------------------------------------------------------] 100%
2016-2017	[--------------------------------------------------------------] 100%
2017-2018	[--------------------------------------------------------------] 100%
2018-2019	[-----------------------------

#### Validation

In [16]:
datasets = []
for date in dates.keys():
    dataset = pd.read_csv('files/stats/' + date + '.csv', index_col=0)
    datasets.append(dataset)
    print(date + '\t' + str(len(dataset.columns)))

2006-2007	19
2007-2008	20
2008-2009	20
2009-2010	20
2010-2011	21
2011-2012	21
2012-2013	21
2013-2014	21
2014-2015	21
2015-2016	21
2016-2017	21
2017-2018	21
2018-2019	21
2019-2020	21
2020-2021	21
2021-2022	21
2022-2023	21
2023-2024	21


The above shows the number of statistics collected for each team for each of the listed seasons. We can see that there only seems to be consistency from season 2010-2011 onwards because there are the same number of stats collected. To validate that, we'll see if the stats are the same ones collected.

In [17]:
mismatches = []

for i in range(4,12):
    for j in range(i+1,12):
        if datasets[i].columns.tolist() != datasets[j].columns.tolist():
            mismatches.append((i,j))
            
if len(mismatches) == 0:
    print('Valid Set')
else:
    print('Invalid Set')

Valid Set


### Results

In [30]:
def get_team_ids(date):
    # setup
    api = 'https://footballapi.pulselive.com/football/compseasons/' + str(dates[date]) + '/teams'
    headers = {'Origin': 'https://www.premierleague.com'}
    
    # request
    response = requests.get(api, headers=headers)
    teams = json.loads(response.text)
    
    # parse
    team_ids = []
    for team in teams:
        team_ids.append(int(team['id']))
    team_ids = ','.join(map(str, team_ids))
    
    

In [37]:

def get_results(date, team_ids):
    # setup
    api = 'https://footballapi.pulselive.com/football/fixtures'
    headers = {'Origin': 'https://www.premierleague.com'}
    params = {'comps':'1', 'compSeasons':dates[date], 'teams':team_ids, 'page':'0', 'pageSize':'380', 'sort':'asc', 'statuses':'C', 'altIds':'true'}

    # request
    response = requests.get(api, params=params, headers=headers)
    results = json.loads(response.text)
    
    # parse
    df = pd.DataFrame(columns=['home_team', 'away_team', 'home_goals', 'away_goals', 'result'])
    for result in results['content']:
        row = []
        row.append(result['teams'][0]['team']['name'])
        row.append(result['teams'][1]['team']['name'])
        row.append(result['teams'][0]['score'])
        row.append(result['teams'][1]['score'])
        row.append(result['outcome'])
        row = pd.Series(row, index=df.columns)
        df = df.append(row, ignore_index=True)
    
    return df

In [38]:
bar = progressbar.ProgressBar(maxval=len(dates), widgets=['', '\t', progressbar.Bar('-', '[', ']'), ' ', progressbar.Percentage()])
bar.start()
for i, date in zip(range(len(dates)), dates.keys()):
    bar.widgets[0] = date
    team_ids = get_team_ids(date)
    results = get_results(date, team_ids)
    bar.update(i+1)
    results.to_csv('files/results/' + date + '.csv')
bar.finish()

	[                                                                       ]   0%

AttributeError: 'DataFrame' object has no attribute 'append'

### Concatenation

Concatenating the discrete sets for each season into one big set that contains all seasons - one for stats and another for results

In [3]:
files = ['2006-2007.csv', '2007-2008.csv', '2008-2009.csv', '2009-2010.csv', '2010-2011.csv', '2011-2012.csv', '2012-2013.csv', '2013-2014.csv', '2014-2015.csv', '2015-2016.csv', '2016-2017.csv', '2017-2018.csv']

stats_df = pd.DataFrame()
results_df = pd.DataFrame()
for name in files:
    
    # Stats
    f = 'files/stats/' + name
    stats_series = pd.Series([name[:-4]]*20, name='season')
    stats_season = pd.concat([pd.read_csv(f, index_col=False), stats_series], axis=1)
    columns = stats_season.columns.tolist(); columns[0] = 'team'; stats_season.columns = columns
    if stats_df.empty:
        stats_df = stats_season
    else:
        stats_df = pd.concat([stats_df, stats_season])
        
    # Results
    f = 'files/results/' + name
    results_series = pd.Series([name[:-4]]*380, name='season')
    results_season = pd.concat([pd.read_csv(f), results_series], axis=1)
    if results_df.empty:
        results_df = results_season
    else:
        results_df = pd.concat([results_df, results_season])
    
stats_df = stats_df[stats_season.columns.tolist()]
stats_df.to_csv('files/stats/stats.csv', index=False)

results_df.drop(results_df.columns[0], axis=1, inplace=True)
results_df.to_csv('files/results/results.csv', index=False)