# European Football Match Scraper
> Downloading data for the maching learning model

- toc: false
- badges: true
- comments: true
- categories: [soccer, machine learning, webscraping]

The philosophy behind this model is this: people have been trying to figure out the best stat for predicting match outcomes for decades, we don't need to create something new. We can base the model on who has the best stats & probably get a great result. We'll also add in the odds from the casinos, but that's later.

Right now, we need to download the stats for all those games. on [fbref.com](http://fbref.com), we see 122 different stats per team per game. Sometimes these are at the player level, sometimes at the team level. For this model, I am going to do all calculations at the team level. You'll see that we'll get a good result.

This post is just the web scraper. The next post will build and train a model using the odds from the previous post and the matches from this post.

In [1]:
import requests
import re
from bs4 import BeautifulSoup as bs
import pandas as pd
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

The next cell processes the schedule pages since 2017 and grabs the links for each indivual match. Before 2017, the stats on fbref are different, not as complete.

In [2]:
# Full stats for fbref.com seem to start in 2017
sched_pages = [
    "https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures",
    "https://fbref.com/en/comps/9/1631/schedule/2017-2018-Premier-League-Scores-and-Fixtures",
    "https://fbref.com/en/comps/9/1889/schedule/2018-2019-Premier-League-Scores-and-Fixtures",
    "https://fbref.com/en/comps/9/3232/schedule/2019-2020-Premier-League-Scores-and-Fixtures",
    "https://fbref.com/en/comps/13/3243/schedule/2019-2020-Ligue-1-Scores-and-Fixtures",
    "https://fbref.com/en/comps/13/2104/schedule/2018-2019-Ligue-1-Scores-and-Fixtures",
    "https://fbref.com/en/comps/13/1632/schedule/2017-2018-Ligue-1-Scores-and-Fixtures",
    "https://fbref.com/en/comps/13/schedule/Ligue-1-Scores-and-Fixtures",
    "https://fbref.com/en/comps/20/3248/schedule/2019-2020-Bundesliga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/20/2109/schedule/2018-2019-Bundesliga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/20/1634/schedule/2017-2018-Bundesliga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/20/schedule/Bundesliga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/11/3260/schedule/2019-2020-Serie-A-Scores-and-Fixtures",
    "https://fbref.com/en/comps/11/1896/schedule/2018-2019-Serie-A-Scores-and-Fixtures",
    "https://fbref.com/en/comps/11/1640/schedule/2017-2018-Serie-A-Scores-and-Fixtures",
    "https://fbref.com/en/comps/11/schedule/Serie-A-Scores-and-Fixtures",
    "https://fbref.com/en/comps/12/3239/schedule/2019-2020-La-Liga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/12/1886/schedule/2018-2019-La-Liga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/12/1652/schedule/2017-2018-La-Liga-Scores-and-Fixtures",
    "https://fbref.com/en/comps/12/chedule/La-Liga-Scores-and-Fixtures",
]
match_pages = set()
for s in sched_pages:
    html = requests.get(s).text
    match_url_regex = "\/en\/matches\/.{8}\/[^\"]+"
    matches = re.findall(match_url_regex, html)
    match_pages.update(matches)
print(f"found {len(match_pages)} matches...")

found 6110 matches...


Here is the routine that parses the match data. The inline comments give some detail. There is an imporvement needed on to get the full data for the goalkeepers, if more than one is used in the match. This is a pretty rare event, so I haven't written the code to handle it yet.

In [3]:
def parse_match(html):
    soup = bs(html, 'lxml')
    match = {}
    # get game data
    match['game_id'] = url.split('/')[-1]
    match['date'] = soup.find('span',{'class':'venuetime'})['data-venue-date']
    match['game_time'] = soup.find('span',{'class':'venuetime'})['data-venue-time']
    
    # get scores
    scorebox = soup.find('div',{'class':'scorebox'})
    match['away_team'] = scorebox.findAll('div',{'itemprop':'performer'})[1].text.strip()
    match['home_team'] = scorebox.findAll('div',{'itemprop':'performer'})[0].text.strip()
    match['away_score'] = scorebox.findAll('div',{'class':'score'})[1].text.strip()
    match['home_score'] = scorebox.findAll('div',{'class':'score'})[0].text.strip()
    match['away_score_xg'] = scorebox.findAll('div',{'class':'score_xg'})[1].text.strip()
    match['home_score_xg'] = scorebox.findAll('div',{'class':'score_xg'})[0].text.strip()
    
    # get stats from footers of stat tables
    stat_table_regexes = [
        'stats_.+_passing$',
        'stats_.+_passing_types',
        'stats_.+_defense',
        'stats_.+_possession',
        'stats_.+_misc'
    ]
    for p in stat_table_regexes:
        tbls = soup.findAll('table',{'id':re.compile(p)})
        # tbls[1] is the away team
        for c in tbls[1].find('tfoot').findAll('td'):
            if len(c.text)>0: match['away_'+c['data-stat']] = c.text
        # tbls[0] is the home team
        for c in tbls[0].find('tfoot').findAll('td'):
            if len(c.text)>0: match['home_'+c['data-stat']] = c.text
    
    #goalkeeper stats from first row of goalkeepers (no summary row) 
    #TODO: get all goalkeepers in match if more than one
    tbls = soup.findAll('table',{'id':re.compile('keeper_stats_')})
    # tbls[1] is the away team
    for c in tbls[1].findAll('tr')[2].findAll('td')[3:]: # skip the first 3 columns (name, age, country)
        if len(c.text)>0: match['away_'+c['data-stat']] = c.text

    # tbls[0] is the home team
    for c in tbls[1].findAll('tr')[2].findAll('td')[3:]: # skip the first 3 columns (name, age, country)
        if len(c.text)>0: match['home_'+c['data-stat']] = c.text
    return match

Below is the iterator to parse all of the matches. If there's an error parsing the match, it provides the link. Most of these are delayed matches. I haven't found any other types of problems.

In [4]:
match_pages = list(match_pages)
matches = []
for m in match_pages:
    url = 'https://fbref.com'+m
    html = requests.get(url).text
    try:
        match = parse_match(html)
    except:
        # no advanced stats, or match was cancelled. Uncomment the below line to see examples
        # print(url)
        continue
    matches.append(match)
    if len(matches)%250==0: print(len(matches), dt.datetime.now().time())

250 09:22:09.653225
500 09:25:59.893262
750 09:30:01.152984
1000 09:34:28.578578
1250 09:38:47.283655
1500 09:43:17.526664
1750 09:47:39.714315
2000 09:51:46.943318
2250 09:55:56.813498
2500 09:59:37.172234
2750 10:03:53.583911
3000 10:07:55.117692
3250 10:13:05.095473
3500 10:18:18.249996
3750 10:26:15.529503
4000 10:37:23.986049
4250 10:47:48.410901
4500 10:57:33.316190
4750 11:06:57.992304
5000 11:16:44.733365
5250 11:28:27.944883
5500 11:40:59.823845
5750 11:50:52.897307
6000 12:00:06.576418


Last, we'll save the output to disk.

In [5]:
df = pd.DataFrame(matches)
df.to_csv("matches.csv.gzip", index=False, compression='gzip')
df.shape

(6001, 251)

That's 253 columns per match. It's a lot of stats... 

- 5 for game identification: game_id, date, time, home team and away team
- 4 for score (score and score_xg, home and away)
- 122 individual stats for each team

I don't want to do a full EDA of this dataset in this notebook, but let's at least run a correlation to see which stats might be important. Starting with home team vs their score.

In [2]:
df = pd.read_csv("matches.csv.gzip", compression='gzip')

In [3]:
home_columns = [x for x in df.columns if 'home' in x]
df['home_score'] = pd.to_numeric(df['home_score'], errors='coerce')
df[home_columns].corr()['home_score'].abs().sort_values(ascending=False)[:25]

home_score                       1.000000
home_goals_against_gk            0.989576
home_assists                     0.844361
home_psxg_gk                     0.762950
home_save_pct                    0.614315
home_score_xg                    0.592121
home_shots_on_target_against     0.550086
home_xa                          0.522682
home_touches_att_pen_area        0.275894
home_assisted_shots              0.272490
home_through_balls               0.251706
home_passes_completed_short      0.250673
home_passes_pct                  0.245690
home_passes_ground               0.238994
home_passes_short                0.237699
home_passes_pct_long             0.233341
home_passes_completed            0.226572
home_passes_received             0.226568
home_passes_high                 0.214659
home_carries                     0.213738
home_pens_won                    0.211345
home_pass_targets                0.210090
home_touches_live_ball           0.208687
home_passes_live                 0

psxg_gk is "Post-Shot Expected Goals", a goalkeeper stat. Hover over the item on [fbref.com](http://fbref.com) to find out what each is.

Also let's see what correlates with the away team score, in case its different.

In [4]:
away_columns = [x for x in df.columns if 'away' in x]
df['away_score'] = pd.to_numeric(df['away_score'], errors='coerce')
df[away_columns].corr()['away_score'].abs().sort_values(ascending=False)[:25]

away_score                          1.000000
away_assists                        0.844739
away_score_xg                       0.611637
away_xa                             0.529720
away_assisted_shots                 0.313385
away_touches_att_pen_area           0.291722
away_through_balls                  0.245553
away_pens_won                       0.215357
away_passes_into_penalty_area       0.201255
away_passes_completed_short         0.179558
away_passes_progressive_distance    0.172739
away_passes_short                   0.170302
away_touches_live_ball              0.163794
away_passes_received                0.162729
away_passes_completed               0.162645
away_passes_pct                     0.161670
away_passes_ground                  0.159949
away_touches                        0.157365
away_pass_targets                   0.152591
away_passes_pct_long                0.150881
away_passes_live                    0.149583
away_carries                        0.145819
away_passe

This is interesting, but I'm much more interested in how a machine learning model feels about the stats. We'll do that next.