# Portfolio Project - Web Scraping Football Matches From The EPL in Python

### Introduction

This project is a first part of a Machine Learning project where I will predict the winner of each football match on the English Premier League (EPL).
In this project the main goal is gathering data for a cleaned pandas dataframe via web scraping.

### Downloading and Exploring the EPL Stats Page

In [3]:
import requests

In [4]:
url = "https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats"

In [5]:
page = requests.get(url)

In [7]:
page.status_code # if the code is 200 then OK

200

Inspecting the page in web browser I can see the `id` of the table-tag which contains the teams and theirs results is `results2021-202291_overall`. Furthermore the HTML Tag which contains the Team name and their URL to details is `team`.

### Parsing HTML Links

In [14]:
from bs4 import BeautifulSoup

In [15]:
soup = BeautifulSoup(page.content, "html.parser")

In [21]:
teams_table = soup.find(id="results2021-202291_overall")

In [38]:
links = teams_table.find_all("a")

In [52]:
"""
Example links:

Link: /en/squads/19538871/2021-2022/Manchester-United-Stats, Text: Manchester Utd    - if contains word 'squads' then it refer to club
Link: /en/players/dea698d9/Cristiano-Ronaldo, Text: Cristiano Ronaldo
"""

club_paths = []

for link in links:
    path = link.get("href")
    
    if ("squads") in path:
        club_paths.append(path)

In [55]:
club_paths

['/en/squads/b8fd03ef/2021-2022/Manchester-City-Stats',
 '/en/squads/822bd0ba/2021-2022/Liverpool-Stats',
 '/en/squads/cff3d9bb/2021-2022/Chelsea-Stats',
 '/en/squads/361ca564/2021-2022/Tottenham-Hotspur-Stats',
 '/en/squads/18bb7c10/2021-2022/Arsenal-Stats',
 '/en/squads/19538871/2021-2022/Manchester-United-Stats',
 '/en/squads/7c21e445/2021-2022/West-Ham-United-Stats',
 '/en/squads/a2d435b3/2021-2022/Leicester-City-Stats',
 '/en/squads/d07537b9/2021-2022/Brighton-and-Hove-Albion-Stats',
 '/en/squads/8cec06e1/2021-2022/Wolverhampton-Wanderers-Stats',
 '/en/squads/b2b47a98/2021-2022/Newcastle-United-Stats',
 '/en/squads/47c64c55/2021-2022/Crystal-Palace-Stats',
 '/en/squads/cd051869/2021-2022/Brentford-Stats',
 '/en/squads/8602292d/2021-2022/Aston-Villa-Stats',
 '/en/squads/33c895d4/2021-2022/Southampton-Stats',
 '/en/squads/d3fd31cc/2021-2022/Everton-Stats',
 '/en/squads/5bfb9659/2021-2022/Leeds-United-Stats',
 '/en/squads/943e8050/2021-2022/Burnley-Stats',
 '/en/squads/2abfe087/2021-

In [56]:
club_urls = [url + path for path in club_paths] # Creating a list of full URLs (domain url + paths of each club)

In [57]:
club_urls

['https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/b8fd03ef/2021-2022/Manchester-City-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/822bd0ba/2021-2022/Liverpool-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/cff3d9bb/2021-2022/Chelsea-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/361ca564/2021-2022/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/18bb7c10/2021-2022/Arsenal-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/19538871/2021-2022/Manchester-United-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/7c21e445/2021-2022/West-Ham-United-Stats',
 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats/en/squads/a2d435b3/2021-2022/Leicester-City-Stats',
 'https://fbref.com