# **Football Data Task 4 : Web Scraping Football UnderStat with Beautiful Soup - Aston Villa vs Arsenal 2024/2025** #
### **Background** ###

This notebook provides a step by step guide to scraping football match data from UnderStat. 
We will make a program which will extract shot data from a specfic match, given a user inputted Match ID. Match ID's can be found in the URL of any desired match on UnderStat. In this example we 
will be scraping shot data for the Aston Villa vs. Arsenal match which took place on 25/09/2024
<br>

### **Step by Step Guide** ###
Here’s a step-by-step guide to the process:
<br> 

1) **Ask the User for the Understat Match ID :** Prompt the User for Match ID: Obtain the match ID from the user, which will be used to construct the URL.

2) **Make Inital HTML Requests :** Send requests to the UnderStat website to retrieve the HTML content of the match page.

3) **Scrape HTML with BeautifulSoup:** Use BeautifulSoup to parse the HTML and extract shot data in JSON format.

4) **Convert JSON to Python Dictionary:** Transform the JSON data into a Python dictionary for easier manipulation.

5) **Organize Data:** Initialize lists to organize the shot information separately for home and away teams.

6) **Populate Data Structures:** Populate these lists with data from the Python dictionary.

7) **Construct a DataFrame:** Construct a DataFrame from the organized data.

8) **Export to CSV:** Save the DataFrame as a CSV file for further analysis.
<br>


By the end of this notebook you will have a clear understanding on how to extract and analyse football match data from UnderStat.


In [12]:
# Imports 

import requests 
from bs4 import BeautifulSoup
import json 
import pandas as pd 

In [13]:
# Asking the user to input the desired match id and formatting the url 
# For our example, the match id for the Aston vs Arsenal game is = 26618
base_url = 'https://understat.com/match/'
match = str(input('Please enter the match id:'))
url = base_url+match


# Requests to get the webpage and beautiful soup to parse the page, lxml is a package used for processing XML in Python
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml' )

# Checking results 
print(soup)

<!DOCTYPE html>
<html>
<head>
<base href="https://understat.com/"/>
<title>Aston Villa 0 - 2 Arsenal (August 24 2024) | EPL | 2024/2025 | xG | Understat.com</title>
<meta charset="utf-8"/>
<meta content="Aston Villa 0 - 2 Arsenal. Check out detailed player statistic, goals, assists, key passes, xG, shot map, xGplot." name="description"/>
<meta content="Aston Villa, Arsenal, EPL, 2024/2025, (August 24 2024), xG, expected goals, shot map" name="Keywords"/>
<link href="apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="manifest.json" rel="manifest"/>
<link color="#5bbad5" href="safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="understat" name="apple-mobile-web-app-title"/>
<meta content="understat" name="application-name"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="no-cache" http-equiv="cache-c

In [14]:
# Parsing through for script tags 
scripts = soup.find_all('script')
print(scripts)

[<script>
			var THEME = localStorage.getItem("theme") || 'DARK';
			document.body.className = "theme-" + THEME.toLowerCase();
		</script>, <script>
	var shotsData 	= JSON.parse('\x7B\x22h\x22\x3A\x5B\x7B\x22id\x22\x3A\x22586301\x22,\x22minute\x22\x3A\x2224\x22,\x22result\x22\x3A\x22MissedShots\x22,\x22X\x22\x3A\x220.8919999694824219\x22,\x22Y\x22\x3A\x220.56\x22,\x22xG\x22\x3A\x220.3767159879207611\x22,\x22player\x22\x3A\x22Ollie\x20Watkins\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x228865\x22,\x22situation\x22\x3A\x22OpenPlay\x22,\x22season\x22\x3A\x222024\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x2226618\x22,\x22h_team\x22\x3A\x22Aston\x20Villa\x22,\x22a_team\x22\x3A\x22Arsenal\x22,\x22h_goals\x22\x3A\x220\x22,\x22a_goals\x22\x3A\x222\x22,\x22date\x22\x3A\x222024\x2D08\x2D24\x2016\x3A30\x3A00\x22,\x22player_assisted\x22\x3A\x22Morgan\x20Rogers\x22,\x22lastAction\x22\x3A\x22Pass\x22\x7D,\x7B\x22id\x22\x3A\x22586304\x22,\x22minute\x22\x3A\x2243\x22,\x22r

In [15]:
# Getting the shots data only 
strings = scripts[1].string
print(strings)



	var shotsData 	= JSON.parse('\x7B\x22h\x22\x3A\x5B\x7B\x22id\x22\x3A\x22586301\x22,\x22minute\x22\x3A\x2224\x22,\x22result\x22\x3A\x22MissedShots\x22,\x22X\x22\x3A\x220.8919999694824219\x22,\x22Y\x22\x3A\x220.56\x22,\x22xG\x22\x3A\x220.3767159879207611\x22,\x22player\x22\x3A\x22Ollie\x20Watkins\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x228865\x22,\x22situation\x22\x3A\x22OpenPlay\x22,\x22season\x22\x3A\x222024\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x2226618\x22,\x22h_team\x22\x3A\x22Aston\x20Villa\x22,\x22a_team\x22\x3A\x22Arsenal\x22,\x22h_goals\x22\x3A\x220\x22,\x22a_goals\x22\x3A\x222\x22,\x22date\x22\x3A\x222024\x2D08\x2D24\x2016\x3A30\x3A00\x22,\x22player_assisted\x22\x3A\x22Morgan\x20Rogers\x22,\x22lastAction\x22\x3A\x22Pass\x22\x7D,\x7B\x22id\x22\x3A\x22586304\x22,\x22minute\x22\x3A\x2243\x22,\x22result\x22\x3A\x22BlockedShot\x22,\x22X\x22\x3A\x220.779000015258789\x22,\x22Y\x22\x3A\x220.3920000076293945\x22,\x22xG\x22\x3A\x220.0289393812417984

In [16]:
# Locate the start and end indices of the JSON data within the string
ind_start = strings.index("('")+2
ind_end = strings.index("')")

# Extract and clean the JSON data
json_data = strings[ind_start:ind_end]
json_data = json_data.encode('utf8').decode('unicode_escape') #this is often done to interpret an escape seqeunces (like /n and /t) present in the string 

# Convert the cleaned JSON string to a Python dictionary
data = json.loads(json_data)
data


{'h': [{'id': '586301',
   'minute': '24',
   'result': 'MissedShots',
   'X': '0.8919999694824219',
   'Y': '0.56',
   'xG': '0.3767159879207611',
   'player': 'Ollie Watkins',
   'h_a': 'h',
   'player_id': '8865',
   'situation': 'OpenPlay',
   'season': '2024',
   'shotType': 'RightFoot',
   'match_id': '26618',
   'h_team': 'Aston Villa',
   'a_team': 'Arsenal',
   'h_goals': '0',
   'a_goals': '2',
   'date': '2024-08-24 16:30:00',
   'player_assisted': 'Morgan Rogers',
   'lastAction': 'Pass'},
  {'id': '586304',
   'minute': '43',
   'result': 'BlockedShot',
   'X': '0.779000015258789',
   'Y': '0.3920000076293945',
   'xG': '0.0289393812417984',
   'player': 'Morgan Rogers',
   'h_a': 'h',
   'player_id': '12412',
   'situation': 'OpenPlay',
   'season': '2024',
   'shotType': 'RightFoot',
   'match_id': '26618',
   'h_team': 'Aston Villa',
   'a_team': 'Arsenal',
   'h_goals': '0',
   'a_goals': '2',
   'date': '2024-08-24 16:30:00',
   'player_assisted': 'Ollie Watkins',
   

In [17]:
# Initialising a series of lists which will compose our pandas dataframe 
# Seperating the data for the home and away teams
x = [] # x pitch coordinates of the shot position
y = [] # y pitch coordinates of the shot position
xG = [] # xG from the shot  
result = [] # outcome of the shot
last_action = [] # action taken before the shot 
situation = [] # pitch situation before the shot was taken
team = [] # team name of the shot maker 
player = [] # player name of the shot maker
data_away = data['a'] # all away teams shot information
data_home = data['h'] # all home teams shot information


In [18]:
# Obtaining the home teams shot data from the results to append into the relevant lists 
for index in range(len(data_home)): 
    for key in data_home[index] : 
        if key == 'X':
            x.append(data_home[index][key])
        if key == 'Y' : 
            y.append(data_home[index][key])
        if key == 'h_team' : 
            team.append(data_home[index][key])
        if key == 'xG' : 
            xG.append(data_home[index][key]) 
        if key == 'player' :
            player.append(data_home[index][key])
        if key == 'lastAction' : 
            last_action.append(data_home[index][key])
        if key == 'result' : 
            result.append(data_home[index][key])
        if key == 'situation' : 
            situation.append(data_home[index][key])

In [19]:
# Obtaining the away teams data from the results to append into the relevant lists 
for index in range(len(data_away)): 
    for key in data_away[index] : 
        if key == 'X':
            x.append(data_away[index][key])
        if key == 'Y' : 
            y.append(data_away[index][key])
        if key == 'a_team' : 
            team.append(data_away[index][key])
        if key == 'xG' : 
            xG.append(data_away[index][key]) 
        if key == 'player' :
            player.append(data_away[index][key])
        if key == 'lastAction' : 
            last_action.append(data_away[index][key])
        if key == 'result' : 
            result.append(data_away[index][key])
        if key == 'situation' : 
            situation.append(data_away[index][key])

In [20]:
# Checking that all lists are the same length, in order to form the pandas dataframe
list_length = len(situation)

if all(len(lst) == list_length for lst in [team, player, x, y, xG, result, last_action, situation]) : 
    print('All of the lists have the same length')
else : 
    print('The lists do not all have the same length')

All of the lists have the same length


In [21]:
# Making the pandas dataframe
col_names = ['team_name', 'player_name', 'X', 'Y', 'xG', 'result','last_action', 'situation']
df = pd.DataFrame([team, player, x, y, xG, result, last_action, situation], index = col_names)

# Transposing the dataframe 
df.T

Unnamed: 0,team_name,player_name,X,Y,xG,result,last_action,situation
0,Aston Villa,Ollie Watkins,0.8919999694824219,0.56,0.3767159879207611,MissedShots,Pass,OpenPlay
1,Aston Villa,Morgan Rogers,0.779000015258789,0.3920000076293945,0.0289393812417984,BlockedShot,Pass,OpenPlay
2,Aston Villa,Leon Bailey,0.899000015258789,0.3029999923706055,0.0544026866555213,SavedShot,BallTouch,OpenPlay
3,Aston Villa,Amadou Onana,0.8059999847412109,0.5029999923706054,0.0576391480863094,BlockedShot,Pass,OpenPlay
4,Aston Villa,Ollie Watkins,0.94,0.519000015258789,0.4753965139389038,SavedShot,Rebound,OpenPlay
5,Aston Villa,Morgan Rogers,0.9009999847412108,0.2970000076293945,0.0502687245607376,BlockedShot,Pass,OpenPlay
6,Aston Villa,Amadou Onana,0.894000015258789,0.500999984741211,0.0543505139648914,MissedShots,Cross,FromCorner
7,Aston Villa,Morgan Rogers,0.8190000152587891,0.3479999923706054,0.1237539574503898,BlockedShot,TakeOn,OpenPlay
8,Aston Villa,Jacob Ramsey,0.8730000305175781,0.6469999694824219,0.0420124232769012,SavedShot,,OpenPlay
9,Aston Villa,Jhon Durán,0.924000015258789,0.5940000152587891,0.0651178881525993,BlockedShot,Cross,OpenPlay


In [22]:
# Exporting the dataframe to a .csv file 
df.to_csv('avfc_afc_2425.csv', index = False )