# **Football Data Task 2 : Web Scraping SofaScore - Arsenal vs Chelsea 2023 / 2024** ##

### **Background** ###
This notebook provides a step by step guide on how to scraping data from www.sofascore.com
We will use two different methods to obtain the data, first we will scrape the HTML directly from the website, secondly we will use a cURL Command to fetch data directly from the websites API endpoint. 

In this example, we will explore how to scrape data on the shots taken by Arsenal players during the Arsenal vs. Chelsea game from the 2023/2024 season. 

### **Step by Step Guide** ###
The steps outlined below will guide you through the entire process:
<br>

1) **Method 1 - HTML Requests** : We'll demonstrate two different ways to make initial HTML requests:
* Simple HTML Request - A simple request using basic settings.
* HTML Request with Headers - An advanced request that includes specifying headers and user-agents to mimic a real browser session, ensuring that we bypass any restrictions or potential blocks.

2) **Method 2 - cURL Commands** : We will use a cURL command to handle JavaScript elements on the webpage. This step is crucial for extracting data that might not be directly accessible through standard scraping techniques due to JavaScript rendering.

3) **Parse Data :** After retrieving the necessary data in JSON format, we will parse it to construct our DataFrame.

4) **Export DataFrame :** Export the constructed DataFrame to CSV file
<br>

By the end of this notebook, you will have a clear understanding of how to scrape and process web data, even from complex sites with dynamic content. 

In [1]:
# Importing the relevant libraries 
import requests 
import pandas as pd
from pandas import json_normalize
from bs4 import BeautifulSoup
import json

In [2]:
# Requesting the webpage 
# Verify the status code to ensure proper functionality. 
url = 'https://www.sofascore.com/football/match/arsenal-chelsea/NR#id:12190322'
response = requests.get(url)
response.status_code


200

In [3]:
# If you encounter a 403 error, you can use whatsmybrowser.org to retrieve your browser's header and user-agent information.
# Example below
url = 'https://www.sofascore.com/football/match/arsenal-chelsea/NR#id:12190322'
response = requests.get(url, headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15 '})
response.status_code

200

In [4]:
# Locate the HTML element containing shot data.
# This doesnt work as expected because the page content is dynamically loaded via JavaScript. We can use a different method to access the data.
soup = BeautifulSoup(response.text, 'html.parser')
soup.select('g[cursor="pointer"]')

[]

In [5]:
# As an alternative, you can use cURL to retrieve the information.
# cURL (Client for URL) is a tool used to transfer data to and from a server.
# To do this we can inspect the HTML of the desired element by right-clicking and selecting "Inspect" in your browser.
# Navigate to the "Network" tab and locate the request associated with the element you want.
# Right-click on the request and select "Copy as cURL."
# Paste the cURL command into a cURL converter to generate the necessary code for scraping
cookies = {
    '_ga_HNQ9P9MGZR': 'GS1.1.1724919866.4.1.1724919963.32.0.0',
    '__browsiSessionID': 'ec21ca3d-2836-4b37-bdf2-394a598cdcd7&true&SEARCH&gb&desktop-4.28.123&false',
    '__browsiUID': '63da2817-f606-43bd-bbde-1f3692805dc9',
    'FCNEC': '%5B%5B%22AKsRol-kFHX7Yzhzy5e2tsZo8_j5VIdzKIaOsckouf6XeXgHmmj-1zFZilY6K09jhTYI8wBRF-jquafduotTzk0JHhB9Rp_vdY_Z5JOHfFEzA2KV6sfiy6MinMYlA-KPHz3OOS0P5g-qvcnxrzcqqU8MBgB7p4xjFw%3D%3D%22%5D%5D',
    'cto_bundle': 'sNMBJ19hR0c3eHI4NkMwMnBLbUF1cXRHWkc4QzBZVGEyVUcxcUE4b3ZxMmglMkI4JTJCdzNKbTIxcWJBdHlaUEhNJTJCSHFwenI2ajZkOW5lRGQwa0NQNkZwMENlQ0dvVTdKd1JkNlY1ZWVteHZ2dTQyM0F2WmY0aXFHbzkyeFlaVnNSZkxOYjZKWA',
    '__eoi': 'ID=15ca5a67fa7ab7de:T=1724840295:RT=1724919866:S=AA-AfjboWmuFLU54absAIy8E53XX',
    '__gads': 'ID=e8bfc438140ca62a:T=1724840295:RT=1724919866:S=ALNI_Mb8oXgZEbMqH-xUnKI84tLH51zeqw',
    '__gpi': 'UID=00000eaa4f3887ad:T=1724840295:RT=1724919866:S=ALNI_MaNs1wj75qHhn3jkaTE_opuNy8vfA',
    '_ga': 'GA1.1.220782722.1724840295',
    'gcid_first': 'bf063a4c-d0a7-4bc6-8825-64ca3b964122',
    'FCCDCF': '%5Bnull%2Cnull%2Cnull%2C%5B%22CQEEBUAQEEBUAEsACBENBDFoAP_gAEPgAA6II3gB5C5ETSFBYH51KIsEYAEHwAAAIsAgAAYBAQABQBKQAIQCAGAAEAhAhCACgAAAIEYBIAEACAAQAAAAAAAAIAAEIAAQAAAIICAAAAAAAABIAAAIAAAAEAAAwCAABAAA0AgEAJIISNgAAAAAAAAAAgAAAAAAAgAAAEhAAAEIAAAAACgAEABAEAAAAAEIABBII3gB5C5ETSFBYHhVIIsEIAEXwAAAIsAgAAYBAQABQBKQAIQCEGAAAAgAACAAAAAAIEQBIAEAAAgAAAAAAAAAIAAEAAAQAAAIICAAAAAAAABAAAAIAAAAEAAAwCAABAAAwQhEAJIASFgAAAAgAAAAAoAAAAAAAgAAAEhAAAEAAAAAAAAAEABAEAAAAAAAABBIAAA.dnAACAgAAAA%22%2C%222~70.89.108.149.211.313.358.415.486.540.621.981.1029.1046.1092.1097.1126.1205.1301.1516.1558.1584.1598.1651.1697.1716.1753.1810.1832.1985.2328.2373.2440.2571.2572.2575.2577.2628.2642.2677.2767.2860.2878.2887.2922.3182.3190.3234.3290.3292.3331.10631~dv.%22%2C%227CC1CE98-D2D1-4901-9D95-7C158EB525F7%22%5D%5D',
    '__qca': 'P0-1961469510-1724840292643',
    '_lr_env_src_ats': 'false',
}

headers = {
    'Accept': '*/*',
    'Sec-Fetch-Site': 'same-origin',
    'Accept-Language': 'en-GB,en;q=0.9',
    #'Accept-Encoding': 'gzip, deflate, br',
    'Sec-Fetch-Mode': 'cors',
    'Cache-Control': 'max-age=0',
    'Host': 'www.sofascore.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15',
    'Referer': 'https://www.sofascore.com/football/match/arsenal-chelsea/NR',
    'Connection': 'keep-alive',
    #'Cookie': '_ga_HNQ9P9MGZR=GS1.1.1724919866.4.1.1724919963.32.0.0; __browsiSessionID=ec21ca3d-2836-4b37-bdf2-394a598cdcd7&true&SEARCH&gb&desktop-4.28.123&false; __browsiUID=63da2817-f606-43bd-bbde-1f3692805dc9; FCNEC=%5B%5B%22AKsRol-kFHX7Yzhzy5e2tsZo8_j5VIdzKIaOsckouf6XeXgHmmj-1zFZilY6K09jhTYI8wBRF-jquafduotTzk0JHhB9Rp_vdY_Z5JOHfFEzA2KV6sfiy6MinMYlA-KPHz3OOS0P5g-qvcnxrzcqqU8MBgB7p4xjFw%3D%3D%22%5D%5D; cto_bundle=sNMBJ19hR0c3eHI4NkMwMnBLbUF1cXRHWkc4QzBZVGEyVUcxcUE4b3ZxMmglMkI4JTJCdzNKbTIxcWJBdHlaUEhNJTJCSHFwenI2ajZkOW5lRGQwa0NQNkZwMENlQ0dvVTdKd1JkNlY1ZWVteHZ2dTQyM0F2WmY0aXFHbzkyeFlaVnNSZkxOYjZKWA; __eoi=ID=15ca5a67fa7ab7de:T=1724840295:RT=1724919866:S=AA-AfjboWmuFLU54absAIy8E53XX; __gads=ID=e8bfc438140ca62a:T=1724840295:RT=1724919866:S=ALNI_Mb8oXgZEbMqH-xUnKI84tLH51zeqw; __gpi=UID=00000eaa4f3887ad:T=1724840295:RT=1724919866:S=ALNI_MaNs1wj75qHhn3jkaTE_opuNy8vfA; _ga=GA1.1.220782722.1724840295; gcid_first=bf063a4c-d0a7-4bc6-8825-64ca3b964122; FCCDCF=%5Bnull%2Cnull%2Cnull%2C%5B%22CQEEBUAQEEBUAEsACBENBDFoAP_gAEPgAA6II3gB5C5ETSFBYH51KIsEYAEHwAAAIsAgAAYBAQABQBKQAIQCAGAAEAhAhCACgAAAIEYBIAEACAAQAAAAAAAAIAAEIAAQAAAIICAAAAAAAABIAAAIAAAAEAAAwCAABAAA0AgEAJIISNgAAAAAAAAAAgAAAAAAAgAAAEhAAAEIAAAAACgAEABAEAAAAAEIABBII3gB5C5ETSFBYHhVIIsEIAEXwAAAIsAgAAYBAQABQBKQAIQCEGAAAAgAACAAAAAAIEQBIAEAAAgAAAAAAAAAIAAEAAAQAAAIICAAAAAAAABAAAAIAAAAEAAAwCAABAAAwQhEAJIASFgAAAAgAAAAAoAAAAAAAgAAAEhAAAEAAAAAAAAAEABAEAAAAAAAABBIAAA.dnAACAgAAAA%22%2C%222~70.89.108.149.211.313.358.415.486.540.621.981.1029.1046.1092.1097.1126.1205.1301.1516.1558.1584.1598.1651.1697.1716.1753.1810.1832.1985.2328.2373.2440.2571.2572.2575.2577.2628.2642.2677.2767.2860.2878.2887.2922.3182.3190.3234.3290.3292.3331.10631~dv.%22%2C%227CC1CE98-D2D1-4901-9D95-7C158EB525F7%22%5D%5D; __qca=P0-1961469510-1724840292643; _lr_env_src_ats=false',
    'Sec-Fetch-Dest': 'empty',
    'X-Requested-With': '975421',
}

response = requests.get('https://www.sofascore.com/api/v1/event/12190322/shotmap', cookies=cookies, headers=headers)
response.status_code

200

In [6]:
# If you encounter a 403 error, try bypassing it by adding a Date header with the current date and time.
headers['If-Modified-Since']= 'Weds 28th Aug 00:00:00 GMT'
response = requests.get('https://www.sofascore.com/api/v1/event/12190322/shotmap', cookies = cookies, headers = headers)
response.status_code


200

In [7]:
# Now we can retrieve the shot data from the webpage in json format, and obtain the relevant information.
response_json = response.json()
allshots = response_json['shotmap']
allshots

[{'player': {'name': 'Gabriel Martinelli',
   'slug': 'gabriel-martinelli',
   'shortName': 'G. Martinelli',
   'position': 'F',
   'jerseyNumber': '11',
   'userCount': 64026,
   'id': 922573,
   'fieldTranslations': {'nameTranslation': {'ar': 'مارتينيلي, غابرييل'},
    'shortNameTranslation': {'ar': 'غ. مارتينيلي'}}},
  'isHome': True,
  'shotType': 'save',
  'situation': 'assisted',
  'playerCoordinates': {'x': 9.8, 'y': 37, 'z': 0},
  'bodyPart': 'right-foot',
  'goalMouthLocation': 'low-centre',
  'goalMouthCoordinates': {'x': 0, 'y': 50, 'z': 11.4},
  'blockCoordinates': {'x': 2, 'y': 46, 'z': 0},
  'xg': 0.41483238339424,
  'xgot': 0.0811,
  'id': 2762206,
  'time': 90,
  'addedTime': 7,
  'timeSeconds': 5769,
  'draw': {'start': {'x': 37, 'y': 9.8},
   'block': {'x': 46, 'y': 2},
   'end': {'x': 50, 'y': 0},
   'goal': {'x': 50, 'y': 88.6}},
  'reversedPeriodTime': 1,
  'reversedPeriodTimeSeconds': 531,
  'incidentType': 'shot'},
 {'player': {'name': 'Gabriel Martinelli',
   's

In [8]:
# Converting into a dataframe and formatting. 
shots_df = pd.json_normalize(allshots)
shots_df['isHome'] = shots_df['isHome'].apply(lambda x: 'Arsenal' if x else 'Chelsea')
shots_df = shots_df[['time', 'player.name', 'isHome', 'xg', 'playerCoordinates.x', 'playerCoordinates.y', 'goalMouthCoordinates.y', 'goalMouthCoordinates.z', 'shotType', 'situation']]
shots_df = shots_df.rename(columns={'player.name' : 'player_name', 'isHome' : 'team', 'shotType' : 'type', 'xg' : 'xG'}
).sort_values(by=['team', 'time'])
shots_df


Unnamed: 0,time,player_name,team,xG,playerCoordinates.x,playerCoordinates.y,goalMouthCoordinates.y,goalMouthCoordinates.z,type,situation
33,5,Leandro Trossard,Arsenal,0.125989,4.8,34.4,53.7,1.3,goal,assisted
32,10,Thomas Partey,Arsenal,0.066875,20.3,66.0,47.7,19.0,block,assisted
31,15,Takehiro Tomiyasu,Arsenal,0.294427,4.8,54.1,46.7,72.2,miss,corner
26,22,Declan Rice,Arsenal,0.069264,18.5,44.3,45.5,61.1,miss,assisted
27,22,Bukayo Saka,Arsenal,0.032008,19.0,54.9,45.4,19.0,block,assisted
25,23,Thomas Partey,Arsenal,0.045113,20.1,63.6,52.0,19.0,block,assisted
22,26,Kai Havertz,Arsenal,0.128959,13.5,56.8,46.4,3.2,save,assisted
23,26,Bukayo Saka,Arsenal,0.029585,12.6,65.6,47.7,0.6,save,assisted
24,26,Leandro Trossard,Arsenal,0.019448,13.3,28.4,45.7,19.0,block,assisted
21,27,Leandro Trossard,Arsenal,0.278696,10.2,59.9,61.0,5.6,miss,assisted


In [9]:
# Filter through to obtain shots made by Arsenal players only
arsenal_shots_df = shots_df[shots_df['team'] == 'Arsenal']
arsenal_shots_df

Unnamed: 0,time,player_name,team,xG,playerCoordinates.x,playerCoordinates.y,goalMouthCoordinates.y,goalMouthCoordinates.z,type,situation
33,5,Leandro Trossard,Arsenal,0.125989,4.8,34.4,53.7,1.3,goal,assisted
32,10,Thomas Partey,Arsenal,0.066875,20.3,66.0,47.7,19.0,block,assisted
31,15,Takehiro Tomiyasu,Arsenal,0.294427,4.8,54.1,46.7,72.2,miss,corner
26,22,Declan Rice,Arsenal,0.069264,18.5,44.3,45.5,61.1,miss,assisted
27,22,Bukayo Saka,Arsenal,0.032008,19.0,54.9,45.4,19.0,block,assisted
25,23,Thomas Partey,Arsenal,0.045113,20.1,63.6,52.0,19.0,block,assisted
22,26,Kai Havertz,Arsenal,0.128959,13.5,56.8,46.4,3.2,save,assisted
23,26,Bukayo Saka,Arsenal,0.029585,12.6,65.6,47.7,0.6,save,assisted
24,26,Leandro Trossard,Arsenal,0.019448,13.3,28.4,45.7,19.0,block,assisted
21,27,Leandro Trossard,Arsenal,0.278696,10.2,59.9,61.0,5.6,miss,assisted


In [10]:
# Exporting as a .csv file
arsenal_shots_df.to_csv('afc_shots_cfc_2324.csv', index = False )