# **Web Scraping Task 3: Using cURL to Extract Horse Race Data**

### **Background** ###
This notebook provides a step by step guide to scraping data using cURL commands to access and fetch data directly from the websites API. 
<br>

 In this example, we'll demonstrate how to scrape tomorrow's AU/NZ gallops horse race data from www.swiftbet.com.au
<br> 

### **cURL Commands** ###

In a previous task I achieved a similar result by using selenium and bs4 in order to navigate, scrape and parse data fron the website directly. This involved programming a selenium bot to navigate through a website and wait for JavaScript elements to render.
<br>

To improve upon this, I used a cURL command to access data directly from the websites API in JSON format. I found this to be far more faster and efficient than automating a bot and scraping the html, which also used far fewer lines of code. It also allowed me to extract data without needing to run a browser environment through selenium, which uses less resources. Overall it is a far more robust method of extracting data. 
<br>

It's worth noting though as cURL is purely a HTTP client it does not execute JavaScript and consequently it cannot render or interact with dynamically generated content. Some websites load additional content via AJAX calls or JavaScript frameworks (React, Angular or Vue), this content cannot be accessed with cURL requests. However you may be able to load this content if you find the API endpoints that JavaScript calls.
<br>

In order to find the API endpoint using a cURL command you can... 
<br>

1) Access the the webpage you want to scrape and wait for the webpage to completely load the content you want to scrape.

2) Use the developer tools on your browser in order to identify the desired element. 

3) Navigate to the 'Network' tab and find the name of the connection you are looking for which contains the data in that element.

4) Right click 'copy as cURL.

5) Use a cURL coder converter to obtain the script.

6) Import script into vscode.
<br>

### **Step by Step Guide** ###
The steps outlined below will guide you through the entire process : 
<br>

1) **Import Libraries :** Import the relevant libraries. Some may require pip installs.

2) **Find Todays Date :** Retrieve and format tomorrows date. 

3) **Use cURL Command :** Use the cURL command to retrieve a JSON response.

4) **Extract JSON data containing Race Data :** Extract race data from the JSON response.

5) **Parse Through Data for Race Data :** Parse through the race data to extract the relevant horse race data for tomorrow.

6) **Get Ready to Construct DataFrame :** Check all lists are the same length and construct our pandas DataFrame.

7) **Export DataFrame :** Export Pandas DataFrame as a .CSV file. 
<br> 

By the end of this notebook, you will have a clear understanding of how to scrape and process web data using a cURL command. 

In [1]:
# Imports
from datetime import datetime, date, timedelta
import requests
import pandas as pd


In [2]:
# Retrieving tomorrows date
today = date.today()
tomorrow = today + timedelta(days=1)

# Format date in a standard format expected by the API
formatted_tomorrow = tomorrow.strftime('%Y-%m-%d')  # Assuming the API expects 'YYYY-MM-DD'

# Make the cURL command to the websites API endpoint
headers = {
    'sec-ch-ua': '"Chromium";v="128", "Not;A=Brand";v="24", "Google Chrome";v="128"',
    'website-version': '31.21.30',
    'client-brand': 'SWIFTBET',
    'locale': 'en',
    'sec-ch-ua-mobile': '?0',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36',
    'Referer': 'https://www.swiftbet.com.au/',
    'sec-ch-ua-platform': '"macOS"',
}

params = {
    'order': 'true',
    'date': formatted_tomorrow  # Correctly format the date according to API expectations
}

response = requests.get('https://api.swiftbet.com.au/api/v2/combined/meetings/races', params=params, headers=headers)

#Check that the API call was successful, debug if neccessary
print(response.status_code)

200


In [3]:
# Convert the response to JSON and extarct race data
response_json = response.json()
race_data = response_json['data']
race_data

[{'id': 1077078,
  'name': 'NEWBURY',
  'state': 'GBR',
  'track': 'GOOD',
  'weather': None,
  'type': 'R',
  'start_date': '2024-09-06 01:05:00',
  'grade': None,
  'country': 'UK',
  'next_race_date': '2024-09-06 01:05:00',
  'next_race_number': 1,
  'order': 0,
  'fixed_odds_enabled': False,
  'feature_products': [],
  'exotic_pools': [],
  'gbs_meeting_id': '526897',
  'srm_enabled': True,
  'races': [{'id': 1838455,
    'type': 'R',
    'start_date': '2024-09-06 01:05:00',
    'number': 1,
    'name': "Get Best Odds Guaranteed With Betvictor 'Newcomers' Maiden Filli",
    'status': 'selling',
    'result_string': None,
    'fixed_odds_enabled': False,
    'race_site_link': 'https://swiftbet.com.au//racing/gallops/newbury/race-1-1838455-1077078',
    'global_tote_available': False,
    'srm_enabled': True,
    'requested_at': 17255269517935,
    'srm_status': []},
   {'id': 1838495,
    'type': 'R',
    'start_date': '2024-09-06 01:40:00',
    'number': 2,
    'name': "Take A Luck

In [4]:
#Initial lists which will compose our DataFrame 
refined_list = []
race_countries = []
race_locations = []
race_start_times = []
race_numbers = []
race_links = []

# Parse through the race data to extract the relevant horse race data for tomorrow 
for x in range(len(race_data)):
    
    # Check if 'country' is either 'AU' or 'NZ' and 'type' is 'R'
    # Extract country and track location
    if race_data[x]['country'] in ('AU', 'NZ') and race_data[x]['type'] == 'R' :
        race_country = race_data[x]['country']
        race_location = race_data[x]['name']
        refined_list.append(race_data[x])
    
        # Extract all of the races from each of the locations 
        for x in range(len(refined_list)) : 
            races = refined_list[x]['races']

        # Extract relevant information from the race data   
        for x in range(len(races)) : 
                 race_start = races[x]['start_date']
                 race_number = 'Race ' + str(races[x]['number'])
                 race_link = races[x]['race_site_link']
                 
                 # Append data to lists
                 race_countries.append(race_country)
                 race_locations.append(race_location)
                 race_start_times.append(race_start)
                 race_numbers.append(race_number)
                 race_links.append(race_link)

In [5]:
# Check that all of the lists are the same length so we can construct our DataFrame
if len(race_countries) == len(race_numbers) == len(race_locations) == len(race_start_times) == len(race_links) : 
    print('all lists are the same length') 

else : print('they are not the same length')

all lists are the same length


In [10]:
# Construct our DataFrame
swiftbet_gallops_df = pd.DataFrame({
    'Race Country' : race_countries, 
    'Location' : race_locations, 
    'Race Number' : race_numbers, 
    'Race Start Time' : race_start_times, 
   'Race Links' : race_links})

# Check Result
swiftbet_gallops_df

Unnamed: 0,Race Country,Location,Race Number,Race Start Time,Race Links
0,AU,MORNINGTON,Race 1,2024-09-06 12:30:00,https://swiftbet.com.au//racing/gallops/mornin...
1,AU,MORNINGTON,Race 2,2024-09-06 13:00:00,https://swiftbet.com.au//racing/gallops/mornin...
2,AU,MORNINGTON,Race 3,2024-09-06 13:30:00,https://swiftbet.com.au//racing/gallops/mornin...
3,AU,MORNINGTON,Race 4,2024-09-06 14:00:00,https://swiftbet.com.au//racing/gallops/mornin...
4,AU,MORNINGTON,Race 5,2024-09-06 14:30:00,https://swiftbet.com.au//racing/gallops/mornin...
5,AU,MORNINGTON,Race 6,2024-09-06 15:00:00,https://swiftbet.com.au//racing/gallops/mornin...
6,AU,MORNINGTON,Race 7,2024-09-06 15:30:00,https://swiftbet.com.au//racing/gallops/mornin...
7,AU,MORNINGTON,Race 8,2024-09-06 16:00:00,https://swiftbet.com.au//racing/gallops/mornin...
8,AU,MORNINGTON,Race 9,2024-09-06 16:30:00,https://swiftbet.com.au//racing/gallops/mornin...
9,AU,MORNINGTON,Race 10,2024-09-06 17:00:00,https://swiftbet.com.au//racing/gallops/mornin...


In [9]:
# Getting todays date, formatting the file name 
today_date = datetime.now().date()
file_name = f'swiftbet_gallops_curl_data_{today_date}'

# Exporting DataFrame as .CSV File
swiftbet_gallops_df.to_csv(file_name, index = True)