## Ofsted Inspection Reports Scrape Tool 


<b>Summary:</b><br>
<span style="font-size:10pt">Scrapes all local authorities 'Childrens Services Ofsted Inspection' reports in pdf format* and creates a summary overview report by LA with key inspection data points in either csv or xls format. The Children's Services full inspection reports(i.e. not interim nor monitoring visits) are downloaded locally as their original pdf, organised into folders** by local authority name and their URN. The LA naming is cleaned to better reflect a more standard/onward process use of LA naming - e.g. 'Nottingham City Council' becomes 'Nottingham' and 'Barnsley Metropolitan Borough Council' becomes 'Barnsley'. </span><br>
    
<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>*from  from https://reports.ofsted.gov.uk/ .</li>
    <li>**folder naming convention: \provider_urn+local_authority_name(lowercase)...pdf inspection files.</li>
</ul> 


<br>
<b>N.B/Pre-requisites:</b><br>
<span style="font-size:10pt">Relies on Ofsted's continued use of nonvisual css element descriptors on the web site. Obv not ideal to rely on anything in the web-space, but any scrape process, however robust, is undermined/dictated by subsequent page changes. The tool has avoided the use of Selenium or similar as this is more likely to be impacted by visual design changes on the page(s). Instead it relies on the underlying php search process, and associated php generated links.</span>

<b>Backlog/to-do:</b><br>

<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>Moved to Trello: https://trello.com/c/4TihKpvQ</li>

</ul> 

<b>Known bugs:</b><br>

<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>Moved to Trello</li>
    
</ul> 

In [70]:
#
# Export options

export_summary_filename = 'ofsted_childrens_services_overview'
export_file_type         = 'csv' # Excel / csv currently supported


# scrape inspection grade/data from pdf reports
pdf_data_capture = True # True is default (scrape within pdf inspection reports)
                        # This impacts run time E.g False == ~1m20 / True == ~ 4m10
                        # False == only pdfs/list of LA's+link to most recent exported. Not grades

In [71]:
#
# Ofsted site/page admin settings

short_inspection_threshold    = 7 # ILACS inspection duration in days
standard_inspection_threshold = 14

max_page_results = 200 # Set max number of search results to show on page(MUST be > total number of LA's!) 
url_stem = 'https://reports.ofsted.gov.uk/'
search_url = 'search?q=&location=&lat=&lon=&radius=&level_1_types=3&level_2_types%5B%5D=12' # On to-do list
max_page_results_url = '&rows=' + str(max_page_results) # Coerce results page to display ALL providers on single results page without next/pagination

# resultant complete url to process
url = url_stem + search_url + max_page_results_url




In [72]:
#
# Script admin settings

# Keep warnings quiet unless priority
import logging
import subprocess
import warnings
logging.getLogger('org.apache.pdfbox').setLevel(logging.ERROR)
warnings.filterwarnings('ignore')
                    

In [73]:
# # Non-standard modules that might need installing
# !pip install PyPDF2
# !pip install tabula-py
# !pip install textblob
# !pip install gensim

In [74]:

import os
import requests
import pandas as pd
from bs4 import BeautifulSoup

from requests.exceptions import RequestException

# pdf search/data extraction
import io
import tabula   
import PyPDF2   
import re       

# used in handling inspection dates
from dateutil import parser 
from datetime import datetime

# nlp stuff for sentiment
from textblob import TextBlob
from gensim import corpora, models

In [75]:
#
# Function defs

def get_soup(url):
    """
    Given a URL, returns a BeautifulSoup object.
    Args: url (str): The URL to fetch and parse.
    Returns: BeautifulSoup: The parsed HTML content.
    """
    timeout_seconds = 10 # lets not assume the Ofsted page is up

    try:
        response = requests.get(url, timeout=timeout_seconds)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except RequestException as e:
        print(f"An error occurred while fetching the URL '{url}': {e}")
        return None


def clean_provider_name(name):
    """
    Cleans the la/provider name according to:
                - expected output based on existing ILACS sheet
                - historic string issues seen on Ofsted site

    Args:
        name (str): The original name to be cleaned.
    Returns:
        str: The cleaned name.
    """
    # Convert to lowercase and remove extra spaces
    name = name.lower().replace('  ', ' ')
    
    # Remove specific phrases
    name = name.replace("royal borough of ", "").replace("city of ", "").replace("metropolitan district council", "").replace("london borough of", "").replace("council of", "")
    
    # Remove further undesired 'single' words and join the remaining parts
    name_parts = [part for part in name.split() if part not in ['city', 'metropolitan', 'borough', 'council', 'county', 'district', 'the']]
    return ' '.join(name_parts)


def get_framework_type(start_date, end_date, short_inspection_threshold, standard_inspection_threshold):
    """
    Returns an inspection framework type based on the duration between the start and end dates.

    Args:
        start_date (str): Start date in the format "dd/mm/yyyy".
        end_date (str): End date in the format "dd/mm/yyyy".

    Returns:
        str: Inspection framework type, which can be "short", "standard", or "inspection duration longer than standard framework".
    """

    # Check if both start and end dates have been accessible
    if start_date is not None and end_date is not None:

        # Check if end date is not earlier than start date
        if end_date < start_date:
            inspection_framework_str = "invalid end or start date extracted"

        # Calculate the number of days between inspection start and end dates
        else:
            delta = end_date - start_date
            inspection_duration_days = delta.days

            # Determine the inspection framework based on the duration days
            # Note: Needs further investigation to sense check real-world timeframes here, i.e. are thresholds 'working days'?
            # For most instances this appears to be sufficiently accurate as-is. 
            if inspection_duration_days <= short_inspection_threshold:
                inspection_framework_str = "short"
            elif short_inspection_threshold < inspection_duration_days <= standard_inspection_threshold + 1:
                inspection_framework_str = "standard"
            else:
                inspection_framework_str = "inspection duration longer than standard framework"

    # Handle cases where start or end date is not provided 
    # Note: end date most likely to have not been extracted due to formatting issues
    else:
        inspection_framework_str = "invalid date format"

    return inspection_framework_str


def format_date(date_str: str, input_format: str, output_format: str) -> str:
    """
    Convert and format a date string.

    Args:
        date_str (str): The input date string.
        input_format (str): The format of the input date string.
        output_format (str): The desired output format.

    Returns:
        str: The formatted date string.
    """
    dt = datetime.strptime(date_str, input_format)
    date_obj = dt.date()

    return date_obj.strftime(output_format)


def parse_date(date_str, input_format):
    dt = datetime.strptime(date_str, input_format)

    return dt.date()


def format_date_for_report(date_obj, output_format_str):
    """
    Formats a datetime object as a string in the d/m/y format, or returns an empty string if the input is None.

    Args:
        date_obj (datetime.datetime or None): The datetime object to format, or None.

    Returns:
        str: The formatted date string, or an empty string if date_obj is None.
    """
    if date_obj is not None:
        return date_obj.strftime(output_format_str)
    else:
        return ""


def extract_inspection_grade(row, column_name):
    """
    Extracts the grade from the given row and column name. If the grade contains
    the phrase "requires improvement", it returns the cleaned-up value.
    
    Args:
        row (pd.Series): A row from a Pandas DataFrame.
        column_name (str): The name of the column containing the grade.
    
    Returns:
        str: The extracted grade.
    
    Raises:
        ValueError: If the grade value cannot be converted to a string.
    """
    try:
        grade = str(row[column_name])

        if "requires improvement" in grade.lower():
            # Some RI text has further comment that we don't want, i.e. 'RI, *to become good*' 
            grade = "Requires improvement"
        return grade
    except Exception as e:
        grade = f"Unknown value type : {grade}"
        error_msg = f"unknown value found: \"unknown : {grade}\""
        raise ValueError(error_msg)
    


def extract_inspection_data(pdf_content):
    """
    Extracts the inspector's name, overall Ofsted grade, and inspection dates from the first page of a PDF report.

    Args:
        pdf_content (bytes): The content of the PDF file as bytes.

    Returns:
        dict: A dictionary containing the inspector's name, overall Ofsted grade, and inspection dates, or None if not found.

    Notes:
        This function extracts information from the first page of the PDF report. The inspector's name is extracted using a
        regular expression search for the string "Lead inspector:". The overall Ofsted grade is extracted from a table that
        appears on the first page of the report. The function uses the tabula library to extract the table data. The inspection
        dates are also extracted using a regular expression search for the string "Inspection dates:". The function attempts to
        parse the inspection dates into datetime objects and format them as "dd/mm/yyyy". The final output is a dictionary
        containing the extracted information or None if any of the information could not be found.

    Raises:
        ValueError: Raised when an unknown grade value is found during grade extraction.
    """

    # Create a file-like buffer for the PDF content
    with io.BytesIO(pdf_content) as buffer:
        # Read the PDF content for text extraction
        reader = PyPDF2.PdfReader(buffer)
        page = reader.pages[0]
        text = page.extract_text()

        # Find the inspector's name using a regular expression
        match = re.search(r"Lead inspector:\s*(.+)", text)
        if match:
            inspector_name = match.group(1)
            
            inspector_name = inspector_name.split(',')[0].strip()       # Remove everything after the first comma (some contain '.., Her Majesty’s Inspector')
            inspector_name = inspector_name.replace("HMI", "").rstrip() # Remove "HMI" and any trailing spaces(some inspectors add this to name)

        else:
            inspector_name = None

        # Read the PDF and extract the table on the first page
        try:
            buffer.seek(0)  # Reset the buffer position to the beginning
            tables = tabula.read_pdf(buffer, pages=1, multiple_tables=True)
        except Exception as e:
            print(f"An error occurred while reading the PDF: {e}")
            tables = []

    # Initialize variables to store inspection grades
    inspection_grade = None
    impact_of_leaders_grade_str = None
    help_and_protection_grade_str = None
    care_and_care_leavers_grade_str = None

    # Loop through tables to find the table containing inspection grades
    # (Obv at the moment only 1, but just in case someone adds another)

    # Loop through tables to find the table containing grades
    for table in tables:
        # Check if table contains necessary columns
        if 'Judgement' in table.columns and 'Grade' in table.columns:
            # Iterate through rows of the table
            for index, row in table.iterrows():
                # Convert judgement to lower case for case-insensitive matching
                # Check if the value is NaN or Null and convert judgement to lower case for case-insensitive matching
                if pd.isna(row['Judgement']):
                    judgement = ''
                else:
                    judgement = str(row['Judgement']).lower()

                # Check if report summary table/row contains 'Overall effectiveness'
                if 'overall effectiveness' == judgement:
                    inspection_grade = extract_inspection_grade(row, 'Grade')

                # Check if report summary table/row contains 'The impact of leaders on social work practice with children and families'
                elif re.search('impact of leaders', judgement):
                    impact_of_leaders_grade_str = extract_inspection_grade(row, 'Grade')

                # Check if report summary table/row contains 'The experiences and progress of children who need help and protection'
                elif re.search('help and protection', judgement):
                    help_and_protection_grade_str = extract_inspection_grade(row, 'Grade')

                # Check if report summary table/row contains 'The experiences and progress of children in care and care leavers'
                elif re.search('care and care leavers', judgement):
                    care_and_care_leavers_grade_str = extract_inspection_grade(row, 'Grade')


            # If inspection_grade is found and all other optional grades are found or not required, exit the loop
            if inspection_grade is not None:
                optional_grades = [impact_of_leaders_grade_str, help_and_protection_grade_str, care_and_care_leavers_grade_str]
                if all(grade is not None for grade in optional_grades) or any(grade is None for grade in optional_grades):
                    break


    # Find the inspection dates using a regular expression
    date_match = re.search(r"Inspection dates:\s*(.+)", text)

    if date_match:
        # IF there was date data


        inspection_dates = date_match.group(1).strip()
            
        # Some initial clean up based on historic data obs
        inspection_dates = inspection_dates.replace(".", "")
        inspection_dates = inspection_dates.replace("\u00A0", " ") # Remove non-breaking space (Seen in nottingham report)
        inspection_dates = re.sub(r"[\u2012\u2013\u2014\u2212\-]+", " to ", inspection_dates) # replace en dash char ("\u2013"), em dash ("\u2014"), or ("-") 
        inspection_dates = inspection_dates.split("and")[0].strip() # Need this because we have such as :
                                                                    # "8 July 2019 to 12 July 2019 and 7 August 2019 to 8 August 2019"
                                                                    # E.g. Derbyshire
        inspection_dates = re.sub(r'(\d)\s(\d)', r'\1\2', inspection_dates) # Fix white spaces between date numbers e.g. "wiltshire,	1 9 June 2019"



        if isinstance(inspection_dates, str):
            # data was as expected
            year_match = re.search(r"\d{4}", inspection_dates)
            if year_match:
                year = year_match.group(0) # get single copy of yyyy

                # Now remove the year from the inspection_dates string
                inspection_dates_cleaned = inspection_dates.replace(year, "").strip()

            else:
                # We had inspection_dates data but no recognisable year
                year = None
                inspection_dates_cleaned = inspection_dates

        else:
            # spurious data
            # inspection_dates arrived with non-str, set default val
            print("Error: inspection_dates is not a string. Type is", type(inspection_dates))
            inspection_dates_cleaned = None 


        # Now that we have already removed/cleaned those with 'and .....'
        # Split the inspection_dates_cleaned string using ' to ' as the delimiter and limit the number of splits to 1
        date_parts = inspection_dates_cleaned.split(' to ', maxsplit=1) # expect only 1 instance of 'to' between date vals
        

  
        # Get the seperate inspection date(s) 
        start_date = date_parts[0].strip()
        end_date = date_parts[1].strip() if len(date_parts) > 1 else None
        
        # Check if the month text is written in *both* the date strings
        # Required work-around as Ofsted reports contain inspection date strings in multiple formats (i/ii/iii...)
        #   i)      "15 to 26 November"  
        #   ii)     "28 February to 4 March" or "8 October to 19 October" (majority)
        #   iii)    ['8 July ', '12 July   and 7 August  to'] (*recently seen)
        #   iv)     "11 September 2017 to 5 October 2017" (double year)
        #   v)      "Inspection dates: 19 November–30 November 2018" (Bromley)
        if len(start_date) <= 2: # i.e. do we only have a date with no month text
            inspection_month = end_date.split()[1]
            start_date = f"{start_date} {inspection_month}"

        # Append the inspection year to the start_date and end_date
        start_date_str = f"{start_date} {year}"
        end_date_str = f"{end_date} {year}" if end_date else None


        # format current str dates (as dt objects)
        start_date_formatted = parse_date(start_date_str, '%d %B %Y') #  str '8 January 2021' 
        end_date_formatted = parse_date(end_date_str, '%d %B %Y')

        # calculate inspection duration and return framework string
        # Note: Problems arising here generally relate to the end_date extraction from pdf
        inspection_framework_str = get_framework_type(start_date_formatted, end_date_formatted, short_inspection_threshold, standard_inspection_threshold)

    else:
        # unable to extract the data or didnt exist
        start_date_formatted = None
        end_date_formatted = None
        inspection_framework_str = None


    return {'inspector_name': inspector_name, 
            'overall_inspection_grade': inspection_grade,
            'inspection_start_date': start_date_formatted,
            'inspection_end_date': end_date_formatted,
            'inspection_framework': inspection_framework_str,
            'impact_of_leaders_grade': impact_of_leaders_grade_str,
            'help_and_protection_grade': help_and_protection_grade_str,
            'care_and_care_leavers_grade': care_and_care_leavers_grade_str
            }


def process_provider_links(provider_links):
    """
    Processes provider links and returns a list of dictionaries containing URN, local authority, and inspection link.

    Args:
        provider_links (list): A list of BeautifulSoup Tag objects representing provider links.

    Returns:
        list: A list of dictionaries containing URN, local authority, inspection link, and, if enabled, additional inspection data.
    """
    
    data = []
    global pdf_data_capture # Bool flag
    
    for link in provider_links:
        # Extract the URN and provider name from the web link shown
        urn = link['href'].rsplit('/', 1)[-1]
        name = clean_provider_name(link.text.strip())

        # Create the provider directory path
        provider_dir = os.path.join('.', urn + '_' + name)

        # Create the provider directory if it doesn't exist
        if not os.path.exists(provider_dir):
            os.makedirs(provider_dir)

        # Get the child page content
        child_url = 'https://reports.ofsted.gov.uk' + link['href']
        child_soup = get_soup(child_url)

        # Find all publication links in the provider's child page
        pdf_links = child_soup.find_all('a', {'class': 'publication-link'})

        # Initialize a flag to indicate if an inspection link has been found
        # Important: This assumes that the provider's reports are returned/organised most recent FIRST
        found_inspection_link = False

        # Iterate through the publication links
        for pdf_link in pdf_links:

            # Check if the current/next href-link meets the selection criteria
            # This block obv relies on Ofsted continued use of nonvisual element descriptors
            # containing the type(s) of inspection text. We use  "children's services inspection"

            nonvisual_text = pdf_link.select_one('span.nonvisual').text.lower().strip()

            # For now at least, search terms hard-coded. 
            if 'children' in nonvisual_text and 'services' in nonvisual_text and 'inspection' in nonvisual_text:


                # Create the filename and download the PDF (this filetype needs to be hard-coded here)
                filename = nonvisual_text.replace(', pdf', '') + '.pdf'

                pdf_content = requests.get(pdf_link['href']).content
                with open(os.path.join(provider_dir, filename), 'wb') as f:
                    f.write(pdf_content)


               # Extract the local authority and inspection link, and add the data to the list
                if not found_inspection_link:

                    # Capture the data that will be exported about the most recent inspection only
                    local_authority = provider_dir.split('_', 1)[-1].replace('_', ' ').strip()
                    inspection_link = pdf_link['href']
                    
                    # Extract the report published date
                    report_published_date_str = filename.split('-')[-1].strip().split('.')[0] # published date appears after '-' 
            
                    # get/format date(s) (as dt objects)
                    report_published_date = format_date(report_published_date_str, '%d %B %Y', '%d/%m/%y')

                    # Now get the in-document data
                    if pdf_data_capture:
                        # Opt1 : ~x4 slower runtime
                        # Only here if we have set PDF text scrape flag to True
                        # Turn this off, speeds up script if we only need the inspection documents themselves to be retrieved

                        # Scrape inside the pdf inspection reports
                        inspection_data_dict = extract_inspection_data(pdf_content)
                        
                        # Added for readability of returned data/onward
                        overall_effectiveness = inspection_data_dict['overall_inspection_grade']
                        inspector_name = inspection_data_dict['inspector_name']
                        inspection_start_date = inspection_data_dict['inspection_start_date']
                        inspection_end_date = inspection_data_dict['inspection_end_date']
                        inspection_framework = inspection_data_dict['inspection_framework']

                        impact_of_leaders_grade = inspection_data_dict['impact_of_leaders_grade']
                        help_and_protection_grade = inspection_data_dict['help_and_protection_grade']
                        care_and_care_leavers_grade = inspection_data_dict['care_and_care_leavers_grade']

                        # format dates for output                       
                        inspection_start_date_formatted = format_date_for_report(inspection_start_date, "%d/%m/%Y")
                        inspection_end_date_formatted = format_date_for_report(inspection_end_date, "%d/%m/%Y")

                        # Format the provider directory as a file path link for Excel
                        # needs a fix
                        provider_dir_link = 'file:///' + provider_dir.replace("\\", "/")
                                            
                        data.append({
                                        'urn': urn,
                                        'local_authority': local_authority,
                                        'inspection_link': inspection_link,
                                        'overall_effectiveness_grade': overall_effectiveness,
                                        'inspection_framework': inspection_framework,
                                        'inspector_name': inspector_name,
                                        'inspection_start_date': inspection_start_date_formatted,
                                        'inspection_end_date': inspection_end_date_formatted,
                                        'publication_date': report_published_date,
                                        'local_link_to_all_inspections': provider_dir_link,
                                        'impact_of_leaders_grade': impact_of_leaders_grade,
                                        'help_and_protection_grade': help_and_protection_grade,
                                        'care_and_care_leavers_grade': care_and_care_leavers_grade

                                    })
                        
                    else:
                        # Opt2 : ~x4 faster runtime
                        # Only grab the data/docs we can get direct off the Ofsted page 
                        data.append({'urn': urn, 'local_authority': local_authority, 'inspection_link': inspection_link})

                    
                    found_inspection_link = True # Flag to ensure data reporting on only the most recent inspection
    return data


def handle_pagination(soup, url_stem):
    """
    Handles pagination for a BeautifulSoup object representing a web page with paginated content.
    
    Args:
        soup (bs4.BeautifulSoup): The BeautifulSoup object representing the web page.
        url_stem (str): The base URL to which the relative path of the next page will be appended.
        
    Returns:
        str: The full URL of the next page if it exists, otherwise None.
    """
    
    # Find the pagination element in the soup object
    pagination = soup.find('ul', {'class': 'pagination'})

    # Check if the pagination element exists
    if pagination:
        # Find the next page button in the pagination element
        next_page_button = pagination.find('li', {'class': 'next'})

        # Check if the next page button exists
        if next_page_button:
            # Extract the relative URL of the next page
            next_page_url = next_page_button.find('a')['href']
            
            # Return the full URL of the next page by appending the relative URL to the base URL
            return url_stem + next_page_url

    # Return None if there is no next page button or pagination element
    return None


def save_data(data, filename, file_type='csv'):
    """
    Exports data to a specified file type.

    Args:
        data (list or dict): The data to be exported.
        filename (str): The desired name of the output file.
        file_type (str, optional): The desired file type. Defaults to 'csv'.

    Returns:
        None
    """
    if file_type == 'csv':
        filename_with_extension = filename + '.csv'
        pd.DataFrame(data).to_csv(filename_with_extension, index=False)

    elif file_type == 'excel':
        filename_with_extension = filename + '.xlsx'
        pd.DataFrame(data).to_excel(filename_with_extension, index=False)

    else:
        print(f"Error: unsupported file type '{file_type}'. Please choose 'csv' or 'xlsx'.")
        return
    
    print(f"{filename_with_extension} successfully created!")


def get_sentiment_and_topics(pdf_filename):
    # Read the PDF file
    with open(pdf_filename, 'rb') as buffer:
        reader = PyPDF2.PdfReader(buffer)
        text = ''
        for page in reader.pages:
            text += page.extract_text()

    # Perform sentiment analysis on the extracted text
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    
    # Identify key themes from the extracted text
    # First, preprocess the text by tokenizing and removing stop words
    tokens = [word for sentence in blob.sentences for word in sentence.words]
    stop_words = set(nltk.corpus.stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]
    
    # Create a dictionary from the tokenized text
    dictionary = corpora.Dictionary([tokens])
    
    # Create a corpus from the dictionary and the tokenized text
    corpus = [dictionary.doc2bow(tokens)]
    
    # Create an LDA model from the corpus
    lda_model = models.LdaModel(corpus, num_topics=3, id2word=dictionary)
    
    # Get the top 3 topics from the LDA model
    topics = [lda_model.print_topic(topic_num) for topic_num in range(3)]
    
    return sentiment, topics

In [76]:
# # In Progress / New Dev
# # LA to region lookup. For now, this hard-coded. 

local_authorities_lookup = {"Brent": ["GL", "304"], "Birmingham": ["WM", "330"], "Haringey": ["GL", "309"], "Bexley": ["GL", "303"], "East Riding of Yorkshire": ["YH", "811"], "Hertfordshire": ["E", "919"], "Rochdale": ["NW", "354"], "Slough": ["SE", "871"], "Bristol": ["SW", "801"], "Blackpool": ["NW", "890"], "Lancashire": ["NW", "888"], "Trafford": ["NW", "358"], "Bradford": ["YH", "380"], "Kingston upon Hull": ["YH", "810"], "Norfolk": ["E", "926"], "Wandsworth": ["GL", "212"], "Solihull": ["WM", "334"], "Dudley": ["WM", "332"], "Lambeth": ["GL", "208"], "North Lincolnshire": ["YH", "813"], "Darlington": ["NE", "841"], "Stoke-on-Trent": ["WM", "861"], "West Northamptonshire": ["EM", "928"], "North Northamptonshire": ["EM", "928"], "Southwark": ["GL", "210"], "Cumbria": ["NW", "909"], "Herefordshire": ["WM", "884"], "Newham": ["GL", "316"], "Somerset": ["SW", "933"], "Nottingham": ["EM", "892"], "Luton": ["E", "821"], "Rotherham": ["YH", "372"], "Coventry": ["WM", "331"], "Redcar and Cleveland": ["NE", "807"], "Bracknell Forest": ["SE", "867"], "Durham": ["NE", "840"], "Kent": ["SE", "886"], "Sandwell": ["WM", "333"], "Wigan": ["NW", "359"], "Camden": ["GL", "202"], "Stockport": ["NW", "356"], "Wolverhampton": ["WM", "336"], "Manchester": ["NW", "352"], "Derby": ["EM", "831"], "Torbay": ["SW", "880"], "West Berkshire": ["SE", "869"], "York": ["YH", "816"], "Bath and North East Somerset": ["SW", "800"], "Merton": ["GL", "315"], "Leeds": ["YH", "383"], "Sefton": ["NW", "343"], "Doncaster": ["YH", "371"], "Gloucestershire": ["SW", "916"], "Shropshire": ["WM", "893"], "Richmond upon Thames": ["GL", "318"], "Blackburn with Darwen": ["NW", "889"], "Central Bedfordshire": ["E", "823"], "Surrey": ["SE", "936"], "Sutton": ["GL", "319"], "Bournemouth, Christchurch & Poole": ["SW", "839"], "Buckinghamshire": ["SE", "825"], "Newcastle upon Tyne": ["NE", "391"], "Warwickshire": ["WM", "937"], "Bedford": ["E", "822"], "Wakefield": ["YH", "384"], "Bury": ["NW", "351"], "Milton Keynes": ["SE", "826"], "Knowsley": ["NW", "340"], "North East Lincolnshire": ["YH", "812"], "Walsall": ["WM", "335"], "Dorset": ["SW", "835"], "Leicester": ["EM", "856"], "Sunderland": ["NE", "394"], "Islington": ["GL", "206"], "North Tyneside": ["NE", "392"], "City of London": ["GL", "201"], "Halton": ["NW", "876"], "Rutland": ["EM", "857"], "North Somerset": ["SW", "802"], "Harrow": ["GL", "310"], "Croydon": ["GL", "306"], "Northumberland": ["NE", "929"], "Telford and Wrekin": ["WM", "894"], "Devon": ["SE", "878"], "Windsor and Maidenhead": ["SE", "868"], "Greenwich": ["GL", "203"], "Middlesbrough": ["NE", "806"], "Cheshire East": ["NW", "895"], "Southampton": ["SE", "852"], "Hackney": ["GL", "204"], "Thurrock": ["E", "883"], "Ealing": ["GL", "307"], "Kingston upon Thames": ["GL", "314"], "Cornwall": ["SW", "908"], "Isles of Scilly": ["SW", "420"], "Nottinghamshire": ["EM", "891"], "Leicestershire": ["EM", "855"], "St Helens": ["NW", "342"], "Reading": ["SE", "870"], "Hammersmith and Fulham": ["GL", "205"], "Kensington and Chelsea": ["GL", "207"], "Westminster": ["GL", "213"], "Medway": ["SE", "887"], "Sheffield": ["YH", "373"], "Southend-on-Sea": ["E", "882"], "Derbyshire": ["EM", "830"], "Lewisham": ["GL", "209"], "Warrington": ["NW", "877"], "Swindon": ["SW", "866"], "Northamptonshire": ["EM", "928"], "Wirral": ["NW", "344"], "Worcestershire": ["WM", "885"], "Kirklees": ["YH", "382"], "Tower Hamlets": ["GL", "211"], "Wiltshire": ["SW", "865"], "Wokingham": ["SE", "872"], "Stockton-on-Tees": ["NE", "808"], "Barnet": ["GL", "302"], "Tameside": ["NW", "357"], "Gateshead": ["NE", "390"], "Hampshire": ["SE", "850"], "Lincolnshire": ["EM", "925"], "Redbridge": ["GL", "317"], "Suffolk": ["E", "935"], "Cheshire West and Chester": ["NW", "896"], "Enfield": ["GL", "308"], "South Gloucestershire": ["SW", "803"], "West Sussex": ["SE", "938"], "Barking and Dagenham": ["GL", "301"], "Staffordshire": ["WM", "860"], "Waltham Forest": ["GL", "320"], "Oldham": ["NW", "353"], "Cambridgeshire": ["E", "873"], "Bromley": ["GL", "305"], "Essex": ["E", "881"], "Isle of Wight": ["SE", "921"], "Calderdale": ["YH", "381"], "Plymouth": ["SW", "879"], "Barnsley": ["YH", "370"], "Salford": ["NW", "355"], "Hounslow": ["GL", "313"], "Portsmouth": ["SE", "851"], "Hartlepool": ["NE", "805"], "Bournemouth": ["SW", "837"], "East Sussex": ["SE", "845"], "Brighton and Hove": ["SE", "846"], "North Yorkshire": ["YH", "815"], "Peterborough": ["E", "874"], "Havering": ["GL", "311"], "Liverpool": ["NW", "341"], "Bolton": ["NW", "350"], "Hillingdon": ["GL", "312"], "Oxfordshire": ["SE", "931"], "South Tyneside": ["NE", "393"] }
# test
print(local_authorities_lookup["South Tyneside"])

['NE', '393']


In [77]:
#
# Scrape data

data = []
while True:
    # Fetch and parse the HTML content of the current URL
    soup = get_soup(url)
    
    # Find all 'provider' links on the page
    provider_links = soup.find_all('a', href=lambda href: href and '/provider/' in href)

    # Process the provider links and extend the data list with the results
    data.extend(process_provider_links(provider_links))

    
    # Since all results are on a single page, no need to handle pagination. 
    # Processing complete.   
    break


# Export summary data
save_data(data, export_summary_filename, export_file_type)

Got stderr: May 02, 2023 3:07:13 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
May 02, 2023 3:07:14 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Got stderr: May 02, 2023 3:07:28 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Got stderr: May 02, 2023 3:08:00 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



ofsted_childrens_services_overview.csv successfully created!
