## Ofsted Inspection Reports Scrape Tool 


<b>Summary:</b><br>
<span style="font-size:10pt">Scrapes all local authorities 'Childrens Services Ofsted Inspection' reports* creating (an enhanced version) the 'Ofsted ILACS Outcomes summary' report**. Output is optionally in either csv or direct to xls format. As well as creating the summary report, the process scrapes/downloads and organises into named folders*** the Children's Services <b>full</b> inspection reports(i.e. not interim, focused nor monitoring visits) in their original pdf. Each LA name is 'cleaned' to aid a standard/onward process use of LA naming - e.g. 'Nottingham City Council' becomes 'Nottingham' and 'Barnsley Metropolitan Borough Council' becomes 'Barnsley'. </span><br>
    
<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>**Available at: https://reports.ofsted.gov.uk/ .</li>
    <li>*Available at: https://adcs.org.uk/inspection/article/ilacs-outcomes-summary </li>
    <li>***folder structure : \\export_data\inspection_reports\provider_urn+local_authority_name(lowercase)...pdf.</li>
</ul> 

<br>
<b>Exports:</b><br>
<span style="font-size:10pt">Static data can be added to further enrich the current summary output. As an example the Local Authority Number is added within this process. This process has been structured in a manner that would easily provide the mechanisms to enable further data enrichment, e.g. geospacial providing a suitable key column can be accessed. </span><br>

`script root`
<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>\export_data\
        <ul>
            <li>\inspection_reports\</li>
                <ul>
                <li>\provider_urn+local_authority_name\*.pdf</li>
                </ul>
        </ul>
    </li>
</ul>

<b>Imports:</b><br>
<span style="font-size:10pt">Static data can be added to further enrich the current summary output. As an example the Local Authority Number is added within this process. This process has been structured in a manner that would easily provide the mechanisms to enable further data enrichment, e.g. geospacial providing a suitable key column can be accessed. </span><br>

`script root`
<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>\import_data\
        <ul>
            <li>\la_lookup\</li>
        </ul>
    </li>
</ul>

<br>
<b>N.B/Pre-requisites:</b><br>
<span style="font-size:10pt">Relies on Ofsted's continued use of nonvisual css element descriptors on the web site. Obv not ideal to rely on anything in the web-space, but any scrape process, however robust, is undermined/dictated by subsequent page changes. The tool has avoided the use of Selenium or similar as this is more likely to be impacted by visual design changes on the page(s). Instead it relies on the underlying php search process, and associated php generated links.</span>

<b>Backlog/to-do:</b><br>

<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>Moved to Trello: https://trello.com/c/4TihKpvQ</li>

</ul> 

<b>Known bugs:</b><br>

<ul style="font-size:10pt; list-style-type:disc; margin-left: 20px;">
    <li>Moved to Trello</li>
    
</ul> 

In [6]:
#
# Export options

export_summary_filename = 'ofsted_childrens_services_overview'
# export_file_type         = 'csv' # Excel / csv currently supported
export_file_type         = 'excel' 

# Default (sub)folder structure
# Defined to offer some ease of onward flexibility
root_export_folder = 'export_data'              # <all> exports folder
inspections_subfolder = 'inspection_reports'    # downloaded report pdfs


# scrape inspection grade/data from pdf reports
pdf_data_capture = True # True is default (scrape within pdf inspection reports for inspection results etc)
                        # This impacts run time E.g False == ~1m20 / True == ~ 4m10
                        # False == only pdfs/list of LA's+link to most recent exported. Not inspection results.

In [7]:
#
# Ofsted site/page admin settings

short_inspection_threshold    = 7 # ILACS inspection duration in days
standard_inspection_threshold = 14

max_page_results = 200 # Set max number of search results to show on page(MUST be > total number of LA's!) 
url_stem = 'https://reports.ofsted.gov.uk/'
search_url = 'search?q=&location=&lat=&lon=&radius=&level_1_types=3&level_2_types%5B%5D=12' # On to-do list
max_page_results_url = '&rows=' + str(max_page_results) # Coerce results page to display ALL providers on single results page without next/pagination

# resultant complete url to process
url = url_stem + search_url + max_page_results_url




In [8]:
#
# Script admin settings

# Keep warnings quiet unless priority
import logging
import subprocess
import warnings
logging.getLogger('org.apache.pdfbox').setLevel(logging.ERROR)
warnings.filterwarnings('ignore')
                    

In [9]:
# # Non-standard modules that might need installing
# !pip install PyPDF2
# !pip install tabula-py
# !pip install textblob
# !pip install gensim
# !pip install matplotlib
# !pip install openpyxl
# !pip install XlsxWriter


In [10]:

import os
import requests
import pandas as pd
from bs4 import BeautifulSoup

from requests.exceptions import RequestException

# pdf search/data extraction
import io
import os
import tabula   
import PyPDF2   
import re       

# used in handling inspection dates
from dateutil import parser 
from datetime import datetime

# nlp stuff for sentiment
from textblob import TextBlob
from gensim import corpora, models

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# handle optional excel export+active file links
import openpyxl
from openpyxl.styles import Font
from openpyxl.utils.dataframe import dataframe_to_rows
import xlsxwriter



# visuals
import matplotlib.pyplot as plt

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
#
# Function defs

def get_soup(url):
    """
    Given a URL, returns a BeautifulSoup object.
    Args: url (str): The URL to fetch and parse.
    Returns: BeautifulSoup: The parsed HTML content.
    """
    timeout_seconds = 10 # lets not assume the Ofsted page is up

    try:
        response = requests.get(url, timeout=timeout_seconds)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except RequestException as e:
        print(f"An error occurred while fetching the URL '{url}': {e}")
        return None


def clean_provider_name(name):
    """
    Cleans the la/provider name according to:
                - expected output based on existing ILACS sheet
                - historic string issues seen on Ofsted site

    Args:
        name (str): The original name to be cleaned.
    Returns:
        str: The cleaned name.
    """
    # Convert to lowercase and remove extra spaces
    name = name.lower().replace('  ', ' ')
    
    # Remove specific phrases
    name = name.replace("royal borough of ", "").replace("city of ", "").replace("metropolitan district council", "").replace("london borough of", "").replace("council of", "")
    
    # Remove further undesired 'single' words and join the remaining parts
    name_parts = [part for part in name.split() if part not in ['city', 'metropolitan', 'borough', 'council', 'county', 'district', 'the']]
    return ' '.join(name_parts)


def get_framework_type(start_date, end_date, short_inspection_threshold, standard_inspection_threshold):
    """
    Returns an inspection framework type based on the duration between the start and end dates.
    Dates are scraped, as this currently the only ref. This not ideal as based entirely
    on varied formats of text based data. Therefore some cleaning included here. 

    Args:
        start_date (str): Start date in the format "dd/mm/yyyy".
        end_date (str): End date in the format "dd/mm/yyyy".

    Returns:
        str: Inspection framework type, which can be "short", "standard", or "inspection duration longer than standard framework".
    """

    # Check if both start and end dates have been accessible
    if start_date is not None and end_date is not None:

        # Check if end date is not earlier than start date
        if end_date < start_date:
            inspection_framework_str = "invalid end or start date extracted"

        # Calculate the number of days between inspection start and end dates
        else:
            delta = end_date - start_date
            inspection_duration_days = delta.days

            # Determine the inspection framework based on the duration days
            # Note: Needs further investigation to sense check real-world timeframes here, i.e. are thresholds 'working days'?
            # For most instances this appears to be sufficiently accurate as-is. 
            if inspection_duration_days <= short_inspection_threshold:
                inspection_framework_str = "short"
            elif short_inspection_threshold < inspection_duration_days <= standard_inspection_threshold + 1:
                inspection_framework_str = "standard"
            else:
                inspection_framework_str = "inspection duration longer than standard framework"

    # Handle cases where start or end date is not provided 
    # Note: end date most likely to have not been extracted due to formatting issues
    else:
        inspection_framework_str = "invalid date format"

    return inspection_framework_str


def format_date(date_str: str, input_format: str, output_format: str) -> str:
    """
    Convert and format a date string.

    Args:
        date_str (str): The input date string.
        input_format (str): The format of the input date string.
        output_format (str): The desired output format.

    Returns:
        str: The formatted date string.
    """
    dt = datetime.strptime(date_str, input_format)
    date_obj = dt.date()

    return date_obj.strftime(output_format)


def parse_date(date_str, input_format):
    dt = datetime.strptime(date_str, input_format)

    return dt.date()


def format_date_for_report(date_obj, output_format_str):
    """
    Formats a datetime object as a string in the d/m/y format, or returns an empty string if the input is None.

    Args:
        date_obj (datetime.datetime or None): The datetime object to format, or None.

    Returns:
        str: The formatted date string, or an empty string if date_obj is None.
    """
    if date_obj is not None:
        return date_obj.strftime(output_format_str)
    else:
        return ""


def extract_inspection_grade(row, column_name):
    """
    Extracts the grade from the given row and column name. If the grade contains
    the phrase "requires improvement", it returns the cleaned-up value.
    
    Args:
        row (pd.Series): A row from a Pandas DataFrame.
        column_name (str): The name of the column containing the grade.
    
    Returns:
        str: The extracted grade.
    
    Raises:
        ValueError: If the grade value cannot be converted to a string.
    """
    try:
        grade = str(row[column_name])

        if "requires improvement" in grade.lower():
            # Some RI text has further comment that we don't want, i.e. 'RI, *to become good*' 
            grade = "Requires improvement"
        return grade
    except Exception as e:
        grade = f"Unknown value type : {grade}"
        error_msg = f"unknown value found: \"unknown : {grade}\""
        raise ValueError(error_msg)
    


def extract_inspection_data(pdf_content):
    """
    Extracts the inspector's name, overall Ofsted grade, and inspection dates from the first page of a PDF report.

    Args:
        pdf_content (bytes): The content of the PDF file as bytes.

    Returns:
        dict: A dictionary containing results extracted from the ofsted inspection report(s) 
        Incl. inspector's name, overall Ofsted grade, inspection dates, or None if not found.
        Additional extracted data can easily be added here, but must be added to the returned dict. 

    Notes:
        This function extracts information from the first page of the PDF report. The inspector's name is extracted using a
        regular expression search for the string "Lead inspector:". The overall Ofsted grade is extracted from a table that
        appears on the first page of the report. The function uses the tabula library to extract the table data. The inspection
        dates are also extracted using a regular expression search for the string "Inspection dates:". The function attempts to
        parse the inspection dates into datetime objects and format them as "dd/mm/yyyy". The final output is a dictionary
        containing the extracted information or None if any of the information could not be found.

    Raises:
        ValueError: Raised when an unknown grade value is found during grade extraction.
    """

    # Create a file-like buffer for the PDF content
    with io.BytesIO(pdf_content) as buffer:
        # Read the PDF content for text extraction
        reader = PyPDF2.PdfReader(buffer)
        
        # Extract the first page of inspection report pdf
        # This to ensure when we iterate/search the summary table, chance of invalid table reduced
        first_page_text = reader.pages[0].extract_text()

        # Extract text from <all> pages in the pdf
        full_text = ''
        for page in reader.pages:
            full_text += page.extract_text()

        # Call the get_sentiment_and_topics function
        sentiment_val, key_inspection_themes_lst = get_sentiment_and_topics(buffer)


        # Convert val to a <general> sentiment text/str for (readable) reporting
        sentiment_summary_str = get_sentiment_category(sentiment_val)


        # #################
        # # testing / in dev-progress
        # print(sentiment_summary_str) # testing
        # # Call the updated get_sentiment** function # testing
        # sentiment_val2, filtered_themes = get_sentiment_and_sentiment_by_theme(buffer, "leadership", "results", "management") # testing
        # plot_filtered_topics(filtered_themes) # testing
        # #################



        # Find the inspector's name using a regular expression
        match = re.search(r"Lead inspector:\s*(.+)", first_page_text)
        if match:
            inspector_name = match.group(1)
            
            inspector_name = inspector_name.split(',')[0].strip()       # Remove everything after the first comma (some contain '.., Her Majesty’s Inspector')
            inspector_name = inspector_name.replace("HMI", "").rstrip() # Remove "HMI" and any trailing spaces(some inspectors add this to name)

        else:
            inspector_name = None

        # Read the PDF and extract the table on the first page
        try:
            buffer.seek(0)  # Reset the buffer position to the beginning
            tables = tabula.read_pdf(buffer, pages=1, multiple_tables=True)
        except Exception as e:
            print(f"An error occurred while reading the PDF: {e}")
            tables = []

    # Initialize variables to store inspection grades
    inspection_grade = None
    impact_of_leaders_grade_str = None
    help_and_protection_grade_str = None
    care_and_care_leavers_grade_str = None

    # Loop through tables to find the table containing inspection grades
    # (Obv at the moment only 1, but just in case someone adds another)

    # Loop through tables to find the table containing grades
    for table in tables:
        # Check if table contains necessary columns
        if 'Judgement' in table.columns and 'Grade' in table.columns:
            # Iterate through rows of the table
            for index, row in table.iterrows():
                # Convert judgement to lower case for case-insensitive matching
                # Check if the value is NaN or Null and convert judgement to lower case for case-insensitive matching
                if pd.isna(row['Judgement']):
                    judgement = ''
                else:
                    judgement = str(row['Judgement']).lower()

                # Check if report summary table/row contains 'Overall effectiveness'
                if 'overall effectiveness' == judgement:
                    inspection_grade = extract_inspection_grade(row, 'Grade')

                # Check if report summary table/row contains 'The impact of leaders on social work practice with children and families'
                elif re.search('impact of leaders', judgement):
                    impact_of_leaders_grade_str = extract_inspection_grade(row, 'Grade')

                # Check if report summary table/row contains 'The experiences and progress of children who need help and protection'
                elif re.search('need help', judgement):
                    help_and_protection_grade_str = extract_inspection_grade(row, 'Grade')

                # Check if report summary table/row contains 'The experiences and progress of children in care and care leavers'
                elif re.search('in care', judgement):
                    care_and_care_leavers_grade_str = extract_inspection_grade(row, 'Grade')


            # If inspection_grade is found and all other optional grades are found or not required, exit the loop
            if inspection_grade is not None:
                optional_grades = [impact_of_leaders_grade_str, help_and_protection_grade_str, care_and_care_leavers_grade_str]
                if all(grade is not None for grade in optional_grades) or any(grade is None for grade in optional_grades):
                    break


    # Find the inspection dates using a regular expression
    date_match = re.search(r"Inspection dates:\s*(.+)", first_page_text)

    if date_match:
        # IF there was date data


        inspection_dates = date_match.group(1).strip()
            
        # Some initial clean up based on historic data obs
        inspection_dates = inspection_dates.replace(".", "")
        inspection_dates = inspection_dates.replace("\u00A0", " ") # Remove non-breaking space (Seen in nottingham report)
        inspection_dates = re.sub(r"[\u2012\u2013\u2014\u2212\-]+", " to ", inspection_dates) # replace en dash char ("\u2013"), em dash ("\u2014"), or ("-") 
        inspection_dates = inspection_dates.split("and")[0].strip() # Need this because we have such as :
                                                                    # "8 July 2019 to 12 July 2019 and 7 August 2019 to 8 August 2019"
                                                                    # E.g. Derbyshire
        inspection_dates = re.sub(r'(\d)\s(\d)', r'\1\2', inspection_dates) # Fix white spaces between date numbers e.g. "wiltshire,	1 9 June 2019"



        if isinstance(inspection_dates, str):
            # data was as expected
            year_match = re.search(r"\d{4}", inspection_dates)
            if year_match:
                year = year_match.group(0) # get single copy of yyyy

                # Now remove the year from the inspection_dates string
                inspection_dates_cleaned = inspection_dates.replace(year, "").strip()

            else:
                # We had inspection_dates data but no recognisable year
                year = None
                inspection_dates_cleaned = inspection_dates

        else:
            # spurious data
            # inspection_dates arrived with non-str, set default val
            print("Error: inspection_dates is not a string. Type is", type(inspection_dates))
            inspection_dates_cleaned = None 


        # Now that we have already removed/cleaned those with 'and .....'
        # Split the inspection_dates_cleaned string using ' to ' as the delimiter and limit the number of splits to 1
        date_parts = inspection_dates_cleaned.split(' to ', maxsplit=1) # expect only 1 instance of 'to' between date vals
        

  
        # Get the seperate inspection date(s) 
        start_date = date_parts[0].strip()
        end_date = date_parts[1].strip() if len(date_parts) > 1 else None
        
        # Check if the month text is written in *both* the date strings
        # Required work-around as Ofsted reports contain inspection date strings in multiple formats (i/ii/iii...)
        #   i)      "15 to 26 November"  
        #   ii)     "28 February to 4 March" or "8 October to 19 October" (majority)
        #   iii)    ['8 July ', '12 July   and 7 August  to'] (*recently seen)
        #   iv)     "11 September 2017 to 5 October 2017" (double year)
        #   v)      "Inspection dates: 19 November–30 November 2018" (Bromley)
        if len(start_date) <= 2: # i.e. do we only have a date with no month text
            inspection_month = end_date.split()[1]
            start_date = f"{start_date} {inspection_month}"

        # Append the inspection year to the start_date and end_date
        start_date_str = f"{start_date} {year}"
        end_date_str = f"{end_date} {year}" if end_date else None


        # format current str dates (as dt objects)
        start_date_formatted = parse_date(start_date_str, '%d %B %Y') #  str '8 January 2021' 
        end_date_formatted = parse_date(end_date_str, '%d %B %Y')

        # calculate inspection duration and return framework string
        # Note: Problems arising here generally relate to the end_date extraction from pdf
        inspection_framework_str = get_framework_type(start_date_formatted, end_date_formatted, short_inspection_threshold, standard_inspection_threshold)

    else:
        # unable to extract the data or didnt exist
        start_date_formatted = None
        end_date_formatted = None
        inspection_framework_str = None


    return {'inspector_name': inspector_name, 
            'overall_inspection_grade': inspection_grade,
            'inspection_start_date': start_date_formatted,
            'inspection_end_date': end_date_formatted,
            'inspection_framework': inspection_framework_str,
            'impact_of_leaders_grade': impact_of_leaders_grade_str,
            'help_and_protection_grade': help_and_protection_grade_str,
            'care_and_care_leavers_grade': care_and_care_leavers_grade_str,
            'sentiment_score': sentiment_val, 
            'sentiment_summary': sentiment_summary_str,
            'main_inspection_topics': key_inspection_themes_lst
            }


def process_provider_links(provider_links):
    """
    Processes provider links and returns a list of dictionaries containing URN, local authority, and inspection link.

    Args:
        provider_links (list): A list of BeautifulSoup Tag objects representing provider links.

    Returns:
        list: A list of dictionaries containing URN, local authority, inspection link, and, if enabled, additional inspection data.
    """
    
    data = []
    global pdf_data_capture # Bool flag
    global root_export_folder
    global inspections_subfolder


    for link in provider_links:
        # Extract the URN and provider name from the web link shown
        urn = link['href'].rsplit('/', 1)[-1]
        name = clean_provider_name(link.text.strip())


        clean_provider_dir = os.path.join(root_export_folder, inspections_subfolder, urn + '_' + name)
        provider_dir = os.path.join('.', root_export_folder, inspections_subfolder, urn + '_' + name)


        # Create the provider directory if it doesn't exist
        if not os.path.exists(provider_dir):
            os.makedirs(provider_dir)

        # Get the child page content
        child_url = 'https://reports.ofsted.gov.uk' + link['href']
        child_soup = get_soup(child_url)

        # Find all publication links in the provider's child page
        pdf_links = child_soup.find_all('a', {'class': 'publication-link'})

        # Initialize a flag to indicate if an inspection link has been found
        # Important: This assumes that the provider's reports are returned/organised most recent FIRST
        found_inspection_link = False

        # Iterate through the publication links
        for pdf_link in pdf_links:

            # Check if the current/next href-link meets the selection criteria
            # This block obv relies on Ofsted continued use of nonvisual element descriptors
            # containing the type(s) of inspection text. We use  "children's services inspection"

            nonvisual_text = pdf_link.select_one('span.nonvisual').text.lower().strip()

            # For now at least, search terms hard-coded. 
            if 'children' in nonvisual_text and 'services' in nonvisual_text and 'inspection' in nonvisual_text:


                # Create the filename and download the PDF (this filetype needs to be hard-coded here)
                filename = nonvisual_text.replace(', pdf', '') + '.pdf'


                pdf_content = requests.get(pdf_link['href']).content
                with open(os.path.join(provider_dir, filename), 'wb') as f:
                    f.write(pdf_content)


               # Extract the local authority and inspection link, and add the data to the list
                if not found_inspection_link:

                    # Capture the data that will be exported about the most recent inspection only
                    local_authority = provider_dir.split('_', 1)[-1].replace('_', ' ').strip()
                    inspection_link = pdf_link['href']
                    
                    # Extract the report published date
                    report_published_date_str = filename.split('-')[-1].strip().split('.')[0] # published date appears after '-' 
            
                    # get/format date(s) (as dt objects)
                    report_published_date = format_date(report_published_date_str, '%d %B %Y', '%d/%m/%y')

                    # Now get the in-document data
                    if pdf_data_capture:
                        # Opt1 : ~x4 slower runtime
                        # Only here if we have set PDF text scrape flag to True
                        # Turn this off, speeds up script if we only need the inspection documents themselves to be retrieved

                        # Scrape inside the pdf inspection reports
                        inspection_data_dict = extract_inspection_data(pdf_content)
                        

                        # Dict extract here for readability of returned data/onward

                        # inspection basics
                        overall_effectiveness = inspection_data_dict['overall_inspection_grade']
                        inspector_name = inspection_data_dict['inspector_name']
                        inspection_start_date = inspection_data_dict['inspection_start_date']
                        inspection_end_date = inspection_data_dict['inspection_end_date']
                        inspection_framework = inspection_data_dict['inspection_framework']
                        # additional inspection grades if available
                        impact_of_leaders_grade = inspection_data_dict['impact_of_leaders_grade']
                        help_and_protection_grade = inspection_data_dict['help_and_protection_grade']
                        care_and_care_leavers_grade = inspection_data_dict['care_and_care_leavers_grade']
                        # NLP extract 
                        sentiment_score = inspection_data_dict['sentiment_score']
                        sentiment_summary = inspection_data_dict['sentiment_summary']
                        main_inspection_topics = inspection_data_dict['main_inspection_topics']


                        # format dates for output                       
                        inspection_start_date_formatted = format_date_for_report(inspection_start_date, "%d/%m/%Y")
                        inspection_end_date_formatted = format_date_for_report(inspection_end_date, "%d/%m/%Y")

                        # Format the provider directory as a file path link (in readiness for such as Excel)

                        provider_dir_link = f"{provider_dir}"

                        # depreciated - testing
                        provider_dir_link = provider_dir_link.replace('/', '\\') # fix for Windows systems
                        
                        # file_link = f"/{provider_dir}"
                        # provider_dir_link = 'file:///' + provider_dir.replace("\\", "/") # to remove
                                            
                        data.append({
                                        'urn': urn,
                                        'local_authority': local_authority,
                                        'inspection_link': inspection_link,
                                        'overall_effectiveness_grade': overall_effectiveness,
                                        'inspection_framework': inspection_framework,
                                        'inspector_name': inspector_name,
                                        'inspection_start_date': inspection_start_date_formatted,
                                        'inspection_end_date': inspection_end_date_formatted,
                                        'publication_date': report_published_date,
                                        'local_link_to_all_inspections': provider_dir_link,
                                        'impact_of_leaders_grade': impact_of_leaders_grade,
                                        'help_and_protection_grade': help_and_protection_grade,
                                        'care_and_care_leavers_grade': care_and_care_leavers_grade,
                                        'sentiment_score': sentiment_score,
                                        'sentiment_summary': sentiment_summary,
                                        'main_inspection_topics': main_inspection_topics

                                    })
                        
                    else:
                        # Opt2 : ~x4 faster runtime
                        # Only grab the data/docs we can get direct off the Ofsted page 
                        data.append({'urn': urn, 'local_authority': local_authority, 'inspection_link': inspection_link})

                    
                    found_inspection_link = True # Flag to ensure data reporting on only the most recent inspection
    return data


def handle_pagination(soup, url_stem):
    """
    Handles pagination for a BeautifulSoup object representing a web page with paginated content.
    
    Args:
        soup (bs4.BeautifulSoup): The BeautifulSoup object representing the web page.
        url_stem (str): The base URL to which the relative path of the next page will be appended.
        
    Returns:
        str: The full URL of the next page if it exists, otherwise None.
    """
    
    # Find the pagination element in the soup object
    pagination = soup.find('ul', {'class': 'pagination'})

    # Check if the pagination element exists
    if pagination:
        # Find the next page button in the pagination element
        next_page_button = pagination.find('li', {'class': 'next'})

        # Check if the next page button exists
        if next_page_button:
            # Extract the relative URL of the next page
            next_page_url = next_page_button.find('a')['href']
            
            # Return the full URL of the next page by appending the relative URL to the base URL
            return url_stem + next_page_url

    # Return None if there is no next page button or pagination element
    return None


def save_data(data, filename, file_type='csv'):
    """
    Exports data to a specified file type.

    Args:
        data (list or dict): The data to be exported.
        filename (str): The desired name of the output file.
        file_type (str, optional): The desired file type. Defaults to 'csv'.

    Returns:
        None
    """
    if file_type == 'csv':
        filename_with_extension = filename + '.csv'
        pd.DataFrame(data).to_csv(filename_with_extension, index=False)

    elif file_type == 'excel':
        filename_with_extension = filename + '.xlsx'
        pd.DataFrame(data).to_excel(filename_with_extension, index=False)

    else:
        print(f"Error: unsupported file type '{file_type}'. Please choose 'csv' or 'xlsx'.")
        return
    
    print(f"{filename_with_extension} successfully created!")




def save_data_update(data, filename, file_type='csv', hyperlink_column=None):
    """
    Exports data to a specified file type.

    Args:
        data (DataFrame): The data to be exported.
        filename (str): The desired name of the output file.
        file_type (str, optional): The desired file type. Defaults to 'csv'.
        hyperlink_column (str, optional): The column containing folder names for hyperlinks. Defaults to None.

    Returns:
        None
    """
    if file_type == 'csv':
        filename_with_extension = filename + '.csv'
        data.to_csv(filename_with_extension, index=False)

    elif file_type == 'excel':
        filename_with_extension = filename + '.xlsx'

        # Create a new workbook and add a worksheet
        workbook = xlsxwriter.Workbook(filename_with_extension)
        sheet = workbook.add_worksheet()

        hyperlink_col_index = data.columns.get_loc(hyperlink_column) if hyperlink_column else None

        # Define hyperlink format
        hyperlink_format = workbook.add_format({'font_color': 'blue', 'underline': 1})


        # Write DataFrame to the worksheet
        for row_num, (index, row) in enumerate(data.iterrows(), start=1):
            for col_num, (column, cell_value) in enumerate(row.items()):
                if hyperlink_col_index is not None and col_num == hyperlink_col_index:
                    # Add hyperlink using the HYPERLINK formula
                    link = f".\\{cell_value}"
                    sheet.write_formula(row_num, col_num, f'=HYPERLINK("{link}", "{cell_value}")', hyperlink_format)

                else:
                    sheet.write(row_num, col_num, str(cell_value))

        # Write header
        header_format = workbook.add_format({'bold': True})
        for col_num, column in enumerate(data.columns):
            sheet.write(0, col_num, column, header_format)

        # Save the workbook
        workbook.close()

    else:
        print(f"Error: unsupported file type '{file_type}'. Please choose 'csv' or 'excel'.")
        return

    print(f"{filename_with_extension} successfully created!")




def get_sentiment_and_topics(pdf_buffer):
    """
    Analyze the sentiment and extract the top 3 topics from a PDF document.

    This function takes a file-like buffer containing a PDF document as input and
    performs the following tasks:
    1. Reads the content of the PDF file using the PyPDF2 library.
    2. Extracts the text from each page and concatenates it into a single string.
    3. Performs sentiment analysis on the extracted text using the TextBlob library.
       The sentiment polarity score ranges from -1 (most negative) to 1 (most positive).
    4. Identifies key themes or topics from the extracted text using the Latent Dirichlet
       Allocation (LDA) model from the Gensim library.
    5. Returns the sentiment polarity score and the top 3 topics extracted from the PDF file.

    Args:
        pdf_buffer (io.BytesIO): A file-like buffer containing the PDF content.

    Returns:
        tuple: A tuple containing the sentiment polarity score (float) and a list of
               the top 3 topics (strings).
    """

    # Read the PDF stuff
    reader = PyPDF2.PdfReader(pdf_buffer)
    text = ''
    for page in reader.pages:
        text += page.extract_text()

    # Perform sentiment analysis on the extracted text
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    
    # Identify key themes from the extracted text
    # First, preprocess the text by tokenising and removing stop words
    tokens = [word for sentence in blob.sentences for word in sentence.words]
    stop_words = set(nltk.corpus.stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]
    
    # Create a dictionary from the tokenized text
    dictionary = corpora.Dictionary([tokens])
    
    # Create a corpus from the dictionary and the tokenised text
    corpus = [dictionary.doc2bow(tokens)]
    
    # Create an LDA model from the corpus
    lda_model = models.LdaModel(corpus, num_topics=3, id2word=dictionary)
    
    # Get the top 3 topics from the LDA model
    topics = [lda_model.print_topic(topic_num) for topic_num in range(3)]

    # return sentiment, [extract_words(topic) for topic in topics] # Alternative to return the words with NO weights
    return sentiment, topics


# This an updated/extended version of the above 
def get_sentiment_and_sentiment_by_theme(pdf_buffer, theme1, theme2, theme3):
    # Read the PDF stuff
    reader = PyPDF2.PdfReader(pdf_buffer)
    text = ''
    for page in reader.pages:
        text += page.extract_text()

    # Perform sentiment analysis on the extracted text
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    
    # Identify key themes from the extracted text
    # First, preprocess the text by tokenising and removing stop words
    tokens = [word for sentence in blob.sentences for word in sentence.words]
    stop_words = set(nltk.corpus.stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]
    
    # Create a dictionary from the tokenized text
    dictionary = corpora.Dictionary([tokens])
    
    # Create a corpus from the dictionary and the tokenised text
    corpus = [dictionary.doc2bow(tokens)]


    # Create an LDA model from the corpus with a higher number of topics
    lda_model = models.LdaModel(corpus, num_topics=10, id2word=dictionary)
    
    # Get all topics from the LDA model
    all_topics = [lda_model.print_topic(topic_num) for topic_num in range(10)]

    # Define a function to calculate similarity between two strings
    def string_similarity(s1, s2):
        vectorizer = CountVectorizer().fit_transform([s1, s2])
        vectors = vectorizer.toarray()
        return cosine_similarity(vectors)[0, 1]

    # Filter topics based on the similarity to the provided theme strings
    filtered_topics = []
    themes = [theme1, theme2, theme3]
    for topic in all_topics:
        for theme in themes:
            if string_similarity(topic, theme) > 0.2:  # Adjust the threshold as needed
                filtered_topics.append(topic)
                break

    return sentiment, filtered_topics

def get_sentiment_category(sentiment):
    """
    Return the sentiment category based on the sentiment value.
    For now, the ranges have been hard-coded. Might need to review that.

    This function takes a sentiment value as input and returns the sentiment category
    according to the following ranges:
    - sentiment > 0.5: very positive
    - 0 < sentiment <= 0.5: positive
    - sentiment == 0: neutral
    - -0.5 <= sentiment < 0: negative
    - sentiment < -0.5: very negative

    Args:
        sentiment (float): Sentiment value ranging from -1 (most negative) to 1 (most positive).

    Returns:
        str: The sentiment category.
    """

    if sentiment > 0.5:
        return "Sentiment very positive"
    elif 0 < sentiment <= 0.5:
        return "Sentiment positive"
    elif sentiment == 0:
        return "Sentiment neutral"
    elif -0.5 <= sentiment < 0:
        return "Sentiment negative"
    else:
        return "Sentiment very negative"


def extract_words(topic_string):
    # Quick fix for when the sentiment weights per topic word not wanted.
    words = re.findall(r'\*"(.*?)"', topic_string)
    return words


def plot_filtered_topics(filtered_topics):
    """
    Note: This only running if using func get_sentiment_and_sentiment_by_theme(pdf_buffer, theme1, theme2, theme3) 

    Visualise filtered inspection topics as a bar chart.

    This function takes a list of filtered topics as input and creates a bar chart
    to visualise the weighted words for each topic.

    Args:
        filtered_topics (list): List of filtered topics as strings.

    Returns:
        None
    """
    # Function to extract words and their weights from a topic string
    def extract_words_weights(topic_string):
        words_weights = [ww.split('*') for ww in topic_string.split(' + ')]
        return [(float(weight.strip()), word.strip(" '\"")) for weight, word in words_weights]

    # Extract words and their weights from the filtered_topics
    topics_words_weights = [extract_words_weights(topic) for topic in filtered_topics]

    # Create a bar chart for each topic
    for idx, (words_weights, topic) in enumerate(zip(topics_words_weights, filtered_topics), 1):
        words, weights = zip(*words_weights)

        fig, ax = plt.subplots()
        ax.barh(words, weights)
        ax.set_xlabel('Weights')
        ax.set_title(f'Topic {idx}: {topic[:50]}...')
        ax.invert_yaxis()  # Invert y-axis to show higher weights at the top

        plt.show()


def import_csv_from_folder(folder_name):
    """
    Imports a single CSV file from a local folder relative to the root of the script.

    The CSV file must be located in the specified folder. If multiple CSV files are found,
    a ValueError is raised. If no CSV files are found, a ValueError is raised.

    Parameters:
    folder_name (str): The name of the folder containing the CSV file.

    Returns:
    pandas.DataFrame: A DataFrame containing the data from the CSV file.
    """
    file_names = [f for f in os.listdir(folder_name) if f.endswith('.csv')]
    if len(file_names) == 0:
        raise ValueError('No CSV file found in the specified folder')
    elif len(file_names) > 1:
        raise ValueError('More than one CSV file found in the specified folder')
    else:
        file_path = os.path.join(folder_name, file_names[0])
        df = pd.read_csv(file_path)
        return df
    
    

def merge_and_select_columns(df1, df2, key_column, columns_to_add):
    """
    Merges two dataframes and returns a merged dataframe with additional columns from
    the second dataframe, without any duplicate columns.

    Parameters:
    df1 (pandas.DataFrame): The first dataframe to merge.
    df2 (pandas.DataFrame): The second dataframe to merge.
    key_column (str): The name of the key column to merge on.
    columns_to_add (list): A list of column names from df2 to add to df1.

    Returns:
    pandas.DataFrame: A new dataframe with merged data from df1 and selected columns from df2.
    """
    merged = df1.merge(df2[columns_to_add + [key_column]], on=key_column)
    return merged


def reorder_columns(df, key_col, columns_to_add):
    """
    Reorders the columns of a dataframe to include the specified key column
    and additional columns to add, without any duplicate or missing columns.

    Parameters:
    df (pandas.DataFrame): The dataframe to reorder.
    key_col (str): The name of the key column to include.
    columns_to_add (list): A list of additional column names to include.

    Returns:
    pandas.DataFrame: A new dataframe with the reordered columns.
    """
    # Create a list of columns to keep, removing duplicates and missing columns
    columns_to_keep = list(dict.fromkeys(columns_to_add + df.columns.tolist()))

    # Extract the index of the key column
    key_col_index = df.columns.get_loc(key_col)

    # Reorder the columns and insert the key column at its original position
    reordered_columns = [key_col] + [c for c in columns_to_keep if c != key_col]
    return df[reordered_columns]

In [12]:
#
# Scrape data

data = []
while True:
    # Fetch and parse the HTML content of the current URL
    soup = get_soup(url)
    
    # Find all 'provider' links on the page
    provider_links = soup.find_all('a', href=lambda href: href and '/provider/' in href)

    # Process the provider links and extend the data list with the results
    data.extend(process_provider_links(provider_links))

    
    # Since all results are on a single page, no need to handle pagination. 
    # Processing complete.   
    break





# Convert the 'data' list to a DataFrame
ilacs_inspection_summary_df = pd.DataFrame(data)



Got stderr: May 09, 2023 3:24:33 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
May 09, 2023 3:24:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Got stderr: May 09, 2023 3:24:57 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



In [13]:
# Bring in additional flat-file stored data 
#

# Enables broader potential onward usage/cross/backwards-compatible access 
# Here an example to enable easier re-use for future further enrichment, e.g. geospatial... 
# Note: Where possible, avoid any reliance on flat-file stored dynamic data! 
#       This process idealy only for static data, or where obtaining specific data points in a dynamic manner isnt possble etc. 


# Enrichment: LA codes
# Ofsted data centres on URN, but some might need historic LA Number

# import the needed external/local data
local_authorities_lookup_df = import_csv_from_folder('import_data/la_lookup/') # bring data in 

# Ensure/Convert key col ('urn') consistency
key_col = 'urn'
ilacs_inspection_summary_df['urn'] = ilacs_inspection_summary_df['urn'].astype('int64')
local_authorities_lookup_df['urn'] = local_authorities_lookup_df['urn'].astype('int64')

# Define what data is required to be merged in
columns_to_add = ['la_code', 'region_code']
ilacs_inspection_summary_df = merge_and_select_columns(ilacs_inspection_summary_df, local_authorities_lookup_df, key_col, columns_to_add)

# re-organise column structure now with new cols
ilacs_inspection_summary_df = reorder_columns(ilacs_inspection_summary_df, key_col, columns_to_add)
## End enrichment 1 ## 



# Enrichment: Anything else to enrich...? 

## End enrichment 2 ##




In [14]:
# Export summary data
#


# Also define the active hyperlink col if exporting to Excel
save_data_update(ilacs_inspection_summary_df, export_summary_filename, file_type=export_file_type, hyperlink_column='local_link_to_all_inspections')




ofsted_childrens_services_overview.xlsx successfully created!
