This Python script is designed to audit data publishing endpoints listed in the Digital Land platform by identifying which endpoints are missing associated documentation URLs. It does this by connecting to the Datasette API, running a SQL query to fetch endpoint metadata, analyzing the results, and saving the findings to a CSV file.

Here's how the script works in detail:

1. **Command-Line Interface**:
   The script uses `argparse` to accept an `--output-dir` argument, which determines where the resulting CSV file will be saved.

2. **Data Fetching with Pagination**:
   A paginated SQL query is constructed and repeatedly executed against the Datasette API endpoint (`https://datasette.planning.data.gov.uk/digital-land.json`) to retrieve all records from the `endpoint`, `source`, `source_pipeline`, and `organisation` tables. These are joined together to build a comprehensive view of each endpoint, including:
   - Organisation name and ID
   - Dataset (pipeline)
   - Endpoint URL
   - Documentation URL
   - Entry and end dates

   The SQL uses `LIMIT 1000 OFFSET {offset}` pagination to work around any row limits imposed by the API.

3. **Data Analysis**:
   The fetched dataset is analyzed to identify missing documentation:
   - A `documentation_missing` column is created, where `True` indicates the `documentation_url` is either `null` or an empty string.
   - A breakdown of statistics is printed to the console:
     - Total number of endpoints
     - Number and percentage of endpoints missing documentation
     - Top 10 datasets (pipelines) most affected by missing documentation
     - A breakdown of missing documentation for **active** (no `end_date`) vs. **ended** endpoints
     - The most recent entry date among endpoints missing documentation

4. **Data Enrichment**:
   - A boolean column `is_active` is created to flag whether an endpoint is still active (i.e., `end_date` is blank).
   - The `entry_date` is parsed into a `datetime` object for more accurate filtering and reporting.

5. **Output**:
   The entire dataset—including both documented and undocumented endpoints—is saved as a CSV file named `endpoints_missing_documentation_urls.csv` in the directory specified by the `--output-dir` argument. The file includes both metadata and computed flags (`documentation_missing`, `is_active`, etc.).

This script is useful for assessing data publishing completeness across planning authorities and helps ensure that all data sources are properly documented for transparency, traceability, and usability.


In [None]:
import requests
import pandas as pd
import os
import argparse

# Constants
DATASSETTE_URL = "https://datasette.planning.data.gov.uk/digital-land.json"

# Base SQL query to retrieve endpoint metadata
BASE_SQL = """
SELECT 
    o.name,
    s.organisation, 
    sp.pipeline AS "pipeline/dataset", 
    e.endpoint_url, 
    s.documentation_url,
    s.entry_date,
    s.end_date,
    e.endpoint
FROM 
    endpoint e
    INNER JOIN source s ON e.endpoint = s.endpoint
    INNER JOIN source_pipeline sp ON s.source = sp.source
    INNER JOIN organisation o ON o.organisation = s.organisation
ORDER BY s.entry_date DESC
LIMIT 1000 OFFSET {offset}
"""

def parse_args():
    """
    Parses command-line arguments for the output directory.
    """
    parser = argparse.ArgumentParser(description="Export endpoints with missing documentation URLs.")
    parser.add_argument(
        "--output-dir",
        type=str,
        required=True,
        help="Directory to save the output CSV"
    )
    return parser.parse_args()

def fetch_endpoint_data():
    """
    Fetches all endpoint metadata using paginated SQL from Datasette API.

    Returns:
        pd.DataFrame: Combined result of all pages as a DataFrame.
    """
    all_rows, offset = [], 0
    columns = []

    while True:
        paginated_sql = BASE_SQL.format(offset=offset)
        response = requests.get(DATASSETTE_URL, params={"sql": paginated_sql, "_size": 1000})

        if response.status_code != 200:
            print("Failed to fetch data from Datasette.")
            break

        json_data = response.json()
        rows = json_data.get("rows", [])

        if not rows:
            break

        if offset == 0:
            columns = json_data.get("columns", [])

        all_rows.extend(rows)
        offset += 1000

    return pd.DataFrame(all_rows, columns=columns) if all_rows else pd.DataFrame()

def analyze_missing_docs(df):
    """
    Analyzes the dataset to flag missing documentation URLs and report stats.

    Args:
        df (pd.DataFrame): Raw endpoint metadata.

    Returns:
        pd.DataFrame: DataFrame with added helper columns.
    """
    total = len(df)
    df["documentation_missing"] = df["documentation_url"].fillna("").str.strip() == ""
    missing_count = df["documentation_missing"].sum()
    percent_missing = (missing_count / total) * 100

    print(f"Total endpoints: {total}")
    print(f"Missing documentation_url: {missing_count}")
    print(f"Percent missing: {percent_missing:.2f}%")

    top_missing = (
        df[df["documentation_missing"]]
        .groupby("pipeline/dataset")
        .size()
        .sort_values(ascending=False)
        .head(10)
    )
    print("\nTop affected pipelines:")
    print(top_missing.to_string())

    df["is_active"] = df["end_date"].fillna("").str.strip() == ""
    active_missing = df.query("documentation_missing and is_active").shape[0]
    ended_missing = df.query("documentation_missing and not is_active").shape[0]

    print(f"\nActive endpoints missing documentation: {active_missing}")
    print(f"Ended endpoints missing documentation: {ended_missing}")

    df["entry_date"] = pd.to_datetime(df["entry_date"], errors="coerce")
    recent_missing = df[df["documentation_missing"]]["entry_date"].max()
    recent_str = recent_missing.date() if pd.notnull(recent_missing) else "N/A"
    print(f"\nMost recent entry with missing documentation: {recent_str}")

    return df

def save_results(df, output_dir):
    """
    Filters endpoints missing documentation and saves to CSV.

    Args:
        df (pd.DataFrame): The analyzed DataFrame.
        output_dir (str): Output directory path.
    """
    os.makedirs(output_dir, exist_ok=True)
    #filtered = df.query("documentation_missing and is_active")
    output_path = os.path.join(output_dir, "endpoints_missing_documentation_urls.csv")
    df.to_csv(output_path, index=False)
    print(f"CSV saved: {output_path}")

def main():
    """
    Main workflow to fetch, analyze, and save data.
    """
    args = parse_args()
    df = fetch_endpoint_data()

    if df.empty:
        print("No data found to process.")
        return

    df = analyze_missing_docs(df)
    save_results(df, args.output_dir)

if __name__ == "__main__":
    main()
