This script automates the process of exporting SQL query results from a Datasette API and saving them as CSV files. It is designed to batch export data from multiple Datasette URLs using custom SQL queries.

The script begins by accepting command-line input for an output directory via `argparse`. It defines a dictionary of Datasette URLs (`urls`) and a corresponding list of SQL queries (`sqls`). Each SQL query is URL-encoded and appended to the respective Datasette endpoint using the `.json?sql=...&_shape=array` format, which returns query results in JSON array structure.

Using `requests`, the script sends GET requests to these full URLs. The JSON response is parsed into a pandas DataFrame, which is then saved as a CSV file to the specified output directory. The script also ensures the directory exists and handles failures gracefully—printing error messages but continuing to the next item.

Included examples:
- `log_by_week`: Groups log records by week and HTTP status (200 vs. failures), showing request volumes.
- `operational_issues`: Aggregates operational issues by week over the past 6 months.

Each result is saved using the key name from the `urls` dictionary (e.g. `log_by_week.csv`, `operational_issues.csv`).



In [None]:
import requests
import pandas as pd
import urllib.parse
import os
import argparse

def sql_queried_datasette_tables(urls: dict, sqls: list, save_dir: str):
    """
    Fetches data from a dictionary of Datasette URLs using optional SQL queries
    and saves each result as a CSV file in the specified directory.

    Args:
        urls (dict): Mapping of table names to Datasette base URLs.
        sqls (list of str): SQL queries corresponding to each URL.
        save_dir (str): Directory path where the resulting CSV files will be saved.

    Raises:
        ValueError: If the lengths of the URLs and SQL lists do not match.
    """
    if len(urls) != len(sqls):
        raise ValueError("The number of URLs and SQL queries must match.")

    # Ensure the output directory exists
    os.makedirs(save_dir, exist_ok=True)

    # Iterate over each (name, URL) and associated SQL
    for (name, url), sql in zip(urls.items(), sqls):
        try:
            # Define the output CSV filename
            csv_name = f"{name}.csv"

            # Encode SQL query and construct JSON API URL
            encoded_sql = urllib.parse.quote(sql)
            full_url = f"{url}.json?sql={encoded_sql}&_shape=array"
            
            print(f"Fetching: {name} from SQL URL:\n{full_url}")

            # Fetch JSON data and load into DataFrame
            response = requests.get(full_url)
            response.raise_for_status()
            data = response.json()
            print(f"Rows returned: {len(data)}")
            df = pd.DataFrame(data)

            # rename columns to match expected
            df.rename(columns={'entry-date': 'entry_date'}, inplace=True)

            # Save DataFrame to CSV in the specified directory
            save_path = os.path.join(save_dir, csv_name)
            df.to_csv(save_path, index=False)
            print(f"Saved: {save_path}")

        except Exception as e:
            # Log failure and continue
            print(f"Failed to fetch from {url}: {e}")

def parse_args():
    """
    Parses command-line arguments for the output directory.

    Returns:
        argparse.Namespace: Parsed arguments containing the output directory path.
    """
    parser = argparse.ArgumentParser(description="Datasette batch exporter")
    parser.add_argument(
        "--output-dir",
        type=str,
        required=True,
        help="Directory to save exported CSVs"
    )
    return parser.parse_args()

if __name__ == "__main__":
    # Parse arguments from CLI
    #args = parse_args()

    # Define URLs and SQL queries to export
    urls = {
        "operational_issues": "https://datasette.planning.data.gov.uk/digital-land"
    }

    sqls = [
        # SQL to count operational issues by week over the last 6 months
        """
        SELECT
            [entry-date],
            COUNT(rowid) AS issue_count
        FROM
            operational_issue
        WHERE
            [entry-date] >= DATE('now', '-6 months')
        GROUP BY
            [entry-date];
        """
    ]

    # Execute the export
    output_dir = "C:/Users/DanielGodden/Documents/MCHLG/collecting_and_managing_data"
    sql_queried_datasette_tables(urls, sqls, output_dir)#args.output_dir)


Fetching: log_by_week from SQL URL:
https://datasette.planning.data.gov.uk/digital-land.json?sql=%0A%20%20%20%20%20%20%20%20SELECT%0A%20%20%20%20%20%20%20%20%20%20%20%20COUNT%28endpoint%29%20AS%20endpoint_count%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20SUBSTR%28entry_date%2C%201%2C%2010%29%20AS%20entrydate%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20DATE%28entry_date%2C%20%27weekday%200%27%2C%20%27-6%20days%27%29%20AS%20week_start%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20CASE%20%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20WHEN%20status%20%3D%20200%20THEN%20%27200%27%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20ELSE%20%27FAIL%27%0A%20%20%20%20%20%20%20%20%20%20%20%20END%20AS%20status_group%0A%20%20%20%20%20%20%20%20FROM%20log%0A%20%20%20%20%20%20%20%20WHERE%0A%20%20%20%20%20%20%20%20%20%20%20%20entry_date%20%3E%3D%20DATE%28%27now%27%2C%20%27-6%20months%27%29%0A%20%20%20%20%20%20%20%20%20%20%20%20AND%20SUBSTR%28entry_date%2C%201%2C%2010%29%20%3C%3D%20DATE%28entry_date%2C%20%27week