## üß≠ Argo FormatChecker Notebook (AMRIT Consortium)

This notebook allows you to use the **Argo Format Checker** provided by  **AMRIT** to validate Argo NetCDF files.  
The FormatChecker performs both **format** and **content** checks on Argo NetCDF files to ensure compliance with the Argo data standards.

üìò **References:**
- The Argo NetCDF format is defined in the [Argo User‚Äôs Manual](http://dx.doi.org/10.13155/29825).  
- More details and documentation are available on the [Argo Data Management website](https://www.argodatamgt.org/Documentation).


## The main steps to run the checker

1. Setup 
    - before running this notebook, install dependencies with:
    ```bash
    pip install -r notebooks/requirements.txt
    ```
2. Please check and fix any Configurations
    - Check the API URL and DAC are correct.
        - Set the API_BASE_URL to where the File Checker API is running.
            - If you are running the File Checker locally inside Docker, set it to the address where the 
                Docker container is exposing the API `http://localhost:8000`
            - If you are running the File Checker in a Kubernetes cluster, use the cluster
                URL where the service is exposed. For example:
                `https://livkrakentst.clusters.bodc.me/ewetchy/amrit/argo-toolbox/api/file-checker`
        - Only one of the URLs should be uncommented based on your environment.
 
2. Run the complete notebook
    - This will
        - Import necessary packages.
        - Sets the config and DAC.
        - Checks the connectivity to the API.
5. Checking files
    - <p>If you need to check few selected files ensure the list of files to check are configured. 
    - The 'samples' folder has a few files that has been used for demo.
6. Optionally run the checker on all files of a deployment.
    - <p>If you want to check a whole deployment please check the path is correctly configured and run the corresponding cell.</p>
    - The 'profiles' sub directory for a sample float within 'samples' folder has a subset of the full deployment files which is used for demo purpose.
7. Results are saved to a csv file if the path is configured.Results are also shown on the console.

In [None]:
# ===============================
# 1Ô∏è‚É£ Importing necessary packages
# ===============================


import tempfile
from pathlib import Path

import pandas as pd
import requests
from IPython.display import Markdown, display

## Check the configurations an do any changes accordingly.

In [None]:
# ===============================
# 2Ô∏è‚É£ Configuration
# ===============================

# Local Docker instance
API_BASE_URL="http://localhost:8000"

# Kubernetes test instance
# API_BASE_URL= "https://livkrakentst.clusters.bodc.me/ewetchy/amrit/argo-toolbox/api/file-checker"

# Default DAC for validation
DEFAULT_DAC = "bodc"

# Mount location of all the deployment files
# if you are running the API locally via Docker, ensure this path matches the volume mount in your Docker setup
# for example docker  run --rm --name argo-file-checker2 -p 8000:8000 argo-file-checker

# Local path where deployment files are located
DEPLOYMENTS_BASE_PATH = "samples/6902892/profiles"
# files to check along with their patha
FILES_TO_CHECK = [
    "samples/6902892/profiles/BR6902892_001.nc",
    "samples/6902892/profiles/BR6902892_002.nc"]

# Result file for full deployment checks
RESULTS="results/deployment_files_check_result.csv"

# Endpoints
CHECK_FILE_ENDPOINT = "/check-files"
FULL_DEPLOYMENT_CHECK_ENDPOINT = "/check-deployment"

# Request settings
TIMEOUT = 30  # API request timeout in seconds
HEADERS = {
    "accept": "application/json",
}

# Full URL for file check endpoint
FILE_CHECK_URL = f"{API_BASE_URL}/{CHECK_FILE_ENDPOINT}"

print(f"üîó API Base URL: {API_BASE_URL}")
print(f"üèõÔ∏è DAC: {DEFAULT_DAC}")

## Check the API connections.

In [None]:
# ===============================
#  3Ô∏è‚É£ Check the API connection
# ===============================

response = requests.get(f"{API_BASE_URL}/", timeout=5)
response.raise_for_status()
print("‚úÖ API is reachable.")

## Check a file or list of files using the API


In [None]:
# ===============================
# 4Ô∏è‚É£ Checking selected files
# ===============================
print(f"Configured files to check: {FILES_TO_CHECK}")
file_paths = []
with tempfile.TemporaryDirectory() as tmp_dir:
    # Save uploaded files to temporary directory
    for filepath in FILES_TO_CHECK:
        fname = Path(filepath).name  # Get the filename from the path
        tmp_path = Path(tmp_dir) / fname

        # Copy the file to tmp_dir
        with Path(filepath).open("rb") as src:
            with tmp_path.open("wb") as dst:
                dst.write(src.read())
        file_paths.append(tmp_path)

    # Prepare files for upload
    files_data = []
    for file_path in file_paths:
        file_obj = Path(file_path).open("rb")
        files_data.append(("files", (Path(file_path).name, file_obj, "application/x-netcdf")))

    # Prepare request parameters
    params = {"dac": DEFAULT_DAC}

    # Make the POST request to check files
    response = requests.post(
            FILE_CHECK_URL,
            files=files_data,
            params=params,
            headers=HEADERS,
            timeout=TIMEOUT,
    )
    if files_data:
        # Close all opened file objects
        for _, (_, file_obj, _) in files_data:
            file_obj.close()

        # Process the response
        checked_result = response.json()['results']
        # Display the results in a DataFrame
        if checked_result:
            for r in checked_result:
                r["errors_messages"] = "\n".join(r.get("errors_messages", []))
                r["warnings_messages"] = "\n".join(r.get("warnings_messages", []))
            df = pd.DataFrame(checked_result, columns=[
                "file",
                "result",
                "phase",
                "errors_number",
                "warnings_number",
                "errors_messages",
                "warnings_messages",
            ])

            # Display the results
            with pd.option_context(
                'display.max_columns', None,
                'display.width', None,
                'display.max_colwidth', None
                ):
                display(df)

            # Optional: Save to CSV
            if RESULTS:
                df.to_csv(RESULTS, index=False, encoding="utf-8")
                print(f"‚úÖ Results saved to:  {RESULTS}")

## [Optional] Check all the files of a deployment.

In [None]:
# ===============================
# 5Ô∏è‚É£ Checking all files in a deployment
# ===============================
print("The deployment folder configured is:", DEPLOYMENTS_BASE_PATH)
deployment_folder = DEPLOYMENTS_BASE_PATH
if not Path(deployment_folder).exists():
    print(f"üìÅ Folder does not exist: { deployment_folder}")
else:
    # Count number of files in the folder
    file_paths = [str(f) for f in Path(deployment_folder).iterdir() if f.is_file()]
    num_files = len(file_paths)

    if num_files == 0:
        print(f"No files found in deployment folder: {deployment_folder}" )
    else:
        print(f"‚úîÔ∏è Number of files found in deployment folder: {num_files}" )
        files_data = []
        # Prepare files for upload
        for file_path in file_paths:

            if not Path(file_path) or not Path(file_path).is_file():
                print(f"üìÅ Error: Cannot access file: {Path(file_path)}")
                continue
            file_obj = Path(file_path).open("rb")
            files_data.append(("files", (Path(file_path).name, file_obj, "application/x-netcdf")))

        # set the request parameters
        params = {"dac": DEFAULT_DAC}

        # Send files to API
        response = requests.post(
            FILE_CHECK_URL,
            files=files_data,
            params=params,
            headers=HEADERS,
            timeout=TIMEOUT,
        )
        if files_data:
        # Close all opened file objects
            for _, (_, file_obj, _) in files_data:
                file_obj.close()

        # Process the response
        checked_result = response.json()['results']
        if checked_result:
            for r in checked_result:
                r["errors_messages"] = "\n".join(r.get("errors_messages", []))
                r["warnings_messages"] = "\n".join(r.get("warnings_messages", []))
            df = pd.DataFrame(checked_result, columns=[
                "file",
                "result",
                "phase",
                "errors_number",
                "warnings_number",
                "errors_messages",
                "warnings_messages",
            ])

            # Display the results
            with pd.option_context(
                'display.max_columns', None,
                'display.width', None,
                'display.max_colwidth', None
                ):
                display(df)

            # Optional: Save to CSV
            if RESULTS:
                df.to_csv(RESULTS, index=False, encoding="utf-8")
                print(f"‚úÖ Results saved to:  {RESULTS}")

        display(Markdown("**‚úÖ Check complete!**"))