## üß≠ Argo FormatChecker Notebook (AMRIT Consortium)

This pre-configured notebook allows you to use the **Argo Format Checker** provided by  **AMRIT** to validate Argo NetCDF files.  
The FormatChecker performs both **format** and **content** checks on Argo NetCDF files to ensure compliance with the Argo data standards.

üìò **References:**
- The Argo NetCDF format is defined in the [Argo User‚Äôs Manual](http://dx.doi.org/10.13155/29825).  
- More details and documentation are available on the [Argo Data Management website](https://www.argodatamgt.org/Documentation).


## The main steps to run the checker
1. Assure that you have the file checker api service running. See the main readme of this repository to know how to do it.
2. you will need data in the examples/data directory. Two deployment are provided : floats 2903996 (coriolis) and 3901945 (bodc)
3. Run the complete notebook
    - This will
        - Import necessary packages.
        - pull data from git LFS if absent
        - Sets the config and DAC.
        - Checks the connectivity to the API.
4. Checking files
    - <p>If you need to check few selected files ensure the list of files to check are configured. 
    - The 'data' folder has a few files that has been used for demo.
5. Optionally run the checker on all files of a deployment.
    - <p>If you want to check a whole deployment please check the path is correctly configured and run the corresponding cell.</p>
6. Results are saved to a csv file if the path is configured.Results are also shown on the console.

In [1]:
# ===============================
# 1Ô∏è‚É£ Importing necessary packages
# ===============================


import tempfile
from pathlib import Path

import pandas as pd
import requests
from IPython.display import Markdown, display
import subprocess

In [2]:
# ====================================
# 2Ô∏è‚É£  Pull LFS example data if missing
# ====================================

# check if data and result folders exists :
data_dirs = [
    Path('../data/2903996'),
    Path('../data/3901945'),
    Path ('../results')
]
for data_dir in data_dirs:
    if not data_dir.exists():
        print(f"Create directory'{data_dir}'...")
        data_dir.mkdir(parents=True, exist_ok=True)

#find .nc files and pull it from LFS if not exists:
nc_files = []
for data_dir in data_dirs:
    nc_files.extend(list(data_dir.glob("*.nc")))

if not nc_files:
    print("No files found. Pulling files from LFS..")
    try:
        subprocess.run(["git", "lfs", "pull"], check=True, cwd="..")
    except:
        print("Git LFS error. Make sure Git LFS is installed.")
        print("   Installation: git lfs install")

## Check the configurations an do any changes accordingly.

In [None]:
# ===============================
# 3Ô∏è‚É£ Configuration
# ===============================

# Local Docker instance
API_BASE_URL="http://host.docker.internal:8080/argo-toolbox/api/file-checker"

# Default DAC for validation
DEFAULT_DAC = "coriolis" # use "bodc" for deployment 3901945

# Local path where deployment files are located
DEPLOYMENTS_BASE_PATH = "../data/2903996" #you can also check files from 3901945

# files to check along with their patha
FILES_TO_CHECK = [
    "../data/2903996/R2903996_001.nc",
    "../data/2903996/R2903996_002.nc"]

# Result file for full deployment checks
RESULTS="../results/deployment_files_check_result.csv"

# Endpoints
CHECK_FILE_ENDPOINT = "/check-files"

# Request settings
TIMEOUT = 30  # API request timeout in seconds
HEADERS = {
    "accept": "application/json",
}

# Full URL for file check endpoint
FILE_CHECK_URL = f"{API_BASE_URL}/{CHECK_FILE_ENDPOINT}"

print(f"üîó API Base URL: {API_BASE_URL}")
print(f"üèõÔ∏è DAC: {DEFAULT_DAC}")

## Check the API connections.

In [None]:
# ===============================
# 4Ô∏è‚É£ Check the API connection
# ===============================

response = requests.get(f"{API_BASE_URL}/", timeout=5)
response.raise_for_status()
print("‚úÖ API is reachable.")

## Check a file or list of files using the API


In [None]:
# ===============================
# 5Ô∏è‚É£ Checking selected files
# ===============================
print(f"Configured files to check: {FILES_TO_CHECK}")
file_paths = []
with tempfile.TemporaryDirectory() as tmp_dir:
    # Save uploaded files to temporary directory
    for filepath in FILES_TO_CHECK:
        fname = Path(filepath).name  # Get the filename from the path
        tmp_path = Path(tmp_dir) / fname

        # Copy the file to tmp_dir
        with Path(filepath).open("rb") as src:
            with tmp_path.open("wb") as dst:
                dst.write(src.read())
        file_paths.append(tmp_path)

    # Prepare files for upload
    files_data = []
    for file_path in file_paths:
        file_obj = Path(file_path).open("rb")
        files_data.append(("files", (Path(file_path).name, file_obj, "application/x-netcdf")))

    # Prepare request parameters
    params = {"dac": DEFAULT_DAC}

    # Make the POST request to check files
    response = requests.post(
            FILE_CHECK_URL,
            files=files_data,
            params=params,
            headers=HEADERS,
            timeout=TIMEOUT,
    )
    if files_data:
        # Close all opened file objects
        for _, (_, file_obj, _) in files_data:
            file_obj.close()

        # Process the response
        checked_result = response.json()
        # Display the results in a DataFrame
        if checked_result:
            for r in checked_result:
                r["errors_messages"] = "\n".join(r.get("errors_messages", []))
                r["warnings_messages"] = "\n".join(r.get("warnings_messages", []))
            df = pd.DataFrame(checked_result, columns=[
                "file",
                "result",
                "phase",
                "errors_number",
                "warnings_number",
                "errors_messages",
                "warnings_messages",
            ])

            # Display the results
            with pd.option_context(
                'display.max_columns', None,
                'display.width', None,
                'display.max_colwidth', None
                ):
                display(df)

            # Optional: Save to CSV
            if RESULTS:
                df.to_csv(RESULTS, index=False, encoding="utf-8")
                print(f"‚úÖ Results saved to:  {RESULTS}")

## [Optional] Check all the files of a deployment.

In [None]:
# ===============================
# 6Ô∏è‚É£ Checking all files in a deployment
# ===============================
print("The deployment folder configured is:", DEPLOYMENTS_BASE_PATH)
deployment_folder = DEPLOYMENTS_BASE_PATH
if not Path(deployment_folder).exists():
    print(f"üìÅ Folder does not exist: { deployment_folder}")
else:
    # Count number of files in the folder
    file_paths = [str(f) for f in Path(deployment_folder).iterdir() if f.is_file()]
    num_files = len(file_paths)

    if num_files == 0:
        print(f"No files found in deployment folder: {deployment_folder}" )
    else:
        print(f"‚úîÔ∏è Number of files found in deployment folder: {num_files}" )
        files_data = []
        # Prepare files for upload
        for file_path in file_paths:

            if not Path(file_path) or not Path(file_path).is_file():
                print(f"üìÅ Error: Cannot access file: {Path(file_path)}")
                continue
            file_obj = Path(file_path).open("rb")
            files_data.append(("files", (Path(file_path).name, file_obj, "application/x-netcdf")))

        # set the request parameters
        params = {"dac": DEFAULT_DAC}

        # Send files to API
        response = requests.post(
            FILE_CHECK_URL,
            files=files_data,
            params=params,
            headers=HEADERS,
            timeout=TIMEOUT,
        )
        if files_data:
        # Close all opened file objects
            for _, (_, file_obj, _) in files_data:
                file_obj.close()

        # Process the response
        checked_result = response.json()
        if checked_result:
            for r in checked_result:
                r["errors_messages"] = "\n".join(r.get("errors_messages", []))
                r["warnings_messages"] = "\n".join(r.get("warnings_messages", []))
            df = pd.DataFrame(checked_result, columns=[
                "file",
                "result",
                "phase",
                "errors_number",
                "warnings_number",
                "errors_messages",
                "warnings_messages",
            ])

            # Display the results
            with pd.option_context(
                'display.max_columns', None,
                'display.width', None,
                'display.max_colwidth', None
                ):
                display(df)

            # Optional: Save to CSV
            if RESULTS:
                df.to_csv(RESULTS, index=False, encoding="utf-8")
                print(f"‚úÖ Results saved to:  {RESULTS}")

        display(Markdown("**‚úÖ Check complete!**"))