# Artificial Intelligence
## AI Ready Data - 006
###  Process Profile data using various techniques of each dataset loaded



<center>
<table align="center">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData-Speed-Tests-003.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/notebooks?referrer=search&hl=en&project=ai-bootcamp">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Link to Colab Enterprise
    </a>
  </td>   
  <td style="text-align: center">
    <a href="https://github.com/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData-Speed-Tests-003.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/instances?referrer=search&hl=en&project=ai-bootcamp">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Link to Vertex AI Workbench
    </a>
  </td>
</table>
</center>
</br></br></br>

| | |
|-|-|
|Author(s) | [Christopher G Wood](https://github.com/christophergarthwood)  |

# Overview

Using various data sources we will download, review, and package data in various formats exploring the options for "AI Ready" data and what that means.

## What is a "**AI Ready**" Data?


## References:

#### Data Formats
+ [List of ML File Formats](https://github.com/trailofbits/ml-file-formats)
+ [ML Guide to Data Formats](https://www.hopsworks.ai/post/guide-to-file-formats-for-machine-learning)
+ [Why are ML Data Structures Different?](https://stackoverflow.blog/2023/01/04/getting-your-data-in-shape-for-machine-learning/)

#### FAIR
+ [FAIR and AI-Ready](https://repository.niddk.nih.gov/public/NIDDKCR_Office_Hours_AI-Readiness_and_Preparing_AI-Ready_+Datasets_12_2023.pdf)
+ [AI-Ready-Data](https://www.rishabhsoft.com/blog/ai-ready-data)
+ [AI-Ready FAIR Data](https://medium.com/@sean_hill/ai-ready-fair-data-accelerating-science-through-responsible-ai-and-data-stewardship-3b4f21c804fd)
+ [AI-Ready Data ... Quality](https://www.elucidata.io/blog/building-ai-ready-data-why-quality-matters-more-than-quantity)
+ [AI-Ready Data Explained](https://acodis.io/hubfs/pdfs/AI-ready%20data%20Explained%20Whitepaper%20(1).pdf)

+ [GCP with BigQuery DataFrames](https://cloud.google.com/blog/products/data-analytics/building-aiml-apps-in-python-with-bigquery-dataframes)

#### Format Libraries / Standards
+ [Earth Science Information partners (ESIP)](https://www.esipfed.org/checklist-ai-ready-data/)
+ [Zarr - Storage of N-dimensional arrays (tensors)](https://zarr.dev/#description)
  + [Zarr explained](https://aijobs.net/insights/zarr-explained/)
+ [Apache Parquet](https://parquet.apache.org/)
  + [All about Parquet](https://medium.com/data-engineering-with-dremio/all-about-parquet-part-01-an-introduction-b62a5bcf70f8)
+ [PySTAC - SpatioTemporal Asset Catalogs](https://pystac.readthedocs.io/en/stable/)
  + [John Hogland's Spatial Modeling Tutorials](https://github.com/jshogland/SpatialModelingTutorials/blob/main/README.md)
 

In [1]:
# Let's define some variables (information holders) for our project overall

global PROJECT_ID, BUCKET_NAME, LOCATION
BUCKET_NAME = "ai-bootcamp-vertex-colab"
PROJECT_ID = "ai-bootcamp"
LOCATION = "us-central1"

BOLD_START = "\033[1m"
BOLD_END = "\033[0m"

In [2]:
# Now create a means of enforcing project id selection

import ipywidgets as widgets
from IPython.display import display


def wait_for_button_press():

    button_pressed = False

    # Create widgets
    html_widget = widgets.HTML(
        value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>

        <table><tr><td>
            <span style="font-family: Tahoma;font-size: 18">
              This notebook was designed to work in Jupyter Notebook or Google Colab with the understnading that certain permissions might be enabled.</br>
              Please verify that you are in the appropriate project and that the:</br>
              <center><code><b>PROJECT_ID</b></code> </br></center>
              aligns with the Project Id in the upper left corner of this browser and that the location:
              <center><code><b>LOCATION</b></code> </br></center>
              aligns with the instructions provided.
            </span>
          </td></tr></table></br></br>

    """
    )

    project_list = [
        "ai-bootcamp",
        "usfs-ai-bootcamp",
        "usfa-ai-advanced-training",
        "I will setup my own",
    ]
    dropdown = widgets.Dropdown(
        options=project_list,
        value=project_list[0],
        description="Set Your Project:",
    )

    html_widget2 = widgets.HTML(
        value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>
          """
    )

    button = widgets.Button(description="Accept")

    # Function to handle the selection change
    def on_change(change):
        global PROJECT_ID
        if change["type"] == "change" and change["name"] == "value":
            # print("Selected option:", change['new'])
            PROJECT_ID = change["new"]

    # Observe the dropdown for changes
    dropdown.observe(on_change)

    def on_button_click(b):
        nonlocal button_pressed
        global PROJECT_ID
        button_pressed = True
        # button.disabled = True
        button.close()  # Remove the button from display
        with output:
            # print(f"Button pressed...continuing")
            # print(f"Selected option: {dropdown.value}")
            PROJECT_ID = dropdown.value

    button.on_click(on_button_click)
    output = widgets.Output()

    # Create centered layout
    centered_layout = widgets.VBox(
        [
            html_widget,
            widgets.HBox([dropdown, button]),
            html_widget2,
        ],
        layout=widgets.Layout(
            display="flex", flex_flow="column", align_items="center", width="100%"
        ),
    )
    # Display the layout
    display(centered_layout)


wait_for_button_press()

VBox(children=(HTML(value='\n        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b…

## Environment Check

In [3]:
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# - Google Colab Check
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
import datetime

RunningInCOLAB = False
RunningInCOLAB = "google.colab" in str(get_ipython())
current_time = datetime.datetime.now()

if RunningInCOLAB:
    print(
        f"You are running this notebook in Google Colab at {current_time} in the {BOLD_START}{PROJECT_ID}{BOLD_END}lab."
    )
else:
    print(
        f"You are likely running this notebook with Jupyter iPython runtime at {current_time} in the {PROJECT_ID} lab."
    )

You are likely running this notebook with Jupyter iPython runtime at 2025-06-02 17:46:03.903692 in the ai-bootcamp lab.


## Library Management
### Load Libraries necessary for this operation via pip install

In [4]:
# Import key libraries necessary to support dynamic installation of additional libraries
import sys

# Use subprocess to support running operating system commands from the program, using the "bang" (!)
# symbology is supported, however that does not translate to an actual python script, this is a more
# agnostic approach.
import subprocess
import importlib.util

In [5]:
# Identify the libraries you'd like to add to this Runtime environment.
# Commented out as this adds time but is critical for initial run.
"""
libraries = [
    "backoff",
    "python-dotenv",
    "seaborn",
    "piexif",
    "unidecode",
    "icecream",
    "watermark",
    "watermark[GPU]",
    "rich",
    "rich[jupyter]",
    "numpy",
    "pydot",
    "polars[all]",
    "dask[complete]",
    "xarray",
    "pandas",
    "pystac",
    "pystac[jinja2]",
    "pystac[orjson]",
    "pystac[validation]",
    "fastparquet",
    "zarr",
    "gdown",
    "wget",
]

# Loop through each library and test for existence, if not present install quietly
for library in libraries:
    if library == "Pillow":
        spec = importlib.util.find_spec("PIL")
    else:
        spec = importlib.util.find_spec(library)
    if spec is None:
        print("Installing library " + library)
        subprocess.run(["pip", "install", library, "--quiet"], check=True)
    else:
        print("Library " + library + " already installed.")

# Specialized install for GPU enabled capability with CUDF
# pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu12==25.2.*" "dask-cudf-cu12==25.2.*" "cuml-cu12==25.2.*" "cugraph-cu12==25.2.*" "nx-cugraph-cu12==25.2.*" "cuspatial-cu12==25.2.*"     "cuproj-cu12==25.2.*" "cuxfilter-cu12==25.2.*" "cucim-cu12==25.2.*"
try:
    library="cudf-cu12"
    spec = importlib.util.find_spec(library)
    if spec is None:
        subprocess.run(
            [
                "pip",
                "install",
                "--extra-index-url=https://pypi.nvidia.com",
                library,
                "--quiet",
            ],
            check=True,
        )
    else:
        print("Library " + library + " already installed.")

    library="dask-cudf-cu12"
    spec = importlib.util.find_spec(library)
    if spec is None:
        subprocess.run(
            [
                "pip",
                "install",
                "--extra-index-url=https://pypi.nvidia.com",
                library,
                "--quiet",
            ],
            check=True,
        )
    else:
        print("Library " + library + " already installed.")

except (subprocess.CalledProcessError, RuntimeError, Exception) as e:
    print(repr(e))
"""

'\nlibraries = [\n    "backoff",\n    "python-dotenv",\n    "seaborn",\n    "piexif",\n    "unidecode",\n    "icecream",\n    "watermark",\n    "watermark[GPU]",\n    "rich",\n    "rich[jupyter]",\n    "numpy",\n    "pydot",\n    "polars[all]",\n    "dask[complete]",\n    "xarray",\n    "pandas",\n    "pystac",\n    "pystac[jinja2]",\n    "pystac[orjson]",\n    "pystac[validation]",\n    "fastparquet",\n    "zarr",\n    "gdown",\n    "wget",\n]\n\n# Loop through each library and test for existence, if not present install quietly\nfor library in libraries:\n    if library == "Pillow":\n        spec = importlib.util.find_spec("PIL")\n    else:\n        spec = importlib.util.find_spec(library)\n    if spec is None:\n        print("Installing library " + library)\n        subprocess.run(["pip", "install", library, "--quiet"], check=True)\n    else:\n        print("Library " + library + " already installed.")\n\n# Specialized install for GPU enabled capability with CUDF\n# pip install --ext

### Library Import

In [6]:
# - Import additional libraries that add value to the project related to NLP

# - Set of libraries that perhaps should always be in Python source
import backoff
import datetime
from dotenv import load_dotenv
import gc
import getopt
import glob
import inspect
import io
import itertools
import json
import math
import os
from pathlib import Path
import pickle
import platform
import random
import re
import shutil
import string
from io import StringIO
import subprocess
import socket
import sys
import textwrap
import tqdm
import traceback
import warnings
import time
import uuid

#- Datastructures
from dataclasses import dataclass, fields, field
from typing import List

#- Profiling
from time import perf_counter
import gc
import io
import tracemalloc
import psutil
import cProfile
import pstats
from pstats import SortKey

#- Text formatting
from rich import print as rprint
from rich.console import Console
from rich.traceback import install
from tabulate import tabulate
import locale

# - Displays system info
from watermark import watermark as the_watermark
from py3nvml import py3nvml

# - Additional libraries for this work
import math
from base64 import b64decode
from IPython.display import Image, Markdown
import pandas, IPython.display as display, io, jinja2, base64
from IPython.display import clear_output  # used to support real-time plotting
import requests
import unidecode
import pydot
import wget

# - Data Science Libraries
import pandas as pd
import numpy as np
import polars as pl
import dask as da
import dask.dataframe as dd
import dask.bag as db
import xarray as xr
import cupy_xarray  # never actually invoked in source itself use ds=ds.cupy.as_cupy()
import pystac as pys
import pystac
from pystac.utils import datetime_to_str

# - Statistics
import statistics

# from stacframes import df_from
import fastparquet as fq
import zarr
from zarr import Group
import netCDF4 as nc
from netCDF4 import Dataset

try:
    import cudf
except Exception as e:
    pass

try:
    import cupy
except Exception as e:
    pass

# Tensorflow and related AI libraries
import tensorflow as tf
from tensorflow import data as tf_data

# Torch
import torch

# - Graphics
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import AnnotationBbox, DrawingArea, OffsetImage, TextArea
from matplotlib.pyplot import imshow
from matplotlib.patches import Circle
from PIL import Image as PIL_Image
import PIL.ImageOps
import matplotlib.image as mpimg
from imageio import imread
import seaborn as sns

from mpl_toolkits.basemap import Basemap
from pylab import *

# - Image meta-data for Section 508 compliance
import piexif
from piexif.helper import UserComment

# - Progress bar
from tqdm import tqdm
from tqdm.notebook import trange, tqdm


--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------

  import cupy
2025-06-02 17:46:08.103249: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders.

## DataClasses

In [7]:
## Dataclass used to represent each metric used during execution
#
@dataclass
class aggregate_metrics:
    id: str

    # see @Profile
    runtime: List[float] = field(default_factory=list)
    
    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_time: List[float] = field(default_factory=list)
    
    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_time: List[float] = field(default_factory=list)
    
    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_count: List[float] = field(default_factory=list)

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_count: List[float] = field(default_factory=list)

    # bytes read [end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_throughput: List[float] = field(default_factory=list)

    # bytes read [end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_throughput: List[float] = field(default_factory=list)

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_read_count: List[float] = field(default_factory=list)

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_write_count: List[float] = field(default_factory=list)

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_read_time: List[float] = field(default_factory=list)

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_write_time: List[float] = field(default_factory=list)

    # calculated in MBs, reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
    mem_current: List[float] = field(default_factory=list)

    # calculated in MBs, reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
    mem_peak: List[float] = field(default_factory=list)

    def calculate_stats(self, field_name) -> List:
       my_stats=[]
       my_field_data=getattr(self, field_name)
       my_field_data = [float(x) for x in my_field_data]
       my_stats.append(statistics.mean(my_field_data))
       my_stats.append(statistics.median(my_field_data))
       my_stats.append(statistics.mode(my_field_data))
       my_stats.append(statistics.stdev(my_field_data))
       my_stats.append(statistics.variance(my_field_data))

       return my_stats
        
    def __str__(self):
             #id, runtime, disk read counts, disk write counts, general disk read counts, general disk write counts, target disk read time, target disk write time, general disk read time, general disk write time, memory current, memory peak
     return f"""
             {self.id}^{self.calculate_stats("runtime")}^{self.calculate_stats("io_disk_read_count")}^{self.calculate_stats("io_disk_write_count")}^{self.calculate_stats("io_os_read_count")}^{self.calculate_stats("io_os_write_count")}^{self.calculate_stats("io_disk_read_time")}^{self.calculate_stats("io_disk_write_time")}^{self.calculate_stats("io_os_read_time")}^{self.calculate_stats("io_os_write_time")}^{self.calculate_stats("mem_current")}^{self.calculate_stats("mem_peak")} 
             """ 


In [8]:
 
        
    def __strs__(self):
     return f"""
             Id---------------------------------------------
                                   Id: {self.id}
             Runtime----------------------------------------
                   Runtime Stats:    {self.calculate_stats("runtime")} milliseconds

             I/O Counts-------------------------------------
                   Targeted disk read: {self.calculate_stats("io_disk_read_count")} counts
                  Targeted disk write: {self.calculate_stats("io_disk_write_count")} counts
                    General disk read: {self.calculate_stats("io_os_read_count")} counts
                   General disk write: {self.calculate_stats("io_os_write_count")} counts

             I/O Time---------------------------------------
              Targeted disk read time: {self.calculate_stats("io_disk_read_time")} milliseconds
             Targeted disk write time: {self.calculate_stats("io_disk_write_time")} milliseconds
               General disk read time: {self.calculate_stats("io_os_read_time")} milliseconds
              General disk write time: {self.calculate_stats("io_os_write_time")} milliseconds

             Memory------------------------------------------
                              Current: {self.calculate_stats("mem_current")} MB
                                 Peak: {self.calculate_stats("mem_peak")} MB

             """ 

In [9]:
@dataclass
class runtime_metrics:
    id: str

    # see @Profile
    runtime: float = field(default=0.0)

    # reference: https://docs.python.org/4/library/profile.html
    profile_data: cProfile.Profile = field(init=False)

    # reference: https://www.geeksforgeeks.org/how-to-get-file-size-in-python/
    file_size: float = field(
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_time: float = field(
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_time: float = field(
        default=0.0,
    )
    
    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_count: float = field(
        default=0.0,
    )

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_count: float = field(
        default=0.0,
    )

    # bytes read [end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_throughput: float = field(
        default=0.0,
    )

    # bytes read [end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_throughput: float = field(
        default=0.0,
    )

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_read_count: float = field(
        default=0.0,
    )

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_write_count: float = field(
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_read_time: float = field(
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_write_time: float = field(
        default=0.0,
    )

    # calculated in MBs, reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
    mem_current: float = field(
        default=0.0,
    )

    # calculated in MBs, reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
    mem_peak: float = field(
        default=0.0,
    )

    def __str__(self):
        return f"""
                Id---------------------------------------------
                                      Id: {self.id}
                Runtime----------------------------------------
                                 Runtime: {self.runtime:,.2f} milliseconds

                I/O Size---------------------------------------
                               File Size: {self.file_size:,.2f} bytes

                I/O Counts-------------------------------------
                      Targeted disk read: {self.io_disk_read_count:,.2f} counts
                     Targeted disk write: {self.io_disk_write_count:,.2f} counts
                       General disk read: {self.io_os_read_count:,.2f} counts
                      General disk write: {self.io_os_write_count:,.2f} counts

                I/O Time---------------------------------------
                 Targeted disk read time: {self.io_disk_read_time:,.2f} milliseconds
                Targeted disk write time: {self.io_disk_write_time:,.2f} milliseconds
                  General disk read time: {self.io_os_read_time:,.2f} milliseconds
                 General disk write time: {self.io_os_write_time:,.2f} milliseconds

                Memory------------------------------------------
                                 Current: {self.mem_current:,.2f} MB
                                    Peak: {self.mem_peak:,.2f} MB

                """

    #def __repr__(self):
    #    return f'{self.__class__.__name__}(name={self.name!r}, unit_price={self.unit_price!r}, quantity={self.quantity_on_hand!r})'

    # TODO - CGW
    # def __post_init__(self):
    #    self.id = f'{self.phrase}_{self.word_type.name.lower()}'

    # worthy consideration - https://www.geeksforgeeks.org/psutil-module-in-python/

## Profile

In [10]:
# Profiling function custom created to track IO, memory, and runtme.
# Reference: https://jiffyclub.github.io/snakeviz/
# Reference: https://www.machinelearningplus.com/python/cprofile-how-to-profile-your-python-code/
# Reference: https://cloud.google.com/stackdriver/docs/instrumentation/setup/python
# Reference: https://www.turing.com/kb/python-code-with-cprofile

def profile(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):

        # custom metrics values
        current_memories = 0.0
        peak_memories = 0.0
        current_metric = runtime_metrics(id=func.__name__)
        disk = "sdc"

        #####################################################################################################
        # - Cprofiler startup
        # Reference: https://www.google.com/search?client=firefox-b-1-d&q=python+example+use+of+cprofle+for+a+single+function#cobssid=s
        #####################################################################################################
        pr = cProfile.Profile()
        pr.enable()
        start_time = time.perf_counter()

        #####################################################################################################
        # - Memory tracking
        #  Reference: https://docs.python.org/3/library/tracemalloc.html
        #  Reference: https://www.kdnuggets.com/how-to-trace-memory-allocation-in-python
        #  Reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
        #####################################################################################################
        tracemalloc.start()

        #####################################################################################################
        # - Disk tracking
        # Reference: https://stackoverflow.com/questions/16945664/insight-needed-into-python-psutil-output#:~:text=1%20Answer%201%20%C2%B7%20read_count:%20number%20of,write_bytes:%20number%20of%20bytes%20written%20%C2%B7%20read_time:
        #####################################################################################################
        iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
        disk_io_counters1 = psutil.disk_io_counters()
        read_bytes_start = iocnt1.read_bytes
        write_bytes_start = iocnt1.write_bytes
        read_counters_start = iocnt1.read_count
        write_counters_start = iocnt1.write_count
        read_time_start = iocnt1.read_time
        write_time_start = iocnt1.write_time
        
        read_os_bytes_start = disk_io_counters1.read_bytes
        write_os_bytes_start = disk_io_counters1.write_bytes
        read_os_counters_start = disk_io_counters1.read_count
        write_os_counters_start = disk_io_counters1.write_count
        read_os_time_start = disk_io_counters1.read_time
        write_os_time_start = disk_io_counters1.write_time
        
        #####################################################################################################
        # - Actual function call
        #####################################################################################################
        result = func(*args, **kwargs)

        # disk close out
        # targeted I/O
        iocnt2 = psutil.disk_io_counters(perdisk=True)["sdc"]
        disk_io_counters2 = psutil.disk_io_counters()

        #targeted disk
        read_bytes_end = iocnt2.read_bytes
        write_bytes_end = iocnt2.write_bytes
        read_counters_end = iocnt2.read_count
        write_counters_end = iocnt2.write_count
        read_time_end = iocnt2.read_time
        write_time_end = iocnt2.write_time
        #general OS
        read_os_bytes_end = disk_io_counters2.read_bytes
        write_os_bytes_end = disk_io_counters2.write_bytes
        read_os_counters_end = disk_io_counters2.read_count
        write_os_counters_end = disk_io_counters2.write_count
        read_os_time_end = disk_io_counters2.read_time
        write_os_time_end = disk_io_counters2.write_time

        #targeted disk
        read_throughput = (read_bytes_end - read_bytes_start) / (1024 * 1024)  # MB/s
        write_throughput = (write_bytes_end - write_bytes_start) / (1024 * 1024)  # MB/s
        read_counters = (read_counters_end - read_counters_start)
        write_counters = (read_counters_end - read_counters_start)
        read_time =  (read_time_end - read_time_start)
        write_time = (write_time_end - write_time_start)
        current_metric.io_disk_read_throughput = read_throughput
        current_metric.io_disk_write_throughput = write_throughput
        current_metric.io_disk_read_count = read_counters
        current_metric.io_disk_write_count = write_counters
        current_metric.io_disk_read_time = read_time
        current_metric.io_disk_write_time = write_time

        #general OS
        read_os_throughput = (read_os_bytes_end - read_os_bytes_start) / (1024 * 1024)  # MB/s
        write_os_throughput = (write_os_bytes_end - write_os_bytes_start) / (1024 * 1024)  # MB/s
        read_os_counters = (read_os_counters_end - read_os_counters_start)
        write_os_counters = (read_os_counters_end - read_os_counters_start)
        read_os_time =  (read_os_time_end - read_os_time_start)
        write_os_time = (write_os_time_end - write_os_time_start)
        current_metric.io_os_read_throughput = read_os_throughput
        current_metric.io_os_write_throughput = write_os_throughput
        current_metric.io_os_read_count = read_os_counters
        current_metric.io_os_write_count = write_os_counters
        current_metric.io_os_read_time = read_os_time
        current_metric.io_os_write_time = write_os_time



        # memory close
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()        
        current_metric.mem_current = current / (1024 * 1024)
        current_metric.mem_peak = peak / (1024 * 1024)
        tracemalloc.clear_traces()


        # CProfiler disabled
        pr.disable()

        #can't pickle this result
        #current_metric.profile_data=pr
        
        # s = io.StringIO()
        # sortby = SortKey.CUMULATIVE
        # ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        # ps.print_stats()
        # print(s.getvalue())
        end_time = time.perf_counter()
        # other characteristics
        current_metric.runtime = end_time - start_time

        timestamp = datetime.datetime.now()
        #timestamp_str = timestamp.strftime("%Y%m%d%H%M%S%f") # Format as string
        timestamp_str = timestamp.strftime("%Y%m%d%H%M%S") # Format as string
        unique_id = uuid.uuid4()
        filename = f"{output_directory}/{timestamp_str}_{unique_id}_{current_metric.id}_profiler.pkl"
        #print(filename)
        #print("##########################################")
        #print(current_metric)

        with open(filename, "wb") as file:
            pickle.dump(current_metric, file)

        return result

    return wrapper

#### Example Complex Function to Profile

In [11]:
@profile
def complex_function():
    # Define the size of the matrix
    matrix_size = 2048
    # Generate two random matrices
    matrix_a = np.random.rand(matrix_size, matrix_size)
    matrix_b = np.random.rand(matrix_size, matrix_size)
    result_matrix = np.matmul(matrix_a, matrix_b)
    np.savez(
        "./folderOnColab/data/local_test.npy",
        the_matrix=result_matrix,
    )

## Function Declaration

#### Lib Diagnostics

In [12]:
def lib_diagnostics() -> None:

    import pkg_resources

    package_name_length = 20
    package_version_length = 10

    # Show notebook details
    #%watermark?
    #%watermark --github_username christophergwood --email christopher.g.wood@gmail.com --date --time --iso8601 --updated --python --conda --hostname --machine --githash --gitrepo --gitbranch --iversions --gpu
    # Watermark
    print(
        the_watermark(
            author=f"{AUTHOR_NAME}",
            github_username=f"GITHUB_USERNAME",
            email=f"{AUTHOR_EMAIL}",
            iso8601=True,
            datename=True,
            current_time=True,
            python=True,
            updated=True,
            hostname=True,
            machine=True,
            gitrepo=True,
            gitbranch=True,
            githash=True,
        )
    )

    print(f"{BOLD_START}Packages:{BOLD_END}")
    print("")
    # Get installed packages
    the_packages = [
        "nltk",
        "numpy",
        "os",
        "pandas",
        "keras",
        "seaborn",
        "fastparquet",
        "zarr",
        "dask",
        "pystac",
        "polars",
        "xarray",
    ]  # Functions are like legos that do one thing, this function outputs library version history of effort.

    installed = {pkg.key: pkg.version for pkg in pkg_resources.working_set}
    for package_idx, package_name in enumerate(installed):
        if package_name in the_packages:
            installed_version = installed[package_name]
            print(
                f"{package_name:<40}#: {str(pkg_resources.parse_version(installed_version)):<20}"
            )

    try:
        print(f"{'TensorFlow version':<40}#: {str(tf.__version__):<20}")
        print(
            f"{'     gpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        print(
            f"{'     cpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
    except Exception as e:
        pass

    try:
        print(f"{'Torch version':<40}#: {str(torch.__version__):<20}")
        if torch.cuda.is_available():
            device = torch.device("cuda")
            print(f"{'     GPUs available?':<40}#: {torch.cuda.is_available()}")
            print(f"{'     count':<40}#: {torch.cuda.device_count()}")
            print(f"{'     current':<40}#: {torch.cuda.get_device_name(0)}")
        else:
            device = torch.device("cpu")
            print("No GPU available, using CPU.")
    except Exception as e:
        pass

    try:
        print(f"{'OpenAI Azure Version':<40}#: {str(the_openai_version):<20}")
    except Exception as e:
        pass

    return

#### Section 508 Compliance Tools

In [13]:
# Routines designed to support adding ALT text to an image generated through Matplotlib.


def capture(figure):
    buffer = io.BytesIO()
    figure.savefig(buffer)
    # return F"data:image/png;base64,{base64.b64encode(buffer.getvalue()).decode()}"
    return f"data:image/jpg;base64,{base64.b64encode(buffer.getvalue()).decode()}"


def make_accessible(figure, template, **kwargs):
    return display.Markdown(
        f"""![]({capture(figure)} "{template.render(**globals(), **kwargs)}")"""
    )


# requires JPG's or TIFFs
def add_alt_text(image_path, alt_text):
    try:
        if os.path.isfile(image_path):
            img = PIL_Image.open(image_path)
            if "exif" in img.info:
                exif_dict = piexif.load(img.info["exif"])
            else:
                exif_dict = {}

            w, h = img.size
            if "0th" not in exif_dict:
                exif_dict["0th"] = {}
            exif_dict["0th"][piexif.ImageIFD.XResolution] = (w, 1)
            exif_dict["0th"][piexif.ImageIFD.YResolution] = (h, 1)

            software_version = " ".join(
                ["STEM-001 with Python v", str(sys.version).split(" ")[0]]
            )
            exif_dict["0th"][piexif.ImageIFD.Software] = software_version.encode(
                "utf-8"
            )

            if "Exif" not in exif_dict:
                exif_dict["Exif"] = {}
            exif_dict["Exif"][piexif.ExifIFD.UserComment] = UserComment.dump(
                alt_text, encoding="unicode"
            )

            exif_bytes = piexif.dump(exif_dict)
            img.save(image_path, "jpeg", exif=exif_bytes)
        else:
            rprint(
                f"Cound not fine {image_path} for ALT text modification, please check your paths."
            )

    except (FileExistsError, FileNotFoundError, Exception) as e:
        process_exception(e)


# Appears to solve a problem associated with GPU use on Colab, see: https://github.com/explosion/spaCy/issues/11909
def getpreferredencoding(do_setlocale=True):
    return "UTF-8"

#### Libary Configuration

In [14]:
def set_library_configuration() -> None:

    ############################################
    # - JUPYTER NOTEBOOK OUTPUT CONTROL / FORMATTING
    ############################################
    # pandas set floating point to 4 places to things don't run loose
    debug.msg_info("Setting Pandas and Numpy library options.")
    pd.set_option(
        "display.max_colwidth", 10
    )  # None if you want to view the full json blob in the printed dataframe, use this
    pd.options.display.float_format = "{:,.4f}".format
    np.set_printoptions(precision=4)

#### Custom Exception Display

In [15]:
# this function displays the stack trace on errors from a central location making adjustments to the display on an error easier to manage
# functions perform useful solutions for highly repetitive code
def process_exception(inc_exception: Exception) -> None:
    if DEBUG_STACKTRACE == 1:
        traceback.print_exc()
        console.print_exception(show_locals=True)
    else:
        rprint(repr(inc_exception))

#### Quick Stats for a DataFrame

In [16]:
def quick_df_stats(
    inc_df: pd.DataFrame,
    inc_header_count: int,
) -> None:
    """
    Load the data and return as a pd.DataFrame.

            Parameters:
                   inc_df (pd.DataFrame): Dataframe to be inspected, displayed
                   inc_header_count (int): Anticipated number of columns to read in (validation check)

            Returns:
                    Printed output
    """
    print("Data Resolution has: " + str(inc_df.columns))
    print("\n")
    print(f"""{"size":20} : {inc_df.size:15,} """)
    print(f"""{"shape":20} : {str(inc_df.shape):15} """)
    print(f"""{"ndim":20} : {inc_df.ndim:15,} """)
    print(f"""{"column size":20} : {inc_df.columns.size:15,} """)

    # index added so you get an extra column
    print(f"""{"Read":20} : {inc_df.columns.size:15,} """)
    print(f"""{"Expected":20} : {inc_header_count:15,} """)
    if (inc_df.columns.size) == inc_header_count:
        print(f"{BOLD_START}Expectations met{BOLD_END}.")
    else:
        print(
            f"Expectations {BOLD_START}not met{BOLD_END}, check your datafile, columns don't match."
        )
    rprint("\n")
    # rprint(str(inc_df.describe()))

#### Check your resources from a CPU/GPU perspective

In [17]:
def get_hardware_stats() -> None:
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    print(
        f"{BOLD_START}List Devices{BOLD_END} #########################################"
    )
    try:
        from tensorflow.python.client import device_lib

        rprint(device_lib.list_local_devices())
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Devices Counts{BOLD_END} ########################################"
    )
    try:
        rprint(
            f"Num GPUs Available: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        rprint(
            f"Num CPUs Available: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Optional Enablement{BOLD_END} ####################################"
    )
    try:
        gpus = tf.config.experimental.list_physical_devices("GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    if gpus:
        # Restrict TensorFlow to only use the first GPU
        try:
            tf.config.experimental.set_visible_devices(gpus[0], "GPU")
            logical_gpus = tf.config.experimental.list_logical_devices("GPU")
            rprint(
                str(
                    str(len(gpus))
                    + " Physical GPUs,"
                    + str(len(logical_gpus))
                    + " Logical GPU"
                )
            )
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            rprint(str(repr(e)))
        print("")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Clean House

In [18]:
def clean_house() -> None:
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    gc.collect()

    # could leave the GPU unstable so holding off.
    # torch.cuda.empty_cache()
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Remove Files

In [19]:
def nuke_file(target_filename: str) -> None:
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    if os.path.isfile(target_filename):
        try:
            # removing existing file, else you would append
            subprocess.run(["rm", "-rf", f"{target_filename}"], check=True)
        except (subprocess.CalledProcessError, Exception) as e:
            process_exception(e)
            raise SystemError
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

## Input Sources

#### Read Profiles

In [20]:
def read_profiles(inc_source_filenames: []) -> []:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    the_list = []
    failed_read = []

    rprint(f"...reading pickled profile data from list of {len(inc_source_filenames)} files:")
    for target_filename in inc_source_filenames:
        try:
            rprint(f"......reading profile ({target_filename})")
            #the_netcdf = Dataset(target_filename, "r", format="NETCDF4")
            with (open(target_filename, "rb")) as openfile:
                the_list.append(pickle.load(openfile))
        except Exception as e:
            process_exception(e)
            print(f"...ERROR, investigate this failed read.")
            failed_read.append(target_filename)

    print(f"......{len(the_list)}  of {len(inc_source_filenames)} files successfully read in.")
    print(f"......{len(failed_read)}  of {len(inc_source_filenames)} files failed to read in.")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    return the_list

#### Gather Variables

In [21]:
def gather_variables(inc_product_name:str, inc_file_list:[], inc_netcdfs:[]) -> {}:

        rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

        geospatial_lat_nm=LAT_SNAME
        geospatial_lon_nm=LONG_SNAME
        product_nm=inc_product_name
        local_payload={}

        #priming array setup
        if DEBUG_USING_GPU==1:
            master_lat   =cupy.array(inc_netcdfs[0].variables[geospatial_lat_nm][:][:].flatten(),dtype=cupy.double)
            master_lon   =cupy.array(inc_netcdfs[0].variables[geospatial_lon_nm][:][:].flatten(),dtype=cupy.double)
            #varAry=cupy.array(the_netcdf.variables[product_nm][:][:],dtype=cupy.double)
            master_varAry=cupy.array(inc_netcdfs[0].variables[product_nm][:][:][:][:].flatten(),dtype=cupy.double)
        else:
            master_lat   =np.array(inc_netcdfs[0].variables[geospatial_lat_nm][:][:].flatten(),dtype=np.double)
            master_lon   =np.array(inc_netcdfs[0].variables[geospatial_lon_nm][:][:].flatten(),dtype=np.double)
            #varAry=np.array(the_netcdf.variables[product_nm][:][:],dtype=np.double)
            master_varAry=np.array(inc_netcdfs[0].variables[product_nm][:][:][:][:].flatten(),dtype=np.double)


        print("...pulling data for and stacking it to the core data structure:")

        #skip the first record since it's already captured
        for idx, the_netcdf in enumerate(inc_netcdfs):
            if idx==0:
                continue

            print(f"......stacking {inc_file_list[idx]}")

            #note x,y values are shown below as they are part of the APS meta-data
            #based on the NetCDF Best Practice subject x,y vars should not exist.
            #keeping for continuity between BIOCAST code already written to read APS input.
            #single variable exercise

            if DEBUG_USING_GPU==1:
                lat   =cupy.array(the_netcdf.variables[geospatial_lat_nm][:][:].flatten(),dtype=cupy.double)
                lon   =cupy.array(the_netcdf.variables[geospatial_lon_nm][:][:].flatten(),dtype=cupy.double)
                #varAry=cupy.array(the_netcdf.variables[product_nm][:][:],dtype=cupy.double)
                varAry=cupy.array(the_netcdf.variables[product_nm][:][:][:][:].flatten(),dtype=cupy.double)
            else:
                lat   =np.array(the_netcdf.variables[geospatial_lat_nm][:][:].flatten(),dtype=np.double)
                lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:].flatten(),dtype=np.double)
                #varAry=np.array(the_netcdf.variables[product_nm][:][:],dtype=np.double)
                varAry=np.array(the_netcdf.variables[product_nm][:][:][:][:].flatten(),dtype=np.double)
                
            print(f"...{BOLD_START}{geospatial_lat_nm:10}{" data type:":20}{BOLD_END}{str(type(lat)):20}")
            print(f".......shape:{lat.shape}")
            print(f"....datatype:{lat.dtype}")
    
            print(f"...{BOLD_START}{geospatial_lon_nm:10}{" data type:":20}{BOLD_END}{str(type(lon)):20}")
            print(f".......shape:{lon.shape}")
            print(f"....datatype:{lon.dtype}")
    
            print(f"...{BOLD_START}{"Data type":20}({product_nm:20}){BOLD_END}{str(type(varAry)):20}")
            print(f".......shape:{varAry.shape}")
            print(f"....datatype:{varAry.dtype}")

            #arr = np.concatenate([arr, val])
            #master_lat = np.concatenate([master_lat, lat])
            #master_lon = np.concatenate([master_lon, lon])
            #master_varAry = np.concatenate([master_varAry, varAry])
            #np.stack((c, d), axis=0)
            master_lat = np.stack([master_lat, lat], axis=0)
            master_lon = np.stack([master_lon, lon], axis=0)
            master_varAry = np.stack([master_varAry, varAry], axis=0)
            
        local_payload[LAT_LNAME]=master_lat
        local_payload[LONG_LNAME]=master_lon
        local_payload[PRODUCT_LNAME]=master_varAry
    
        print(f"...{BOLD_START}Master {geospatial_lat_nm:10}{" data type:":20}{BOLD_END}{str(type(master_lat)):20}")
        print(f".......shape:{master_lat.shape}")
        print(f"....datatype:{master_lat.dtype}")

        print(f"...{BOLD_START}Master {geospatial_lon_nm:10}{" data type:":20}{BOLD_END}{str(type(master_lon)):20}")
        print(f".......shape:{master_lon.shape}")
        print(f"....datatype:{master_lon.dtype}")

        print(f"...{BOLD_START}Master {"Data type":20}({product_nm:20}){BOLD_END}{str(type(master_varAry)):20}")
        print(f".......shape:{master_varAry.shape}")
        print(f"....datatype:{master_varAry.dtype}")

        rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")
    
        return local_payload

#### Analysis

In [22]:
#def add_dataclass_values(a: aggregate_metrics, b: runtime_metrics) -> aggregate_metrics:
def add_dataclass_values(a: aggregate_metrics, b: runtime_metrics):
    """Modified to append to an array. """
    field_metadata = a.__dataclass_fields__
    #for field in fields(a):
    for field_name in field_metadata:
        if field_name not in "id":
            field_list = getattr(a, field_name)
            new_value  = getattr(b, field_name)
            field_list.append(float(new_value))
            #setattr(a, field_name, field_list)
            #print(f"{field_list} - {new_value}")
    return a

In [23]:
def stats_dataclass_values(a: aggregate_metrics, b: runtime_metrics) -> runtime_metrics:
    """Calculates the statistics of all data class values added to this point"""
    for field in fields(a):
        if field.name not in "profile_data" and field.name not in "id":
            setattr(b, field.name, getattr(a,statistics.mean(field.name)))
    return a

In [24]:
def analyze_outputs(inc_files:[], inc_pattern:str) -> None:
    
 current_pattern=[]
 for idx, filename in enumerate(inc_files):
     match = re.search(f"{inc_pattern}", filename)
     if match:
         current_pattern.append(filename)

 dataset_aggregated=aggregate_metrics(id=inc_pattern)
 for profiler_data in current_pattern:
     with open(profiler_data, 'rb') as file:
         single_profile = pickle.load(file)
         dataset_aggregrated=add_dataclass_values(dataset_aggregated, single_profile)
         #add_dataclass_values(dataset_aggregated, single_profile)
         
 #return stats_dataclass_values(dataset_aggregated)
 return dataset_aggregated

#### Pattern Capture

In [25]:
def get_unique_patterns(inc_files:[]) -> []:
  delimiter="_"
  split_index=2
  patterns=set() 
  for idx, filename in enumerate(inc_files):
      filename_pattern=filename.split(delimiter, )[split_index:-1]
      #print("_".join(filename_pattern))
      patterns.add("_".join(filename_pattern))
      
  return list(patterns)

## Process

In [26]:
## Main routine that executes all code, does return a data frame of data for further analysis if desired.
#
#  @param (None)
def process(inc_input_directory: str, ) -> {}:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    # identify target files
    for idx, value in enumerate(["write", "read"]):
        #iterate through each data profile saved and gather metrics
        #create a list of unique profiles (per type) and process them
        source_filenames_list = []
        unique_patterns = []
        stats = []
        print(f"...marshaling {value} data files:")
        target_directory = f"{inc_input_directory}{os.sep}"
        if os.path.isdir(target_directory):
            for file in os.listdir(target_directory):
                filename, file_extension = os.path.splitext(file)
                if OUTPUT_PICKLE_EXT.lower() in file_extension.lower():
                    if filename.find(f"_{value}_") > -1:
                        source_filenames_list.append(os.path.join(target_directory, file))
        else:
            print(
                "Target directory ({target_directory}) does not exist, cannot continue execution.  Check your paths."
            )
            raise SystemError
    
        source_filenames_list = sorted(source_filenames_list)
        print(f"Found {len(source_filenames_list)} potential target files.")
        unique_patterns = get_unique_patterns(source_filenames_list)
        print("    Patterns found are:")
        for idx, value in enumerate(unique_patterns):
            #print(f"    ...{value}")
            stats.append(analyze_outputs(source_filenames_list, value))
            #print("    ##########################################################################################")
            #print(f"    {stats[-1]}")
            #print("    ##########################################################################################")
            print(f"{stats[-1]}")
            #print("")
            #print("")

    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

# Main Routine (call all other routines)

In [27]:
if __name__ == "__main__":

    # note that this design now deviates from previous methods.
    # Implementation will assume a single execution of a single PIID folder, scanning results and
    # appending metrics to a single ASCII file as the code proceeds thus ensuring multi-processor, *nix driven execution.

    start_t = perf_counter()
    print("BEGIN PROGRAM")

    ############################################
    # CONSTANTS
    ############################################

    # Semantic Versioning
    VERSION_NAME = "MLDATAREADY_ANALYSIS"
    VERSION_MAJOR = 0
    VERSION_MINOR = 0
    VERSION_RELEASE = 2

    DATA_VERSION_RELEASE = "-".join(
        [
            str(VERSION_NAME),
            str(VERSION_MAJOR),
            str(VERSION_MINOR),
            str(VERSION_RELEASE),
        ]
    )

    # OUTPUT EXTENSIONS
    OUTPUT_PICKLE_EXT = "pkl"
    OUTPUT_PANDAS_EXT = "pkl"
    OUTPUT_NUMPY_EXT = "npy"
    OUTPUT_TORCH_EXT = "pt"
    OUTPUT_XARRAY_EXT = "xr"
    OUTPUT_ZARR_EXT = "zarr"
    OUTPUT_PARQUET_EXT = "parquet"
    OUTPUT_TENSORFLOW_EXT = "tf"
    OUTPUT_PYSTAC_EXT = "psc"
    OUTPUT_DASK_EXT = "dask"
    # location of our working files
    # WORKING_FOLDER="/content/folderOnColab"
    WORKING_FOLDER = "./folderOnColab/ANALYSIS3"
    input_directory = "./folderOnColab/ANALYSIS3"
    output_directory = "./folderOnColab/ANALYSIS3"

    # Notebook Author details
    AUTHOR_NAME = "Christopher G Wood"
    GITHUB_USERNAME = "christophergarthwood"
    AUTHOR_EMAIL = "christopher.g.wood@gmail.com"

    # GEOSPATIAL NAMES
    LAT_LNAME = "latitude"
    LAT_SNAME = "lat"
    LONG_LNAME = "longitude"
    LONG_SNAME = "lon"
    #PRODUCT_LNAME = "chlor_a"
    #PRODUCT_SNAME = "chlor_a"
    PRODUCT_LNAME = "cld_amt"
    PRODUCT_SNAME = "cld_amt"

    # PRODUCT_LNAME="salinity"
    # PRODUCT_SNAME="salinity"

    # Encoding
    ENCODING = "utf-8"
    os.environ["PYTHONIOENCODING"] = ENCODING

    BOLD_START = "\033[1m"
    BOLD_END = "\033[0;0m"
    TEXT_WIDTH = 77

    # You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL:
    #
    # 0 = all messages are logged (default behavior)
    # 1 = INFO messages are not printed
    # 2 = INFO and WARNING messages are not printed
    # 3 = INFO, WARNING, and ERROR messages are not printed
    TF_CPP_MIN_LOG_LEVEL_SETTING = 0

    # Set the Seed for the experiment (ask me why?)
    # seed the pseudorandom number generator
    # THIS IS ESSENTIAL FOR CONSISTENT MODEL OUTPUT, remember these are random in nature.
    # SEED_INIT = 7
    # random.seed(SEED_INIT)
    # tf.random.set_seed(SEED_INIT)
    # np.random.seed(SEED_INIT)

    DEBUG_STACKTRACE = 0
    DEBUG_USING_GPU = 0   #no gpu utilization on 0, 1 is gpu utilization
    NUM_PROCESSORS = 10
    ITERATIONS = 20

    # make comparisons lower case and include wild card character at the end of each to catch anomalous file extensions like xlsx, etc.
    EXTENSIONS = [".nc"]
    LOWER_EXTENSIONS = [x.lower() for x in EXTENSIONS]

    THE_DEVICE_NAME = "/job:localhost/replica:0/task:0/device:CPU:0"
    if DEBUG_USING_GPU == 1:
        THE_DEVICE_NAME = "/job:localhost/replica:0/task:0/device:GPU:0"

    warnings.filterwarnings("ignore", category=DeprecationWarning)
    warnings.filterwarnings("ignore", category=FutureWarning)
    warnings.filterwarnings("ignore", category=UserWarning)

    # GPU Setup (for multiple GPU devices)
    device = torch.cuda.current_device()

    # softare watermark
    lib_diagnostics()

    # hardware specs
    get_hardware_stats()

    # - Core workhorse routine
    process(input_directory)

    # - Save the results
    # save_output()

    end_t = perf_counter()
    print("END PROGRAM")
    print(f"Elapsed time: {end_t - start_t}")

BEGIN PROGRAM
Author: Christopher G Wood

Github username: GITHUB_USERNAME

Email: christopher.g.wood@gmail.com

Last updated: 2025-06-02T17:46:11.902608-05:00

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 8.30.0

Compiler    : GCC 13.3.0
OS          : Linux
Release     : 6.6.87.1-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit

Hostname: ThulsaDoom

Git hash: 50e42c531656f1a386cf2bc2f2699d3a839a9aec

Git repo: git@github.com:christophergarthwood/jbooks.git

Git branch: Updates

[1mPackages:[0;0m

dask                                    #: 2024.12.1           
fastparquet                             #: 2024.11.0           
keras                                   #: 3.9.0               
numpy                                   #: 1.26.4              
pandas                                  #: 2.2.3               
polars                                  #: 1.24.0              
pystac           

[1mList Devices[0;0m #########################################


I0000 00:00:1748904371.942930    3884 service.cc:148] XLA service 0x5616fc312950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1748904371.942979    3884 service.cc:156]   StreamExecutor device (0): Host, Default Version



[1mDevices Counts[0;0m ########################################


I0000 00:00:1748904372.102127    3884 service.cc:148] XLA service 0x5616fc3147d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1748904372.102152    3884 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 2060, Compute Capability 7.5
I0000 00:00:1748904372.119175    3884 gpu_device.cc:2022] Created device /device:GPU:0 with 4056 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5



[1mOptional Enablement[0;0m ####################################


I0000 00:00:1748904372.129503    3884 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4056 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5





...marshaling write data files:
Found 79942 potential target files.
    Patterns found are:

             write_pytorch^[0.8681899197108153, 0.8575471450003533, 0.8440515000002051, 0.051048647738937226, 0.0026059644359741007]^[0.0001, 0.0, 0.0, 0.01, 0.0001]^[0.0001, 0.0, 0.0, 0.01, 0.0001]^[0.0003, 0.0, 0.0, 0.022359785240462397, 0.0004999599959996]^[0.0003, 0.0, 0.0, 0.022359785240462397, 0.0004999599959996]^[0.0001, 0.0, 0.0, 0.01, 0.0001]^[0.0, 0.0, 0.0, 0.0, 0.0]^[0.0002, 0.0, 0.0, 0.014141428428549924, 0.0001999799979998]^[0.0, 0.0, 0.0, 0.0, 0.0]^[0.048762766361236574, 0.04836273193359375, 0.04840850830078125, 0.0015144182078616452, 2.2934625083028775e-06]^[0.061965049076080324, 0.06151103973388672, 0.06151008605957031, 0.0014616330761963992, 2.136371249431349e-06] 
             

             write_pandas^[0.5497218169279229, 0.5437217500002589, 0.5422853960044449, 0.025237098994186895, 0.0006369111656423892]^[0.0, 0.0, 0.0, 0.0, 0.0]^[0.0, 0.0, 0.0, 0.0, 0.0]^[0.0, 0.0, 0.0, 0

END PROGRAM
Elapsed time: 44.52518039700226
