# Artificial Intelligence
## AI Ready Data - 006
### Download, curate, and process weather and tree data.

<center>
<table align="center">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/notebooks?referrer=search&hl=en&project=usfs-ai-bootcamp">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Link to Colab Enterprise
    </a>
  </td>   
  <td style="text-align: center">
    <a href="https://github.com/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/instances?referrer=search&hl=en&project=usfs-ai-bootcamp">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Link to Vertex AI Workbench
    </a>
  </td>
</table>
</center>
</br></br></br>

| | |
|-|-|
|Author(s) | [Christopher G Wood](https://github.com/christophergarthwood)  |

# Overview

Using various data sources we will download, review, and package data in various formats exploring the options for "AI Ready" data and what that means.

## What is a "**AI Ready**" Data?


## References:

#### Data Formats
+ [List of ML File Formats](https://github.com/trailofbits/ml-file-formats)
+ [ML Guide to Data Formats](https://www.hopsworks.ai/post/guide-to-file-formats-for-machine-learning)
+ [Why are ML Data Structures Different?](https://stackoverflow.blog/2023/01/04/getting-your-data-in-shape-for-machine-learning/)

#### FAIR
+ [FAIR and AI-Ready](https://repository.niddk.nih.gov/public/NIDDKCR_Office_Hours_AI-Readiness_and_Preparing_AI-Ready_+Datasets_12_2023.pdf)
+ [AI-Ready-Data](https://www.rishabhsoft.com/blog/ai-ready-data)
+ [AI-Ready FAIR Data](https://medium.com/@sean_hill/ai-ready-fair-data-accelerating-science-through-responsible-ai-and-data-stewardship-3b4f21c804fd)
+ [AI-Ready Data ... Quality](https://www.elucidata.io/blog/building-ai-ready-data-why-quality-matters-more-than-quantity)
+ [AI-Ready Data Explained](https://acodis.io/hubfs/pdfs/AI-ready%20data%20Explained%20Whitepaper%20(1).pdf)

+ [GCP with BigQuery DataFrames](https://cloud.google.com/blog/products/data-analytics/building-aiml-apps-in-python-with-bigquery-dataframes)

#### Format Libraries / Standards
+ [Earth Science Information partners (ESIP)](https://www.esipfed.org/checklist-ai-ready-data/)
+ [Zarr - Storage of N-dimensional arrays (tensors)](https://zarr.dev/#description)
  + [Zarr explained](https://aijobs.net/insights/zarr-explained/)
+ [Apache Parquet](https://parquet.apache.org/)
  + [All about Parquet](https://medium.com/data-engineering-with-dremio/all-about-parquet-part-01-an-introduction-b62a5bcf70f8)
+ [PySTAC - SpatioTemporal Asset Catalogs](https://pystac.readthedocs.io/en/stable/)
  + [John Hogland's Spatial Modeling Tutorials](https://github.com/jshogland/SpatialModelingTutorials/blob/main/README.md)

In [1]:
# Let's define some variables (information holders) for our project overall

global PROJECT_ID, BUCKET_NAME, LOCATION
BUCKET_NAME = "cio-training-vertex-colab"
PROJECT_ID = "usfs-ai-bootcamp"
LOCATION = "us-central1"

BOLD_START = "\033[1m"
BOLD_END = "\033[0m"

In [2]:
# Now create a means of enforcing project id selection

import ipywidgets as widgets
from IPython.display import display


def wait_for_button_press():

    button_pressed = False

    # Create widgets
    html_widget = widgets.HTML(
        value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>

        <table><tr><td>
            <span style="font-family: Tahoma;font-size: 18">
              This notebook was designed to work in Jupyter Notebook or Google Colab with the understnading that certain permissions might be enabled.</br>
              Please verify that you are in the appropriate project and that the:</br>
              <center><code><b>PROJECT_ID</b></code> </br></center>
              aligns with the Project Id in the upper left corner of this browser and that the location:
              <center><code><b>LOCATION</b></code> </br></center>
              aligns with the instructions provided.
            </span>
          </td></tr></table></br></br>

    """
    )

    project_list = [
        "usfs-ai-bootcamp",
        "usfa-ai-advanced-training",
        "I will setup my own",
    ]
    dropdown = widgets.Dropdown(
        options=project_list,
        value=project_list[0],
        description="Set Your Project:",
    )

    html_widget2 = widgets.HTML(
        value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>
          """
    )

    button = widgets.Button(description="Accept")

    # Function to handle the selection change
    def on_change(change):
        global PROJECT_ID
        if change["type"] == "change" and change["name"] == "value":
            # print("Selected option:", change['new'])
            PROJECT_ID = change["new"]

    # Observe the dropdown for changes
    dropdown.observe(on_change)

    def on_button_click(b):
        nonlocal button_pressed
        global PROJECT_ID
        button_pressed = True
        # button.disabled = True
        button.close()  # Remove the button from display
        with output:
            # print(f"Button pressed...continuing")
            # print(f"Selected option: {dropdown.value}")
            PROJECT_ID = dropdown.value

    button.on_click(on_button_click)
    output = widgets.Output()

    # Create centered layout
    centered_layout = widgets.VBox(
        [
            html_widget,
            widgets.HBox([dropdown, button]),
            html_widget2,
        ],
        layout=widgets.Layout(
            display="flex", flex_flow="column", align_items="center", width="100%"
        ),
    )
    # Display the layout
    display(centered_layout)


wait_for_button_press()

VBox(children=(HTML(value='\n        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b…

## Environment Check

In [3]:
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# - Google Colab Check
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
import datetime

RunningInCOLAB = False
RunningInCOLAB = "google.colab" in str(get_ipython())
current_time = datetime.datetime.now()

if RunningInCOLAB:
    print(
        f"You are running this notebook in Google Colab at {current_time} in the {BOLD_START}{PROJECT_ID}{BOLD_END}lab."
    )
else:
    print(
        f"You are likely running this notebook with Jupyter iPython runtime at {current_time} in the {PROJECT_ID} lab."
    )

You are likely running this notebook with Jupyter iPython runtime at 2025-03-13 16:50:47.880951 in the usfs-ai-bootcamp lab.


## Library Management
### Load Libraries necessary for this operation via pip install

In [4]:
# Import key libraries necessary to support dynamic installation of additional libraries
import sys

# Use subprocess to support running operating system commands from the program, using the "bang" (!)
# symbology is supported, however that does not translate to an actual python script, this is a more
# agnostic approach.
import subprocess
import importlib.util

In [5]:
# Identify the libraries you'd like to add to this Runtime environment.

libraries = [
    "backoff",
    "python-dotenv",
    "seaborn",
    "piexif",
    "unidecode",
    "icecream",
    "watermark",
    "watermark[GPU]",
    "rich",
    "rich[jupyter]",
    "numpy",
    "pydot",
    "polars[all]",
    "dask[complete]",
    "xarray",
    "pandas",
    "pystac",
    "pystac[jinja2]",
    "pystac[orjson]",
    "pystac[validation]",
    "fastparquet",
    "zarr",
    "gdown",
]

# Loop through each library and test for existence, if not present install quietly
for library in libraries:
    if library == "Pillow":
        spec = importlib.util.find_spec("PIL")
    else:
        spec = importlib.util.find_spec(library)
    if spec is None:
        print("Installing library " + library)
        subprocess.run(["pip", "install", library, "--quiet"], check=True)
    else:
        print("Library " + library + " already installed.")

# Specialized install for GPU enabled capability with CUDF
# pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu12==25.2.*" "dask-cudf-cu12==25.2.*" "cuml-cu12==25.2.*" "cugraph-cu12==25.2.*" "nx-cugraph-cu12==25.2.*" "cuspatial-cu12==25.2.*"     "cuproj-cu12==25.2.*" "cuxfilter-cu12==25.2.*" "cucim-cu12==25.2.*"
try:
    subprocess.run(
        [
            "pip",
            "install",
            "--extra-index-url=https://pypi.nvidia.com",
            "cudf-cu12",
            "dask-cudf-cu12",
            "--quiet",
        ],
        check=True,
    )
except (subprocess.CalledProcessError, RuntimeError, Exception) as e:
    print(repr(e))

Library backoff already installed.
Installing library python-dotenv
Library seaborn already installed.
Library piexif already installed.
Library unidecode already installed.
Library icecream already installed.
Library watermark already installed.
Installing library watermark[GPU]
Library rich already installed.
Installing library rich[jupyter]
Library numpy already installed.
Library pydot already installed.
Installing library polars[all]
Installing library dask[complete]
Library xarray already installed.
Library pandas already installed.
Library pystac already installed.
Installing library pystac[jinja2]
Installing library pystac[orjson]
Installing library pystac[validation]
Library fastparquet already installed.
Library zarr already installed.
Library gdown already installed.


### Library Import

In [12]:
# - Import additional libraries that add value to the project related to NLP

# - Set of libraries that perhaps should always be in Python source
import backoff
import datetime
from dotenv import load_dotenv
import gc
import getopt
import glob
import inspect
import io
import itertools
import json
import math
import os
from pathlib import Path
import pickle
import platform
import random
import re
import shutil
import string
from io import StringIO
import subprocess
import socket
import sys
import textwrap
import tqdm
import traceback
import warnings
import time

#- Datastructures
from dataclasses import dataclass, field

#- Profiling
from time import perf_counter
import gc
import io
import tracemalloc
import psutil
import cProfile
import pstats
from pstats import SortKey

#- Text formatting
from rich import print as rprint
from rich.console import Console
from rich.traceback import install
from tabulate import tabulate
import locale

# - Displays system info
from watermark import watermark as the_watermark
from py3nvml import py3nvml

# - Additional libraries for this work
import math
from base64 import b64decode
from IPython.display import Image, Markdown
import pandas, IPython.display as display, io, jinja2, base64
from IPython.display import clear_output  # used to support real-time plotting
import requests
import unidecode
import pydot

# - Data Science Libraries
import pandas as pd
import numpy as np
import polars as pl
import dask as da
import xarray as xr
import cupy_xarray  # never actually invoked in source itself use ds=ds.cupy.as_cupy()
import pystac as pys
import pystac
from pystac.utils import datetime_to_str

# from stacframes import df_from
import fastparquet as fq
import zarr
from zarr import Group
import netCDF4 as nc
from netCDF4 import Dataset

try:
    import cudf
except Exception as e:
    pass

try:
    import cupy
except Exception as e:
    pass

# Tensorflow and related AI libraries
import tensorflow as tf
from tensorflow import data as tf_data

# Torch
import torch

# - Graphics
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import AnnotationBbox, DrawingArea, OffsetImage, TextArea
from matplotlib.pyplot import imshow
from matplotlib.patches import Circle
from PIL import Image as PIL_Image
import PIL.ImageOps
import matplotlib.image as mpimg
from imageio import imread
import seaborn as sns

from mpl_toolkits.basemap import Basemap
from pylab import *

# - Image meta-data for Section 508 compliance
import piexif
from piexif.helper import UserComment

# - Progress bar
from tqdm import tqdm
from tqdm.notebook import trange, tqdm


## DataClasses

In [13]:
## Dataclass used to represent each metric used during execution
#
@dataclass
class runtime_metrics:
    id: str

    # see @Profile
    runtime: float = field(init=False, default=0.0)

    # reference: https://docs.python.org/3/library/profile.html
    profile_data: cProfile.Profile = field(init=False)

    # reference: https://www.geeksforgeeks.org/how-to-get-file-size-in-python/
    file_size: float = field(
        init=False,
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_time: float = field(
        init=False,
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_time: float = field(
        init=False,
        default=0.0,
    )
    
    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_count: float = field(
        init=False,
        default=0.0,
    )

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_count: float = field(
        init=False,
        default=0.0,
    )

    # bytes read [end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_read_throughput: float = field(
        init=False,
        default=0.0,
    )

    # bytes read [end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_disk_write_throughput: float = field(
        init=False,
        default=0.0,
    )

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_read_count: float = field(
        init=False,
        default=0.0,
    )

    # number read operations[end-begin], reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_write_count: float = field(
        init=False,
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_read_time: float = field(
        init=False,
        default=0.0,
    )

    # milliseconds, reference: https://stackoverflow.com/questions/24723092/using-python-to-measure-in-situ-read-write-speed-for-files
    io_os_write_time: float = field(
        init=False,
        default=0.0,
    )

    # calculated in MBs, reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
    mem_current: float = field(
        init=False,
        default=0.0,
    )

    # calculated in MBs, reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
    mem_peak: float = field(
        init=False,
        default=0.0,
    )

    def __str__(self):
        return f"""
                Id---------------------------------------------
                                      Id: {self.id}
                Runtime----------------------------------------
                                 Runtime: {self.runtime:,.2f} milliseconds

                I/O Size---------------------------------------
                               File Size: {self.file_size:,.2f} bytes

                I/O Counts-------------------------------------
                      Targeted disk read: {self.io_disk_read_count:,.2f} counts
                     Targeted disk write: {self.io_disk_write_count:,.2f} counts
                       General disk read: {self.io_os_read_count:,.2f} counts
                      General disk write: {self.io_os_write_count:,.2f} counts

                I/O Time---------------------------------------
                 Targeted disk read time: {self.io_disk_read_time:,.2f} milliseconds
                Targeted disk write time: {self.io_disk_write_time:,.2f} milliseconds
                  General disk read time: {self.io_os_read_time:,.2f} milliseconds
                 General disk write time: {self.io_os_write_time:,.2f} milliseconds

                Memory------------------------------------------
                                 Current: {self.mem_current:,.2f} MB
                                    Peak: {self.mem_peak:,.2f} MB

                """

    #def __repr__(self):
    #    return f'{self.__class__.__name__}(name={self.name!r}, unit_price={self.unit_price!r}, quantity={self.quantity_on_hand!r})'

    # TODO - CGW
    # def __post_init__(self):
    #    self.id = f'{self.phrase}_{self.word_type.name.lower()}'

    # worthy consideration - https://www.geeksforgeeks.org/psutil-module-in-python/

### Profile

In [14]:
# Profiling function custom created to track IO, memory, and runtme.
# Reference: https://jiffyclub.github.io/snakeviz/
# Reference: https://www.machinelearningplus.com/python/cprofile-how-to-profile-your-python-code/
# Reference: https://cloud.google.com/stackdriver/docs/instrumentation/setup/python
# Reference: https://www.turing.com/kb/python-code-with-cprofile

def profile(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):

        # custom metrics values
        current_memories = 0.0
        peak_memories = 0.0
        current_metric = runtime_metrics(id=func.__name__)
        disk = "sdc"

        #####################################################################################################
        # - Cprofiler startup
        # Reference: https://www.google.com/search?client=firefox-b-1-d&q=python+example+use+of+cprofle+for+a+single+function#cobssid=s
        #####################################################################################################
        pr = cProfile.Profile()
        pr.enable()
        start_time = time.perf_counter()

        #####################################################################################################
        # - Memory tracking
        #  Reference: https://docs.python.org/3/library/tracemalloc.html
        #  Reference: https://www.kdnuggets.com/how-to-trace-memory-allocation-in-python
        #  Reference: https://www.geeksforgeeks.org/monitoring-memory-usage-of-a-running-python-program/
        #####################################################################################################
        tracemalloc.start()

        #####################################################################################################
        # - Disk tracking
        # Reference: https://stackoverflow.com/questions/16945664/insight-needed-into-python-psutil-output#:~:text=1%20Answer%201%20%C2%B7%20read_count:%20number%20of,write_bytes:%20number%20of%20bytes%20written%20%C2%B7%20read_time:
        #####################################################################################################
        iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
        disk_io_counters1 = psutil.disk_io_counters()
        read_bytes_start = iocnt1.read_bytes
        write_bytes_start = iocnt1.write_bytes
        read_counters_start = iocnt1.read_count
        write_counters_start = iocnt1.write_count
        read_time_start = iocnt1.read_time
        write_time_start = iocnt1.write_time
        
        read_os_bytes_start = disk_io_counters1.read_bytes
        write_os_bytes_start = disk_io_counters1.write_bytes
        read_os_counters_start = disk_io_counters1.read_count
        write_os_counters_start = disk_io_counters1.write_count
        read_os_time_start = disk_io_counters1.read_time
        write_os_time_start = disk_io_counters1.write_time
        
        #####################################################################################################
        # - Actual function call
        #####################################################################################################
        result = func(*args, **kwargs)

        # disk close out
        # targeted I/O
        iocnt2 = psutil.disk_io_counters(perdisk=True)["sdc"]
        disk_io_counters2 = psutil.disk_io_counters()

        #targeted disk
        read_bytes_end = iocnt2.read_bytes
        write_bytes_end = iocnt2.write_bytes
        read_counters_end = iocnt2.read_count
        write_counters_end = iocnt2.write_count
        read_time_end = iocnt2.read_time
        write_time_end = iocnt2.write_time
        #general OS
        read_os_bytes_end = disk_io_counters2.read_bytes
        write_os_bytes_end = disk_io_counters2.write_bytes
        read_os_counters_end = disk_io_counters2.read_count
        write_os_counters_end = disk_io_counters2.write_count
        read_os_time_end = disk_io_counters2.read_time
        write_os_time_end = disk_io_counters2.write_time

        #targeted disk
        read_throughput = (read_bytes_end - read_bytes_start) / (1024 * 1024)  # MB/s
        write_throughput = (write_bytes_end - write_bytes_start) / (1024 * 1024)  # MB/s
        read_counters = (read_counters_end - read_counters_start)
        write_counters = (read_counters_end - read_counters_start)
        read_time =  (read_time_end - read_time_start)
        write_time = (write_time_end - write_time_start)
        current_metric.io_disk_read_throughput = read_throughput
        current_metric.io_disk_write_throughput = write_throughput
        current_metric.io_disk_read_count = read_counters
        current_metric.io_disk_write_count = write_counters
        current_metric.io_disk_read_time = read_time
        current_metric.io_disk_write_time = write_time

        #general OS
        read_os_throughput = (read_os_bytes_end - read_os_bytes_start) / (1024 * 1024)  # MB/s
        write_os_throughput = (write_os_bytes_end - write_os_bytes_start) / (1024 * 1024)  # MB/s
        read_os_counters = (read_os_counters_end - read_os_counters_start)
        write_os_counters = (read_os_counters_end - read_os_counters_start)
        read_os_time =  (read_os_time_end - read_os_time_start)
        write_os_time = (write_os_time_end - write_os_time_start)
        current_metric.io_os_read_throughput = read_os_throughput
        current_metric.io_os_write_throughput = write_os_throughput
        current_metric.io_os_read_count = read_os_counters
        current_metric.io_os_write_count = write_os_counters
        current_metric.io_os_read_time = read_os_time
        current_metric.io_os_write_time = write_os_time



        # memory close
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()        
        current_metric.mem_current = current / (1024 * 1024)
        current_metric.mem_peak = peak / (1024 * 1024)
        tracemalloc.clear_traces()


        # CProfiler disabled
        pr.disable()
        current_metric.profile_data=pr
        # s = io.StringIO()
        # sortby = SortKey.CUMULATIVE
        # ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        # ps.print_stats()
        # print(s.getvalue())
        end_time = time.perf_counter()
        # other characteristics
        current_metric.runtime = end_time - start_time

        print(current_metric)

        return result

    return wrapper

#### Example Complex Function to Profile

In [15]:
@profile
def complex_function():
    # Define the size of the matrix
    matrix_size = 2048
    # Generate two random matrices
    matrix_a = np.random.rand(matrix_size, matrix_size)
    matrix_b = np.random.rand(matrix_size, matrix_size)
    result_matrix = np.matmul(matrix_a, matrix_b)
    np.savez(
        "./folderOnColab/data/local_test.npy",
        the_matrix=result_matrix,
    )

## Function Declaration

#### Lib Diagnostics

In [16]:
def lib_diagnostics() -> None:

    import pkg_resources

    package_name_length = 20
    package_version_length = 10

    # Show notebook details
    #%watermark?
    #%watermark --github_username christophergwood --email christopher.g.wood@gmail.com --date --time --iso8601 --updated --python --conda --hostname --machine --githash --gitrepo --gitbranch --iversions --gpu
    # Watermark
    print(
        the_watermark(
            author=f"{AUTHOR_NAME}",
            github_username=f"GITHUB_USERNAME",
            email=f"{AUTHOR_EMAIL}",
            iso8601=True,
            datename=True,
            current_time=True,
            python=True,
            updated=True,
            hostname=True,
            machine=True,
            gitrepo=True,
            gitbranch=True,
            githash=True,
        )
    )

    print(f"{BOLD_START}Packages:{BOLD_END}")
    print("")
    # Get installed packages
    the_packages = [
        "nltk",
        "numpy",
        "os",
        "pandas",
        "keras",
        "seaborn",
        "fastparquet",
        "zarr",
        "dask",
        "pystac",
        "polars",
        "xarray",
    ]  # Functions are like legos that do one thing, this function outputs library version history of effort.

    installed = {pkg.key: pkg.version for pkg in pkg_resources.working_set}
    for package_idx, package_name in enumerate(installed):
        if package_name in the_packages:
            installed_version = installed[package_name]
            print(
                f"{package_name:<40}#: {str(pkg_resources.parse_version(installed_version)):<20}"
            )

    try:
        print(f"{'TensorFlow version':<40}#: {str(tf.__version__):<20}")
        print(
            f"{'     gpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        print(
            f"{'     cpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
    except Exception as e:
        pass

    try:
        print(f"{'Torch version':<40}#: {str(torch.__version__):<20}")
        if torch.cuda.is_available():
            device = torch.device("cuda")
            print(f"{'     GPUs available?':<40}#: {torch.cuda.is_available()}")
            print(f"{'     count':<40}#: {torch.cuda.device_count()}")
            print(f"{'     current':<40}#: {torch.cuda.get_device_name(0)}")
        else:
            device = torch.device("cpu")
            print("No GPU available, using CPU.")
    except Exception as e:
        pass

    try:
        print(f"{'OpenAI Azure Version':<40}#: {str(the_openai_version):<20}")
    except Exception as e:
        pass

    return

#### Section 508 Compliance Tools

In [17]:
# Routines designed to support adding ALT text to an image generated through Matplotlib.


def capture(figure):
    buffer = io.BytesIO()
    figure.savefig(buffer)
    # return F"data:image/png;base64,{base64.b64encode(buffer.getvalue()).decode()}"
    return f"data:image/jpg;base64,{base64.b64encode(buffer.getvalue()).decode()}"


def make_accessible(figure, template, **kwargs):
    return display.Markdown(
        f"""![]({capture(figure)} "{template.render(**globals(), **kwargs)}")"""
    )


# requires JPG's or TIFFs
def add_alt_text(image_path, alt_text):
    try:
        if os.path.isfile(image_path):
            img = PIL_Image.open(image_path)
            if "exif" in img.info:
                exif_dict = piexif.load(img.info["exif"])
            else:
                exif_dict = {}

            w, h = img.size
            if "0th" not in exif_dict:
                exif_dict["0th"] = {}
            exif_dict["0th"][piexif.ImageIFD.XResolution] = (w, 1)
            exif_dict["0th"][piexif.ImageIFD.YResolution] = (h, 1)

            software_version = " ".join(
                ["STEM-001 with Python v", str(sys.version).split(" ")[0]]
            )
            exif_dict["0th"][piexif.ImageIFD.Software] = software_version.encode(
                "utf-8"
            )

            if "Exif" not in exif_dict:
                exif_dict["Exif"] = {}
            exif_dict["Exif"][piexif.ExifIFD.UserComment] = UserComment.dump(
                alt_text, encoding="unicode"
            )

            exif_bytes = piexif.dump(exif_dict)
            img.save(image_path, "jpeg", exif=exif_bytes)
        else:
            rprint(
                f"Cound not fine {image_path} for ALT text modification, please check your paths."
            )

    except (FileExistsError, FileNotFoundError, Exception) as e:
        process_exception(e)


# Appears to solve a problem associated with GPU use on Colab, see: https://github.com/explosion/spaCy/issues/11909
def getpreferredencoding(do_setlocale=True):
    return "UTF-8"

#### Libary Configuration

In [18]:
def set_library_configuration() -> None:

    ############################################
    # - JUPYTER NOTEBOOK OUTPUT CONTROL / FORMATTING
    ############################################
    # pandas set floating point to 4 places to things don't run loose
    debug.msg_info("Setting Pandas and Numpy library options.")
    pd.set_option(
        "display.max_colwidth", 10
    )  # None if you want to view the full json blob in the printed dataframe, use this
    pd.options.display.float_format = "{:,.4f}".format
    np.set_printoptions(precision=4)

#### Custom Exception Display

In [19]:
# this function displays the stack trace on errors from a central location making adjustments to the display on an error easier to manage
# functions perform useful solutions for highly repetitive code
def process_exception(inc_exception: Exception) -> None:
    if DEBUG_STACKTRACE == 1:
        traceback.print_exc()
        console.print_exception(show_locals=True)
    else:
        rprint(repr(inc_exception))

#### Quick Stats for a DataFrame

In [20]:
def quick_df_stats(
    inc_df: pd.DataFrame,
    inc_header_count: int,
) -> None:
    """
    Load the data and return as a pd.DataFrame.

            Parameters:
                   inc_df (pd.DataFrame): Dataframe to be inspected, displayed
                   inc_header_count (int): Anticipated number of columns to read in (validation check)

            Returns:
                    Printed output
    """
    print("Data Resolution has: " + str(inc_df.columns))
    print("\n")
    print(f"""{"size":20} : {inc_df.size:15,} """)
    print(f"""{"shape":20} : {str(inc_df.shape):15} """)
    print(f"""{"ndim":20} : {inc_df.ndim:15,} """)
    print(f"""{"column size":20} : {inc_df.columns.size:15,} """)

    # index added so you get an extra column
    print(f"""{"Read":20} : {inc_df.columns.size:15,} """)
    print(f"""{"Expected":20} : {inc_header_count:15,} """)
    if (inc_df.columns.size) == inc_header_count:
        print(f"{BOLD_START}Expectations met{BOLD_END}.")
    else:
        print(
            f"Expectations {BOLD_START}not met{BOLD_END}, check your datafile, columns don't match."
        )
    rprint("\n")
    # rprint(str(inc_df.describe()))

#### Download the FIADB Dataset

In [21]:
# Reference: https://research.fs.usda.gov/programs/fia#data-and-tools
# Forest Inventory Asset Database (FIADB)


def download_fiadb() -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    dataset_long_names = [
        "ALASKA_AK",
        "CALIFORNIA_CA",
        "HAWAII_HI",
        "IDAHO_ID",
        "NEVADA_NV",
        "OREGON_OR",
        "WASHINGTON_WA",
        "ARIZONA_AZ",
        "ARKANSAS_AR",
        "COLORADO_CO",
        "IOWA_IA",
        "KANSAS_KS    ",
        "LOUISIANA_LA",
        "MINNESOTA_MN",
        "MISSOURI_MO",
        "MONTANA_MT",
        "NEBRASKA_NE",
        "NEW_MEXICO_NM",
        "NORTH_DAKOTA_ND",
        "OKLAHOMA_OK",
        "SOUTH_DAKOTA_SD",
        "TEXAS_TX",
        "U    TAH_UT",
        "WYOMING_WY",
        "ALABAMA_AL",
        "CONNECTICUT_CT",
        "DELAWARE_DE",
        "FLORIDA_FL",
        "GEORGIA_GA",
        "ILLINOIS_IL",
        "INDIANA_IN",
        "KENTUCKY_KY",
        "MAINE_ME",
        "MARYLAND    _MD",
        "MASSACHUSETTS_MA",
        "MICHIGAN_MI",
        "MISSISSIPPI_MS",
        "NEW_HAMPSHIRE_NH",
        "NEW_JERSEY_NJ",
        "NEW_YORK_NY",
        "NORTH_CAROLINA_NC",
        "OHIO_OH",
        "PENNSYLVANIA_PA",
        "    RHODE_ISLAND_RI",
        "SOUTH_CAROLINA_SC",
        "TENNESSEE_TN",
        "VERMONT_VT",
        "VIRGINIA_VA",
        "WEST_VIRGINIA_WV",
        "WISCONSIN_WI",
        "GUAM_GU",
        "FEDERATED_STATES_OF_MICRONES_FM    ",
        "NORTHERN_MARIANA_ISLANDS_MP",
        "PALAU_PW",
        "AMERICAN_SAMOA_AS",
        "PUERTO_RICO_PR",
        "US_VIRGIN_ISLANDS_VI",
    ]
    dataset_short_names = [
        "AK",
        "AL",
        "AR",
        "AS",
        "AZ",
        "CA",
        "CO",
        "CT",
        "DE",
        "FL",
        "GA",
        "GU",
        "HI",
        "IA",
        "ID",
        "IL",
        "IN",
        "KS",
        "KY",
        "LA",
        "MA",
        "MD",
        "ME",
        "MI",
        "MN",
        "MO",
        "MP",
        "MS",
        "MT",
        "NC",
        "ND",
        "NE",
        "NH",
        "NJ",
        "NM",
        "NV",
        "NY",
        "OH",
        "OK",
        "OR",
        "PA",
        "PR",
        "PW",
        "RI",
        "SC",
        "SD",
        "SFM",
        "TN",
        "TX",
        "UT",
        "VA",
        "VI",
        "VT",
        "WA",
        "WI",
        "WV",
        "WY",
    ]
    # dataset_pattern="https://apps.fs.usda.gov/fia/datamart/CSV/MT_VEG_SUBPLOT.zip"
    dataset_pattern = "https://apps.fs.usda.gov/fia/datamart/CSV/"

    rprint("Performing `wget` on target FIA records.")
    target_folder = WORKING_FOLDER
    if os.path.isdir(target_folder):
        target_directory = f"{target_folder}{os.sep}downloads"
        for idx, filename in enumerate(dataset_short_names):
            if os.path.isdir(target_directory):
                target_filename = f"{filename}_CSV.zip"
                target_url = f"{dataset_pattern}{target_filename}"
                try:
                    rprint(
                        f"...copying {dataset_long_names[idx]} to target folder: {target_directory}"
                    )
                    subprocess.run(
                        [
                            "/usr/bin/wget",
                            "--show-progress",
                            f"--directory-prefix={target_directory}",
                            f"{target_url}",
                        ],
                        check=True,
                    )
                    rprint("......completed")
                except (subprocess.CalledProcessError, Exception) as e:
                    rprocess_exception(e)
            else:
                rprint(
                    f"...target folder: {target_directory} isn't present for {filename} download."
                )
            break
    else:
        rprint(
            "ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created."
        )
        rprint(f"...target folder: {target_directory}")
        rprint("...if you can't find the problem contact the instructor.")

    # Process the downloaded data, open it up
    rprint("Uncompressing the downloads...")
    if os.path.isdir(target_folder):
        source_directory = f"{target_folder}{os.sep}downloads"
        target_directory = f"{target_folder}{os.sep}data"
        if os.path.isdir(target_directory) and os.path.isdir(source_directory):
            for idx, filename in enumerate(dataset_short_names):
                target_filename = f"{filename}_CSV.zip"
                final_directory = f"{target_directory}{os.sep}{filename}{os.sep}"
                try:
                    if os.path.isfile(f"{source_directory}{os.sep}{target_filename}"):
                        rprint(
                            f"...unzipping {dataset_long_names[idx]} to created target folder: {final_directory}"
                        )
                        subprocess.run(["mkdir", "-p", final_directory], check=True)
                        subprocess.run(
                            [
                                "/usr/bin/unzip",
                                "-o",
                                "-qq",
                                "-d",
                                f"{final_directory}",
                                f"{source_directory}{os.sep}{target_filename}",
                            ],
                            check=True,
                        )
                        process1 = subprocess.Popen(
                            [
                                "/usr/bin/find",
                                f"{final_directory}",
                                "-type",
                                "f",
                                "-print",
                            ],
                            stdout=subprocess.PIPE,
                        )
                        process2 = subprocess.Popen(
                            ["wc", "-l"], stdin=process1.stdout, stdout=subprocess.PIPE
                        )

                        # Close the output of process1 to allow process2 to receive EOF
                        process1.stdout.close()
                        output, error = process2.communicate()
                        process2.stdout.close()
                        number_files = output.decode().strip()
                        rprint(f"......completed, {number_files} files extracted.")
                    else:
                        rprint(
                            f"......failed, unable to find ({source_directory}{os.sep}{target_filename}{os.sep})"
                        )
                except (subprocess.CalledProcessError, Exception) as e:
                    process_exception(e)
                break
        else:
            rprint(
                f"...either the source directory ({source_directory})  or the ({target_directory}) isn't present for extraction."
            )
    else:
        rprint(
            "ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created."
        )
        rprint(f"...target folder: {target_directory}")
        rprint("...if you can't find the problem contact the instructor.")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Download NOAA GDS 0.25 Degree Data for a Range of Dates

In [22]:
# Reference: https://polar.ncep.noaa.gov/global/data_access.shtml
# Global Forecast System (GFS), 0.25 degree resolution
def download_noaa() -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    dataset_url = "https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.20250304/00/atmos/gfs.t00z.atmf000.nc"
    dataset_url_pattern_begin = (
        f"https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs."
    )
    dataset_filename_pattern = "gfs.t00z.atmf000.nc"
    dataset_url_pattern_end = f"/00/atmos/{dataset_filename_pattern}"
    dataset_date_month = "03"
    dataset_day_start = int(4)
    dataset_day_end = int(12)

    rprint("...Performing `wget` on target GFS records.")
    target_folder = WORKING_FOLDER
    if os.path.isdir(target_folder):
        target_directory = f"{target_folder}{os.sep}data"
        if os.path.isdir(target_directory):
            for idx, day in enumerate(range(dataset_day_start, dataset_day_end)):
                if day < 10:
                    day = f"0{day}"
                target_date = f"2025{dataset_date_month}{day}"
                target_url = "".join(
                    [dataset_url_pattern_begin, target_date, dataset_url_pattern_end]
                )

                # remove potentially partial downloads
                target_partial_file = (
                    f"{target_folder}{os.sep}{dataset_filename_pattern}"
                )
                if os.path.isfile(target_partial_file):
                    rprint(
                        f"...removing {dataset_filename_pattern} as it is likely a partial download."
                    )
                    subprocess.run(
                        ["/usr/bin/rm", "-rf", f"{target_partial_file}"], check=True
                    )
                try:
                    rprint(
                        f"......copying {target_url} to target folder: {target_directory}"
                    )
                    # subprocess.run(["/usr/bin/wget", "--show-progress", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)
                    # subprocess.run(["/usr/bin/wget", "--quiet", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)
                    rprint(f"......completed download of day:{target_date}")
                    target_filename = "_".join([target_date, dataset_filename_pattern])
                    os.rename(dataset_filename_pattern, target_filename)
                    rprint(f".........renamed file to {target_filename}")
                    if os.path.isfile(target_filename):
                        rprint(f".........SUCCESS.")
                    else:
                        rprint(f".........inspect download, there could be a problem.")
                except (subprocess.CalledProcessError, Exception) as e:
                    process_exception(e)

        else:
            rprint(
                f"ERROR: Target folder, {target_directory}, isn't present for {target_date} download."
            )
            raise SystemError
    else:
        rprint(
            "ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created."
        )
        rprint(f"...target folder: {target_directory}")
        raise SystemError

    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

#### Download Single Specific NetCDF (MS Bight in Gulf of America) from Google Drive

In [23]:
def download_test() -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    THE_FILE = "ACS.txt"
    THE_ID = "12L8VRY6J1Sj-B1vIf-ODh4kjHWHqIzm8"

    THE_FILE = "MissBight_2020010900.nc"
    THE_ID = "1uYMFrdeVD7_qvG2wRbyu6ir9C6b4wAZC"

    target_folder = f"{WORKING_FOLDER}{os.sep}data"

    target_ids = [THE_ID]
    target_filenames = [THE_FILE]

    for idx, the_id in enumerate(target_ids):
        try:
            if os.path.isfile(f"{target_folder}{os.sep}{target_filenames[idx]}"):
                rprint(f"...no need to download {target_filenames[idx]} again.")
            else:
                rprint(f"...downloading {target_filenames[idx]}.")
                subprocess.run(
                    [
                        "gdown",
                        f"{the_id}",
                        "--no-check-certificate",
                        "--continue",
                        "-O",
                        f"{target_folder}{os.sep}{target_filenames[idx]}",
                    ],
                    check=True,
                )
        except (subprocess.CalledProcessError, Exception) as e:
            process_exception(e)
            raise SystemError
    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

#### Check your resources from a CPU/GPU perspective

In [24]:
def get_hardware_stats() -> None:
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    print(
        f"{BOLD_START}List Devices{BOLD_END} #########################################"
    )
    try:
        from tensorflow.python.client import device_lib

        rprint(device_lib.list_local_devices())
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Devices Counts{BOLD_END} ########################################"
    )
    try:
        rprint(
            f"Num GPUs Available: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        rprint(
            f"Num CPUs Available: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Optional Enablement{BOLD_END} ####################################"
    )
    try:
        gpus = tf.config.experimental.list_physical_devices("GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    if gpus:
        # Restrict TensorFlow to only use the first GPU
        try:
            tf.config.experimental.set_visible_devices(gpus[0], "GPU")
            logical_gpus = tf.config.experimental.list_logical_devices("GPU")
            rprint(
                str(
                    str(len(gpus))
                    + " Physical GPUs,"
                    + str(len(logical_gpus))
                    + " Logical GPU"
                )
            )
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            rprint(str(repr(e)))
        print("")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Clean House

In [25]:
def clean_house() -> None:
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    gc.collect()

    # could leave the GPU unstable so holding off.
    # torch.cuda.empty_cache()
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Remove Files

In [26]:
def nuke_file(target_filename: str) -> None:
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    if os.path.isfile(target_filename):
        try:
            # removing existing file, else you would append
            subprocess.run(["rm", "-rf", f"{target_filename}"], check=True)
        except (subprocess.CalledProcessError, Exception) as e:
            process_exception(e)
            raise SystemError
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

## Input Sources
### Create the storage locations


In [27]:
# Create the folder that will hold our content.
def create_storage_locations(inc_directory: str) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    target_folder = inc_directory
    sub_folders = ["downloads", "data"]
    rprint(f"Creating project infrastructure:")
    try:
        for idx, subdir in enumerate(sub_folders):
            target_directory = f"{target_folder}{os.sep}{subdir}"
            rprint(f"...creating ({target_directory}) to store project data.")
            if os.path.isfile(target_directory):
                raise OSError(
                    f"Cannot create your folder ({target_directory}) a file of the same name already exists there, work with your instructor or remove it yourself."
                )
            elif os.path.isdir(target_directory):
                print(
                    f"......folder named ({target_directory}) {BOLD_START}already exists{BOLD_END}, we won't try to create a new folder."
                )
            else:
                subprocess.run(["mkdir", "-p", target_directory], check=True)
    except (subprocess.CalledProcessError, Exception) as e:
        process_exception(e)

    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

#### Read NetCDF's

In [28]:
def read_netcdfs(inc_source_filenames: []) -> []:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    the_list = []

    rprint(f"...reading NetCDF4 from list of {len(inc_source_filenames)} files:")
    for target_filename in inc_source_filenames:
        try:
            rprint(f"......reading NetCDF4 ({target_filename})")
            the_netcdf = Dataset(target_filename, "r", format="NETCDF4")
            the_list.append(the_netcdf)
        except Exception as e:
            process_exception(e)

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    return the_list

### Mapping

In [29]:
def show_basic_map(data) -> None:

    # plt.figure()
    lat = the_netcdf.variables["latitude"][:]
    lon = the_netcdf.variables["longitude"][:]
    data = the_netcdf.variables["water_temp"][0, 0, :, :]

    m = Basemap(
        projection="mill",
        lat_ts=10,
        llcrnrlon=lon.min(),
        urcrnrlon=lon.max(),
        llcrnrlat=lat.min(),
        urcrnrlat=lat.max(),
        resolution="c",
    )

    Lon, Lat = meshgrid(lon, lat)
    x, y = m(Lon, Lat)

    # cs = m.pcolormesh(x,y,data,shading='flat', cmap=plt.cm.jet)
    cs = m.pcolormesh(x, y, data, cmap=plt.cm.jet)

    m.drawcoastlines()
    m.fillcontinents()
    m.drawmapboundary()
    m.drawparallels(np.arange(-90.0, 120.0, 30.0), labels=[1, 0, 0, 0])
    m.drawmeridians(np.arange(-180.0, 180.0, 60.0), labels=[0, 0, 0, 1])

    colorbar(cs)
    plt.title("Example 1: Global RTOFS SST from NOMADS")
    plt.show()

### Demonstrate Various Data Storage Solutions

#### Pandas DataFrame

In [30]:
def build_pandas(inc_payload: {}) -> pd.DataFrame():

    if DEBUG_USING_GPU == 1:
        import cudf.pandas

        cudf.pandas.install()
        import pandas as pd
    else:
        import pandas as pd

    latSeries = pd.Series(inc_payload[LAT_LNAME].flatten())
    lonSeries = pd.Series(inc_payload[LONG_LNAME].flatten())
    varSeries = pd.Series(inc_payload[PRODUCT_LNAME][0, 0, :, :].flatten())

    # define a Panda.DataFrame()
    frame = {
        LAT_LNAME: latSeries,
        LONG_LNAME: lonSeries,
        PRODUCT_LNAME: varSeries,
    }

    # instantiate a dataframe
    df = pd.DataFrame(frame)

    # ensure the data is cast as expected
    df[LAT_LNAME].astype("float64")
    df[LONG_LNAME].astype("float64")
    df[PRODUCT_LNAME].astype("float64")

    # clean up behind yourself
    del latSeries, lonSeries, varSeries, frame
    clean_house()

    return df

In [31]:
def process_pandas(inc_payload: {}) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_PANDAS_EXT}"

    # create dataframe
    df = build_pandas(inc_payload)
    # quick stats
    quick_df_stats(df, 3)

    # Get memory usage of each column in bytes
    memory_usage_per_column = df.memory_usage(deep=True)

    # Get total memory usage of the DataFrame in bytes
    total_memory_usage = df.memory_usage().sum()
    print(f"Original Dataframe memory use: {total_memory_usage:20,}")

    # WRITE
    nuke_file(target_filename)
    start_time = time.perf_counter()
    write_pandas(target_filename, df)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...Pandas Write Execution time: {execution_time:.4f} seconds")

    # READ
    start_time = time.perf_counter()
    read_pandas(target_filename)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...Pandas Read Execution time: {execution_time:.4f} seconds")

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

In [32]:
def write_pandas(target_pandas_filename: str, df: pd.DataFrame) -> None:
    df.to_pickle(target_pandas_filename)

In [33]:
def read_pandas(target_pandas_filename: str) -> None:
    df = pd.read_pickle(target_pandas_filename)

#### Numpy

In [34]:
def process_numpy(inc_payload: {}) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    target_filename = (
        f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_NUMPY_EXT}"
    )

    # WRITE
    nuke_file(target_filename)
    start_time = time.perf_counter()
    write_numpy_native(target_filename, inc_payload)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...Numpy Write Execution time: {execution_time:.4f} seconds")

    df = build_pandas(inc_payload)
    nuke_file(target_filename)
    start_time = time.perf_counter()
    write_numpy_as_df(target_filename, df)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...Numpy pd.DataFrame Write Execution time: {execution_time:.4f} seconds")

    # READ
    start_time = time.perf_counter()
    read_numpy(target_filename)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...Numpy Read Execution time: {execution_time:.4f} seconds")

In [35]:
def write_numpy_as_df(target_numpy_filename: str, df: pd.DataFrame) -> None:
    df_numpy = df.to_numpy()
    np.savez(target_numpy_filename, df_numpy)


def write_numpy_native(target_numpy_filename: str, inc_payload: {}) -> None:
    # varying sized arrays
    np.savez(
        target_numpy_filename,
        LAT_LNAME=inc_payload[LAT_LNAME],
        LONG_LNAME=inc_payload[LONG_LNAME],
        PRODUCT_LNAME=inc_payload[PRODUCT_LNAME],
    )

In [36]:
def read_numpy(target_numpy_filename: str) -> None:
    loaded_arr = np.load(target_numpy_filename + ".npz")

    # to unpack
    # new_lat = loaded_arr["lat"]
    # new_lon = loaded_arr["lon"]
    # new_product = loaded_arr["product"]
    # del loaded_arr, new_lat, new_lon, new_product

#### PyTorch

In [37]:
def process_pytorch(inc_payload: {}) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    target_filename = (
        f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_TORCH_EXT}"
    )

    nuke_file(target_filename)
    start_time = time.perf_counter()
    write_pytorch(target_filename, inc_payload)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...PyTorch Write Execution time: {execution_time:.4f} seconds")

    start_time = time.perf_counter()
    read_pytorch(target_filename)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...PyTorch Read Execution time: {execution_time:.4f} seconds")

In [38]:
def write_pytorch(target_pytorch_filename: str, inc_payload: {}):
    if DEBUG_USING_GPU == 1:
        dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    else:
        dev = "cpu"

    # input_tensor = input_tensor.to(device)
    lat_tensor = torch.tensor(inc_payload[LAT_LNAME].flatten()).to(dev)
    lon_tensor = torch.tensor(inc_payload[LONG_LNAME].flatten()).to(dev)
    var_tensor = torch.tensor(inc_payload[PRODUCT_LNAME].flatten()).to(dev)

    # lonSeries=pd.Series(lon.flatten())
    # varSeries=pd.Series(varAry[0,0,:,:].flatten())

    # Save the tensor
    # Save multiple tensors as a list
    tensors_list = [lat_tensor, lon_tensor, var_tensor]
    torch.save(tensors_list, target_pytorch_filename)

In [39]:
def read_pytorch(target_pytorch_filename: str) -> None:
    # Load the tensor from the file
    tensor_loaded = torch.load(target_pytorch_filename)

#### TensorFlow

In [40]:
def process_tensorflow(inc_payload: {}) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    if DEBUG_USING_GPU == 1:
        os.environ["CUDA_VISIBLE_DEVICES"] = "0"
        os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true"
        os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
        # gpus = tf.config.list_physical_devices('GPU')
        # tf.config.experimental.set_memory_growth(gpus[0], True)

    target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_TENSORFLOW_EXT}"

    nuke_file(target_filename)
    start_time = time.perf_counter()
    write_tensorflow(target_filename, inc_payload)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...TensorFlow Write Execution time: {execution_time:.4f} seconds")

    start_time = time.perf_counter()
    read_tensorflow(target_filename)
    end_time = time.perf_counter()
    execution_time = end_time - start_time
    print(f"...TensorFlow Read Execution time: {execution_time:.4f} seconds")

    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

In [41]:
def write_tensorflow(target_tensorflow_filename: str, inc_payload: {}):

    with tf.device(THE_DEVICE_NAME):
        # create a TFRecord to store the data
        lat_list = tf.train.FloatList(value=inc_payload[LAT_LNAME].flatten().tolist())
        lon_list = tf.train.FloatList(value=inc_payload[LONG_LNAME].flatten().tolist())
        varAry_list = tf.train.FloatList(
            value=inc_payload[PRODUCT_LNAME].flatten().tolist()
        )
        feature = {
            LAT_LNAME: tf.train.Feature(float_list=lat_list),
            LONG_LNAME: tf.train.Feature(float_list=lon_list),
            PRODUCT_LNAME: tf.train.Feature(float_list=varAry_list),
        }
        tfRecord = tf.train.Example(
            features=tf.train.Features(feature=feature)
        ).SerializeToString()
        # dataset = tf.data.Dataset.from_tensor_slices((lat_tensor, lon_tensor, var_tensor))
        # print(f"Size of TFRecord is {sys.getsizeof(tfRecord):20,} bytes.")
        # tf.io.write_file(target_tensorflow_filename, tf.io.serialize_tensor(tensors_list))
        with tf.io.TFRecordWriter(target_tensorflow_filename) as writer:
            writer.write(tfRecord)

In [42]:
def parse_tfrecord_fn(example_proto):
    feature_description = {
        "latitude": tf.io.FixedLenSequenceFeature(
            [], dtype=tf.float32, allow_missing=True
        ),
        "longitude": tf.io.FixedLenSequenceFeature(
            [], dtype=tf.float32, allow_missing=True
        ),
        "product": tf.io.FixedLenSequenceFeature(
            [], dtype=tf.float32, allow_missing=True
        ),
    }
    return tf.io.parse_single_example(example_proto, feature_description)

In [43]:
def read_tensorflow(target_tensorflow_filename: str) -> None:

    with tf.device(THE_DEVICE_NAME):
        dataset = tf.data.TFRecordDataset(target_tensorflow_filename)
        tfRecord = dataset.map(parse_tfrecord_fn)
        # for record in tfRecord:
        #    lat = record["latitude"]
        #    lon = record["longitude"]
        #    varAry = record["product"]
        #    break

#### Xarray

In [44]:
# def process_xarray(the_netcdf) -> None:
def process_xarray(inc_netcdf_filename: str) -> None:

    # xarray on GPU: https://github.com/xarray-contrib/cupy-xarray
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    """
    the_netcdf = Dataset(inc_netcdf_filename, "r", format="NETCDF4")

    geospatial_lat_nm=LAT_SNAME
    geospatial_lon_nm=LONG_SNAME
    product_nm=PRODUCT_LNAME
    lat   =np.array(the_netcdf.variables[geospatial_lat_nm][:][:],dtype=np.double)
    lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:],dtype=np.double)
    varAry=np.array(the_netcdf.variables[product_nm][0][0][:][:],dtype=np.double)
    lat   =lat.flatten()
    lon   =lon.flatten()
    varAry=varAry[0, 0,]
    #error in cupy for saving to netcdf, defaulting to CPU only
    if DEBUG_USING_GPU==1:
        lat   =cupy.array(inc_payload[LAT_LNAME].flatten(),dtype=cupy.double)
        lon   =cupy.array(inc_payload[LONG_LNAME].flatten(),dtype=cupy.double)
        varAry=cupy.array(inc_payload[PRODUCT_LNAME][0, 0,],dtype=cupy.double)
    else:
        lat   =np.array(inc_payload[LAT_LNAME].flatten(),dtype=np.double)
        lon   =np.array(inc_payload[LONG_LNAME].flatten(),dtype=np.double)
        varAry=np.array(inc_payload[PRODUCT_LNAME][0, 0,],dtype=np.double)

    #build the xarray dataset from scratch
    #Reference: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html
    ds_xr = xr.Dataset(
        data_vars=dict(
            data=([LAT_LNAME, LONG_LNAME], varAry),
            data=(["geospatial_coordinates"], varAry),
        ),
        coords=dict(
            LAT_LNAME=(LAT_LNAME,lat),
            LONG_LNAME=(LONG_LNAME,lon),
        ),
        attrs=dict(description="Weather related data."),
    )

    #https://github.com/xarray-contrib/cupy-xarray
    ds_xr = ds_xr.cupy.as_cupy()
    """

    ds_xr = xr.open_dataset(inc_netcdf_filename, engine="netcdf4")

    # WRITE - NetCDF
    try:
        target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_XARRAY_EXT}"
        nuke_file(target_filename)
        start_time = time.perf_counter()
        write_xarray_netcdf(target_filename, ds_xr)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Write (NetCDF) Execution time: {execution_time:.4f} seconds")

        start_time = time.perf_counter()
        read_xarray_netcdf(target_filename)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Read (NetCDF) Execution time: {execution_time:.4f} seconds")
    except Exception as e:
        process_exception(e)
        pass

    # WRITE - HDF5
    try:
        target_filename = (
            f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.h5"
        )
        nuke_file(target_filename)
        start_time = time.perf_counter()
        write_xarray_hdf5(target_filename, ds_xr)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Write (HDF5) Execution time: {execution_time:.4f} seconds")

        start_time = time.perf_counter()
        read_xarray_hdf5(target_filename)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Read (HDF5) Execution time: {execution_time:.4f} seconds")
    except Exception as e:
        process_exception(e)
        pass

    # WRITE - ZARR
    try:
        target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_ZARR_EXT}"
        print(f"delete {target_filename}")
        nuke_file(target_filename)
        start_time = time.perf_counter()
        write_xarray_zarr(target_filename, ds_xr)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Write (Zarr) Execution time: {execution_time:.4f} seconds")

        start_time = time.perf_counter()
        read_xarray_zarr(target_filename)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Read (Zarr) Execution time: {execution_time:.4f} seconds")
    except Exception as e:
        process_exception(e)
        pass

    # WRITE - PICKLE
    try:
        target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}{OUTPUT_PANDAS_EXT}2"
        nuke_file(target_filename)
        start_time = time.perf_counter()
        write_xarray_pickle(target_filename, ds_xr)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Write (Pickle) Execution time: {execution_time:.4f} seconds")

        start_time = time.perf_counter()
        read_xarray_pickle(target_filename)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Xarray Read (Pickle) Execution time: {execution_time:.4f} seconds")
    except Exception as e:
        process_exception(e)
        pass

    del ds_xr
    clean_house()

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

In [45]:
def write_xarray_netcdf(target_xarray_filename: str, ds_xr: xr.Dataset) -> None:
    ds_xr.to_netcdf(target_xarray_filename)


def write_xarray_hdf5(target_xarray_filename: str, ds_xr: xr.Dataset) -> None:
    ds_xr.to_netcdf(target_xarray_filename)


def write_xarray_zarr(target_xarray_filename: str, ds_xr: xr.Dataset) -> None:
    ds_xr.to_zarr(target_xarray_filename)


def write_xarray_pickle(target_xarray_filename: str, ds_xr: xr.Dataset) -> None:
    pkl = pickle.dumps(ds_xr, protocol=-1)
    with open(target_xarray_filename, "wb") as file:
        # Use pickle.dump() to serialize and write the data to the file
        pickle.dump(pkl, target_xarray_filename)

In [46]:
def read_xarray_netcdf(target_xarray_filename: str) -> None:
    ds_xr_loaded = xr.open_dataset(target_xarray_filename, engine="netcdf4")


def read_xarray_hdf5(target_xarray_filename: str) -> None:
    ds_xr_loaded = xr.open_dataset(target_xarray_filename, engine="netcdf4")


def read_xarray_zarr(target_xarray_filename: str) -> None:
    ds_xr_loaded = xr.open_zarr(target_xarray_filename)


def read_xarray_pickle(target_xarray_filename: str) -> None:
    try:
        with open(target_xarray_filename, "rb") as file:
            ds_xr = pickle.load(file, protocol=-1)
    except (FileNotFoundError, Exception) as e:
        process_exception(e)

#### Apache Parquet

In [47]:
def process_parquet(inc_payload: {}) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    # build the xarray dataset from scratch
    df = build_pandas(inc_payload)

    # WRITE - NetCDF
    try:
        target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_PARQUET_EXT}"
        nuke_file(target_filename)
        start_time = time.perf_counter()
        write_parquet(target_filename, df)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Parquet Write Execution time: {execution_time:.4f} seconds")

        # READ
        start_time = time.perf_counter()
        read_parquet(target_filename)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Parquet Read Execution time: {execution_time:.4f} seconds")
    except Exception as e:
        process_exception(e)
        pass

    del df
    clean_house()

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

In [57]:
@profile
def write_parquet(target_parquet_filename: str, df: pd.DataFrame) -> None:
    df.to_parquet(target_parquet_filename, compression="gzip")

In [58]:
@profile
def read_parquet(target_parquet_filename: str) -> None:
    df_parquet = pd.read_parquet(target_parquet_filename)

#### Zarr

In [50]:
def process_zarr(the_netcdf) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    # Zarr
    # Save xarray Dataset to Zarr
    # ds_zarr=ds_xr.to_zarr('data.zarr')
    import zarr
    import cupy as cp

    zarr.config.enable_gpu()
    # ds_xr = xr.open_dataset(target_filename)
    geospatial_lat_nm = LAT_SNAME
    geospatial_lon_nm = LONG_SNAME
    product_nm = PRODUCT_LNAME
    lat = np.array(the_netcdf.variables[geospatial_lat_nm][:][:], dtype=np.double)
    lon = np.array(the_netcdf.variables[geospatial_lon_nm][:][:], dtype=np.double)
    varAry = np.array(the_netcdf.variables[product_nm][:][:], dtype=np.double)
    lat = lat.flatten()
    lon = lon.flatten()
    varAry = varAry[0, 0, :, :].flatten()

    # root = zarr.open_group('data/group.zarr', mode='w')
    # store = zarr.storage.MemoryStore()
    # store= zarr.storage.LocalStore(target_zarr_filename, read_only=False)
    # store = zarr.storage.ZipStore(target_zarr_filename, mode='w')
    # root = zarr.create_group(store=store)

    root = zarr.group()
    z_lat_grp = root.create_group(LAT_LNAME)
    z_lat_ary = z_lat_grp.create_array(
        name=LAT_LNAME, shape=lat.shape, chunks="auto", dtype=np.float32
    )
    z_lat_ary[:] = lat

    z_lon_grp = root.create_group(LONG_LNAME)
    z_lon_ary = z_lon_grp.create_array(
        name=LONG_LNAME, shape=lon.shape, chunks="auto", dtype=np.float32
    )
    z_lon_ary[:] = lon

    z_product_grp = root.create_group(PRODUCT_LNAME)
    z_product_ary = z_product_grp.create_array(
        name=PRODUCT_LNAME, shape=varAry.shape, chunks="auto", dtype=np.float32
    )
    z_product_ary[:] = varAry

    print(f"{BOLD_START}Zarr Node Tree:{BOLD_END}")
    print(root.tree())

    # WRITE
    try:
        target_filename = f"{WORKING_FOLDER}{os.sep}data{os.sep}{DATA_VERSION_RELEASE}.{OUTPUT_ZARR_EXT}"
        nuke_file(target_filename)
        start_time = time.perf_counter()
        write_zarr(target_filename, root)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Zarr Write Execution time: {execution_time:.4f} seconds")

        # READ
        start_time = time.perf_counter()
        read_zarr(target_filename)
        end_time = time.perf_counter()
        execution_time = end_time - start_time
        print(f"...Zarr Read Execution time: {execution_time:.4f} seconds")
    except Exception as e:
        process_exception(e)
        pass

    # store.close()
    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

In [51]:
def write_zarr(target_zarr_filename: str, root: zarr.Group) -> None:
    zarr.save(target_zarr_filename, root)

In [52]:
def read_zarr(target_zarr_filename: str) -> None:
    # Load the group from the directory
    loaded_group = zarr.open(target_zarr_filename)

    """
    # Verify the structure and data
    assert isinstance(loaded_group, zarr.Group)
    assert 'root' in loaded_group
    assert 'bar' in loaded_group['foo']
    assert 'baz' in loaded_group['foo']
    np.testing.assert_array_equal(bar, loaded_group['foo']['bar'])
    np.testing.assert_array_equal(baz, loaded_group['foo']['baz'])
    """

#### Dask

#### Gather Variables

In [53]:
def gather_variables(inc_product_name:str, the_netcdf) -> {}:

        rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

        geospatial_lat_nm=LAT_SNAME
        geospatial_lon_nm=LONG_SNAME
        product_nm=inc_product_name
        local_payload={}
        #note x,y values are shown below as they are part of the APS meta-data
        #based on the NetCDF Best Practice subject x,y vars should not exist.
        #keeping for continuity between BIOCAST code already written to read APS input.

        if DEBUG_USING_GPU==1:
            lat   =cupy.array(the_netcdf.variables[geospatial_lat_nm][:][:],dtype=cupy.double)
            lon   =cupy.array(the_netcdf.variables[geospatial_lon_nm][:][:],dtype=cupy.double)
            varAry=cupy.array(the_netcdf.variables[product_nm][:][:],dtype=cupy.double)
        else:
            lat   =np.array(the_netcdf.variables[geospatial_lat_nm][:][:],dtype=np.double)
            lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:],dtype=np.double)
            varAry=np.array(the_netcdf.variables[product_nm][:][:],dtype=np.double)

        print(f"...{BOLD_START}{geospatial_lat_nm:10}{" data type:":20}{BOLD_END}{str(type(lat)):20}")
        print(f".......shape:{lat.shape}")
        print(f"....datatype:{lat.dtype}")


        print(f"...{BOLD_START}{geospatial_lon_nm:10}{" data type:":20}{BOLD_END}{str(type(lon)):20}")
        print(f".......shape:{lon.shape}")
        print(f"....datatype:{lon.dtype}")

        print(f"...{BOLD_START}{"Oceanographic data type":20}({product_nm:20}){BOLD_END}{str(type(varAry)):20}")
        print(f".......shape:{varAry.shape}")
        print(f"....datatype:{varAry.dtype}")

        local_payload[LAT_LNAME]=lat
        local_payload[LONG_LNAME]=lon
        local_payload[PRODUCT_LNAME]=varAry

        rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
        return local_payload

## Process

In [55]:
## Main routine that executes all code, does return a data frame of data for further analysis if desired.
#
#  @param (None)
def process(inc_input_directory: str) -> {}:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

    # variables
    source_nc_list = []
    source_filenames_list = []

    # setup storage solution
    create_storage_locations(inc_input_directory)

    # download the data
    download_test()

    # identify target files
    print("...marshaling data files:")
    target_directory = f"{inc_input_directory}{os.sep}data"
    if os.path.isdir(target_directory):
        for file in os.listdir(target_directory):
            print(f"......processing {file} from {target_directory}")
            filename, file_extension = os.path.splitext(file)
            if file_extension.lower() in LOWER_EXTENSIONS:
                source_filenames_list.append(os.path.join(target_directory, file))
    else:
        print(
            "Target directory ({target_directory}) does not exist, cannot continue execution.  Check your paths."
        )
        raise SystemError

    print(source_filenames_list)

    # iterate through netCDFs and read them into an array
    source_nc_list = read_netcdfs(source_filenames_list)

    for idx, the_netcdf in enumerate(source_nc_list):
        the_payload = gather_variables(PRODUCT_LNAME, the_netcdf)
        # process_numpy(the_payload)
        # process_pandas(the_payload)
        # process_pytorch(the_payload)
        # process_tensorflow(the_payload)
        #process_xarray(source_filenames_list[idx])
        #process_zarr(the_netcdf)
        process_parquet(the_payload)

        break

    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

# Main Routine (call all other routines)

In [59]:
if __name__ == "__main__":

    # note that this design now deviates from previous methods.
    # Implementation will assume a single execution of a single PIID folder, scanning results and
    # appending metrics to a single ASCII file as the code proceeds thus ensuring multi-processor, *nix driven execution.

    start_t = perf_counter()
    print("BEGIN PROGRAM")

    ############################################
    # CONSTANTS
    ############################################

    # Semantic Versioning
    VERSION_NAME = "MLDATAREADY"
    VERSION_MAJOR = 0
    VERSION_MINOR = 0
    VERSION_RELEASE = 1

    DATA_VERSION_RELEASE = "-".join(
        [
            str(VERSION_NAME),
            str(VERSION_MAJOR),
            str(VERSION_MINOR),
            str(VERSION_RELEASE),
        ]
    )

    # OUTPUT EXTENSIONS
    OUTPUT_PANDAS_EXT = "pkl"
    OUTPUT_NUMPY_EXT = "npy"
    OUTPUT_TORCH_EXT = "pt"
    OUTPUT_XARRAY_EXT = "xr"
    OUTPUT_ZARR_EXT = "zarr"
    OUTPUT_PARQUET_EXT = "parquet"
    OUTPUT_TENSORFLOW_EXT = "tf"
    OUTPUT_PYSTAC_EXT = "psc"
    OUTPUT_DASK_EXT = "dask"
    # location of our working files
    # WORKING_FOLDER="/content/folderOnColab"
    WORKING_FOLDER = "./folderOnColab"
    input_directory = "./folderOnColab"
    output_directory = "./folderOnColab"

    # Notebook Author details
    AUTHOR_NAME = "Christopher G Wood"
    GITHUB_USERNAME = "christophergarthwood"
    AUTHOR_EMAIL = "christopher.g.wood@gmail.com"

    # GEOSPATIAL NAMES
    LAT_LNAME = "latitude"
    LAT_SNAME = "lat"
    LONG_LNAME = "longitude"
    LONG_SNAME = "lon"
    PRODUCT_LNAME = "chlor_a"
    PRODUCT_SNAME = "chlor_a"

    # PRODUCT_LNAME="salinity"
    # PRODUCT_SNAME="salinity"

    # Encoding
    ENCODING = "utf-8"
    os.environ["PYTHONIOENCODING"] = ENCODING

    BOLD_START = "\033[1m"
    BOLD_END = "\033[0;0m"
    TEXT_WIDTH = 77

    # You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL:
    #
    # 0 = all messages are logged (default behavior)
    # 1 = INFO messages are not printed
    # 2 = INFO and WARNING messages are not printed
    # 3 = INFO, WARNING, and ERROR messages are not printed
    TF_CPP_MIN_LOG_LEVEL_SETTING = 0

    # Set the Seed for the experiment (ask me why?)
    # seed the pseudorandom number generator
    # THIS IS ESSENTIAL FOR CONSISTENT MODEL OUTPUT, remember these are random in nature.
    # SEED_INIT = 7
    # random.seed(SEED_INIT)
    # tf.random.set_seed(SEED_INIT)
    # np.random.seed(SEED_INIT)

    DEBUG_STACKTRACE = 0
    DEBUG_USING_GPU = 0   #no gpu utilization on 0, 1 is gpu utilization
    NUM_PROCESSORS = 10
    ITERATIONS = 20

    # make comparisons lower case and include wild card character at the end of each to catch anomalous file extensions like xlsx, etc.
    EXTENSIONS = [".nc"]
    LOWER_EXTENSIONS = [x.lower() for x in EXTENSIONS]

    THE_DEVICE_NAME = "/job:localhost/replica:0/task:0/device:CPU:0"
    if DEBUG_USING_GPU == 1:
        THE_DEVICE_NAME = "/job:localhost/replica:0/task:0/device:GPU:0"

    warnings.filterwarnings("ignore", category=DeprecationWarning)
    warnings.filterwarnings("ignore", category=FutureWarning)
    warnings.filterwarnings("ignore", category=UserWarning)

    # GPU Setup (for multiple GPU devices)
    device = torch.cuda.current_device()

    # softare watermark
    lib_diagnostics()

    # hardware specs
    get_hardware_stats()

    # - Core workhorse routine
    process(input_directory)
    # - Save the results
    # save_output(docs, output_directory, "policy")

    end_t = perf_counter()
    print("END PROGRAM")
    print(f"Elapsed time: {end_t - start_t}")

BEGIN PROGRAM
Author: Christopher G Wood

Github username: GITHUB_USERNAME

Email: christopher.g.wood@gmail.com

Last updated: 2025-03-14T10:24:28.266841-05:00

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 8.30.0

Compiler    : GCC 13.3.0
OS          : Linux
Release     : 5.15.167.4-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit

Hostname: ThulsaDoom

Git hash: e51e8809604eeafbcb18e553442050e39802904d

Git repo: git@github.com:christophergarthwood/jbooks.git

Git branch: Updates

[1mPackages:[0;0m

dask                                    #: 2024.12.1           
fastparquet                             #: 2024.11.0           
keras                                   #: 3.9.0               
numpy                                   #: 1.26.4              
pandas                                  #: 2.2.3               
polars                                  #: 1.24.0              
pystac         

[1mList Devices[0;0m #########################################


I0000 00:00:1741965868.296515    3500 gpu_device.cc:2022] Created device /device:GPU:0 with 4056 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5



[1mDevices Counts[0;0m ########################################



[1mOptional Enablement[0;0m ####################################





......folder named (./folderOnColab/downloads) [1malready exists[0;0m, we won't try to create a new folder.


......folder named (./folderOnColab/data) [1malready exists[0;0m, we won't try to create a new folder.


...marshaling data files:
......processing MLDATAREADY-0-0-1.zarr from ./folderOnColab/data
......processing MLDATAREADY-0-0-1.parquet from ./folderOnColab/data
......processing MLDATAREADY-0-0-1.h5 from ./folderOnColab/data
......processing MissBight_2020010900.nc from ./folderOnColab/data
......processing local_test.npy.npz from ./folderOnColab/data
......processing MLDATAREADY-0-0-1.xr from ./folderOnColab/data
......processing MLDATAREADY-0-0-1pkl2 from ./folderOnColab/data
['./folderOnColab/data/MissBight_2020010900.nc']


...[1mlat        data type:         [0;0m<class 'numpy.ndarray'>
.......shape:(400,)
....datatype:float64
...[1mlon        data type:         [0;0m<class 'numpy.ndarray'>
.......shape:(800,)
....datatype:float64
...[1mOceanographic data type(chlor_a             )[0;0m<class 'numpy.ndarray'>
.......shape:(1, 29, 400, 800)
....datatype:float64



                Id---------------------------------------------
                                      Id: write_parquet
                Runtime----------------------------------------
                                 Runtime: 0.13 milliseconds

                I/O Size---------------------------------------
                               File Size: 0.00 bytes

                I/O Counts-------------------------------------
                      Targeted disk read: 0.00 counts
                     Targeted disk write: 0.00 counts
                       General disk read: 0.00 counts
                      General disk write: 0.00 counts

                I/O Time---------------------------------------
                 Targeted disk read time: 0.00 milliseconds
                Targeted disk write time: 0.00 milliseconds
                  General disk read time: 0.00 milliseconds
                 General disk write time: 2.00 milliseconds

                Memory----------------------------

END PROGRAM
Elapsed time: 0.9928143040015129
