# Artificial Intelligence
## AI Ready Data - 006
### Download datasets .

<center>
<table align="center">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData-Download-001.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/notebooks?referrer=search&hl=en&project-test=ai-bootcamp">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Link to Colab Enterprise
    </a>
  </td>   
  <td style="text-align: center">
    <a href="https://github.com/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData-Download-001.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/instances?referrer=search&hl=en&project=ai-bootcamp">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Link to Vertex AI Workbench
    </a>
  </td>
</table>
</center>
</br></br></br>

| | |
|-|-|
|Author(s) | [Christopher G Wood](https://github.com/christophergarthwood)  |

# Overview

Using various data sources we will download some data in various formats exploring the options for "AI Ready" data and what that means.

## What is a "**AI Ready**" Data?


## References:

#### Data Formats
+ [List of ML File Formats](https://github.com/trailofbits/ml-file-formats)
+ [ML Guide to Data Formats](https://www.hopsworks.ai/post/guide-to-file-formats-for-machine-learning)
+ [Why are ML Data Structures Different?](https://stackoverflow.blog/2023/01/04/getting-your-data-in-shape-for-machine-learning/)

#### FAIR
+ [FAIR and AI-Ready](https://repository.niddk.nih.gov/public/NIDDKCR_Office_Hours_AI-Readiness_and_Preparing_AI-Ready_+Datasets_12_2023.pdf)
+ [AI-Ready-Data](https://www.rishabhsoft.com/blog/ai-ready-data)
+ [AI-Ready FAIR Data](https://medium.com/@sean_hill/ai-ready-fair-data-accelerating-science-through-responsible-ai-and-data-stewardship-3b4f21c804fd)
+ [AI-Ready Data ... Quality](https://www.elucidata.io/blog/building-ai-ready-data-why-quality-matters-more-than-quantity)
+ [AI-Ready Data Explained](https://acodis.io/hubfs/pdfs/AI-ready%20data%20Explained%20Whitepaper%20(1).pdf)

+ [GCP with BigQuery DataFrames](https://cloud.google.com/blog/products/data-analytics/building-aiml-apps-in-python-with-bigquery-dataframes)

#### Format Libraries / Standards
+ [Earth Science Information partners (ESIP)](https://www.esipfed.org/checklist-ai-ready-data/)
+ [Zarr - Storage of N-dimensional arrays (tensors)](https://zarr.dev/#description)
  + [Zarr explained](https://aijobs.net/insights/zarr-explained/)
+ [Apache Parquet](https://parquet.apache.org/)
  + [All about Parquet](https://medium.com/data-engineering-with-dremio/all-about-parquet-part-01-an-introduction-b62a5bcf70f8)
+ [PySTAC - SpatioTemporal Asset Catalogs](https://pystac.readthedocs.io/en/stable/)
  + [John Hogland's Spatial Modeling Tutorials](https://github.com/jshogland/SpatialModelingTutorials/blob/main/README.md)
 

# ***TODO***:
+ 1. Save profile data for later analysis.
+ 2. Perform complex operation on data with particular tool for metrics.


In [1]:
# Let's define some variables (information holders) for our project overall

global PROJECT_ID, BUCKET_NAME, LOCATION
BUCKET_NAME = "ai-bootcamp-training-vertex-colab"
PROJECT_ID = "ai-bootcamp"
LOCATION = "us-central1"

BOLD_START = "\033[1m"
BOLD_END = "\033[0m"

In [2]:
# Now create a means of enforcing project id selection

import ipywidgets as widgets
from IPython.display import display


def wait_for_button_press():

    button_pressed = False

    # Create widgets
    html_widget = widgets.HTML(
        value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>

        <table><tr><td>
            <span style="font-family: Tahoma;font-size: 18">
              This notebook was designed to work in Jupyter Notebook or Google Colab with the understnading that certain permissions might be enabled.</br>
              Please verify that you are in the appropriate project and that the:</br>
              <center><code><b>PROJECT_ID</b></code> </br></center>
              aligns with the Project Id in the upper left corner of this browser and that the location:
              <center><code><b>LOCATION</b></code> </br></center>
              aligns with the instructions provided.
            </span>
          </td></tr></table></br></br>

    """
    )

    project_list = [
        "ai-bootcamp",
        "usfs-ai-bootcamp",
        "usfa-ai-advanced-training",
        "I will setup my own",
    ]
    dropdown = widgets.Dropdown(
        options=project_list,
        value=project_list[0],
        description="Set Your Project:",
    )

    html_widget2 = widgets.HTML(
        value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>
          """
    )

    button = widgets.Button(description="Accept")

    # Function to handle the selection change
    def on_change(change):
        global PROJECT_ID
        if change["type"] == "change" and change["name"] == "value":
            # print("Selected option:", change['new'])
            PROJECT_ID = change["new"]

    # Observe the dropdown for changes
    dropdown.observe(on_change)

    def on_button_click(b):
        nonlocal button_pressed
        global PROJECT_ID
        button_pressed = True
        # button.disabled = True
        button.close()  # Remove the button from display
        with output:
            # print(f"Button pressed...continuing")
            # print(f"Selected option: {dropdown.value}")
            PROJECT_ID = dropdown.value

    button.on_click(on_button_click)
    output = widgets.Output()

    # Create centered layout
    centered_layout = widgets.VBox(
        [
            html_widget,
            widgets.HBox([dropdown, button]),
            html_widget2,
        ],
        layout=widgets.Layout(
            display="flex", flex_flow="column", align_items="center", width="100%"
        ),
    )
    # Display the layout
    display(centered_layout)


wait_for_button_press()

VBox(children=(HTML(value='\n        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b…

## Environment Check

In [3]:
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# - Google Colab Check
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
import datetime

RunningInCOLAB = False
RunningInCOLAB = "google.colab" in str(get_ipython())
current_time = datetime.datetime.now()

if RunningInCOLAB:
    print(
        f"You are running this notebook in Google Colab at {current_time} in the {BOLD_START}{PROJECT_ID}{BOLD_END}lab."
    )
else:
    print(
        f"You are likely running this notebook with Jupyter iPython runtime at {current_time} in the {PROJECT_ID} lab."
    )

You are likely running this notebook with Jupyter iPython runtime at 2025-03-24 14:34:06.380041 in the ai-bootcamp lab.


## Library Management
### Load Libraries necessary for this operation via pip install

In [4]:
# Import key libraries necessary to support dynamic installation of additional libraries
import sys

# Use subprocess to support running operating system commands from the program, using the "bang" (!)
# symbology is supported, however that does not translate to an actual python script, this is a more
# agnostic approach.
import subprocess
import importlib.util

In [5]:
# Identify the libraries you'd like to add to this Runtime environment.
# Commented out as this adds time but is critical for initial run.
libraries = [
    "backoff",
    "python-dotenv",
    "seaborn",
    "piexif",
    "unidecode",
    "icecream",
    "watermark",
    "watermark[GPU]",
    "rich",
    "rich[jupyter]",
    "numpy",
    "pydot",
    "polars[all]",
    "dask[complete]",
    "xarray",
    "pandas",
    "pystac",
    "pystac[jinja2]",
    "pystac[orjson]",
    "pystac[validation]",
    "fastparquet",
    "zarr",
    "gdown",
    "wget",
]

# Loop through each library and test for existence, if not present install quietly
for library in libraries:
    if library == "Pillow":
        spec = importlib.util.find_spec("PIL")
    else:
        spec = importlib.util.find_spec(library)
    if spec is None:
        print("Installing library " + library)
        subprocess.run(["pip", "install", library, "--quiet"], check=True)
    else:
        print("Library " + library + " already installed.")

# Specialized install for GPU enabled capability with CUDF
# pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu12==25.2.*" "dask-cudf-cu12==25.2.*" "cuml-cu12==25.2.*" "cugraph-cu12==25.2.*" "nx-cugraph-cu12==25.2.*" "cuspatial-cu12==25.2.*"     "cuproj-cu12==25.2.*" "cuxfilter-cu12==25.2.*" "cucim-cu12==25.2.*"
try:
    library="cudf-cu12"
    spec = importlib.util.find_spec(library)
    if spec is None:
        subprocess.run(
            [
                "pip",
                "install",
                "--extra-index-url=https://pypi.nvidia.com",
                library,
                "--quiet",
            ],
            check=True,
        )
    else:
        print("Library " + library + " already installed.")

    library="dask-cudf-cu12"
    spec = importlib.util.find_spec(library)
    if spec is None:
        subprocess.run(
            [
                "pip",
                "install",
                "--extra-index-url=https://pypi.nvidia.com",
                library,
                "--quiet",
            ],
            check=True,
        )
    else:
        print("Library " + library + " already installed.")

except (subprocess.CalledProcessError, RuntimeError, Exception) as e:
    print(repr(e))

Library backoff already installed.
Installing library python-dotenv
Library seaborn already installed.
Library piexif already installed.
Library unidecode already installed.
Library icecream already installed.
Library watermark already installed.
Installing library watermark[GPU]
Library rich already installed.
Installing library rich[jupyter]
Library numpy already installed.
Library pydot already installed.
Installing library polars[all]
Installing library dask[complete]
Library xarray already installed.
Library pandas already installed.
Library pystac already installed.
Installing library pystac[jinja2]
Installing library pystac[orjson]
Installing library pystac[validation]
Library fastparquet already installed.
Library zarr already installed.
Library gdown already installed.
Library wget already installed.


### Library Import

In [6]:
# - Import additional libraries that add value to the project related to NLP

# - Set of libraries that perhaps should always be in Python source
import backoff
import datetime
from datetime import date, timedelta
from dotenv import load_dotenv
import gc
import getopt
import glob
import inspect
import io
import itertools
import json
import math
import os
from pathlib import Path
import pickle
import platform
import random
import re
import shutil
import string
from io import StringIO
import subprocess
import socket
import sys
import textwrap
import tqdm
import traceback
import warnings
import time
import uuid

#- Datastructures
from dataclasses import dataclass, field

#- Profiling
from time import perf_counter
import gc
import io
import tracemalloc
import psutil
import cProfile
import pstats
from pstats import SortKey

#- Text formatting
from rich import print as rprint
from rich.console import Console
from rich.traceback import install
from tabulate import tabulate
import locale

# - Displays system info
from watermark import watermark as the_watermark
from py3nvml import py3nvml

# - Additional libraries for this work
import math
from base64 import b64decode
from IPython.display import Image, Markdown
import pandas, IPython.display as display, io, jinja2, base64
from IPython.display import clear_output  # used to support real-time plotting
import requests
import unidecode
import pydot
import wget

# - Data Science Libraries
import pandas as pd
import numpy as np
import polars as pl
import dask as da
import dask.dataframe as dd
import dask.bag as db
import xarray as xr
import cupy_xarray  # never actually invoked in source itself use ds=ds.cupy.as_cupy()
import pystac as pys
import pystac
from pystac.utils import datetime_to_str

# from stacframes import df_from
import fastparquet as fq
import zarr
from zarr import Group
import netCDF4 as nc
from netCDF4 import Dataset

try:
    import cudf
except Exception as e:
    pass

try:
    import cupy
except Exception as e:
    pass

# Tensorflow and related AI libraries
import tensorflow as tf
from tensorflow import data as tf_data

# Torch
import torch

# - Graphics
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import AnnotationBbox, DrawingArea, OffsetImage, TextArea
from matplotlib.pyplot import imshow
from matplotlib.patches import Circle
from PIL import Image as PIL_Image
import PIL.ImageOps
import matplotlib.image as mpimg
from imageio import imread
import seaborn as sns

from mpl_toolkits.basemap import Basemap
from pylab import *

# - Image meta-data for Section 508 compliance
import piexif
from piexif.helper import UserComment

# - Progress bar
from tqdm import tqdm
from tqdm.notebook import trange, tqdm


--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------

  import cupy
2025-03-24 14:34:26.960989: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders.

## Function Declaration

#### Lib Diagnostics

In [7]:
def lib_diagnostics() -> None:

    import pkg_resources

    package_name_length = 20
    package_version_length = 10

    # Show notebook details
    #%watermark?
    #%watermark --github_username christophergwood --email christopher.g.wood@gmail.com --date --time --iso8601 --updated --python --conda --hostname --machine --githash --gitrepo --gitbranch --iversions --gpu
    # Watermark
    print(
        the_watermark(
            author=f"{AUTHOR_NAME}",
            github_username=f"GITHUB_USERNAME",
            email=f"{AUTHOR_EMAIL}",
            iso8601=True,
            datename=True,
            current_time=True,
            python=True,
            updated=True,
            hostname=True,
            machine=True,
            gitrepo=True,
            gitbranch=True,
            githash=True,
        )
    )

    print(f"{BOLD_START}Packages:{BOLD_END}")
    print("")
    # Get installed packages
    the_packages = [
        "nltk",
        "numpy",
        "os",
        "pandas",
        "keras",
        "seaborn",
        "fastparquet",
        "zarr",
        "dask",
        "pystac",
        "polars",
        "xarray",
    ]  # Functions are like legos that do one thing, this function outputs library version history of effort.

    installed = {pkg.key: pkg.version for pkg in pkg_resources.working_set}
    for package_idx, package_name in enumerate(installed):
        if package_name in the_packages:
            installed_version = installed[package_name]
            print(
                f"{package_name:<40}#: {str(pkg_resources.parse_version(installed_version)):<20}"
            )

    try:
        print(f"{'TensorFlow version':<40}#: {str(tf.__version__):<20}")
        print(
            f"{'     gpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        print(
            f"{'     cpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
    except Exception as e:
        pass

    try:
        print(f"{'Torch version':<40}#: {str(torch.__version__):<20}")
        if torch.cuda.is_available():
            device = torch.device("cuda")
            print(f"{'     GPUs available?':<40}#: {torch.cuda.is_available()}")
            print(f"{'     count':<40}#: {torch.cuda.device_count()}")
            print(f"{'     current':<40}#: {torch.cuda.get_device_name(0)}")
        else:
            device = torch.device("cpu")
            print("No GPU available, using CPU.")
    except Exception as e:
        pass

    try:
        print(f"{'OpenAI Azure Version':<40}#: {str(the_openai_version):<20}")
    except Exception as e:
        pass

    return

#### Check your resources from a CPU/GPU perspective

In [8]:
def get_hardware_stats() -> None:
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    print(
        f"{BOLD_START}List Devices{BOLD_END} #########################################"
    )
    try:
        from tensorflow.python.client import device_lib

        rprint(device_lib.list_local_devices())
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Devices Counts{BOLD_END} ########################################"
    )
    try:
        rprint(
            f"Num GPUs Available: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        rprint(
            f"Num CPUs Available: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Optional Enablement{BOLD_END} ####################################"
    )
    try:
        gpus = tf.config.experimental.list_physical_devices("GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    if gpus:
        # Restrict TensorFlow to only use the first GPU
        try:
            tf.config.experimental.set_visible_devices(gpus[0], "GPU")
            logical_gpus = tf.config.experimental.list_logical_devices("GPU")
            rprint(
                str(
                    str(len(gpus))
                    + " Physical GPUs,"
                    + str(len(logical_gpus))
                    + " Logical GPU"
                )
            )
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            rprint(str(repr(e)))
        print("")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Section 508 Compliance Tools

In [9]:
# Routines designed to support adding ALT text to an image generated through Matplotlib.


def capture(figure):
    buffer = io.BytesIO()
    figure.savefig(buffer)
    # return F"data:image/png;base64,{base64.b64encode(buffer.getvalue()).decode()}"
    return f"data:image/jpg;base64,{base64.b64encode(buffer.getvalue()).decode()}"


def make_accessible(figure, template, **kwargs):
    return display.Markdown(
        f"""![]({capture(figure)} "{template.render(**globals(), **kwargs)}")"""
    )


# requires JPG's or TIFFs
def add_alt_text(image_path, alt_text):
    try:
        if os.path.isfile(image_path):
            img = PIL_Image.open(image_path)
            if "exif" in img.info:
                exif_dict = piexif.load(img.info["exif"])
            else:
                exif_dict = {}

            w, h = img.size
            if "0th" not in exif_dict:
                exif_dict["0th"] = {}
            exif_dict["0th"][piexif.ImageIFD.XResolution] = (w, 1)
            exif_dict["0th"][piexif.ImageIFD.YResolution] = (h, 1)

            software_version = " ".join(
                ["STEM-001 with Python v", str(sys.version).split(" ")[0]]
            )
            exif_dict["0th"][piexif.ImageIFD.Software] = software_version.encode(
                "utf-8"
            )

            if "Exif" not in exif_dict:
                exif_dict["Exif"] = {}
            exif_dict["Exif"][piexif.ExifIFD.UserComment] = UserComment.dump(
                alt_text, encoding="unicode"
            )

            exif_bytes = piexif.dump(exif_dict)
            img.save(image_path, "jpeg", exif=exif_bytes)
        else:
            rprint(
                f"Cound not fine {image_path} for ALT text modification, please check your paths."
            )

    except (FileExistsError, FileNotFoundError, Exception) as e:
        process_exception(e)


# Appears to solve a problem associated with GPU use on Colab, see: https://github.com/explosion/spaCy/issues/11909
def getpreferredencoding(do_setlocale=True):
    return "UTF-8"

#### Libary Configuration

In [10]:
def set_library_configuration() -> None:

    ############################################
    # - JUPYTER NOTEBOOK OUTPUT CONTROL / FORMATTING
    ############################################
    # pandas set floating point to 4 places to things don't run loose
    debug.msg_info("Setting Pandas and Numpy library options.")
    pd.set_option(
        "display.max_colwidth", 10
    )  # None if you want to view the full json blob in the printed dataframe, use this
    pd.options.display.float_format = "{:,.4f}".format
    np.set_printoptions(precision=4)

#### Custom Exception Display

In [11]:
# this function displays the stack trace on errors from a central location making adjustments to the display on an error easier to manage
# functions perform useful solutions for highly repetitive code
def process_exception(inc_exception: Exception) -> None:
    if DEBUG_STACKTRACE == 1:
        traceback.print_exc()
        console.print_exception(show_locals=True)
    else:
        rprint(repr(inc_exception))

#### Download the FIADB Dataset

In [12]:
# Reference: https://research.fs.usda.gov/programs/fia#data-and-tools
# Forest Inventory Asset Database (FIADB)


def download_fiadb() -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    dataset_long_names = [
        "ALASKA_AK",
        "CALIFORNIA_CA",
        "HAWAII_HI",
        "IDAHO_ID",
        "NEVADA_NV",
        "OREGON_OR",
        "WASHINGTON_WA",
        "ARIZONA_AZ",
        "ARKANSAS_AR",
        "COLORADO_CO",
        "IOWA_IA",
        "KANSAS_KS    ",
        "LOUISIANA_LA",
        "MINNESOTA_MN",
        "MISSOURI_MO",
        "MONTANA_MT",
        "NEBRASKA_NE",
        "NEW_MEXICO_NM",
        "NORTH_DAKOTA_ND",
        "OKLAHOMA_OK",
        "SOUTH_DAKOTA_SD",
        "TEXAS_TX",
        "U    TAH_UT",
        "WYOMING_WY",
        "ALABAMA_AL",
        "CONNECTICUT_CT",
        "DELAWARE_DE",
        "FLORIDA_FL",
        "GEORGIA_GA",
        "ILLINOIS_IL",
        "INDIANA_IN",
        "KENTUCKY_KY",
        "MAINE_ME",
        "MARYLAND    _MD",
        "MASSACHUSETTS_MA",
        "MICHIGAN_MI",
        "MISSISSIPPI_MS",
        "NEW_HAMPSHIRE_NH",
        "NEW_JERSEY_NJ",
        "NEW_YORK_NY",
        "NORTH_CAROLINA_NC",
        "OHIO_OH",
        "PENNSYLVANIA_PA",
        "    RHODE_ISLAND_RI",
        "SOUTH_CAROLINA_SC",
        "TENNESSEE_TN",
        "VERMONT_VT",
        "VIRGINIA_VA",
        "WEST_VIRGINIA_WV",
        "WISCONSIN_WI",
        "GUAM_GU",
        "FEDERATED_STATES_OF_MICRONES_FM    ",
        "NORTHERN_MARIANA_ISLANDS_MP",
        "PALAU_PW",
        "AMERICAN_SAMOA_AS",
        "PUERTO_RICO_PR",
        "US_VIRGIN_ISLANDS_VI",
    ]
    dataset_short_names = [
        "AK",
        "AL",
        "AR",
        "AS",
        "AZ",
        "CA",
        "CO",
        "CT",
        "DE",
        "FL",
        "GA",
        "GU",
        "HI",
        "IA",
        "ID",
        "IL",
        "IN",
        "KS",
        "KY",
        "LA",
        "MA",
        "MD",
        "ME",
        "MI",
        "MN",
        "MO",
        "MP",
        "MS",
        "MT",
        "NC",
        "ND",
        "NE",
        "NH",
        "NJ",
        "NM",
        "NV",
        "NY",
        "OH",
        "OK",
        "OR",
        "PA",
        "PR",
        "PW",
        "RI",
        "SC",
        "SD",
        "SFM",
        "TN",
        "TX",
        "UT",
        "VA",
        "VI",
        "VT",
        "WA",
        "WI",
        "WV",
        "WY",
    ]
    # dataset_pattern="https://apps.fs.usda.gov/fia/datamart/CSV/MT_VEG_SUBPLOT.zip"
    dataset_pattern = "https://apps.fs.usda.gov/fia/datamart/CSV/"

    rprint("Performing `wget` on target FIA records.")
    target_folder = WORKING_FOLDER
    if os.path.isdir(target_folder):
        target_directory = f"{target_folder}{os.sep}downloads"
        for idx, filename in enumerate(dataset_short_names):
            if os.path.isdir(target_directory):
                target_filename = f"{filename}_CSV.zip"
                target_url = f"{dataset_pattern}{target_filename}"
                try:
                    rprint(
                        f"...copying {dataset_long_names[idx]} to target folder: {target_directory}"
                    )
                    subprocess.run(
                        [
                            "/usr/bin/wget",
                            "--show-progress",
                            f"--directory-prefix={target_directory}",
                            f"{target_url}",
                        ],
                        check=True,
                    )
                    rprint("......completed")
                except (subprocess.CalledProcessError, Exception) as e:
                    rprocess_exception(e)
            else:
                rprint(
                    f"...target folder: {target_directory} isn't present for {filename} download."
                )
    else:
        rprint(
            "ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created."
        )
        rprint(f"...target folder: {target_directory}")
        rprint("...if you can't find the problem contact the instructor.")

    # Process the downloaded data, open it up
    rprint("Uncompressing the downloads...")
    if os.path.isdir(target_folder):
        source_directory = f"{target_folder}{os.sep}downloads"
        target_directory = f"{target_folder}{os.sep}data"
        if os.path.isdir(target_directory) and os.path.isdir(source_directory):
            for idx, filename in enumerate(dataset_short_names):
                target_filename = f"{filename}_CSV.zip"
                final_directory = f"{target_directory}{os.sep}{filename}{os.sep}"
                try:
                    if os.path.isfile(f"{source_directory}{os.sep}{target_filename}"):
                        rprint(
                            f"...unzipping {dataset_long_names[idx]} to created target folder: {final_directory}"
                        )
                        subprocess.run(["mkdir", "-p", final_directory], check=True)
                        subprocess.run(
                            [
                                "/usr/bin/unzip",
                                "-o",
                                "-qq",
                                "-d",
                                f"{final_directory}",
                                f"{source_directory}{os.sep}{target_filename}",
                            ],
                            check=True,
                        )
                        process1 = subprocess.Popen(
                            [
                                "/usr/bin/find",
                                f"{final_directory}",
                                "-type",
                                "f",
                                "-print",
                            ],
                            stdout=subprocess.PIPE,
                        )
                        process2 = subprocess.Popen(
                            ["wc", "-l"], stdin=process1.stdout, stdout=subprocess.PIPE
                        )

                        # Close the output of process1 to allow process2 to receive EOF
                        process1.stdout.close()
                        output, error = process2.communicate()
                        process2.stdout.close()
                        number_files = output.decode().strip()
                        rprint(f"......completed, {number_files} files extracted.")
                    else:
                        rprint(
                            f"......failed, unable to find ({source_directory}{os.sep}{target_filename}{os.sep})"
                        )
                except (subprocess.CalledProcessError, Exception) as e:
                    process_exception(e)
                break
        else:
            rprint(
                f"...either the source directory ({source_directory})  or the ({target_directory}) isn't present for extraction."
            )
    else:
        rprint(
            "ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created."
        )
        rprint(f"...target folder: {target_directory}")
        rprint("...if you can't find the problem contact the instructor.")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Download NOAA GDS 0.25 Degree Data for a Range of Dates

In [14]:
# Reference: https://polar.ncep.noaa.gov/global/data_access.shtml
# Global Forecast System (GFS), 0.25 degree resolution
def download_noaa(inc_month:int, inc_start_day: int, inc_end_day: int) -> None:

    print(f"Entering {__name__} {inspect.stack()[0][3]}")
    dataset_url = "https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.20250304/00/atmos/gfs.t00z.atmf000.nc"
    dataset_url_pattern_begin = (
        f"https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs."
    )
    dataset_filename_pattern = "gfs.t00z.atmf000.nc"
    dataset_url_pattern_end = f"/00/atmos/{dataset_filename_pattern}"
    if inc_month < 10:
        dataset_date_month="".join(["0",str(inc_month)])
    else:
        dataset_date_month=str(inc_month)
    dataset_day_start=int(inc_start_day)
    dataset_day_end=int(inc_end_day)

    print("...Performing `wget` on target GFS records.")
    target_folder = WORKING_FOLDER
    target_directory = f"{target_folder}{os.sep}data"
    if os.path.isdir(target_directory):
        for idx, day in enumerate(range(dataset_day_start, dataset_day_end)):
            if day < 10:
                day = f"0{day}"
            target_date = f"2025{dataset_date_month}{day}"
            target_url = "".join(
                [dataset_url_pattern_begin, target_date, dataset_url_pattern_end]
            )

            # remove potentially partial downloads
            target_partial_file = (
                f"{target_folder}{os.sep}{dataset_filename_pattern}"
            )
            if os.path.isfile(target_partial_file):
                print(
                    f"...removing {dataset_filename_pattern} as it is likely a partial download."
                )
                subprocess.run(
                    ["/usr/bin/rm", "-rf", f"{target_partial_file}"], check=True
                )

            exit_code=0
            stdout=""
            stderr=""
            try:
                print(f"......starting download of day:{target_date}")
                #print(f"........./usr/bin/wget --quiet --no-check-certificate --directory-prefix={target_directory} {target_url}")
                # subprocess.run(["/usr/bin/wget", "--show-progress", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)
                #process = subprocess.run(["/usr/bin/wget", "--quiet", "--no-check-certificate", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)
                #process = subprocess.run(["/usr/bin/wget", "--show-progress" , "--no-check-certificate", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)
                #filename = wget.download(target_url, out=target_directory, bar=bar_thermometer)
                filename = wget.download(target_url, out=target_directory, )
                print("")
                print(f"...{filename} wget completed.")
                #print(process)
                #stdout = process.stdout
                #stderr = process.stderr
                #exit_code = process.check_returncode()
                #exit_code = process.returncode
            except (subprocess.CalledProcessError) as e:
                #stdout = e.stdout
                #stderr = e.stderr
                #exit_code=e.returncode
                process_exception(e)
                pass

            try:
                print(f"......testing for existence of {target_partial_file}")
                if os.path.isfile(target_partial_file):
                    print(f"......completed download of day:{target_date}")
                target_filename = "_".join([target_date, dataset_filename_pattern])
                os.rename(dataset_filename_pattern, target_filename)
                print(f".........renamed file to {target_filename}")
                if os.path.isfile(target_filename):
                    print(f".........{BOLD_START}SUCCESS{BOLD_END}.")
                else:
                    print(f".........inspect download, there could be a problem.")
            except (subprocess.CalledProcessError, Exception) as e:
                process_exception(e)
                pass
        else:
                print(f"......didn't complete download of day:{target_date}")
                print(f".........{BOLD_START}FAIL{BOLD_END}.")
                print("")
                print("")
    else:
        print(
            f"ERROR: Target folder, {target_directory}, isn't present for {target_date} download."
        )
        raise SystemError

    print(f"Exiting {__name__} {inspect.stack()[0][3]}")

#### Download Single Specific NetCDF (MS Bight in Gulf of America) from Google Drive

In [15]:
def download_test() -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    THE_FILE = "ACS.txt"
    THE_ID = "12L8VRY6J1Sj-B1vIf-ODh4kjHWHqIzm8"

    THE_FILE = "MissBight_2020010900.nc"
    THE_ID = "1uYMFrdeVD7_qvG2wRbyu6ir9C6b4wAZC"

    target_folder = f"{WORKING_FOLDER}{os.sep}data"

    target_ids = [THE_ID]
    target_filenames = [THE_FILE]

    for idx, the_id in enumerate(target_ids):
        try:
            if os.path.isfile(f"{target_folder}{os.sep}{target_filenames[idx]}"):
                rprint(f"...no need to download {target_filenames[idx]} again.")
            else:
                rprint(f"...downloading {target_filenames[idx]}.")
                subprocess.run(
                    [
                        "gdown",
                        f"{the_id}",
                        "--no-check-certificate",
                        "--continue",
                        "-O",
                        f"{target_folder}{os.sep}{target_filenames[idx]}",
                    ],
                    check=True,
                )
        except (subprocess.CalledProcessError, Exception) as e:
            process_exception(e)
            raise SystemError
    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

#### Check your resources from a CPU/GPU perspective

In [16]:
def get_hardware_stats() -> None:
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    print(
        f"{BOLD_START}List Devices{BOLD_END} #########################################"
    )
    try:
        from tensorflow.python.client import device_lib

        rprint(device_lib.list_local_devices())
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Devices Counts{BOLD_END} ########################################"
    )
    try:
        rprint(
            f"Num GPUs Available: {str(len(tf.config.experimental.list_physical_devices('GPU')))}"
        )
        rprint(
            f"Num CPUs Available: {str(len(tf.config.experimental.list_physical_devices('CPU')))}"
        )
        print("")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    print(
        f"{BOLD_START}Optional Enablement{BOLD_END} ####################################"
    )
    try:
        gpus = tf.config.experimental.list_physical_devices("GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))

    if gpus:
        # Restrict TensorFlow to only use the first GPU
        try:
            tf.config.experimental.set_visible_devices(gpus[0], "GPU")
            logical_gpus = tf.config.experimental.list_logical_devices("GPU")
            rprint(
                str(
                    str(len(gpus))
                    + " Physical GPUs,"
                    + str(len(logical_gpus))
                    + " Logical GPU"
                )
            )
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            rprint(str(repr(e)))
        print("")
    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

#### Remove Files

In [17]:
def nuke_file(target_filename: str) -> None:
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    if os.path.isfile(target_filename):
        try:
            # removing existing file, else you would append
            #subprocess.run(["rm", "-rf", f"{target_filename}"], check=True)
            subprocess.check_output(["rm", "-rf", f"{target_filename}"], check=True, stderr=subprocess.STDOUT, shell=True)
        except (subprocess.CalledProcessError, Exception) as e:
            print(f"ERROR in removal of file, error code of {e.output}")
            process_exception(e)
            raise SystemError
    # rprint(f"Entering {__name__} {inspect.stack()[0][3]}")

## Input Sources
### Create the storage locations


In [18]:
# Create the folder that will hold our content.
def create_storage_locations(inc_directory: str) -> None:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")
    target_folder = inc_directory
    sub_folders = ["downloads", "data"]
    rprint(f"Creating project infrastructure:")
    try:
        for idx, subdir in enumerate(sub_folders):
            target_directory = f"{target_folder}{os.sep}{subdir}"
            rprint(f"...creating ({target_directory}) to store project data.")
            if os.path.isfile(target_directory):
                raise OSError(
                    f"Cannot create your folder ({target_directory}) a file of the same name already exists there, work with your instructor or remove it yourself."
                )
            elif os.path.isdir(target_directory):
                print(
                    f"......folder named ({target_directory}) {BOLD_START}already exists{BOLD_END}, we won't try to create a new folder."
                )
            else:
                subprocess.run(["mkdir", "-p", target_directory], check=True)
    except (subprocess.CalledProcessError, Exception) as e:
        process_exception(e)

    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")

# Main

In [None]:
if __name__ == "__main__":

    # note that this design now deviates from previous methods.
    # Implementation will assume a single execution of a single PIID folder, scanning results and
    # appending metrics to a single ASCII file as the code proceeds thus ensuring multi-processor, *nix driven execution.

    start_t = perf_counter()
    print("BEGIN PROGRAM")

    ############################################
    # CONSTANTS
    ############################################

    # Semantic Versioning
    VERSION_NAME = "MLDATAREADY"
    VERSION_MAJOR = 0
    VERSION_MINOR = 0
    VERSION_RELEASE = 1

    DATA_VERSION_RELEASE = "-".join(
        [
            str(VERSION_NAME),
            str(VERSION_MAJOR),
            str(VERSION_MINOR),
            str(VERSION_RELEASE),
        ]
    )

    # OUTPUT EXTENSIONS
    OUTPUT_PANDAS_EXT = "pkl"
    OUTPUT_NUMPY_EXT = "npy"
    OUTPUT_TORCH_EXT = "pt"
    OUTPUT_XARRAY_EXT = "nc"
    OUTPUT_ZARR_EXT = "zarr"
    OUTPUT_PARQUET_EXT = "parquet"
    OUTPUT_TENSORFLOW_EXT = "tf"
    OUTPUT_PYSTAC_EXT = "psc"
    OUTPUT_DASK_EXT = "dask"
    # location of our working files
    # WORKING_FOLDER="/content/folderOnColab"
    WORKING_FOLDER = "./folderOnColab"
    input_directory = "./folderOnColab"
    output_directory = "./folderOnColab"

    # Notebook Author details
    AUTHOR_NAME = "Christopher G Wood"
    GITHUB_USERNAME = "christophergarthwood"
    AUTHOR_EMAIL = "christopher.g.wood@gmail.com"

    # GEOSPATIAL NAMES
    LAT_LNAME = "latitude"
    LAT_SNAME = "lat"
    LONG_LNAME = "longitude"
    LONG_SNAME = "lon"
    #PRODUCT_LNAME = "chlor_a"
    #PRODUCT_SNAME = "chlor_a"
    PRODUCT_LNAME = "cld_amt"
    PRODUCT_SNAME = "cld_amt"

    # PRODUCT_LNAME="salinity"
    # PRODUCT_SNAME="salinity"

    # Encoding
    ENCODING = "utf-8"
    os.environ["PYTHONIOENCODING"] = ENCODING

    BOLD_START = "\033[1m"
    BOLD_END = "\033[0;0m"
    TEXT_WIDTH = 77
    AI_NUMPY_DATA_TYPE  = np.float32
    AI_PANDAS_DATA_TYPE = "float32"
    AI_TORCH_DATA_TYPE = torch.float32
    AI_XARRAY_REMOVE_VARIABLES= ["clwmr", "delz", "dpres", "dzdt", "grle", "hgtsfc", "icmr", "o3mr", "pressfc", "rwmr", "snmr", "spfh", "tmp", "ugrd", "vgrd",]
    # You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL:
    #
    # 0 = all messages are logged (default behavior)
    # 1 = INFO messages are not printed
    # 2 = INFO and WARNING messages are not printed
    # 3 = INFO, WARNING, and ERROR messages are not printed
    TF_CPP_MIN_LOG_LEVEL_SETTING = 0

    # Set the Seed for the experiment (ask me why?)
    # seed the pseudorandom number generator
    # THIS IS ESSENTIAL FOR CONSISTENT MODEL OUTPUT, remember these are random in nature.
    # SEED_INIT = 7
    # random.seed(SEED_INIT)
    # tf.random.set_seed(SEED_INIT)
    # np.random.seed(SEED_INIT)

    DEBUG_STACKTRACE = 0
    DEBUG_USING_GPU = 0   #no gpu utilization on 0, 1 is gpu utilization
    NUM_PROCESSORS = 10
    ITERATIONS = 20

    # make comparisons lower case and include wild card character at the end of each to catch anomalous file extensions like xlsx, etc.
    EXTENSIONS = [".nc"]
    LOWER_EXTENSIONS = [x.lower() for x in EXTENSIONS]

    THE_DEVICE_NAME = "/job:localhost/replica:0/task:0/device:CPU:0"
    if DEBUG_USING_GPU == 1:
        THE_DEVICE_NAME = "/job:localhost/replica:0/task:0/device:GPU:0"

    warnings.filterwarnings("ignore", category=DeprecationWarning)
    warnings.filterwarnings("ignore", category=FutureWarning)
    warnings.filterwarnings("ignore", category=UserWarning)

    # GPU Setup (for multiple GPU devices)
    device = torch.cuda.current_device()

    # softare watermark
    lib_diagnostics()

    # hardware specs
    get_hardware_stats()

    # download the data
    today = date.today()
    this_month = int(today.strftime("%m"))
    this_day = int(today.strftime("%d"))
    yester_day = int( (today - timedelta(days=8)).strftime("%d") )
    download_noaa(this_month, yester_day, this_day)
    #download_test()
    
    end_t = perf_counter()
    print("END PROGRAM")
    print(f"Elapsed time: {end_t - start_t}")

BEGIN PROGRAM
Author: Christopher G Wood

Github username: GITHUB_USERNAME

Email: christopher.g.wood@gmail.com

Last updated: 2025-03-24T14:34:55.828125-05:00

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 8.30.0

Compiler    : GCC 13.3.0
OS          : Linux
Release     : 5.15.167.4-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit

Hostname: ThulsaDoom

Git hash: f824406166462e63ed44e15a1c1f57341698d566

Git repo: git@github.com:christophergarthwood/jbooks.git

Git branch: Updates

[1mPackages:[0;0m

dask                                    #: 2024.12.1           
fastparquet                             #: 2024.11.0           
keras                                   #: 3.9.0               
numpy                                   #: 1.26.4              
pandas                                  #: 2.2.3               
polars                                  #: 1.24.0              
pystac         

[1mList Devices[0;0m #########################################


I0000 00:00:1742844895.913668   15060 service.cc:148] XLA service 0x560156808770 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1742844895.913930   15060 service.cc:156]   StreamExecutor device (0): Host, Default Version
I0000 00:00:1742844896.062862   15060 service.cc:148] XLA service 0x560156812e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1742844896.062892   15060 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 2060, Compute Capability 7.5
I0000 00:00:1742844896.092399   15060 gpu_device.cc:2022] Created device /device:GPU:0 with 4056 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5



[1mDevices Counts[0;0m ########################################



[1mOptional Enablement[0;0m ####################################


I0000 00:00:1742844896.105886   15060 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4056 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5





Entering __main__ download_noaa
...Performing `wget` on target GFS records.
......starting download of day:20250316
  1% [.                                                                   ]  117776384 / 7183618369