# Artificial Intelligence
## AI Ready Data - 006
### Download, curate, and process weather and tree data.

<center>
<table align="center">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/notebooks?referrer=search&hl=en&project=usfs-ai-bootcamp">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Link to Colab Enterprise
    </a>
  </td>   
  <td style="text-align: center">
    <a href="https://github.com/christophergarthwood/jbooks/blob/main/STEM-006_AIReadyData.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/instances?referrer=search&hl=en&project=usfs-ai-bootcamp">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Link to Vertex AI Workbench
    </a>
  </td>
</table>
</center>
</br></br></br>

| | |
|-|-|
|Author(s) | [Christopher G Wood](https://github.com/christophergarthwood)  |

# Overview

Using various data sources we will download, review, and package data in various formats exploring the options for "AI Ready" data and what that means.

## What is a "**AI Ready**" Data?


## References:

#### Data Formats
+ [List of ML File Formats](https://github.com/trailofbits/ml-file-formats)
+ [ML Guide to Data Formats](https://www.hopsworks.ai/post/guide-to-file-formats-for-machine-learning)
+ [Why are ML Data Structures Different?](https://stackoverflow.blog/2023/01/04/getting-your-data-in-shape-for-machine-learning/)

#### FAIR
+ [FAIR and AI-Ready](https://repository.niddk.nih.gov/public/NIDDKCR_Office_Hours_AI-Readiness_and_Preparing_AI-Ready_+Datasets_12_2023.pdf)
+ [AI-Ready-Data](https://www.rishabhsoft.com/blog/ai-ready-data)
+ [AI-Ready FAIR Data](https://medium.com/@sean_hill/ai-ready-fair-data-accelerating-science-through-responsible-ai-and-data-stewardship-3b4f21c804fd)
+ [AI-Ready Data ... Quality](https://www.elucidata.io/blog/building-ai-ready-data-why-quality-matters-more-than-quantity)
+ [AI-Ready Data Explained](https://acodis.io/hubfs/pdfs/AI-ready%20data%20Explained%20Whitepaper%20(1).pdf)

+ [GCP with BigQuery DataFrames](https://cloud.google.com/blog/products/data-analytics/building-aiml-apps-in-python-with-bigquery-dataframes)

#### Format Libraries / Standards
+ [Earth Science Information partners (ESIP)](https://www.esipfed.org/checklist-ai-ready-data/)
+ [Zarr - Storage of N-dimensional arrays (tensors)](https://zarr.dev/#description)
  + [Zarr explained](https://aijobs.net/insights/zarr-explained/)
+ [Apache Parquet](https://parquet.apache.org/)
  + [All about Parquet](https://medium.com/data-engineering-with-dremio/all-about-parquet-part-01-an-introduction-b62a5bcf70f8)
+ [PySTAC - SpatioTemporal Asset Catalogs](https://pystac.readthedocs.io/en/stable/)
  + [John Hogland's Spatial Modeling Tutorials](https://github.com/jshogland/SpatialModelingTutorials/blob/main/README.md)

In [1]:
# Let's define some variables (information holders) for our project overall

global PROJECT_ID, BUCKET_NAME, LOCATION
BUCKET_NAME ="cio-training-vertex-colab"
PROJECT_ID  ="usfs-ai-bootcamp"
LOCATION    = "us-central1"

BOLD_START="\033[1m"
BOLD_END="\033[0m"

In [2]:
# Now create a means of enforcing project id selection

import ipywidgets as widgets
from IPython.display import display

def wait_for_button_press():

    button_pressed = False

    # Create widgets
    html_widget = widgets.HTML(

    value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>

        <table><tr><td>
            <span style="font-family: Tahoma;font-size: 18">
              This notebook was designed to work in Jupyter Notebook or Google Colab with the understnading that certain permissions might be enabled.</br>
              Please verify that you are in the appropriate project and that the:</br>
              <center><code><b>PROJECT_ID</b></code> </br></center>
              aligns with the Project Id in the upper left corner of this browser and that the location:
              <center><code><b>LOCATION</b></code> </br></center>
              aligns with the instructions provided.
            </span>
          </td></tr></table></br></br>

    """)

    project_list=["usfs-ai-bootcamp", "usfa-ai-advanced-training", "I will setup my own"]
    dropdown = widgets.Dropdown(
        options=project_list,
        value=project_list[0],
        description='Set Your Project:',
    )

    html_widget2 = widgets.HTML(
    value="""
        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b>&#128721; &#9888;&#65039; WARNING &#9888;&#65039;	&#128721; </b></h1></td></tr></table</center></br></br>
          """)

    button = widgets.Button(description="Accept")

    # Function to handle the selection change
    def on_change(change):
        global PROJECT_ID
        if change['type'] == 'change' and change['name'] == 'value':
            #print("Selected option:", change['new'])
            PROJECT_ID=change['new']

    # Observe the dropdown for changes
    dropdown.observe(on_change)

    def on_button_click(b):
        nonlocal button_pressed
        global PROJECT_ID
        button_pressed = True
        #button.disabled = True
        button.close()  # Remove the button from display
        with output:
          #print(f"Button pressed...continuing")
          #print(f"Selected option: {dropdown.value}")
          PROJECT_ID=dropdown.value

    button.on_click(on_button_click)
    output = widgets.Output()

    # Create centered layout
    centered_layout = widgets.VBox([
                                    html_widget,
                                    widgets.HBox([dropdown, button]),
                                    html_widget2,
    ], layout=widgets.Layout(
                              display='flex',
                              flex_flow='column',
                              align_items='center',
                              width='100%'
    ))
    # Display the layout
    display(centered_layout)


wait_for_button_press()

VBox(children=(HTML(value='\n        <center><table><tr><td><h1 style="font-family: Roboto;font-size: 24px"><b…

## Environment Check

In [3]:
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#- Google Colab Check
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
import datetime

RunningInCOLAB = False
RunningInCOLAB = 'google.colab' in str(get_ipython())
current_time   = datetime.datetime.now()

if RunningInCOLAB:
    print(f"You are running this notebook in Google Colab at {current_time} in the {BOLD_START}{PROJECT_ID}{BOLD_END}lab.")
else:
    print(f"You are likely running this notebook with Jupyter iPython runtime at {current_time} in the {PROJECT_ID} lab.")

You are likely running this notebook with Jupyter iPython runtime at 2025-03-13 09:05:15.660394 in the usfs-ai-bootcamp lab.


# Library Management
## Load Libraries necessary for this operation via pip install

In [1]:
# Import key libraries necessary to support dynamic installation of additional libraries
import sys
# Use subprocess to support running operating system commands from the program, using the "bang" (!)
# symbology is supported, however that does not translate to an actual python script, this is a more
# agnostic approach.
import subprocess
import importlib.util

In [5]:
# Identify the libraries you'd like to add to this Runtime environment.

libraries=["backoff", "python-dotenv", "seaborn","piexif", "unidecode", "icecream","watermark", "watermark[GPU]", "rich", "rich[jupyter]", 
           "numpy", "pydot", "polars[all]", "dask[complete]", "xarray","pandas",
           "pystac", "pystac[jinja2]", "pystac[orjson]", "pystac[validation]",
           "fastparquet",
           "zarr",
           "gdown",
           ]

# Loop through each library and test for existence, if not present install quietly
for library in libraries:
    if library == "Pillow":
      spec = importlib.util.find_spec("PIL")
    else:
      spec = importlib.util.find_spec(library)
    if spec is None:
      print("Installing library " + library)
      subprocess.run(["pip", "install" , library, "--quiet"], check=True)
    else:
      print("Library " + library + " already installed.")
    
# Specialized install for GPU enabled capability with CUDF
#pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu12==25.2.*" "dask-cudf-cu12==25.2.*" "cuml-cu12==25.2.*" "cugraph-cu12==25.2.*" "nx-cugraph-cu12==25.2.*" "cuspatial-cu12==25.2.*"     "cuproj-cu12==25.2.*" "cuxfilter-cu12==25.2.*" "cucim-cu12==25.2.*"
try:
    subprocess.run(["pip", "install" , "--extra-index-url=https://pypi.nvidia.com", "cudf-cu12", "dask-cudf-cu12","--quiet",], check=True)
except (subprocess.CalledProcessError, RuntimeError, Exception) as e:
  print(repr(e))

Library backoff already installed.
Installing library python-dotenv
Library seaborn already installed.
Library piexif already installed.
Library unidecode already installed.
Library icecream already installed.
Library watermark already installed.
Installing library watermark[GPU]
Library rich already installed.
Installing library rich[jupyter]
Library numpy already installed.
Library pydot already installed.
Installing library polars[all]
Installing library dask[complete]
Library xarray already installed.
Library pandas already installed.
Library pystac already installed.
Installing library pystac[jinja2]
Installing library pystac[orjson]
Installing library pystac[validation]
Library fastparquet already installed.
Library zarr already installed.
Library gdown already installed.


## Library Import

In [2]:
#- Import additional libraries that add value to the project related to NLP

#- Set of libraries that perhaps should always be in Python source
import backoff
import datetime
from dotenv import load_dotenv
import gc
import getopt
import glob
import inspect
import io
import itertools
import json
import math
import os
from pathlib import Path
import pickle
import platform
import random
import re
import shutil
import string
from io import StringIO
import subprocess
import socket
import sys
import textwrap
import tqdm
import traceback
import warnings
import time
from time import perf_counter
from rich import print as rprint
from rich.console import Console
from rich.traceback import install
from tabulate import tabulate
import locale
import gc

#- Displays system info
from watermark import watermark as the_watermark
from py3nvml import py3nvml

#- Additional libraries for this work
import math
from base64 import b64decode
from IPython.display import Image, Markdown
import pandas, IPython.display as display, io, jinja2, base64
from IPython.display import clear_output #used to support real-time plotting
import requests
import unidecode
import pydot

#- Data Science Libraries
import numpy as np
import polars as pl
import dask as da
import xarray as xr
import pystac as pys
import fastparquet as fq
import zarr
import netCDF4 as nc
from netCDF4 import Dataset

try:
    import cudf 
    import cudf.pandas
    cudf.pandas.install()
except Exception as e:
    pass
finally:
    import pandas as pd


# Tensorflow and related AI libraries
import tensorflow as tf
from tensorflow import data as tf_data

# Torch 
import torch


#- Graphics
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import (AnnotationBbox, DrawingArea, OffsetImage,
                                  TextArea)
from matplotlib.pyplot import imshow
from matplotlib.patches import Circle
from PIL import Image as PIL_Image
import PIL.ImageOps
import matplotlib.image as mpimg
from imageio import imread
import seaborn as sns

#- Image meta-data for Section 508 compliance
import piexif
from piexif.helper import UserComment

#- Progress bar
from tqdm import tqdm

# progress bar
from tqdm.notebook import trange, tqdm

# Setup some basic timers material
from time import perf_counter

2025-03-13 09:41:28.521564: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-13 09:41:28.889430: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741876889.030975   16799 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741876889.071761   16799 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-13 09:41:29.445373: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

## Function Declaration

#### Lib Diagnostics

In [4]:
def lib_diagnostics() -> None:

    import pkg_resources

    package_name_length=20
    package_version_length=10

    # Show notebook details
    #%watermark?
    #%watermark --github_username christophergwood --email christopher.g.wood@gmail.com --date --time --iso8601 --updated --python --conda --hostname --machine --githash --gitrepo --gitbranch --iversions --gpu
    # Watermark
    print(the_watermark(author=f"{AUTHOR_NAME}", github_username=f"GITHUB_USERNAME", email=f"{AUTHOR_EMAIL}",iso8601=True, datename=True, current_time=True, python=True, updated=True, hostname=True, machine=True, gitrepo=True, gitbranch=True, githash=True))


    print(f"{BOLD_START}Packages:{BOLD_END}")
    print("")
    # Get installed packages
    the_packages=["nltk", "numpy", "os", "pandas", "keras", "seaborn","fastparquet", "zarr", "dask", "pystac", "polars","xarray",]# Functions are like legos that do one thing, this function outputs library version history of effort.

    installed = {pkg.key: pkg.version for pkg in pkg_resources.working_set}
    for package_idx, package_name in enumerate(installed):
         if package_name in the_packages:
             installed_version = installed[package_name]
             print(f"{package_name:<40}#: {str(pkg_resources.parse_version(installed_version)):<20}")

    try:
        print(f"{'TensorFlow version':<40}#: {str(tf.__version__):<20}")
        print(f"{'     gpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('GPU')))}")
        print(f"{'     cpu.count:':<40}#: {str(len(tf.config.experimental.list_physical_devices('CPU')))}")
    except Exception as e:
        pass

    try:
        print(f"{'Torch version':<40}#: {str(torch.__version__):<20}")
        if torch.cuda.is_available():
            device = torch.device('cuda')
            print(f"{'     GPUs available?':<40}#: {torch.cuda.is_available()}")
            print(f"{'     count':<40}#: {torch.cuda.device_count()}")
            print(f"{'     current':<40}#: {torch.cuda.get_device_name(0)}")
        else:
            device = torch.device('cpu')
            print('No GPU available, using CPU.')        
    except Exception as e:
        pass


    try:
      print(f"{'OpenAI Azure Version':<40}#: {str(the_openai_version):<20}")
    except Exception as e:
      pass

    return

#### Section 508 Compliance Tools

In [5]:
# Routines designed to support adding ALT text to an image generated through Matplotlib.

def capture(figure):
   buffer = io.BytesIO()
   figure.savefig(buffer)
   #return F"data:image/png;base64,{base64.b64encode(buffer.getvalue()).decode()}"
   return F"data:image/jpg;base64,{base64.b64encode(buffer.getvalue()).decode()}"

def make_accessible(figure, template, **kwargs):
   return display.Markdown(F"""![]({capture(figure)} "{template.render(**globals(), **kwargs)}")""")


# requires JPG's or TIFFs
def add_alt_text(image_path, alt_text):
    try:
        if os.path.isfile(image_path):
          img = PIL_Image.open(image_path)
          if "exif" in img.info:
              exif_dict = piexif.load(img.info["exif"])
          else:
              exif_dict={}

          w, h = img.size
          if "0th" not in exif_dict:
            exif_dict["0th"]={}
          exif_dict["0th"][piexif.ImageIFD.XResolution] = (w, 1)
          exif_dict["0th"][piexif.ImageIFD.YResolution] = (h, 1)

          software_version=" ".join(["STEM-001 with Python v", str(sys.version).split(" ")[0]])
          exif_dict["0th"][piexif.ImageIFD.Software]=software_version.encode("utf-8")

          if "Exif" not in exif_dict:
            exif_dict["Exif"]={}
          exif_dict["Exif"][piexif.ExifIFD.UserComment] = UserComment.dump(alt_text, encoding="unicode")

          exif_bytes = piexif.dump(exif_dict)
          img.save(image_path, "jpeg", exif=exif_bytes)
        else:
          rprint(f"Cound not fine {image_path} for ALT text modification, please check your paths.")

    except (FileExistsError, FileNotFoundError, Exception) as e:
        process_exception(e)

# Appears to solve a problem associated with GPU use on Colab, see: https://github.com/explosion/spaCy/issues/11909
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

#### Libary Configuration

In [4]:
def set_library_configuration() -> None:

    ############################################
    # - JUPYTER NOTEBOOK OUTPUT CONTROL / FORMATTING
    ############################################
    # pandas set floating point to 4 places to things don't run loose
    debug.msg_info("Setting Pandas and Numpy library options.")
    pd.set_option(
        "display.max_colwidth", 10
    )  # None if you want to view the full json blob in the printed dataframe, use this
    pd.options.display.float_format = "{:,.4f}".format
    np.set_printoptions(precision=4)

#### Profiling

In [5]:
def profile_function(func):
    def wrapper(*args, **kwargs):
        pr = cProfile.Profile()
        pr.enable()
        result = func(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        sortby = SortKey.CUMULATIVE
        ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        ps.print_stats()
        print(s.getvalue())
        return result

    return wrapper

#### Custom Exception Display

In [6]:
# this function displays the stack trace on errors from a central location making adjustments to the display on an error easier to manage
# functions perform useful solutions for highly repetitive code
def process_exception(inc_exception: Exception) -> None:
    if DEBUG_STACKTRACE == 1:
        traceback.print_exc()
        console.print_exception(show_locals=True)
    else:
        rprint(repr(inc_exception))

#### Quick Stats for a DataFrame

In [7]:
def quick_df_stats(inc_df:pd.DataFrame,
                   inc_header_count: int,
                   ) -> None:
    '''
    Load the data and return as a pd.DataFrame.

            Parameters:
                   inc_df (pd.DataFrame): Dataframe to be inspected, displayed
                   inc_header_count (int): Anticipated number of columns to read in (validation check)

            Returns:
                    Printed output
    '''
    print("Data Resolution has: " + str(inc_df.columns))
    print("\n")
    print(f"""{"size":20} : {inc_df.size:15,} """)
    print(f"""{"shape":20} : {str(inc_df.shape):15} """)
    print(f"""{"ndim":20} : {inc_df.ndim:15,} """)
    print(f"""{"column size":20} : {inc_df.columns.size:15,} """)

    #index added so you get an extra column
    print(f"""{"Read":20} : {inc_df.columns.size:15,} """)
    print(f"""{"Expected":20} : {inc_header_count:15,} """)
    if ( (inc_df.columns.size) == inc_header_count):
        print(f"{BOLD_START}Expectations met{BOLD_END}.")
    else:
        print(f"Expectations {BOLD_START}not met{BOLD_END}, check your datafile, columns don't match.")
    rprint("\n")
    #rprint(str(inc_df.describe()))



#### Download the FIADB Dataset

In [42]:
#Reference: https://research.fs.usda.gov/programs/fia#data-and-tools
#Forest Inventory Asset Database (FIADB)

def download_fiadb() -> None:

        dataset_long_names=[
        "ALASKA_AK", "CALIFORNIA_CA", "HAWAII_HI", "IDAHO_ID", "NEVADA_NV", "OREGON_OR", "WASHINGTON_WA", "ARIZONA_AZ", "ARKANSAS_AR", "COLORADO_CO", "IOWA_IA", "KANSAS_KS    ", "LOUISIANA_LA", "MINNESOTA_MN", "MISSOURI_MO", "MONTANA_MT", "NEBRASKA_NE", "NEW_MEXICO_NM", "NORTH_DAKOTA_ND", "OKLAHOMA_OK", "SOUTH_DAKOTA_SD", "TEXAS_TX", "U    TAH_UT", "WYOMING_WY", "ALABAMA_AL", "CONNECTICUT_CT", "DELAWARE_DE", "FLORIDA_FL", "GEORGIA_GA", "ILLINOIS_IL", "INDIANA_IN", "KENTUCKY_KY", "MAINE_ME", "MARYLAND    _MD", "MASSACHUSETTS_MA", "MICHIGAN_MI", "MISSISSIPPI_MS", "NEW_HAMPSHIRE_NH", "NEW_JERSEY_NJ", "NEW_YORK_NY", "NORTH_CAROLINA_NC", "OHIO_OH", "PENNSYLVANIA_PA", "    RHODE_ISLAND_RI", "SOUTH_CAROLINA_SC", "TENNESSEE_TN", "VERMONT_VT", "VIRGINIA_VA", "WEST_VIRGINIA_WV", "WISCONSIN_WI", "GUAM_GU", "FEDERATED_STATES_OF_MICRONES_FM    ", "NORTHERN_MARIANA_ISLANDS_MP", "PALAU_PW", "AMERICAN_SAMOA_AS", "PUERTO_RICO_PR", "US_VIRGIN_ISLANDS_VI", 
        ]
        dataset_short_names=[
             "AK", "AL", "AR", "AS", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "GU", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MP", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "PR", "PW", "RI", "SC", "SD", "SFM", "TN", "TX", "UT", "VA", "VI", "VT", "WA", "WI", "WV", "WY", 
        ]
        #dataset_pattern="https://apps.fs.usda.gov/fia/datamart/CSV/MT_VEG_SUBPLOT.zip"
        dataset_pattern="https://apps.fs.usda.gov/fia/datamart/CSV/"
        
        rprint("Performing `wget` on target FIA records.")
        target_folder=WORKING_FOLDER
        if os.path.isdir(target_folder):
            target_directory=f"{target_folder}{os.sep}downloads"
            for idx,filename in enumerate(dataset_short_names):
                if os.path.isdir(target_directory):
                    target_filename=f"{filename}_CSV.zip"
                    target_url=f"{dataset_pattern}{target_filename}"
                    try:
                      rprint(f"...copying {dataset_long_names[idx]} to target folder: {target_directory}")
                      subprocess.run(["/usr/bin/wget", "--show-progress", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)            
                      rprint("......completed")
                    except (subprocess.CalledProcessError, Exception) as e:
                      rprocess_exception(e)
                else:
                    rprint(f"...target folder: {target_directory} isn't present for {filename} download.")
                break;
        else:
            rprint("ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created.")
            rprint(f"...target folder: {target_directory}")
            rprint("...if you can't find the problem contact the instructor.")

        #Process the downloaded data, open it up
        rprint("Uncompressing the downloads...")
        if os.path.isdir(target_folder):
            source_directory=f"{target_folder}{os.sep}downloads"
            target_directory=f"{target_folder}{os.sep}data"
            if os.path.isdir(target_directory) and os.path.isdir(source_directory):
                for idx,filename in enumerate(dataset_short_names):
                    target_filename=f"{filename}_CSV.zip"
                    final_directory=f"{target_directory}{os.sep}{filename}{os.sep}"
                    try:
                      if os.path.isfile(f"{source_directory}{os.sep}{target_filename}"):
                          rprint(f"...unzipping {dataset_long_names[idx]} to created target folder: {final_directory}")
                          subprocess.run(["mkdir", "-p" , final_directory], check=True)
                          subprocess.run(["/usr/bin/unzip", "-o", "-qq", "-d", f"{final_directory}", f"{source_directory}{os.sep}{target_filename}"], check=True)          
                          process1 = subprocess.Popen(["/usr/bin/find", f"{final_directory}", "-type", "f", "-print",], stdout=subprocess.PIPE)
                          process2 = subprocess.Popen(['wc', '-l'], stdin=process1.stdout, stdout=subprocess.PIPE)
        
                          # Close the output of process1 to allow process2 to receive EOF
                          process1.stdout.close()
                          output, error = process2.communicate()
                          process2.stdout.close()
                          number_files=output.decode().strip()
                          rprint(f"......completed, {number_files} files extracted.")
                      else:
                        rprint(f"......failed, unable to find ({source_directory}{os.sep}{target_filename}{os.sep})")
                    except (subprocess.CalledProcessError, Exception) as e:
                      process_exception(e)
                    break;
            else:
                rprint(f"...either the source directory ({source_directory})  or the ({target_directory}) isn't present for extraction.")
        else:
            rprint("ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created.")
            rprint(f"...target folder: {target_directory}")
            rprint("...if you can't find the problem contact the instructor.")
                  



#### Download NOAA GDS 0.25 Degree Data for a Range of Dates

In [43]:
#Reference: https://polar.ncep.noaa.gov/global/data_access.shtml
#Global Forecast System (GFS), 0.25 degree resolution
def download_noaa() -> None:
    
        dataset_url="https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.20250304/00/atmos/gfs.t00z.atmf000.nc"
        dataset_url_pattern_begin=f"https://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs."
        dataset_filename_pattern="gfs.t00z.atmf000.nc"
        dataset_url_pattern_end=f"/00/atmos/{dataset_filename_pattern}"
        dataset_date_month="03"
        dataset_day_start=int(4)
        dataset_day_end=int(12)
        
        rprint("Performing `wget` on target GFS records.")
        target_folder=WORKING_FOLDER
        if os.path.isdir(target_folder):
            target_directory=f"{target_folder}{os.sep}data"
            if os.path.isdir(target_directory):
                for idx,day in enumerate(range(dataset_day_start,dataset_day_end)):
                        if day < 10:
                            day=f"0{day}"
                        target_date=f"2025{dataset_date_month}{day}"
                        target_url="".join([dataset_url_pattern_begin,target_date,dataset_url_pattern_end])
        
                        #remove potentially partial downloads
                        target_partial_file=f"{target_folder}{os.sep}{dataset_filename_pattern}"
                        if os.path.isfile(target_partial_file):
                            rprint(f"...removing {dataset_filename_pattern} as it is likely a partial download.")
                            subprocess.run(["/usr/bin/rm", "-rf", f"{target_partial_file}"], check=True)            
                        try:
                          rprint(f"...copying {target_url} to target folder: {target_directory}")
                          #subprocess.run(["/usr/bin/wget", "--show-progress", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)            
                          #subprocess.run(["/usr/bin/wget", "--quiet", f"--directory-prefix={target_directory}", f"{target_url}"], check=True)            
                          rprint(f"......completed download of day:{target_date}")
                          target_filename="_".join([target_date,dataset_filename_pattern])
                          os.rename(dataset_filename_pattern, target_filename)
                          rprint(f".........renamed file to {target_filename}")  
                          if os.path.isfile(target_filename):
                              rprint(f".........SUCCESS.")
                          else:
                              rprint(f".........inspect download, there could be a problem.")
                        except (subprocess.CalledProcessError, Exception) as e:
                          process_exception(e)

            else:
                rprint(f"ERROR: Target folder, {target_directory}, isn't present for {target_date} download.")
                raise SystemError
        else:
            rprint("ERROR: Local downloads folder not found/created.  Check the output to ensure your folder is created.")
            rprint(f"...target folder: {target_directory}")
            raise SystemError
        
        rprint(f"Display of top-level files in target folder: {WORKING_FOLDER}{os.sep}data")
        
        from pathlib import Path
        source = Path(f"{WORKING_FOLDER}{os.sep}data")
        the_files=[x.name for x in source.iterdir()]
        for file in the_files:
          rprint(f"./{file:50}")


#### Download Single Specific NetCDF (MS Bight in Gulf of America) from Google Drive

In [44]:
def download_test() -> None:
        THE_FILE="ACS.txt"
        THE_ID="12L8VRY6J1Sj-B1vIf-ODh4kjHWHqIzm8"
        
        THE_FILE="MissBight_2020010900.nc"
        THE_ID="1uYMFrdeVD7_qvG2wRbyu6ir9C6b4wAZC"
        
        target_folder=f"{WORKING_FOLDER}{os.sep}data"
        
        target_ids=[THE_ID]
        target_filenames=[THE_FILE]
        
        for idx, the_id in enumerate(target_ids):
          try:
            if os.path.isfile(f"{target_folder}{os.sep}{target_filenames[idx]}"):
                rprint(f"...no need to download {target_filenames[idx]} again.")
            else:
                rprint(f"...downloading {target_filenames[idx]}.")
                subprocess.run(["gdown", f"{the_id}", "--no-check-certificate",  "--continue", "-O", f"{target_folder}{os.sep}{target_filenames[idx]}"], check=True)
          except (subprocess.CalledProcessError, Exception) as e:
            process_exception(e)
            raise SystemError

#### Check your resources from a CPU/GPU perspective

In [45]:
def get_hardware_stats() -> None:
    print(f"{BOLD_START}List Devices{BOLD_END} #########################################")
    try:
      from tensorflow.python.client import device_lib
      rprint(device_lib.list_local_devices())
      print("")
    except RuntimeError as e:
      # Visible devices must be set before GPUs have been initialized
      rprint(str(repr(e)))
    
    print(f"{BOLD_START}Devices Counts{BOLD_END} ########################################")
    try:
      rprint(f"Num GPUs Available: {str(len(tf.config.experimental.list_physical_devices('GPU')))}" )
      rprint(f"Num CPUs Available: {str(len(tf.config.experimental.list_physical_devices('CPU')))}" )
      print("")
    except RuntimeError as e:
      # Visible devices must be set before GPUs have been initialized
      rprint(str(repr(e)))
    
    print(f"{BOLD_START}Optional Enablement{BOLD_END} ####################################")
    try:
      gpus = tf.config.experimental.list_physical_devices('GPU')
    except RuntimeError as e:
      # Visible devices must be set before GPUs have been initialized
      rprint(str(repr(e)))
    
    if gpus:
      # Restrict TensorFlow to only use the first GPU
      try:
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        rprint( str( str(len(gpus)) + " Physical GPUs," + str(len(logical_gpus)) + " Logical GPU") )
      except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        rprint(str(repr(e)))
      print("")

#### Clean House

In [None]:
def clean_house()-> None:
    gc.collect()
    torch.cuda.empty_cache()

## Input Sources
### Create the storage locations


In [12]:
# Create the folder that will hold our content.
target_folder=WORKING_FOLDER
sub_folders=["downloads","data"]
rprint(f"Creating project infrastructure:")
try:
  for idx, subdir in enumerate(sub_folders):
      target_directory=f"{target_folder}{os.sep}{subdir}"
      rprint(f"...creating ({target_directory}) to store project data.")
      if os.path.isfile(target_directory):
        raise OSError(f"Cannot create your folder ({target_directory}) a file of the same name already exists there, work with your instructor or remove it yourself.")
      elif os.path.isdir(target_directory):
        print(f"......folder named ({target_directory}) {BOLD_START}already exists{BOLD_END}, we won't try to create a new folder.")
      else:
        subprocess.run(["mkdir", "-p" , target_directory], check=True)
except (subprocess.CalledProcessError, Exception) as e:
  process_exception(e)

......folder named (./folderOnColab/downloads) [1malready exists[0;0m, we won't try to create a new folder.


......folder named (./folderOnColab/data) [1malready exists[0;0m, we won't try to create a new folder.


#### Download the Data

In [None]:
download_noaa()

#### Reading Base NetCDF

In [None]:
source_nc_list=[]
rprint(f"Reading base NetCDF4:")
for target_filename in the_source_files:
    try:
        rprint(f"...reading NetCDF4 ({target_filename})")
        the_netcdf = Dataset(target_filename, "r", format="NETCDF4")
        source_nc_list.append(the_netcdf)
    except Exception as e:
       process_exception(e)

#### Gather data as simple arrays

In [30]:
geospatial_lat_nm='latitude'
geospatial_lon_nm='longitude'
product_nm="salinity"
#note x,y values are shown below as they are part of the APS meta-data
#based on the NetCDF Best Practice subject x,y vars should not exist.
#keeping for continuity between BIOCAST code already written to read APS input.
geospatial_vars = [geospatial_lat_nm, geospatial_lon_nm, ]

#lat   =np.array(the_netcdf.variables[geospatial_lat_nm][:][:]).flatten()
lat   =np.array(the_netcdf.variables[geospatial_lat_nm][:][:],dtype=np.double)
print(f"{BOLD_START}{"Latitude data type:":30}{BOLD_END}{str(type(lat)):20}")
print(f".......shape:{lat.shape}")
print(f"....datatype:{lat.dtype}")
#df_describe = pd.DataFrame(lat)
#print(df_describe.describe())

#lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:]).flatten()
lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:],dtype=np.double)
print(f"{BOLD_START}{"Longitude data type:":30}{BOLD_END}{str(type(lon)):20}")
print(f".......shape:{lon.shape}")
print(f"....datatype:{lon.dtype}")
#df_describe = pd.DataFrame(lon)
#print(df_describe.describe())

varAry=np.array(the_netcdf.variables[product_nm][:][:],dtype=np.double)
print(f"{BOLD_START}{"Oceanogrpahic data type:":30}{BOLD_END}{str(type(varAry)):20}")
print(f".......shape:{varAry.shape}")
print(f"....datatype:{varAry.dtype}")


[1mLatitude data type:           [0;0m<class 'numpy.ndarray'>
.......shape:(2001,)
....datatype:float64
[1mLongitude data type:          [0;0m<class 'numpy.ndarray'>
.......shape:(4500,)
....datatype:float64
[1mOceanogrpahic data type:      [0;0m<class 'numpy.ndarray'>
.......shape:(1, 1, 2001, 4500)
....datatype:float64


### Demonstrate Various Data Storage Solutions

#### Pandas DataFrame

In [32]:
target_pandas_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}acs.pkl"

In [31]:
latSeries=pd.Series(lat.flatten())
lonSeries=pd.Series(lon.flatten())
varSeries=pd.Series(varAry[0,0,:,:].flatten())

#define a Panda.DataFrame()
frame={geospatial_lat_nm: latSeries, geospatial_lon_nm: lonSeries, product_nm: varSeries}

#instantiate a dataframe
df=pd.DataFrame(frame)

#ensure the data is cast as expected
df[geospatial_lat_nm].astype('float64')
df[geospatial_lon_nm].astype('float64')
df[geospatial_lon_nm].astype('float64')

#quick stats
quick_df_stats(df, 3)

# Get memory usage of each column in bytes
memory_usage_per_column = df.memory_usage(deep=True)

# Get total memory usage of the DataFrame in bytes
total_memory_usage = df.memory_usage().sum()

print(f"Original Dataframe memory use: {total_memory_usage:20,}")

Data Resolution has: Index(['latitude', 'longitude', 'salinity'], dtype='object')


size                 :      27,013,500 
shape                : (9004500, 3)    
ndim                 :               2 
column size          :               3 
Read                 :               3 
Expected             :               3 
[1mExpectations met[0;0m.


Original Dataframe memory use:          290,395,136


In [51]:
%%timeit
"""
if os.path.isfile(target_pandas_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_pandas_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError
"""
df.to_pickle(target_pandas_filename)  

61.8 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [52]:
%%timeit
df = pd.read_pickle(target_pandas_filename)  

8.15 ms ± 836 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [53]:
del df, latSeries, lonSeries, varSeries, frame

#### Numpy

In [54]:
target_numpy_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}acs.npy"
if os.path.isfile(target_numpy_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_numpy_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError

In [55]:
%%timeit
#df_numpy=df.to_numpy()
#np.save(target_numpy_filename, df_numpy)

#varying sized arrays
np.savez(target_numpy_filename, lat=lat, lon=lon, product=varAry)

706 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [56]:
%%timeit
loaded_arr = np.load(target_numpy_filename+".npz")

36.5 μs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [57]:
loaded_arr = np.load(target_numpy_filename+".npz")
new_lat = loaded_arr['lat']
new_lon = loaded_arr['lon']
new_product = loaded_arr['product']

In [58]:
del loaded_arr, new_lat,new_lon,new_product

#### PyTorch

In [59]:
target_pytorch_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}acs.pt"
if os.path.isfile(target_pytorch_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_pytorch_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError

dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")    

In [60]:
#input_tensor = input_tensor.to(device)
lat_tensor = torch.tensor(lat.flatten()).to(dev)
lon_tensor = torch.tensor(lon.flatten()).to(dev)
var_tensor = torch.tensor(varAry.flatten()).to(dev)

#lonSeries=pd.Series(lon.flatten())
#varSeries=pd.Series(varAry[0,0,:,:].flatten())

# Save the tensor
# Save multiple tensors as a list
tensors_list = [lat_tensor,lon_tensor,var_tensor]

In [61]:
%%timeit
torch.save(tensors_list, target_pytorch_filename)

584 ms ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [62]:
%%timeit
# Load the tensor from the file
tensor_loaded = torch.load(target_pytorch_filename)

193 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [63]:
del lat_tensor,lon_tensor,var_tensor,tensors_list

In [71]:
import gc
gc.collect()
torch.cuda.empty_cache()

#### TensorFlow

In [93]:
target_tensorflow_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}acs.tf"

os.environ['CUDA_VISIBLE_DEVICES']='0'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = "true"
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'

#gpus = tf.config.list_physical_devices('GPU')
#tf.config.experimental.set_memory_growth(gpus[0], True)

if os.path.isfile(target_tensorflow_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_tensorflow_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError

# Convert Pandas DataFrame to TensorFlow tensor
THE_DEVICE_NAME="/job:localhost/replica:0/task:0/device:GPU:0" 

with tf.device(THE_DEVICE_NAME):
    os.environ['TF_GPU_ALLOCATOR']="cuda_malloc_async"
    lat   =np.array(the_netcdf.variables[geospatial_lat_nm][:][:],dtype=np.double)
    print(f"{BOLD_START}{"Latitude data type:":30}{BOLD_END}{str(type(lat)):20}")
    print(f".......shape:{lat.shape}")
    print(f"....datatype:{lat.dtype}")
    #df_describe = pd.DataFrame(lat)
    #print(df_describe.describe())
    
    #lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:]).flatten()
    lon   =np.array(the_netcdf.variables[geospatial_lon_nm][:][:],dtype=np.double)
    print(f"{BOLD_START}{"Longitude data type:":30}{BOLD_END}{str(type(lon)):20}")
    print(f".......shape:{lon.shape}")
    print(f"....datatype:{lon.dtype}")
    #df_describe = pd.DataFrame(lon)
    #print(df_describe.describe())
    
    varAry=np.array(the_netcdf.variables[product_nm][:][:],dtype=np.double)
    print(f"{BOLD_START}{"Oceanogrpahic data type:":30}{BOLD_END}{str(type(varAry)):20}")
    print(f".......shape:{varAry.shape}")
    print(f"....datatype:{varAry.dtype}")

    final_lat = np.tile(lat, DATASET_SIZE,)
    lat=final_lat
    print(f"{BOLD_START}{"Latitude data type:":30}{BOLD_END}{str(type(lat)):20}")
    print(f".......shape:{lat.shape}")
    print(f"....datatype:{lat.dtype}")
    
    final_lon = np.tile(lon, DATASET_SIZE,)
    lon=final_lon
    print(f"{BOLD_START}{"Longitude data type:":30}{BOLD_END}{str(type(lon)):20}")
    print(f".......shape:{lon.shape}")
    print(f"....datatype:{lon.dtype}")
    
    final_varAry = np.tile(varAry, (DATASET_SIZE, DATASET_SIZE),)
    varAry=final_varAry
    print(f"{BOLD_START}{"Oceanogrpahic data type:":30}{BOLD_END}{str(type(varAry)):20}")
    print(f".......shape:{varAry.shape}")
    print(f"....datatype:{varAry.dtype}")

[1mLatitude data type:           [0;0m<class 'numpy.ndarray'>
.......shape:(400,)
....datatype:float64
[1mLongitude data type:          [0;0m<class 'numpy.ndarray'>
.......shape:(800,)
....datatype:float64
[1mOceanogrpahic data type:      [0;0m<class 'numpy.ndarray'>
.......shape:(1, 29, 400, 800)
....datatype:float64
[1mLatitude data type:           [0;0m<class 'numpy.ndarray'>
.......shape:(800,)
....datatype:float64
[1mLongitude data type:          [0;0m<class 'numpy.ndarray'>
.......shape:(1600,)
....datatype:float64
[1mOceanogrpahic data type:      [0;0m<class 'numpy.ndarray'>
.......shape:(1, 29, 800, 1600)
....datatype:float64


In [94]:
with tf.device(THE_DEVICE_NAME):
    #tensor = tf.convert_to_tensor(df)   
    #the_lat = [tf.cast(x, tf.float32).numpy() for x in lat]
    #lat_tensor = tf.convert_to_tensor(lat.flatten())
    #lon_tensor = tf.convert_to_tensor(lon.flatten())
    #var_tensor = tf.convert_to_tensor(varAry.flatten())
    
    #create a TFRecord to store the data
    lat_list = tf.train.FloatList(value=lat.flatten().tolist())
    lon_list = tf.train.FloatList(value=lon.flatten().tolist())
    varAry_list = tf.train.FloatList(value=varAry.flatten().tolist())
    feature = {
        "latitude": tf.train.Feature(float_list=lat_list),
        "longitude": tf.train.Feature(float_list=lon_list),
        "product": tf.train.Feature(float_list=varAry_list)
    }
    tfRecord=tf.train.Example(features=tf.train.Features(feature=feature)).SerializeToString()
    #dataset = tf.data.Dataset.from_tensor_slices((lat_tensor, lon_tensor, var_tensor))
    
    print(f"Size of TFRecord is {sys.getsizeof(tfRecord):20,} bytes.")

Size of TFRecord is          148,489,712 bytes.


In [95]:
%%timeit
with tf.device(THE_DEVICE_NAME):
    #tf.io.write_file(target_tensorflow_filename, tf.io.serialize_tensor(tensors_list))
    with tf.io.TFRecordWriter(target_tensorflow_filename) as writer:
        writer.write(tfRecord)    

605 ms ± 170 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [96]:
def parse_tfrecord_fn(example_proto):
   feature_description = {
       "latitude": tf.io.FixedLenSequenceFeature([], dtype=tf.float32, allow_missing=True),
       "longitude": tf.io.FixedLenSequenceFeature([], dtype=tf.float32, allow_missing=True),
       "product": tf.io.FixedLenSequenceFeature([], dtype=tf.float32, allow_missing=True)
   }
   return tf.io.parse_single_example(example_proto, feature_description)

In [102]:
%%timeit
with tf.device(THE_DEVICE_NAME):
    dataset = tf.data.TFRecordDataset(target_tensorflow_filename)
    tfRecord = dataset.map(parse_tfrecord_fn)  
    for record in tfRecord:
        lat=record['latitude']
        lon=record['longitude']
        varAry=record['product']
        break

141 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
import gc

del tfRecord, lat_list,lon_list,varAry_list,feature
gc.collect()
torch.cuda.empty_cache()

#### Xarray

In [111]:
#Xarray
# Convert Pandas DataFrame to xarray Dataset
target_xarray_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}acs.xr"
if os.path.isfile(target_xarray_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_xarray_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError
ds_xr = xr.open_dataset(target_filename)

In [107]:
%%timeit
ds_xr.to_netcdf(target_xarray_filename)

114 ms ± 8.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [108]:
%%timeit
ds_xr_loaded=xr.open_dataset(target_xarray_filename, engine="netcdf4")

7.15 ms ± 128 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
#clena up your mess
del ds_xr
gc.collect()
torch.cuda.empty_cache()

#### Apache Parquet

In [191]:
latSeries=pd.Series(lat.flatten())
lonSeries=pd.Series(lon.flatten())
varSeries=pd.Series(varAry[0,0,:,:].flatten())

#define a Panda.DataFrame()
frame={geospatial_lat_nm: latSeries, geospatial_lon_nm: lonSeries, product_nm: varSeries}

#instantiate a dataframe
df=pd.DataFrame(frame)

#ensure the data is cast as expected
df[geospatial_lat_nm].astype('float64')
df[geospatial_lon_nm].astype('float64')
df[geospatial_lon_nm].astype('float64')

target_parquet_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}ACS.parquet.gzip"
if os.path.isfile(target_parquet_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_parquet_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError

In [192]:
%%timeit
df.to_parquet(target_parquet_filename,compression='gzip')  

269 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [196]:
%%timeit
df_parquet=pd.read_parquet(target_parquet_filename)  

28.8 ms ± 3.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
#clena up your mess
del df, frame, latSeries, lonSeries, varSeries
gc.collect()
torch.cuda.empty_cache()

#### Zaar

In [198]:
#Zarr
# Save xarray Dataset to Zarr
#ds_zarr=ds_xr.to_zarr('data.zarr')
import zarr
import cupy as cp  
zarr.config.enable_gpu()
ds_xr = xr.open_dataset(target_filename)

target_zarr_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}acs.zarr"
if os.path.isdir(target_zarr_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_zarr_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError

In [192]:
%%timeit
ds_xr.to_parquet(target_parquet_filename,compression='gzip')  

269 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [196]:
%%timeit
df_parquet=pd.read_parquet(target_parquet_filename)  

28.8 ms ± 3.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
#clena up your mess
del ds_xr
gc.collect()
torch.cuda.empty_cache()

In [None]:
"""
#%%timeit

target_zarr_filename=f"/home/cwood/jbooks/folderOnColab/data/acs.zarr"
if os.path.isdir(target_zarr_filename):
  try:
    #removing existing file, else you would append
    subprocess.run(["rm", "-rf", f"{target_zarr_filename}"], check=True)
  except (subprocess.CalledProcessError, Exception) as e:
    process_exception(e)
    raise SystemError

#root = zarr.open_group('data/group.zarr', mode='w')
#store = zarr.storage.MemoryStore()
#store= zarr.storage.LocalStore(target_zarr_filename, read_only=False)
store = zarr.storage.ZipStore(target_zarr_filename, mode='w')
root = zarr.create_group(store=store)

z_lat_grp = root.create_group('latitude')
z_lat_ary = z_lat_grp.create_array(name='latitude', shape=lat.shape, chunks='auto', dtype='float32')
z_lat_ary[:] = lat

z_lon_grp = root.create_group('longitude')
z_lon_ary = z_lon_grp.create_array(name='longitude', shape=lon.shape, chunks='auto', dtype='float32')
z_lon_ary[:] = lon

z_product_grp = root.create_group('product')
z_product_ary = z_product_grp.create_array(name='product', shape=varAry.shape, chunks='auto', dtype='float32')
z_product_ary[:] = varAry

#print(root.tree())
#type(z_lat_ary[:])  
zarr.save_group(root)
store.close()

target_zarr_filename=f"/home/cwood/jbooks/folderOnColab/data/acs.zarr"
# Open the ZIP store in read-only mode
#z = zarr.open(target_zarr_filename, mode='r+')
z = zarr.open_group(target_zarr_filename, mode='r')

# Open the Zarr group from the store
#root = zarr.group(z)

# Now you can access the data in the Zarr file
# For example, to print the structure:
#print(root.tree())

# To access an array named 'data' (if it exists):
#data = root['product'][:]
#print(data.shape)

# Don't forget to close the store when you're done
#store.close()

"""

#### Dask

In [67]:
import pandas as pd
import dask.dataframe as dd
import dask.bag as db

target_dask_filename=f"{WORKING_FOLDER}{os.sep}data{os.sep}ACS"
      
df_dask = dd.from_pandas(df, npartitions=1)

# Convert DataFrame rows to dictionaries and create a Dask bag
df_bag = db.from_sequence(df.to_dict(orient='records'))

# Print the Dask bag (computation is lazy, so compute() is needed to see the result)
print(df_bag.compute())

schema = {'name': 'People', 'doc': "Set of people's scores",
          'type': 'record',
          'fields': [
              {'name': 'name', 'type': 'string'},
              {'name': 'value', 'type': 'int'}]}


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [None]:
%%timeit
#df_bag.to_avro(f"{target_dask_filename}.*.avro", schema)  
df_bag.to_avro(f"{target_dask_filename}.*.avro")  

In [None]:
%%timeit
loaded_df_bag = db.read_avro(f"{target_dask_filename}.*.avro")

In [70]:
df.head()

Unnamed: 0,index,Year,Month,Day,Hour,Minute,Second(UTC),Longitude(deg),Latitude(deg),Pressure(dbar),...,A709.2,A713.4,A717.1,A720.8,A724.6,A728.6,A732.1,A735.6,A738.9,A742.7
0,0,2018.0,7.0,18.0,3.0,54.0,55.25,82.50043,6.388097,56.674187,...,0.014609,0.018101,0.020589,0.023703,0.02616,0.02727,0.026489,0.027488,0.03079,0.032814
1,1,2018.0,7.0,18.0,3.0,54.0,56.25,82.500432,6.388067,56.763824,...,0.00761,0.0097,0.015743,0.019404,0.023228,0.028279,0.030916,0.030992,0.032134,0.030258
2,2,2018.0,7.0,18.0,3.0,54.0,57.249001,82.500433,6.388037,56.860547,...,0.018078,0.016237,0.015958,0.010981,0.00708,0.007578,0.005932,0.007313,0.007289,0.004662
3,3,2018.0,7.0,18.0,3.0,54.0,58.25,82.500434,6.388008,56.958323,...,0.018148,0.019505,0.021039,0.022217,0.023576,0.031535,0.033917,0.028271,0.02615,0.023259
4,4,2018.0,7.0,18.0,3.0,54.0,59.249001,82.500434,6.38798,57.056766,...,0.0125,0.013016,0.012733,0.010946,0.010736,0.012069,0.014006,0.01561,0.014729,0.019891


#### PySTAC

In [None]:
import pystac
from pystac.utils import datetime_to_str
from stacframes import df_from

# Sample pandas DataFrame
data = {'id': ['item1', 'item2'], 
        'geometry': [{'type': 'Point', 'coordinates': [1, 1]}, 
                     {'type': 'Point', 'coordinates': [2, 2]}],
        'datetime': [pd.Timestamp('2023-01-01'), pd.Timestamp('2023-01-02')],
        'properties': [{'prop1': 'value1'}, {'prop1': 'value2'}]}

# Create a STAC Catalog
catalog = pystac.Catalog.from_dict({"type": "Catalog", "id": "acs", "stac_version": "1.0.0"})

# Convert DataFrame to STAC Items and add to Catalog
for index, row in df.iterrows():
    item = pystac.Item(id=row['id'], 
                        geometry=row['geometry'],
                        datetime=row['datetime'].to_pydatetime(),
                        properties=row['properties'])
    catalog.add_item(item)

# Write the catalog to a file
catalog.normalize_hrefs("./pystac_data")
catalog.save_object(pystac.Catalog, "acs.json")

# Read STAC catalog into a DataFrame
#df_from_stac = df_from(catalog)
#print(df_from_stac)

## Process

In [40]:
## Main routine that executes all code, does return a data frame of data for further analysis if desired.
#
#  @param (None)
def process(inc_input_directory:str) -> {}:

    rprint(f"Entering {__name__} {inspect.stack()[0][3]}")


    
    rprint(f"Exiting {__name__} {inspect.stack()[0][3]}")


# Main Routine (call all other routines)

In [41]:
if __name__ == "__main__":

    #note that this design now deviates from previous methods.  
    #Implementation will assume a single execution of a single PIID folder, scanning results and 
    #appending metrics to a single ASCII file as the code proceeds thus ensuring multi-processor, *nix driven execution.
    
    start_t = perf_counter()
    print("BEGIN PROGRAM")
    
    ############################################
    # CONSTANTS
    ############################################
    
    # Semantic Versioning
    VERSION_NAME    = "MLDATAREADY"
    VERSION_MAJOR   = 0
    VERSION_MINOR   = 0
    VERSION_RELEASE = 1

    # location of our working files
    #WORKING_FOLDER="/content/folderOnColab"
    WORKING_FOLDER="./folderOnColab"
    input_directory="./folderOnColab";
    output_directory="./folderOnColab";

    # Notebook Author details
    AUTHOR_NAME="Christopher G Wood"
    GITHUB_USERNAME="christophergarthwood"
    AUTHOR_EMAIL="christopher.g.wood@gmail.com"

    # Encoding
    ENCODING  ="utf-8"
    os.environ['PYTHONIOENCODING']=ENCODING

    BOLD_START = "\033[1m"
    BOLD_END = "\033[0;0m"
    TEXT_WIDTH=77

    #You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL:
    #
    #0 = all messages are logged (default behavior)
    #1 = INFO messages are not printed
    #2 = INFO and WARNING messages are not printed
    #3 = INFO, WARNING, and ERROR messages are not printed
    TF_CPP_MIN_LOG_LEVEL_SETTING=0
    
    # Set the Seed for the experiment (ask me why?)
    # seed the pseudorandom number generator
    # THIS IS ESSENTIAL FOR CONSISTENT MODEL OUTPUT, remember these are random in nature.
    SEED_INIT=7
    random.seed(SEED_INIT)
    tf.random.set_seed(SEED_INIT)
    np.random.seed(SEED_INIT)    
    
    DEBUG_STACKTRACE = 0
    DEBUG_USING_GPU = 1    
    NUM_PROCESSORS=10

    #make comparisons lower case and include wild card character at the end of each to catch anomalous file extensions like xlsx, etc.
    EXTENSIONS=["nc"]
    LOWER_EXTENSIONS = [x.lower() for x in EXTENSIONS]

    THE_DEVICE_NAME="/job:localhost/replica:0/task:0/device:CPU:0"
    if DEBUG_USING_GPU==1:
        THE_DEVICE_NAME="/job:localhost/replica:0/task:0/device:GPU:0"
    
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    warnings.filterwarnings("ignore", category=FutureWarning)
    warnings.filterwarnings("ignore", category=UserWarning)
    
    # GPU Setup (for multiple GPU devices)
    device = torch.cuda.current_device()

    # softare watermark
    lib_diagnostics()

    # hardware specs
    get_hardware_stats()

    # - Core workhorse routine
    process(input_directory)
    # - Save the results
    #save_output(docs, output_directory, "policy")    
    
    end_t = perf_counter()
    print("END PROGRAM")
    print(f"Elapsed time: {end_t - start_t}")

BEGIN PROGRAM


NameError: name 'set_library_configuration' is not defined