# Preparing AI-ready PHIPS Image Classification Data
##### This notebook reads in the clean versions of the P-3 aircraft met/nav and PHIPS ice crystal image datasets to create an AI-ready xarray.Dataset that contains 2-D numpy arrays for each image and corresponding meta data from the P-3 met/nav dataset (e.g. P-3 lat/lon coordinates, temperature, altitude)

In [10]:
# Import necessary packages
import tarfile
import numpy as np
import pandas as pd
import os
import glob 
from PIL import Image
import io 
import xarray as xr

In [11]:
# Read in P-3 met/nav data for each flight

dates = ['2020-02-07', '2022-02-17', '2023-01-23', '2023-02-14'] # flight dates

for date in dates:
    datestr = date.replace('-', '')
    fname_p3 = glob.glob(f'/home/disk/meso-home/vgarcia1/PHIPS_classification/MLGEO2024_Snowflake_Classification/data/clean/{date}_P3_MetNav.nc')[0]

    globals()[f'p3_nav_{datestr}'] = xr.open_dataset(fname_p3)

    print(f'p3_nav_{datestr}')
    print(globals()[f'p3_nav_{datestr}'])
    print()

p3_nav_20200207
<xarray.Dataset>
Dimensions:      (time: 21359)
Coordinates:
  * time         (time) datetime64[ns] 2020-02-07T14:05:47 ... 2020-02-07T20:...
Data variables: (12/35)
    lon          (time) float64 ...
    lat          (time) float64 ...
    alt_gps      (time) float64 ...
    alt_pres     (time) float64 ...
    alt_radar    (time) float64 ...
    grnd_spd     (time) float64 ...
    ...           ...
    svp_ice      (time) float64 ...
    rh           (time) float64 ...
    zenith       (time) float64 ...
    sun_elev_P3  (time) float64 ...
    sun_az       (time) float64 ...
    sun_az_P3    (time) float64 ...
Attributes: (12/27)
    Experiment:           IMPACTS
    Platform:             Wallops P-3 N426NA
    Mission PI:           Lynn McMurdie (lynnm@uw.edu)
    PI_CONTACT_INFO:      Melissa Yang Martin, m.yang@baeri.org
    LOCATION:             Included in data records
    ASSOCIATED_DATA:      This flight represents IMPACTS Science Flight #5 ou...
    ...       

In [12]:
# List of ice crystal habit types with corresponding folder names
habits = {
    'aggregates': 'Aggregates', 'bullet_rosettes': 'Bullet-rosettes', 'capped_columns': 'Capped-columns', 
    'columns': 'Columns', 'dendrites': 'Dendrites', 'graupel': 'Graupel', 
    'needles': 'Needles', 'plates': 'Plates', 'polycrystals': 'Polycrystals', 
    'side_planes': 'Side-planes', 'tiny': 'Tiny'
}

# List of ice crystal habit types in singular form
habits_singular = ['aggregate', 'bullet_rosette', 'capped_column', 'column', 'dendrite', 'graupel', 'needle', 'plate', 'polycrystal', 'side_plane', 'tiny']

# Flight navigation datasets
nav_datasets = {
    "2020-02-07": p3_nav_20200207,
    "2022-02-17": p3_nav_20220217,
    "2023-01-23": p3_nav_20230123,
    "2023-02-14": p3_nav_20230214
}

In [13]:
# Define a list to hold the dataset entries
dataset_entries = []

# Loop through each habit type
for habit, folder_name in habits.items():
    tar_file = f'/home/disk/meso-home/vgarcia1/PHIPS_classification/MLGEO2024_Snowflake_Classification/data/clean/PHIPS_{habit}.tar.gz'  # Construct the tar.gz file name
    
    # Make label singular if necessary
    if habit == 'tiny' or habit == 'graupel':
        label = habit
    else:
        label = habit[:-1]  # Assign the label based on the habit type

    print(f"Extracting images for {habit}...")

    # Open the tar.gz file
    with tarfile.open(tar_file, "r:gz") as tar:
        for member in tar.getmembers():
            # Check if the member is a file and ends with .png
            if member.isfile() and member.name.lower().endswith(".png"):
                # Check if the file is inside the folder named according to the corresponding folder name
                if os.path.dirname(member.name).split('/')[-1] == folder_name:
                    print(member.name)
                    # Extract and read image data
                    file = tar.extractfile(member)
                    if file is not None:
                        image = Image.open(io.BytesIO(file.read()))
                        image_array = np.array(image)
                        
                        # Extract timestamp from the filename
                        parts = member.name.split('_')

                        date2 = parts[4][:8]  # Date of the flight
                        time2 = parts[4][8:]  # Time of the flight (HHMMSS format)

                        timestamp = f"{date2}T{time2[:2]}:{time2[2:4]}:{time2[4:]}"  # Convert to ISO format
                        
                        # Convert date2 to match the format in nav_datasets keys (YYYY-MM-DD)
                        date2_formatted = f"{date2[:4]}-{date2[4:6]}-{date2[6:]}"

                        # Match the date with the appropriate navigation dataset
                        nav_data = nav_datasets.get(date2_formatted)

                        # Extract lat, lon, and temperature from the nav data if available
                        if nav_data is not None:
                            # Correct the timestamp format to "YYYY-MM-DDTHH:MM:SS"
                            timestamp = f"{date2[:4]}-{date2[4:6]}-{date2[6:]}T{time2[:2]}:{time2[2:4]}:{time2[4:]}"
                            time = np.datetime64(timestamp)
                            
                            # Print the timestamp from the image filename
                            print("Timestamp of the image:", timestamp)
                            
                            # Get the time values from nav_data
                            nav_time = nav_data['time'].values  # Assuming time is an array in nav_data

                            # Convert nav_time to np.datetime64 if necessary
                            if not np.issubdtype(nav_time.dtype, np.datetime64):
                                print("Converting nav_time to np.datetime64")
                                nav_time = nav_time.astype('datetime64[s]')

                            # Find the nearest time index
                            nearest_time_idx = np.abs(nav_time - time).argmin()

                            # Print the timestamp from the nav_data
                            print("Timestamp from nav_data:", nav_data.time[nearest_time_idx].values)

                            # Extract latitude, longitude, and temperature
                            lat = nav_data['lat'].isel(time=nearest_time_idx).values
                            lon = nav_data['lat'].isel(time=nearest_time_idx).values
                            temperature = np.round(nav_data['temp'].isel(time=nearest_time_idx).values, 3)
                            altitude = np.round(nav_data['alt_gps'].isel(time=nearest_time_idx).values/1000, 3) # Convert to km
                        else:
                            lat = np.nan
                            lon = np.nan
                            temperature = np.nan
                            altitude = np.nan
                        
                        # Create a dictionary for this entry
                        dataset_entry = { 
                            "image_array": image_array,
                            "timestamp": timestamp,
                            'label': label,  # Use singular label from the mapping
                            "latitude": lat,
                            "longitude": lon,
                            "temperature": temperature,
                            "altitude": altitude
                        }

                        # Print dimensions of the image array
                        print("Image array shape:", image_array.shape)
                        print()


                        dataset_entries.append(dataset_entry)


Extracting images for aggregates...
Aggregates/IMPACTS_PHIPS_20200207_1347_20200207151036_003174_C1.png
Timestamp of the image: 2020-02-07T15:10:36
Timestamp from nav_data: 2020-02-07T15:10:36.000000000
Image array shape: (1024, 1360)

Aggregates/IMPACTS_PHIPS_20200207_1347_20200207151037_003177_C1.png
Timestamp of the image: 2020-02-07T15:10:37
Timestamp from nav_data: 2020-02-07T15:10:37.000000000
Image array shape: (1024, 1360)

Aggregates/IMPACTS_PHIPS_20200207_1347_20200207151039_003183_C1.png
Timestamp of the image: 2020-02-07T15:10:39
Timestamp from nav_data: 2020-02-07T15:10:39.000000000
Image array shape: (1024, 1360)

Aggregates/IMPACTS_PHIPS_20200207_1347_20200207151040_003184_C1.png
Timestamp of the image: 2020-02-07T15:10:40
Timestamp from nav_data: 2020-02-07T15:10:40.000000000
Image array shape: (1024, 1360)

Aggregates/IMPACTS_PHIPS_20200207_1347_20200207151040_003185_C1.png
Timestamp of the image: 2020-02-07T15:10:40
Timestamp from nav_data: 2020-02-07T15:10:40.0000000

In [14]:
print("Number of dataset entries:", len(dataset_entries))
print("Sample dataset entry:", dataset_entries[0])

Number of dataset entries: 440
Sample dataset entry: {'image_array': array([[204, 201, 198, ...,  88,  75,  71],
       [206, 203, 200, ...,  94,  80,  71],
       [206, 206, 202, ...,  97,  88,  71],
       ...,
       [255, 254, 249, ..., 126, 122, 115],
       [255, 255, 249, ..., 135, 123, 115],
       [255, 255, 249, ..., 136, 124, 117]], dtype=uint8), 'timestamp': '2020-02-07T15:10:36', 'label': 'aggregate', 'latitude': array(43.124961), 'longitude': array(43.124961), 'temperature': -15.21, 'altitude': 5.147}


In [18]:
# Define lists to hold dataset entries
image_arrays = []
timestamps = []
labels = []
latitudes = []
longitudes = []
temperatures = []
altitudes = []

# Iterate over dataset_entries to fill arrays/lists
for entry in dataset_entries:
    # Append the image array and other attributes to lists
    image_arrays.append(entry['image_array'])  # Keep the original shape
    timestamps.append(entry['timestamp'])
    labels.append(entry['label'])
    latitudes.append(entry['latitude'])
    longitudes.append(entry['longitude'])
    temperatures.append(entry['temperature'])
    altitudes.append(entry['altitude'])

# Convert lists to numpy arrays
image_arrays = np.array(image_arrays)  # Shape will be (samples, height, width)
timestamps = np.array(timestamps)
labels = np.array(labels)
latitudes = np.array(latitudes)
longitudes = np.array(longitudes)
temperatures = np.array(temperatures)
altitudes = np.array(altitudes)


# Create an Xarray dataset
ds = xr.Dataset(
    {
        'image_array': (['samples', 'height', 'width'], image_arrays),  # Use 3D shape (samples, height, width)
        'timestamp': ('samples', timestamps),
        'label': ('samples', labels),
        'latitude': ('samples', latitudes),
        'longitude': ('samples', longitudes),
        'temperature': ('samples', temperatures),
        'altitude': ('samples', altitudes),
    },
    coords={
        'samples': np.arange(len(dataset_entries)),  # Coordinate for the samples
    }
)

# Adding global attributes
ds.attrs['description'] = 'AI-ready dataset for ice crystal habit classification in high-resolution images taken by the PHIPS instrument during the NASA IMPACTS field campaign.'
ds.attrs['creation_date'] = '2024-10-24'
ds.attrs['author'] = 'Valeria Garcia (vgarcia1@uw.edu)'

# Adding attributes to a specific variable
ds['image_array'].attrs['units'] = 'pixel values'
ds['latitude'].attrs['units'] = 'degrees_north'
ds['longitude'].attrs['units'] = 'degrees_east'
ds['temperature'].attrs['units'] = 'Celsius'
ds['altitude'].attrs['units'] = 'kilometers'


In [19]:
# Print the dataset to check its structure
print(ds)

<xarray.Dataset>
Dimensions:      (samples: 440, height: 1024, width: 1360)
Coordinates:
  * samples      (samples) int64 0 1 2 3 4 5 6 7 ... 433 434 435 436 437 438 439
Dimensions without coordinates: height, width
Data variables:
    image_array  (samples, height, width) uint8 204 201 198 192 ... 157 157 156
    timestamp    (samples) <U19 '2020-02-07T15:10:36' ... '2023-01-23T16:57:54'
    label        (samples) <U14 'aggregate' 'aggregate' ... 'tiny' 'tiny'
    latitude     (samples) float64 43.12 43.12 43.12 43.12 ... 44.84 44.79 44.65
    longitude    (samples) float64 43.12 43.12 43.12 43.12 ... 44.84 44.79 44.65
    temperature  (samples) float64 -15.21 -15.28 -15.25 ... -13.14 -5.29 -5.31
    altitude     (samples) float64 5.147 5.147 5.147 5.147 ... 3.967 2.674 2.686
Attributes:
    description:    AI-ready dataset for ice crystal habit classification in ...
    creation_date:  2024-10-24
    author:         Valeria Garcia (vgarcia1@uw.edu)


In [20]:
# Save the dataset to a NetCDF file (outside of GitHub repo since the file size exceeds the Github limit)
ds.to_netcdf('/home/disk/meso-home/vgarcia1/PHIPS_classification/PHIPS_CrystalHabitAI_Dataset.nc')

## PHIPS Crystal Habit AI-Ready Dataset 

#### Size of NetCDF file: 584.4 MB

#### Note: Due to the size of the dataset exceeding 100 MB, it cannot be directly uploaded to Github in data/ai_ready. Instead, the file was downloaded locally and shared on Google Drive. The dataset can be downloaded automatically by clicking this link: https://drive.google.com/uc?id=1gnfZpiBP954-qddiRfEZInfuAoMI7Y__

#### The dataset contains the following variables:


- `image_array`: A 3D array of greyscale images with dimensions `(samples, height, width)`. Each entry represents a 2D array of pixel intensities for an image.
- `timestamp`: The timestamp for each ice crystal image in ISO format (`YYYY-MM-DDTHH:MM:SS`), indicating the time it was captured during a flight.
- `label`: The classified habit type of the ice crystal, with labels in singular form (e.g., 'aggregate', 'column').
- `latitude`: The geographic latitude coordinate where the image was captured.
- `longitude`: The geographic longitude coordinate where the image was captured.
- `temperature`: The air temperature (in degrees Celsius) at the time and location corresponding to each image.
- `altitude`: The flight altitude in kilometers at the time the image was captured.

#### This dataset contains a total of 440 samples of various ice crystals images imaged during the NASA IMPACTS deployments (on flights 2020-02-07, 2022-02-17, 2023-01-23, or 2023-02-14), each converted to numpy array. All images are in greyscale and equally sized (1024 height x 1360 width pixels). Each image is labeled with one of 11 ice crystal habit categories (aggregate, bullet-rosette, capped column, column, dendrite, graupel, meedle, plate, polycrystal, side-plane, tiny), with 40 images labeled per category.

