Brent Knopp  
CS 5621  
Assignment 2  
University of Idaho

#Overview:

This code implements a complete metadata generation and validation pipeline for silver and non-silver image datasets. It reads existing CSV records, scans image folders in Google Drive, verifies file existence, and extracts image-level properties such as image format (JPEG), file size, color mode, dimensions, and aspect ratio. Missing or unavailable images are explicitly flagged. The resulting metadata is organized into structured dictionaries and converted into tabular form. Finally, the metadata for each dataset is exported to both JSON (hierarchical) and CSV (tabular) formats, providing persistent, verifiable records for dataset integrity checks and downstream machine learning analysis.

#1. Setup
This block imports the required libraries for data handling, file operations, web requests, and image processing, and mounts Google Drive so files can be accessed and saved within the Colab environment.

In [8]:
# imports
import pandas as pd
import os
import json
import requests
from io import BytesIO
from PIL import Image
# google drive setup
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#2. Test Integrety of Data Download
This code performs consistency and integrity checks between image files stored in Google Drive and their corresponding metadata stored in CSV files. For both the silver and non-silver image datasets, it verifies that the number of downloaded images recorded in the dataframe matches the actual number of JPEG images present in the image folders. It also checks each row marked as downloaded to ensure that the expected image file (image_<index>.jpg) exists in the correct directory. Any missing images are identified and reported, allowing for validation and debugging of dataset completeness before model training.

In [9]:
# file path
# non-silver image path
path_1 = "/content/drive/MyDrive/CS5621/non_silver_images"
# silver image path
path_2 = "/content/drive/MyDrive/CS5621/silver_images"

# read csv file
df_non_silver = pd.read_csv("/content/drive/MyDrive/CS5621/non_silver.csv")
df_silver = pd.read_csv("/content/drive/MyDrive/CS5621/silver.csv")

def check_downloaded(path, df):
  # error check
  # check count downloaded vs real images in non_silver folder
  downloaded_count = (df["downloaded"] == 1).sum()

  # obtain actual jpeg image count in folder
  file_count = len([
      f for f in os.listdir(path)
      if f.lower().endswith((".jpg", ".jpeg"))
  ])

  # get name of folder
  folder_name = os.path.basename(path)

  # test 1: downloaded images equal the downloaded count in the dataframe
  # display header and test 1 results
  print(f"Folder: {folder_name} vs DataFrame: {folder_name}")
  print("test actual downloaded versus DataFrame count")
  print("----------------------------------------------------")
  print("Test 1")
  print("Downloaded (df):", downloaded_count)
  print("JPG/JPEG files in folder:", file_count)
  print("Difference (folder - df):", file_count - downloaded_count)
  print(f"Pass: {not(bool(file_count - downloaded_count))}")
  print("----------------------------------------------------")
  print("Test 2")

  # test 2: for rows with downloaded==1, confirm image_<index>.jpg exists
  missing_images = []
  not_dl = 0

  # cycle dataframe to examine all downloaded cases
  for i in df.index:
    # build file path for downloaded images
    if df.loc[i, "downloaded"] == 1:
        image_name = f"image_{i}.jpg"
        image_path = os.path.join(path, image_name)  # <-- use full path
        # check to see if the image exsists
        if not os.path.exists(image_path):
            missing_images.append(i)
            not_dl += 1

  # display results
  print("Not downloaded:", not_dl)
  print("Rows marked downloaded but missing file:", len(missing_images))
  print("First missing indices:", missing_images[:10])
  print(f"All files exist: {len(missing_images) == 0}")
  print("----------------------------------------------------\n\n")

# check silver dataset for test 1 and test 2
check_downloaded(path_1, df_non_silver)
# check non silver dataset
check_downloaded(path_2, df_silver)

Folder: non_silver_images vs DataFrame: non_silver_images
test actual downloaded versus DataFrame count
----------------------------------------------------
Test 1
Downloaded (df): 16695
JPG/JPEG files in folder: 16695
Difference (folder - df): 0
Pass: True
----------------------------------------------------
Test 2
Not downloaded: 0
Rows marked downloaded but missing file: 0
First missing indices: []
All files exist: True
----------------------------------------------------


Folder: silver_images vs DataFrame: silver_images
test actual downloaded versus DataFrame count
----------------------------------------------------
Test 1
Downloaded (df): 3869
JPG/JPEG files in folder: 3869
Difference (folder - df): 0
Pass: True
----------------------------------------------------
Test 2
Not downloaded: 0
Rows marked downloaded but missing file: 0
First missing indices: []
All files exist: True
----------------------------------------------------




#3. Data Confirmation Metrics
This code evaluates the integrity of both the silver and non-silver image datasets by comparing the metadata stored in pandas dataframes with the actual image files stored in Google Drive. For each dataset, it counts the number of JPEG image files present in the corresponding image folder and compares this count to the number of records in the dataframe. It then reports any difference between the expected and actual number of images. Additionally, the code counts how many images were marked as not downloaded in the metadata, helping identify missing or failed downloads. Together, these checks ensure that the recorded dataset information accurately reflects the image files available for analysis.

In [10]:
# silver image test metric
path_1 = "/content/drive/MyDrive/CS5621/silver_images"
# count images in silver folder
file_count = len([
      f for f in os.listdir(path_1)
      if f.lower().endswith((".jpg", ".jpeg"))])
# display dataframe of silver images recorded
print("Silver df: ", len(df_silver))
# display actual images in silver folder
print("actual images: ", file_count)
# differece between recorded images and actual images
print("Difference: ", file_count - len(df_silver))
# number images that did not get downloaded
print(f"Non-download: {(df_silver["downloaded"] == 0).sum()}")

# nonsilver image test metric
path_2 = "/content/drive/MyDrive/CS5621/non_silver_images"
# count images in nonsilver folder
file_count = len([
      f for f in os.listdir(path_2)
      if f.lower().endswith((".jpg", ".jpeg"))
  ])
# display dataframe of nonsilver images recorded
print("\n\nNon-Silver df: ", len(df_non_silver))
# display actual images in nonsilver folder
print("actual images: ", file_count)
# differece between recorded images and actual images
print("Difference: ", file_count - len(df_non_silver))
# number images that did not get downloaded
print(f"Non-download: {(df_non_silver["downloaded"] == 0).sum()}")

Silver df:  5124
actual images:  3869
Difference:  -1255
Non-download: 1255


Non-Silver df:  25620
actual images:  16695
Difference:  -8925
Non-download: 8925


#4. Produce Metadata From Data

This step generates structured metadata from the raw image data by scanning image folders and synchronizing file information with tabular records. The process records file paths, labels, and download status, enabling consistency checks between stored images and their corresponding dataframe entries. This metadata serves as the authoritative reference for dataset validation and downstream model training.


#Silver Images

In [11]:
# silver image metadata
silver_path = "/content/drive/MyDrive/CS5621/silver_images"
silver_df = pd.read_csv("/content/drive/MyDrive/CS5621/silver.csv")
# silver image metadata
silver_metadata = {}
# image count
count = 0
# successful images
opened = 0
# failed images
dropped = 0
# number of downloaded images
N = len(silver_df)
print("Total rows:", N)
# extract metadata
for i in range(N):
    filename = f"image_{i}.jpg"
    image_path = os.path.join(silver_path, filename)
    count += 1
    # imaged opened
    try:
        with Image.open(image_path) as img:
            img.load()
            opened += 1
            width, height = img.size
            # get metadata
            silver_metadata[filename] = {
              "image_index": i,
              "label": 0,
              "mineral": silver_df.loc[i, "mineral_name"],
              "format": img.format,
              "file_size": os.path.getsize(image_path),
              "mode": img.mode,
              "image_width": width,
              "image_height": height,
              "aspect_ratio": (width / height) if height else None
            }
    # imaged failed
    except Exception:
        dropped += 1
        # flag to drop image
        silver_metadata[filename] = {
          "image_index": i,
          "label": "drop"
        }

# display results
print("\nFINISHED")
print("Total processed:", count)
print("Opened:", opened)
print("Dropped:", dropped)
print("Total entries:", len(silver_metadata))
# validation entry
if silver_metadata:
    print("\nFirst 3 entries:")
    for k, v in list(silver_metadata.items())[:3]:
        print(k, "->", v)
else:
    print("Metadata dictionary is empty.")

Total rows: 5124

FINISHED
Total processed: 5124
Opened: 3869
Dropped: 1255
Total entries: 5124

First 3 entries:
image_0.jpg -> {'image_index': 0, 'label': 0, 'mineral': 'Silver', 'format': 'JPEG', 'file_size': 52135, 'mode': 'RGB', 'image_width': 356, 'image_height': 403, 'aspect_ratio': 0.8833746898263027}
image_1.jpg -> {'image_index': 1, 'label': 0, 'mineral': 'Silver', 'format': 'JPEG', 'file_size': 58291, 'mode': 'RGB', 'image_width': 423, 'image_height': 338, 'aspect_ratio': 1.2514792899408285}
image_2.jpg -> {'image_index': 2, 'label': 0, 'mineral': 'Silver', 'format': 'JPEG', 'file_size': 29233, 'mode': 'RGB', 'image_width': 335, 'image_height': 428, 'aspect_ratio': 0.7827102803738317}


#Non-Silver Images

In [14]:
# nonsilver image metadata
nonsilver_path = "/content/drive/MyDrive/CS5621/non_silver_images"
nonsilver_df = pd.read_csv("/content/drive/MyDrive/CS5621/non_silver.csv")
# nonsilver image metadata
nonsilver_metadata = {}
# image count
count = 0
# successful images
opened = 0
# failed images
dropped = 0
# number of downloaded images
N = len(nonsilver_df)
print("Total rows:", N)
# extract metadata
for i in range(N):
    filename = f"image_{i}.jpg"
    image_path = os.path.join(nonsilver_path, filename)
    count += 1
    # imaged opened
    try:
        with Image.open(image_path) as img:
            img.load()
            opened += 1
            width, height = img.size
            # get metadata
            nonsilver_metadata[filename] = {
                "image_index": i,
                "label": 0,
                "mineral": nonsilver_df.loc[i, "mineral_name"],
                "format": img.format,
                "file_size": os.path.getsize(image_path),
                "mode": img.mode,
                "image_width": width,
                "image_height": height,
                "aspect_ratio": (width / height) if height else None
            }
    # imaged failed
    except Exception:
        dropped += 1
        # flag to drop image
        nonsilver_metadata[filename] = {
            "image_index": i,
            "label": "drop"
        }

# display results
print("\nFINISHED")
print("Total processed:", count)
print("Opened:", opened)
print("Dropped:", dropped)
print("Total entries:", len(nonsilver_metadata))
# validation entry
if nonsilver_metadata:
    print("\nFirst 3 entries:")
    for k, v in list(nonsilver_metadata.items())[:3]:
        print(k, "->", v)
else:
    print("Metadata dictionary is empty.")

Total rows: 25620

FINISHED
Total processed: 25620
Opened: 16695
Dropped: 8925
Total entries: 25620

First 3 entries:
image_0.jpg -> {'image_index': 0, 'label': 0, 'mineral': 'Galena', 'format': 'JPEG', 'file_size': 75564, 'mode': 'RGB', 'image_width': 1024, 'image_height': 768, 'aspect_ratio': 1.3333333333333333}
image_1.jpg -> {'image_index': 1, 'label': 'drop'}
image_2.jpg -> {'image_index': 2, 'label': 0, 'mineral': 'Hematite', 'format': 'JPEG', 'file_size': 107742, 'mode': 'RGB', 'image_width': 700, 'image_height': 490, 'aspect_ratio': 1.4285714285714286}


#5. Write Metadata Files
This code exports the generated metadata for both the silver and non-silver image datasets to persistent storage in Google Drive. For each dataset, the metadata is saved in two formats: JSON and CSV. The JSON files preserve the hierarchical dictionary structure keyed by image filename, while the CSV files provide a tabular representation suitable for inspection and downstream data analysis. Status messages are printed after each write operation to confirm that the files were successfully created.

In [13]:
# write silver and nonsilver metadata to json and csv files in google drive

# convert format
nonsilver_metadata_df = pd.DataFrame.from_dict(
    nonsilver_metadata,
    orient="index"
).reset_index(drop=True)

# convert format
silver_metadata_df = pd.DataFrame.from_dict(
    silver_metadata,
    orient="index"
).reset_index(drop=True)

# write JSON (nonsilver images)
json_path = "/content/drive/MyDrive/CS5621/non_silver_metadata.json"
with open(json_path, "w") as f:
    json.dump(nonsilver_metadata, f, indent=2)
# display results
print("JSON file written to:", json_path)

# write CSV (nonsilver)
csv_path = "/content/drive/MyDrive/CS5621/non_silver_metadata.csv"
nonsilver_metadata_df.to_csv(csv_path, index=True)
# display results
print("CSV file written to:", csv_path)

# write JSON (silver)
json_path = "/content/drive/MyDrive/CS5621/silver_metadata.json"
with open(json_path, "w") as f:
    json.dump(silver_metadata, f, indent=2)
# display results
print("JSON file written to:", json_path)

# write CSV (silver)
csv_path = "/content/drive/MyDrive/CS5621/silver_metadata.csv"
silver_metadata_df.to_csv(csv_path, index=True)
# display results
print("CSV file written to:", csv_path)

JSON file written to: /content/drive/MyDrive/CS5621/non_silver_metadata.json
CSV file written to: /content/drive/MyDrive/CS5621/non_silver_metadata.csv
JSON file written to: /content/drive/MyDrive/CS5621/silver_metadata.json
CSV file written to: /content/drive/MyDrive/CS5621/silver_metadata.csv
