Brent Knopp  
CS 5621  
Assignment 2  
University of Idaho

#Overview:
This code implements a complete data preparation and image collection pipeline for a silver versus non-silver mineral classification task. The process begins by mounting Google Drive, loading a raw CSV file of mineral image URLs, and validating the dataset structure. The data are then cleaned by removing malformed entries, standardizing mineral names, filtering valid image URLs, and adding metadata fields to track download status.

Mineral samples are labeled into binary silver and non-silver classes using keyword-based matching, and the non-silver class is further subdivided into hard negatives and easy negatives based on visual similarity to silver-bearing minerals. Dataset integrity checks are performed to ensure no silver samples leak into negative subsets. A training dataset is then constructed using a fixed 5:1 non-silver–to–silver ratio, with non-silver samples composed of a controlled mix of hard and easy negatives.

A custom image download function retrieves images from their URLs, validates content types, saves them into structured Google Drive directories, and records download success or failure directly in the associated DataFrames. Images that fail to download are retained in the dataset and explicitly marked as not downloaded. Finally, the processed metadata for silver and non-silver images is saved to Google Drive in both CSV and JSON formats to support reproducibility, verification, and downstream machine learning workflows.

#1. Setup

This code mounts Google Drive, loads an image URL dataset from a CSV file into a pandas DataFrame while skipping malformed entries, and prints sample rows and column headers to validate data integrity and structure.

In [None]:
# libaries
import pandas as pd
import os
import requests
# google drive setup
from google.colab import drive
drive.mount('/content/drive')
# load image URL data set for mineral images
path = '/content/drive/MyDrive/img_url_list.csv'
df = pd.read_csv(path, on_bad_lines='skip', engine='python')
# validate data set
print("\nDisplay Head:")
print(df.head())
print("\n Column names:")
print(list(df.columns))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Display Head:
   https://www.mindat.org/photos/u00/00/au00002.jpg   Aurichalcite   
0  https://www.mindat.org/photos/r00/00/ar00003.jpg      Aragonite   
1  https://www.mindat.org/photos/n00/00/an00001.jpg      Anglesite   
2  https://www.mindat.org/photos/d00/00/ad00001.jpg        Adamite   
3  https://www.mindat.org/photos/r00/00/ar00002.jpg       Artinite   
4  https://www.mindat.org/photos/p00/00/ap00001.jpg   Fluorapatite   

 Column names:
['https://www.mindat.org/photos/u00/00/au00002.jpg', ' Aurichalcite', ' ']


#2. Data Loading and Preprocessing
An image URL dataset is loaded from a CSV file, with URL and mineral name fields cleaned and standardized. Invalid and duplicate entries are removed, and a download-tracking flag is added. The dataset is then reorganized into a consistent, readable format and displayed to verify that preprocessing steps were applied correctly.

In [None]:
# preprocess and clean data
df = pd.read_csv(
    path,
    header=None,
    names=["image_url", "mineral_name", "extra"],
    engine="python",
    on_bad_lines="skip"
)

# remove leading and trailing whitespace
df["image_url"] = df["image_url"].astype(str).str.strip()
df["mineral_name"] = df["mineral_name"].astype(str).str.strip()
# add a column to track whether an image has been downloaded (0 = not downloaded)
df["downloaded"] = 0
# keep only rows where the image URL is a valid HTTP or HTTPS link and remove duplicates
df = df[df["image_url"].str.startswith(("http://", "https://"))].drop_duplicates()
# drop the unused 'extra' column and reset the DataFrame index
df = df.drop(columns=["extra"]).reset_index(drop=True)
# convert mineral names to title case for consistency and readability
df["mineral_name"] = df["mineral_name"].str.title()
# reorder columns into a logical and readable format
df = df[["mineral_name", "image_url", "downloaded"]]
# limit the number of rows displayed in the notebook output
pd.set_option("display.max_rows", 20)
# allow full display of long text fields such as image URLs
pd.set_option("display.max_colwidth", None)

# Display sample rows to verify data cleaning, filtering, and column formatting
df.head()

Unnamed: 0,mineral_name,image_url,downloaded
0,Aurichalcite,https://www.mindat.org/photos/u00/00/au00002.jpg,0
1,Aragonite,https://www.mindat.org/photos/r00/00/ar00003.jpg,0
2,Anglesite,https://www.mindat.org/photos/n00/00/an00001.jpg,0
3,Adamite,https://www.mindat.org/photos/d00/00/ad00001.jpg,0
4,Artinite,https://www.mindat.org/photos/r00/00/ar00002.jpg,0


#3. Assign Silver vs Non-Silver Labels

Each mineral in the dataset is labeled using a binary classification scheme to support downstream analysis and model training. A value of 0 indicates a non-silver mineral, while a value of 1 indicates a silver or silver-bearing mineral. This labeling approach provides a clear and consistent target variable for supervised machine learning tasks.

In [None]:
# list of mineral names that are classified as silver or silver-bearing that
# will be included in the silver class
SILVER_WORDS = [
    "silver", "native silver", "argentum",
    "argentite", "acanthite",
    "chlorargyrite", "cerargyrite", "horn silver",
    "proustite", "pyrargyrite",
    "stephanite", "polybasite", "pearceite",
    "miargyrite", "freibergite"
]
# check to see if the mineral name is in the silver class
def is_silver(name):
    n = name.lower()
    return any(w in n for w in SILVER_WORDS)
# build column in dataframe to classify whether it is silver
df["is_silver"] = df["mineral_name"].apply(is_silver)

# validate silver set
print("Check Silver:")
print(f"All silver images: {df["is_silver"].all()}")
# display dataframe
df.head(5)

Check Silver:
All silver images: False


Unnamed: 0,mineral_name,image_url,downloaded,is_silver
0,Aurichalcite,https://www.mindat.org/photos/u00/00/au00002.jpg,0,False
1,Aragonite,https://www.mindat.org/photos/r00/00/ar00003.jpg,0,False
2,Anglesite,https://www.mindat.org/photos/n00/00/an00001.jpg,0,False
3,Adamite,https://www.mindat.org/photos/d00/00/ad00001.jpg,0,False
4,Artinite,https://www.mindat.org/photos/r00/00/ar00002.jpg,0,False


#4. Label Hard Negatives vs Non-Hard Negatives
To further improve classification robustness, a subset of non-silver minerals that closely resemble silver in appearance is explicitly identified as hard negatives. These minerals are commonly confused with silver due to similar visual or textural characteristics. Labeling hard negatives separately helps distinguish genuinely silver-bearing minerals from visually similar non-silver samples and reduces false-positive classifications in downstream analysis. A value of 0 indicates a non-hard negative, while a value of 1 indicates a hard negative.

In [None]:
# List of mineral names classified as non-silver hard negatives
HARD_NEG_WORDS = [
    "galena",
    "pyrite",
    "hematite",
    "graphite",
    "stibnite",
    "sphalerite",
    "chalcopyrite",
    "molybdenite"
]
# check to see if the mineral name is in the hard negative class
def is_hard_negative(name):
    n = str(name).lower()
    return any(w in n for w in HARD_NEG_WORDS)
# build column in dataframe to classify whether it is hard negative
df["is_hard_negative"] = df["mineral_name"].apply(is_hard_negative)
# display dataframe
df.head()

Unnamed: 0,mineral_name,image_url,downloaded,is_silver,is_hard_negative
0,Aurichalcite,https://www.mindat.org/photos/u00/00/au00002.jpg,0,False,False
1,Aragonite,https://www.mindat.org/photos/r00/00/ar00003.jpg,0,False,False
2,Anglesite,https://www.mindat.org/photos/n00/00/an00001.jpg,0,False,False
3,Adamite,https://www.mindat.org/photos/d00/00/ad00001.jpg,0,False,False
4,Artinite,https://www.mindat.org/photos/r00/00/ar00002.jpg,0,False,False


#5. Create 2 binary class
The dataset is partitioned into two primary binary classes: silver and non-silver, with duplicate image URLs removed to ensure uniqueness. The non-silver class is further subdivided into hard negatives and easy negatives based on mineral similarity to silver. This additional subdivision enables more granular control over negative samples and helps validate that no silver-labeled images leak into either negative subset. Summary metrics are printed to confirm correct class separation and dataset integrity.

In [None]:
# create two binary class of silver and non-silver
# silver class dataframe
silver_df = (
    df[df["is_silver"]]
    .drop_duplicates(subset=["image_url"])
    .reset_index(drop=True)
)
# non-silver class dataframe
nonsilver_df = (
    df[~df["is_silver"]]
    .drop_duplicates(subset=["image_url"])
    .reset_index(drop=True)
)
# display metrics to validate
print("Silver (unique):", len(silver_df))
print("Non-silver (unique):", len(nonsilver_df))

# Split the non-silver set to include a hard-negative subdivision
# hard negative dataframe of non-silver class
hard_neg_df = (
    nonsilver_df[nonsilver_df["is_hard_negative"] == True]
    .reset_index(drop=True)
)
# easy negative dataframe of non-silver class
easy_neg_df = (
    nonsilver_df[nonsilver_df["is_hard_negative"] == False]
    .reset_index(drop=True)
)

# display validation metrics
print("\nHard Negative Metrics:")
print("------------------------------")
print("Hard negatives:", len(hard_neg_df))
print("Easy negatives:", len(easy_neg_df))
print("Silver leaked into hard:", hard_neg_df["is_silver"].any())
print("Silver leaked into easy:", easy_neg_df["is_silver"].any())

Silver (unique): 5124
Non-silver (unique): 411560

Hard Negative Metrics:
------------------------------
Hard negatives: 18192
Easy negatives: 393368
Silver leaked into hard: False
Silver leaked into easy: False


#6. Build a Single DataFrame with the Correct Image Ratio

A fixed training ratio is established in which non-silver images outnumber silver images by a factor of five. The total number of non-silver samples is calculated accordingly and further divided into hard negatives and easy negatives using a 70/30 split. Random sampling is applied to each negative subset to construct a balanced non-silver training set. Finally, all silver and selected non-silver images are combined into a single training DataFrame, and summary outputs are displayed to validate that the intended class ratios have been correctly applied.

In [None]:
# training image ratio of non-silver images to silver images
IMAGE_FACTOR = 5
# compute silver and non-silver set sizes at a 5:1 ratio
model_silver_image_count = len(silver_df)
model_nonsilver_image_count = IMAGE_FACTOR * model_silver_image_count
# display validation metrics
print("Model silver image count:", model_silver_image_count)
print("Model non-silver image count:", model_nonsilver_image_count)
# compute hard negatives and easy negative sizes
model_hard_neg_image_count = round(.7 * model_nonsilver_image_count)
model_easy_neg_image_count = model_nonsilver_image_count - model_hard_neg_image_count
# display validation metrics
print("Hard negative image count:", model_hard_neg_image_count)
print("Easy negative image count:", model_easy_neg_image_count)
# build image training set with correct ratios of images
# all of silver images
model_silver_df = silver_df
# correct ratio of non-silver images with correct ratio of hard negative images
model_nonsilver_df = pd.concat(
    [hard_neg_df.sample(n=model_hard_neg_image_count, random_state=42),
     easy_neg_df.sample(n=model_easy_neg_image_count, random_state=42)],
     ignore_index=True
)
# combine all images
model_images = pd.concat([model_silver_df, model_nonsilver_df], ignore_index=True)
# display dataframe validation
model_nonsilver_df.head()

Model silver image count: 5124
Model non-silver image count: 25620
Hard negative image count: 17934
Easy negative image count: 7686


Unnamed: 0,mineral_name,image_url,downloaded,is_silver,is_hard_negative
0,Galena,https://www.mindat.org/photos/699/16/0699167001259349578.jpg,0,False,True
1,Arsenopyrite,https://www.mindat.org/xpic.php?fname=0004594001422912443.jpg&h=8c51d9ebeaa1a97d652ca94c0f1f8180,0,False,True
2,Hematite,https://www.mindat.org/photos/805/09/0805095001401484050.jpg,0,False,True
3,Galena,https://www.mindat.org/photos/815/80/08158060014371196313768.jpg,0,False,True
4,Hematite (Var: Martite),https://www.mindat.org/xpic.php?fname=0104938001236512794.jpg&h=3ef5cafac8559e5ba7f34c4c75958710,0,False,True


#7. Download Image Data to Google Drive

This section defines folder names and a base Google Drive directory for storing downloaded images. A download_images() function is then created to download images from a list of URLs, save them into a specified Drive folder, and update a DataFrame column (downloaded) to record whether each download succeeded. During downloading, each HTTP response is validated using its Content-Type header to ensure the content is an actual image before saving it to disk. After the download step, full directory paths for silver and non-silver image folders are constructed, URL lists are extracted from the silver and non-silver DataFrames, and both image sets are downloaded into their corresponding folders. Finally, the processed metadata for each class (including download tracking) is saved to Google Drive in both CSV and JSON formats to support verification and downstream machine learning workflows. Images that fail to download are retained in the DataFrame and explicitly marked as not downloaded.

In [57]:
# folder name for silver images
FOLDER_1 = "silver_images"
# folder name for non-silver images
FOLDER_2 = "non_silver_images"
# base path directory
BASE_PATH = "/content/drive/MyDrive/"

def download_images(urls, folder, frame):
    """
    Download images from a list of URLs and save them to a specified
    directory in Google Drive. Each download is validated to ensure
    the content is an image, and the DataFrame is updated to track
    successful and failed downloads.
    """
    # create target folder if it does not exist
    os.makedirs(folder, exist_ok=True)
    # total images downloaded
    image_count = 0
    # total images that failed to download
    fail=0

    # download all urls in dataframe
    for i, url in enumerate(urls):
        try:
            # Send HTTP request with timeout and browser-like headers
            r = requests.get(url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})

            # validate that the response contains an image
            if "image" not in r.headers.get("Content-Type", ""):
              image_count += 1
              fail=fail+1
              frame.loc[i, "downloaded"] = 0
              continue

            # define output filename and path
            filename = f"image_{i}.jpg"
            filepath = os.path.join(folder, filename)

            # write image bytes to disk
            with open(filepath, "wb") as f:
                f.write(r.content)

            # mark successful download in dataframe
            frame.loc[i, "downloaded"] = 1
            image_count += 1

        except Exception as e:
            # handle network or file I/O errors
            print("Failed to download:", url)

    # display download summary statistics
    print(f"Downloaded {image_count} images to {folder}")
    print(f"Failed to download {fail} images to {folder}")

# get urls
url_silver = model_silver_df["image_url"].tolist()[:]
url_nonsilver = model_nonsilver_df["image_url"].tolist()[:]

# download silver images
download_images(url_silver, path_1, model_silver_df)
# download nonsilver images
download_images(url_nonsilver, path_2, model_nonsilver_df)

# save data
save_dir = "/content/drive/MyDrive/my_data"
os.makedirs(save_dir, exist_ok=True)
# save metadata for silver images
model_silver_df.to_csv(os.path.join(save_dir, "silver.csv"), index=False)
model_silver_df.to_json(os.path.join(save_dir, "silver.json"), orient="records", indent=2)
# save metadata for non-silver images
model_nonsilver_df.to_csv(os.path.join(save_dir, "non_silver.csv"), index=False)
model_nonsilver_df.to_json(os.path.join(save_dir, "non_silver.json"), orient="records", indent=2)

Downloaded 5124 images to /content/drive/MyDrive/silver_images
Failed to download 1255 images to /content/drive/MyDrive/silver_images
Downloaded 25620 images to /content/drive/MyDrive/non_silver_images
Failed to download 8925 images to /content/drive/MyDrive/non_silver_images
