## Cell 1: The Ingestion Toolchain
We establish the geographic bounds of our project structure. Since this notebook resides in `notebooks/`, it must traverse up one level to deposit the payload into `dataset/`.

In [6]:
import os
import shutil
import pandas as pd
import kagglehub

# Define the absolute architecture of our directory tree
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
DATASET_DIR = os.path.join(PROJECT_ROOT, 'dataset')
FINAL_DATA_DIR = os.path.join(DATASET_DIR, 'PlantVillage')

# Ensure the dataset directory exists before we initiate the download
os.makedirs(DATASET_DIR, exist_ok=True)

print(f"Project topography initialized. Target directory: {DATASET_DIR}")

Project topography initialized. Target directory: /Users/angelonelson/Projects/crop-disease-identifier/ml/dataset


## Cell 2: Download and Extract via kagglehub
We use `kagglehub` to download and extract the dataset. Authentication is handled automatically via browser login if needed.

In [7]:
# Download latest version
path = kagglehub.dataset_download("abdallahalidev/plantvillage-dataset")

print("Path to dataset files:", path)

Downloading to /Users/angelonelson/.cache/kagglehub/datasets/abdallahalidev/plantvillage-dataset/3.archive...


100%|██████████| 2.04G/2.04G [03:27<00:00, 10.6MB/s]

Extracting files...





Path to dataset files: /Users/angelonelson/.cache/kagglehub/datasets/abdallahalidev/plantvillage-dataset/versions/3


## Cell 3: Structural Normalization
Copy data from kagglehub cache into our project's `dataset/PlantVillage` directory and flatten any redundant nested directories.

In [10]:
print("Normalizing directory structure into project tree...")

# Walk down through any PlantVillage nesting in the cache
source_dir = path
for _ in range(3):
    nested = os.path.join(source_dir, 'PlantVillage')
    if os.path.exists(nested):
        source_dir = nested
    else:
        break

print(f"Source resolved to: {source_dir}")

# The actual images reside in the 'color' subdirectory of this specific Kaggle payload
source_color_dir = os.path.join(source_dir, 'color')
if not os.path.exists(source_color_dir):
    source_color_dir = os.path.join(source_dir, 'plantvillage dataset', 'color')

print(f"Color source resolved to: {source_color_dir}")

# Copy ONLY the color directory, which contains the class folders
if os.path.exists(FINAL_DATA_DIR):
    print("PlantVillage already exists. Removing and re-copying...")
    shutil.rmtree(FINAL_DATA_DIR)

shutil.copytree(source_color_dir, FINAL_DATA_DIR)

# Final flatten if there's still a nested PlantVillage
nested_dir = os.path.join(FINAL_DATA_DIR, 'PlantVillage')
if os.path.exists(nested_dir):
    print("Flattening redundant nesting...")
    for item in os.listdir(nested_dir):
        shutil.move(os.path.join(nested_dir, item), os.path.join(FINAL_DATA_DIR, item))
    os.rmdir(nested_dir)

print(f"Data ready at: {FINAL_DATA_DIR}")

Normalizing directory structure into project tree...
Source resolved to: /Users/angelonelson/.cache/kagglehub/datasets/abdallahalidev/plantvillage-dataset/versions/3
Color source resolved to: /Users/angelonelson/.cache/kagglehub/datasets/abdallahalidev/plantvillage-dataset/versions/3/plantvillage dataset/color
PlantVillage already exists. Removing and re-copying...
Data ready at: /Users/angelonelson/Projects/crop-disease-identifier/ml/dataset/PlantVillage


## Cell 4: The Agronomic Audit (Sanity Check)
Before we feed this to a neural network, we must empirically verify the integrity of the data. How many images do we actually have? What is the severity of the class imbalance? This computational audit proves why we mathematically require Focal Loss in the subsequent `mainmodel.ipynb`.

In [11]:
class_counts = {}
total_images = 0

# Traverse the finalized directory and count the JPGs
for class_name in os.listdir(FINAL_DATA_DIR):
    class_path = os.path.join(FINAL_DATA_DIR, class_name)
    
    # Ignore hidden system files like .DS_Store
    if os.path.isdir(class_path):
        num_images = len([f for f in os.listdir(class_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))])
        class_counts[class_name] = num_images
        total_images += num_images

# Convert to a DataFrame for an elegant, readable output
df_stats = pd.DataFrame(list(class_counts.items()), columns=['Taxonomy', 'Image Count'])
df_stats = df_stats.sort_values(by='Image Count', ascending=False).reset_index(drop=True)

print(f"Audit complete. Total verified images: {total_images}")
print(f"Total distinct crop/disease classifications: {len(df_stats)}\n")
print("Top 5 Dominant Classes (The Majority):")
print(df_stats.head(5).to_string(index=False))
print("\nBottom 5 Underrepresented Classes (The Minority):")
print(df_stats.tail(5).to_string(index=False))

Audit complete. Total verified images: 54305
Total distinct crop/disease classifications: 38

Top 5 Dominant Classes (The Majority):
                                Taxonomy  Image Count
Orange___Haunglongbing_(Citrus_greening)         5507
  Tomato___Tomato_Yellow_Leaf_Curl_Virus         5357
                       Soybean___healthy         5090
                  Peach___Bacterial_spot         2297
                 Tomato___Bacterial_spot         2127

Bottom 5 Underrepresented Classes (The Minority):
                    Taxonomy  Image Count
Tomato___Tomato_mosaic_virus          373
         Raspberry___healthy          371
             Peach___healthy          360
    Apple___Cedar_apple_rust          275
            Potato___healthy          152
