## Cell 1: The Ingestion Toolchain
We establish the geographic bounds of our project structure. Since this notebook resides in `notebooks/`, it must traverse up one level to deposit the payload into `dataset/`.

In [1]:
import os
import shutil
import pandas as pd
import kagglehub

# Define the absolute architecture of our directory tree
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
DATASET_DIR = os.path.join(PROJECT_ROOT, 'dataset')
FINAL_DATA_DIR = os.path.join(DATASET_DIR, 'PlantVillage')

# Ensure the dataset directory exists before we initiate the download
os.makedirs(DATASET_DIR, exist_ok=True)

print(f"Project topography initialized. Target directory: {DATASET_DIR}")

Project topography initialized. Target directory: /Users/angelonelson/Projects/crop-disease-identifier/ml/dataset


## Cell 2: Download and Extract via kagglehub
We use `kagglehub` to download and extract the dataset. Authentication is handled automatically via browser login if needed.

In [2]:
print("Downloading PlantVillage dataset via kagglehub...")
cached_path = kagglehub.dataset_download("emmarex/plantdisease")
print(f"Procurement successful. Data cached at: {cached_path}")
print(f"Cached contents: {os.listdir(cached_path)}")

Downloading PlantVillage dataset via kagglehub...


Downloading to /Users/angelonelson/.cache/kagglehub/datasets/emmarex/plantdisease/1.archive...


  0%|                | 0.00/658M [00:00<?, ?B/s]

  0%|        | 1.00M/658M [00:01<18:24, 623kB/s]

  0%|       | 2.00M/658M [00:01<08:35, 1.33MB/s]

  0%|       | 3.00M/658M [00:01<05:19, 2.15MB/s]

  1%|       | 4.00M/658M [00:02<03:50, 2.98MB/s]

  1%|       | 7.00M/658M [00:02<01:43, 6.57MB/s]

  1%|       | 9.00M/658M [00:02<01:20, 8.42MB/s]

  2%|       | 11.0M/658M [00:02<01:16, 8.84MB/s]

  2%|▏      | 13.0M/658M [00:02<01:08, 9.86MB/s]

  2%|▏      | 15.0M/658M [00:03<01:24, 7.94MB/s]

  3%|▏      | 18.0M/658M [00:03<01:04, 10.4MB/s]

  3%|▏      | 20.0M/658M [00:03<01:02, 10.7MB/s]

  3%|▏      | 22.0M/658M [00:03<01:05, 10.2MB/s]

  4%|▎      | 24.0M/658M [00:03<00:56, 11.8MB/s]

  4%|▎      | 26.0M/658M [00:03<00:49, 13.3MB/s]

  4%|▎      | 28.0M/658M [00:04<00:48, 13.7MB/s]

  5%|▎      | 30.0M/658M [00:04<01:07, 9.69MB/s]

  5%|▎      | 32.0M/658M [00:04<01:02, 10.6MB/s]

  5%|▎      | 34.0M/658M [00:04<01:03, 10.4MB/s]

  6%|▍      | 37.0M/658M [00:04<00:48, 13.4MB/s]

  6%|▍      | 39.0M/658M [00:05<00:48, 13.2MB/s]

  6%|▍      | 41.0M/658M [00:05<00:57, 11.2MB/s]

  7%|▍      | 45.0M/658M [00:05<00:42, 15.3MB/s]

  7%|▌      | 47.0M/658M [00:05<00:50, 12.6MB/s]

  7%|▌      | 49.0M/658M [00:05<00:46, 13.8MB/s]

  8%|▌      | 51.0M/658M [00:06<01:07, 9.48MB/s]

  8%|▌      | 53.0M/658M [00:06<01:12, 8.75MB/s]

  9%|▌      | 56.0M/658M [00:06<01:02, 10.0MB/s]

  9%|▌      | 58.0M/658M [00:06<00:58, 10.8MB/s]

  9%|▋      | 60.0M/658M [00:07<01:13, 8.55MB/s]

 10%|▋      | 63.0M/658M [00:07<00:54, 11.4MB/s]

 10%|▋      | 66.0M/658M [00:07<00:46, 13.4MB/s]

 10%|▋      | 69.0M/658M [00:07<00:38, 16.2MB/s]

 11%|▊      | 71.0M/658M [00:07<00:46, 13.3MB/s]

 11%|▊      | 73.0M/658M [00:08<00:41, 14.6MB/s]

 12%|▊      | 76.0M/658M [00:08<00:36, 16.5MB/s]

 12%|▊      | 78.0M/658M [00:08<00:44, 13.6MB/s]

 12%|▊      | 80.0M/658M [00:08<00:43, 13.9MB/s]

 12%|▊      | 82.0M/658M [00:08<00:44, 13.4MB/s]

 13%|▉      | 84.0M/658M [00:09<00:52, 11.4MB/s]

 13%|▉      | 88.0M/658M [00:09<00:37, 16.1MB/s]

 14%|▉      | 90.0M/658M [00:09<00:35, 16.8MB/s]

 14%|▉      | 92.0M/658M [00:09<00:43, 13.6MB/s]

 14%|█      | 94.0M/658M [00:09<00:41, 14.3MB/s]

 15%|█      | 96.0M/658M [00:09<00:38, 15.4MB/s]

 15%|█      | 99.0M/658M [00:10<00:43, 13.3MB/s]

 15%|█▏      | 101M/658M [00:10<00:48, 12.0MB/s]

 16%|█▎      | 103M/658M [00:10<00:51, 11.2MB/s]

 16%|█▎      | 105M/658M [00:10<00:46, 12.5MB/s]

 16%|█▎      | 107M/658M [00:10<00:42, 13.6MB/s]

 17%|█▎      | 109M/658M [00:10<00:39, 14.4MB/s]

 17%|█▎      | 111M/658M [00:11<00:49, 11.6MB/s]

 17%|█▎      | 113M/658M [00:11<00:46, 12.3MB/s]

 17%|█▍      | 115M/658M [00:11<00:45, 12.6MB/s]

 18%|█▍      | 117M/658M [00:11<01:11, 7.92MB/s]

 18%|█▍      | 119M/658M [00:12<01:14, 7.57MB/s]

 18%|█▍      | 121M/658M [00:12<01:06, 8.47MB/s]

 19%|█▍      | 122M/658M [00:12<01:14, 7.56MB/s]

 19%|█▌      | 124M/658M [00:12<01:02, 8.96MB/s]

 19%|█▌      | 126M/658M [00:12<00:54, 10.2MB/s]

 20%|█▌      | 129M/658M [00:13<00:53, 10.4MB/s]

 20%|█▌      | 131M/658M [00:13<00:46, 11.9MB/s]

 20%|█▌      | 133M/658M [00:13<00:41, 13.1MB/s]

 21%|█▋      | 135M/658M [00:13<00:45, 12.1MB/s]

 21%|█▋      | 137M/658M [00:13<00:46, 11.6MB/s]

 21%|█▋      | 139M/658M [00:14<01:01, 8.80MB/s]

 21%|█▋      | 141M/658M [00:14<00:54, 9.88MB/s]

 22%|█▋      | 143M/658M [00:14<00:47, 11.4MB/s]

 22%|█▊      | 146M/658M [00:14<00:50, 10.7MB/s]

 23%|█▊      | 148M/658M [00:14<00:46, 11.5MB/s]

 23%|█▊      | 151M/658M [00:15<00:37, 14.1MB/s]

 23%|█▊      | 153M/658M [00:15<00:47, 11.1MB/s]

 24%|█▉      | 155M/658M [00:15<00:43, 12.2MB/s]

 24%|█▉      | 157M/658M [00:15<00:50, 10.3MB/s]

 24%|█▉      | 159M/658M [00:15<00:49, 10.5MB/s]

 24%|█▉      | 161M/658M [00:16<00:59, 8.71MB/s]

 25%|█▉      | 164M/658M [00:16<00:44, 11.6MB/s]

 25%|██      | 167M/658M [00:16<00:35, 14.5MB/s]

 26%|██      | 169M/658M [00:16<00:40, 12.5MB/s]

 26%|██      | 171M/658M [00:16<00:40, 12.5MB/s]

 26%|██      | 174M/658M [00:17<00:32, 15.4MB/s]

 27%|██▏     | 176M/658M [00:17<00:46, 10.8MB/s]

 27%|██▏     | 178M/658M [00:17<00:40, 12.4MB/s]

 28%|██▏     | 181M/658M [00:17<00:41, 12.2MB/s]

 28%|██▏     | 183M/658M [00:17<00:38, 13.1MB/s]

 28%|██▎     | 185M/658M [00:18<00:37, 13.1MB/s]

 28%|██▎     | 187M/658M [00:18<00:59, 8.24MB/s]

 29%|██▎     | 189M/658M [00:18<01:05, 7.53MB/s]

 29%|██▎     | 191M/658M [00:19<00:54, 8.95MB/s]

 30%|██▎     | 194M/658M [00:19<00:41, 11.8MB/s]

 30%|██▍     | 196M/658M [00:19<00:50, 9.56MB/s]

 30%|██▍     | 198M/658M [00:20<01:18, 6.11MB/s]

 30%|██▍     | 200M/658M [00:20<01:14, 6.43MB/s]

 31%|██▍     | 203M/658M [00:20<00:52, 9.00MB/s]

 31%|██▌     | 206M/658M [00:20<00:41, 11.4MB/s]

 32%|██▌     | 208M/658M [00:20<00:45, 10.4MB/s]

 32%|██▌     | 211M/658M [00:21<00:34, 13.4MB/s]

 32%|██▌     | 213M/658M [00:21<00:31, 14.7MB/s]

 33%|██▋     | 216M/658M [00:21<00:27, 16.6MB/s]

 33%|██▋     | 218M/658M [00:21<00:35, 12.8MB/s]

 33%|██▋     | 220M/658M [00:21<00:34, 13.2MB/s]

 34%|██▋     | 222M/658M [00:21<00:31, 14.6MB/s]

 34%|██▋     | 224M/658M [00:22<00:37, 12.0MB/s]

 34%|██▋     | 226M/658M [00:22<00:35, 12.8MB/s]

 35%|██▊     | 228M/658M [00:22<00:42, 10.7MB/s]

 35%|██▊     | 230M/658M [00:22<00:38, 11.6MB/s]

 35%|██▊     | 233M/658M [00:22<00:30, 14.4MB/s]

 36%|██▊     | 236M/658M [00:22<00:25, 17.4MB/s]

 36%|██▉     | 238M/658M [00:23<00:28, 15.4MB/s]

 37%|██▉     | 241M/658M [00:23<00:25, 17.2MB/s]

 37%|██▉     | 243M/658M [00:23<00:39, 11.1MB/s]

 37%|██▉     | 245M/658M [00:23<00:34, 12.5MB/s]

 38%|███     | 248M/658M [00:23<00:28, 14.9MB/s]

 38%|███     | 250M/658M [00:24<00:34, 12.4MB/s]

 38%|███     | 252M/658M [00:24<00:32, 13.3MB/s]

 39%|███     | 254M/658M [00:24<00:30, 14.1MB/s]

 39%|███     | 256M/658M [00:24<00:28, 14.9MB/s]

 39%|███▏    | 258M/658M [00:24<00:38, 10.7MB/s]

 40%|███▏    | 260M/658M [00:25<00:37, 11.0MB/s]

 40%|███▏    | 262M/658M [00:25<00:38, 10.9MB/s]

 40%|███▏    | 265M/658M [00:25<00:28, 14.4MB/s]

 41%|███▏    | 267M/658M [00:25<00:45, 9.00MB/s]

 41%|███▎    | 270M/658M [00:25<00:35, 11.6MB/s]

 41%|███▎    | 272M/658M [00:26<00:32, 12.6MB/s]

 42%|███▎    | 274M/658M [00:26<00:42, 9.54MB/s]

 42%|███▎    | 276M/658M [00:26<00:36, 10.9MB/s]

 42%|███▍    | 278M/658M [00:26<00:37, 10.5MB/s]

 43%|███▍    | 280M/658M [00:26<00:36, 10.9MB/s]

 43%|███▍    | 282M/658M [00:27<00:47, 8.29MB/s]

 43%|███▍    | 284M/658M [00:27<00:39, 9.85MB/s]

 44%|███▍    | 287M/658M [00:27<00:41, 9.40MB/s]

 44%|███▌    | 290M/658M [00:27<00:31, 12.2MB/s]

 44%|███▌    | 292M/658M [00:28<00:28, 13.4MB/s]

 45%|███▌    | 294M/658M [00:28<00:33, 11.3MB/s]

 45%|███▌    | 296M/658M [00:28<00:29, 12.7MB/s]

 45%|███▋    | 299M/658M [00:28<00:24, 15.1MB/s]

 46%|███▋    | 301M/658M [00:28<00:34, 10.8MB/s]

 46%|███▋    | 303M/658M [00:29<00:30, 12.1MB/s]

 47%|███▋    | 306M/658M [00:29<00:26, 14.1MB/s]

 47%|███▋    | 308M/658M [00:29<00:28, 12.9MB/s]

 47%|███▊    | 310M/658M [00:29<00:28, 12.7MB/s]

 48%|███▊    | 313M/658M [00:29<00:24, 14.9MB/s]

 48%|███▊    | 315M/658M [00:29<00:25, 13.9MB/s]

 48%|███▊    | 317M/658M [00:30<00:33, 10.7MB/s]

 49%|███▉    | 320M/658M [00:30<00:32, 11.0MB/s]

 49%|███▉    | 322M/658M [00:30<00:29, 12.1MB/s]

 49%|███▉    | 324M/658M [00:30<00:27, 12.6MB/s]

 50%|███▉    | 326M/658M [00:32<01:23, 4.16MB/s]

 50%|███▉    | 327M/658M [00:32<01:23, 4.17MB/s]

 50%|███▉    | 328M/658M [00:32<01:16, 4.50MB/s]

 50%|████    | 331M/658M [00:32<00:57, 5.98MB/s]

 50%|████    | 332M/658M [00:33<01:00, 5.66MB/s]

 51%|████    | 334M/658M [00:33<00:45, 7.49MB/s]

 51%|████    | 336M/658M [00:33<00:37, 9.00MB/s]

 51%|████    | 338M/658M [00:33<00:31, 10.7MB/s]

 52%|████▏   | 340M/658M [00:33<00:36, 9.08MB/s]

 52%|████▏   | 342M/658M [00:33<00:31, 10.6MB/s]

 52%|████▏   | 344M/658M [00:34<00:33, 9.89MB/s]

 53%|████▏   | 346M/658M [00:34<00:38, 8.58MB/s]

 53%|████▏   | 347M/658M [00:34<00:40, 8.01MB/s]

 53%|████▏   | 348M/658M [00:34<00:39, 8.27MB/s]

 53%|████▎   | 350M/658M [00:34<00:32, 9.86MB/s]

 54%|████▎   | 353M/658M [00:34<00:23, 13.5MB/s]

 54%|████▎   | 355M/658M [00:35<00:27, 11.5MB/s]

 54%|████▎   | 357M/658M [00:35<00:24, 13.1MB/s]

 55%|████▎   | 359M/658M [00:35<00:21, 14.5MB/s]

 55%|████▍   | 361M/658M [00:35<00:27, 11.3MB/s]

 55%|████▍   | 363M/658M [00:35<00:24, 12.8MB/s]

 56%|████▍   | 365M/658M [00:35<00:23, 13.0MB/s]

 56%|████▍   | 367M/658M [00:36<00:27, 11.0MB/s]

 56%|████▌   | 370M/658M [00:36<00:21, 14.3MB/s]

 57%|████▌   | 373M/658M [00:36<00:18, 16.2MB/s]

 57%|████▌   | 376M/658M [00:36<00:22, 13.3MB/s]

 58%|████▌   | 379M/658M [00:36<00:18, 15.7MB/s]

 58%|████▋   | 381M/658M [00:37<00:24, 12.0MB/s]

 58%|████▋   | 383M/658M [00:37<00:22, 12.9MB/s]

 59%|████▋   | 385M/658M [00:37<00:20, 13.8MB/s]

 59%|████▋   | 388M/658M [00:37<00:21, 13.3MB/s]

 59%|████▋   | 390M/658M [00:38<00:26, 10.7MB/s]

 60%|████▊   | 393M/658M [00:38<00:24, 11.1MB/s]

 60%|████▊   | 395M/658M [00:38<00:25, 10.6MB/s]

 60%|████▊   | 397M/658M [00:38<00:28, 9.72MB/s]

 61%|████▉   | 401M/658M [00:38<00:19, 14.1MB/s]

 61%|████▉   | 403M/658M [00:39<00:19, 13.9MB/s]

 62%|████▉   | 405M/658M [00:39<00:22, 11.7MB/s]

 62%|████▉   | 408M/658M [00:39<00:18, 14.4MB/s]

 63%|█████   | 411M/658M [00:39<00:15, 16.3MB/s]

 63%|█████   | 413M/658M [00:39<00:19, 13.4MB/s]

 63%|█████   | 415M/658M [00:40<00:20, 12.3MB/s]

 63%|█████   | 417M/658M [00:40<00:19, 13.1MB/s]

 64%|█████   | 419M/658M [00:40<00:20, 12.0MB/s]

 64%|█████   | 421M/658M [00:40<00:19, 12.7MB/s]

 64%|█████▏  | 423M/658M [00:40<00:25, 9.70MB/s]

 65%|█████▏  | 425M/658M [00:41<00:21, 11.2MB/s]

 65%|█████▏  | 428M/658M [00:41<00:17, 14.1MB/s]

 65%|█████▏  | 430M/658M [00:41<00:16, 14.5MB/s]

 66%|█████▎  | 432M/658M [00:41<00:18, 13.1MB/s]

 66%|█████▎  | 436M/658M [00:41<00:13, 17.5MB/s]

 67%|█████▎  | 439M/658M [00:41<00:12, 18.7MB/s]

 67%|█████▎  | 441M/658M [00:42<00:15, 14.6MB/s]

 68%|█████▍  | 444M/658M [00:42<00:13, 16.4MB/s]

 68%|█████▍  | 446M/658M [00:42<00:12, 17.2MB/s]

 68%|█████▍  | 449M/658M [00:42<00:13, 15.8MB/s]

 69%|█████▍  | 451M/658M [00:42<00:13, 15.7MB/s]

 69%|█████▌  | 453M/658M [00:42<00:13, 16.5MB/s]

 69%|█████▌  | 457M/658M [00:43<00:15, 13.2MB/s]

 70%|█████▌  | 460M/658M [00:43<00:13, 15.0MB/s]

 70%|█████▋  | 463M/658M [00:43<00:17, 11.5MB/s]

 71%|█████▋  | 466M/658M [00:43<00:14, 13.9MB/s]

 71%|█████▋  | 468M/658M [00:44<00:15, 12.9MB/s]

 71%|█████▋  | 470M/658M [00:44<00:15, 12.8MB/s]

 72%|█████▋  | 472M/658M [00:44<00:15, 12.7MB/s]

 72%|█████▊  | 474M/658M [00:44<00:17, 11.2MB/s]

 72%|█████▊  | 476M/658M [00:44<00:18, 10.6MB/s]

 73%|█████▊  | 478M/658M [00:44<00:15, 12.1MB/s]

 73%|█████▊  | 480M/658M [00:45<00:16, 11.4MB/s]

 73%|█████▊  | 482M/658M [00:45<00:13, 13.2MB/s]

 74%|█████▉  | 484M/658M [00:45<00:18, 9.91MB/s]

 74%|█████▉  | 486M/658M [00:45<00:17, 10.1MB/s]

 74%|█████▉  | 488M/658M [00:45<00:15, 11.5MB/s]

 75%|█████▉  | 490M/658M [00:46<00:18, 9.68MB/s]

 75%|█████▉  | 492M/658M [00:46<00:15, 10.9MB/s]

 75%|██████  | 494M/658M [00:46<00:14, 11.9MB/s]

 75%|██████  | 496M/658M [00:46<00:16, 10.5MB/s]

 76%|██████  | 498M/658M [00:46<00:14, 11.7MB/s]

 76%|██████  | 500M/658M [00:47<00:17, 9.23MB/s]

 76%|██████  | 502M/658M [00:47<00:17, 9.37MB/s]

 77%|██████▏ | 504M/658M [00:47<00:15, 10.4MB/s]

 77%|██████▏ | 506M/658M [00:47<00:18, 8.55MB/s]

 77%|██████▏ | 508M/658M [00:48<00:16, 9.40MB/s]

 78%|██████▏ | 510M/658M [00:48<00:31, 4.98MB/s]

 78%|██████▏ | 512M/658M [00:49<00:23, 6.40MB/s]

 78%|██████▎ | 514M/658M [00:49<00:20, 7.38MB/s]

 78%|██████▎ | 516M/658M [00:49<00:18, 7.92MB/s]

 79%|██████▎ | 518M/658M [00:49<00:15, 9.28MB/s]

 79%|██████▎ | 520M/658M [00:49<00:15, 9.16MB/s]

 79%|██████▎ | 522M/658M [00:50<00:16, 8.80MB/s]

 80%|██████▎ | 524M/658M [00:50<00:19, 7.36MB/s]

 80%|██████▍ | 526M/658M [00:50<00:15, 8.83MB/s]

 80%|██████▍ | 529M/658M [00:50<00:13, 10.0MB/s]

 81%|██████▍ | 531M/658M [00:51<00:15, 8.77MB/s]

 81%|██████▍ | 533M/658M [00:51<00:14, 9.18MB/s]

 81%|██████▌ | 535M/658M [00:51<00:13, 9.81MB/s]

 82%|██████▌ | 537M/658M [00:51<00:11, 11.3MB/s]

 82%|██████▌ | 539M/658M [00:51<00:12, 10.2MB/s]

 82%|██████▌ | 541M/658M [00:52<00:13, 9.06MB/s]

 82%|██████▌ | 542M/658M [00:52<00:14, 8.36MB/s]

 83%|██████▌ | 544M/658M [00:52<00:13, 8.58MB/s]

 83%|██████▋ | 545M/658M [00:52<00:14, 8.03MB/s]

 83%|██████▋ | 546M/658M [00:52<00:14, 7.82MB/s]

 83%|██████▋ | 548M/658M [00:53<00:13, 8.63MB/s]

 84%|██████▋ | 550M/658M [00:53<00:10, 10.3MB/s]

 84%|██████▋ | 552M/658M [00:53<00:10, 10.1MB/s]

 84%|██████▋ | 554M/658M [00:53<00:10, 9.91MB/s]

 85%|██████▊ | 556M/658M [00:54<00:11, 9.23MB/s]

 85%|██████▊ | 557M/658M [00:54<00:13, 7.98MB/s]

 85%|██████▊ | 559M/658M [00:54<00:11, 9.06MB/s]

 85%|██████▊ | 560M/658M [00:54<00:15, 6.63MB/s]

 85%|██████▊ | 562M/658M [00:54<00:13, 7.61MB/s]

 86%|██████▊ | 563M/658M [00:55<00:16, 6.16MB/s]

 86%|██████▊ | 564M/658M [00:55<00:14, 6.72MB/s]

 86%|██████▊ | 565M/658M [00:55<00:13, 7.26MB/s]

 86%|██████▉ | 566M/658M [00:55<00:15, 6.37MB/s]

 86%|██████▉ | 567M/658M [00:55<00:15, 6.13MB/s]

 87%|██████▉ | 569M/658M [00:56<00:13, 6.89MB/s]

 87%|██████▉ | 571M/658M [00:56<00:11, 7.95MB/s]

 87%|██████▉ | 572M/658M [00:56<00:10, 8.34MB/s]

 87%|██████▉ | 574M/658M [00:56<00:09, 9.09MB/s]

 88%|███████ | 577M/658M [00:56<00:08, 10.2MB/s]

 88%|███████ | 579M/658M [00:57<00:07, 11.1MB/s]

 88%|███████ | 581M/658M [00:57<00:09, 8.66MB/s]

 89%|███████ | 583M/658M [00:57<00:07, 9.87MB/s]

 89%|███████ | 585M/658M [00:57<00:08, 8.56MB/s]

 89%|███████▏| 587M/658M [00:57<00:07, 10.2MB/s]

 90%|███████▏| 590M/658M [00:58<00:05, 13.6MB/s]

 90%|███████▏| 592M/658M [00:58<00:06, 10.2MB/s]

 90%|███████▏| 595M/658M [00:58<00:05, 12.7MB/s]

 91%|███████▎| 597M/658M [00:58<00:05, 12.0MB/s]

 91%|███████▎| 599M/658M [00:58<00:05, 11.6MB/s]

 91%|███████▎| 601M/658M [00:59<00:04, 12.3MB/s]

 92%|███████▎| 603M/658M [01:00<00:11, 4.88MB/s]

 92%|███████▎| 604M/658M [01:00<00:10, 5.14MB/s]

 92%|███████▎| 605M/658M [01:00<00:10, 5.24MB/s]

 92%|███████▎| 606M/658M [01:00<00:09, 5.61MB/s]

 92%|███████▍| 607M/658M [01:01<00:11, 4.49MB/s]

 92%|███████▍| 608M/658M [01:01<00:09, 5.23MB/s]

 93%|███████▍| 610M/658M [01:01<00:09, 5.29MB/s]

 93%|███████▍| 611M/658M [01:01<00:08, 5.87MB/s]

 93%|███████▍| 613M/658M [01:01<00:05, 8.21MB/s]

 94%|███████▍| 615M/658M [01:01<00:04, 9.33MB/s]

 94%|███████▌| 617M/658M [01:02<00:08, 5.05MB/s]

 94%|███████▌| 619M/658M [01:02<00:07, 5.58MB/s]

 94%|███████▌| 620M/658M [01:03<00:06, 5.63MB/s]

 95%|███████▌| 622M/658M [01:03<00:06, 6.14MB/s]

 95%|███████▌| 625M/658M [01:03<00:04, 7.85MB/s]

 95%|███████▋| 627M/658M [01:03<00:03, 9.31MB/s]

 96%|███████▋| 629M/658M [01:04<00:03, 8.14MB/s]

 96%|███████▋| 631M/658M [01:04<00:03, 9.21MB/s]

 96%|███████▋| 633M/658M [01:04<00:03, 7.56MB/s]

 97%|███████▋| 636M/658M [01:04<00:02, 10.4MB/s]

 97%|███████▊| 638M/658M [01:05<00:02, 8.41MB/s]

 97%|███████▊| 641M/658M [01:05<00:01, 11.4MB/s]

 98%|███████▊| 644M/658M [01:05<00:01, 12.4MB/s]

 98%|███████▊| 646M/658M [01:05<00:01, 11.4MB/s]

 99%|███████▉| 648M/658M [01:06<00:01, 10.0MB/s]

 99%|███████▉| 650M/658M [01:06<00:00, 9.59MB/s]

 99%|███████▉| 652M/658M [01:06<00:00, 8.70MB/s]

 99%|███████▉| 654M/658M [01:06<00:00, 9.09MB/s]

100%|███████▉| 656M/658M [01:06<00:00, 10.0MB/s]

100%|████████| 658M/658M [01:07<00:00, 9.99MB/s]

100%|████████| 658M/658M [01:07<00:00, 10.3MB/s]

Extracting files...





Procurement successful. Data cached at: /Users/angelonelson/.cache/kagglehub/datasets/emmarex/plantdisease/versions/1
Cached contents: ['PlantVillage']


## Cell 3: Structural Normalization
Copy data from kagglehub cache into our project's `dataset/PlantVillage` directory and flatten any redundant nested directories.

In [3]:
print("Normalizing directory structure into project tree...")

# Walk down through any PlantVillage nesting in the cache
source_dir = cached_path
for _ in range(3):
    nested = os.path.join(source_dir, 'PlantVillage')
    if os.path.exists(nested):
        source_dir = nested
    else:
        break

print(f"Source resolved to: {source_dir}")

# Copy to our project's dataset directory
if os.path.exists(FINAL_DATA_DIR):
    print("PlantVillage already exists. Removing and re-copying...")
    shutil.rmtree(FINAL_DATA_DIR)

shutil.copytree(source_dir, FINAL_DATA_DIR)

# Final flatten if there's still a nested PlantVillage
nested_dir = os.path.join(FINAL_DATA_DIR, 'PlantVillage')
if os.path.exists(nested_dir):
    print("Flattening redundant nesting...")
    for item in os.listdir(nested_dir):
        shutil.move(os.path.join(nested_dir, item), os.path.join(FINAL_DATA_DIR, item))
    os.rmdir(nested_dir)

print(f"Data ready at: {FINAL_DATA_DIR}")

Normalizing directory structure into project tree...
Source resolved to: /Users/angelonelson/.cache/kagglehub/datasets/emmarex/plantdisease/versions/1/PlantVillage/PlantVillage


Data ready at: /Users/angelonelson/Projects/crop-disease-identifier/ml/dataset/PlantVillage


## Cell 4: The Agronomic Audit (Sanity Check)
Before we feed this to a neural network, we must empirically verify the integrity of the data. How many images do we actually have? What is the severity of the class imbalance? This computational audit proves why we mathematically require Focal Loss in the subsequent `mainmodel.ipynb`.

In [4]:
class_counts = {}
total_images = 0

# Traverse the finalized directory and count the JPGs
for class_name in os.listdir(FINAL_DATA_DIR):
    class_path = os.path.join(FINAL_DATA_DIR, class_name)
    
    # Ignore hidden system files like .DS_Store
    if os.path.isdir(class_path):
        num_images = len([f for f in os.listdir(class_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))])
        class_counts[class_name] = num_images
        total_images += num_images

# Convert to a DataFrame for an elegant, readable output
df_stats = pd.DataFrame(list(class_counts.items()), columns=['Taxonomy', 'Image Count'])
df_stats = df_stats.sort_values(by='Image Count', ascending=False).reset_index(drop=True)

print(f"Audit complete. Total verified images: {total_images}")
print(f"Total distinct crop/disease classifications: {len(df_stats)}\n")
print("Top 5 Dominant Classes (The Majority):")
print(df_stats.head(5).to_string(index=False))
print("\nBottom 5 Underrepresented Classes (The Minority):")
print(df_stats.tail(5).to_string(index=False))

Audit complete. Total verified images: 20638
Total distinct crop/disease classifications: 15

Top 5 Dominant Classes (The Majority):
                                   Taxonomy  Image Count
      Tomato__Tomato_YellowLeaf__Curl_Virus         3208
                      Tomato_Bacterial_spot         2127
                         Tomato_Late_blight         1909
                  Tomato_Septoria_leaf_spot         1771
Tomato_Spider_mites_Two_spotted_spider_mite         1676

Bottom 5 Underrepresented Classes (The Minority):
                     Taxonomy  Image Count
         Potato___Late_blight         1000
Pepper__bell___Bacterial_spot          997
             Tomato_Leaf_Mold          952
  Tomato__Tomato_mosaic_virus          373
             Potato___healthy          152
