
# Collecting Image Datasets (Clean & English Version)

This notebook helps you **collect image datasets** using the `bing-image-downloader`, **zip the dataset**, and (optionally) **save it to Google Drive** when running on Google Colab.

> **How to use this notebook**
> 1. Run the setup cell to install dependencies.
> 2. Set your parameters (dataset name, search queries, number of images).
> 3. Download the images.
> 4. Zip the dataset folder.
> 5. (Optional, Colab) Mount Google Drive and copy the ZIP file there.



## 1) Setup — Install Dependencies

This cell installs the required Python package.  
- If you're using **Google Colab**, it works out of the box.  
- If you're using **local Jupyter**, make sure you have internet access enabled to install packages.

After installation completes successfully, you may need to **restart the kernel** in some environments.


In [None]:
# Install the image downloader package
!pip install -q bing-image-downloader


## 2) Configure Your Dataset Parameters

Customize the values below:
- **dataset_name**: A folder/zip name for your dataset (letters, numbers, underscores).
- **queries**: A list of search terms to collect images for (e.g., different classes).
- **limit_per_query**: Maximum number of images to download for each search term.
- **base_output_dir**: Where to store downloaded images (default is current working directory).  
  - On **Colab**, you may prefer `/content` to store files on the VM.
- **adult_filter_off**: Set to `True` to allow all results, `False` to filter adult content.
- **force_replace**: If `True`, it will overwrite any existing folder for a query.


In [None]:
# ==== USER SETTINGS ==================================================
from pathlib import Path

dataset_name = "spice_dataset"          # Folder/ZIP base name for your dataset
queries = [
    "Cuminum cyminum",   # cumin
    "Lavandula",         # lavender
]
limit_per_query = 200      # images per query
base_output_dir = "."      # use "/content" on Colab for convenience
adult_filter_off = True    # set False to filter adult content
force_replace = False      # overwrite existing query folders if True

# ---- Derived paths ----
BASE_DIR = Path(base_output_dir).resolve()
DATASET_DIR = BASE_DIR / dataset_name
DATASET_DIR.mkdir(parents=True, exist_ok=True)

print(f"Dataset directory: {DATASET_DIR}")
print(f"Queries: {queries}")



## 3) Download Images with Bing Image Downloader

What this cell does:
- Creates (or reuses) a subfolder for each **query** inside your dataset folder.
- Downloads up to **`limit_per_query`** images per query.
- Skips existing folders unless `force_replace=True`.

**Tips**
- Try **specific queries** (e.g., *"Lavandula flower close-up"*).
- Review downloaded images for quality and remove irrelevant ones.
- Respect **copyright** and **usage rights** of images.


In [None]:
from bing_image_downloader import downloader
from pathlib import Path

summary = []

for q in queries:
    print(f"\n=== Downloading: {q} ===")
    try:
        downloader.download(
            query=q,
            limit=limit_per_query,
            output_dir=DATASET_DIR.as_posix(),
            adult_filter_off=adult_filter_off,
            force_replace=force_replace,
            timeout=60,
            verbose=True,
        )
        summary.append((q, "OK"))
    except Exception as e:
        print(f"[WARN] Failed for {q}: {e}")
        summary.append((q, f"ERROR: {e}"))

print("\nSummary:")
for q, status in summary:
    print(f"- {q}: {status}")



## 4) Zip Your Dataset Folder

This cell compresses the **entire dataset directory** (which contains subfolders for each query) into a single ZIP file.  
- The ZIP will be created **next to** the dataset directory by default.
- If a ZIP already exists with the same name, it will be **overwritten**.


In [None]:
import shutil
from pathlib import Path

zip_base = DATASET_DIR.as_posix()
zip_path = f"{zip_base}.zip"

# Remove existing zip if present to avoid appending
p = Path(zip_path)
if p.exists():
    p.unlink()

# Create the zip archive
archive_path = shutil.make_archive(base_name=zip_base, format='zip', root_dir=DATASET_DIR)
print(f"Created ZIP: {archive_path}")



## 5) (Optional) Save to Google Drive (Colab)

If you're running on **Google Colab**, you can mount Google Drive and copy the generated ZIP file there:
- The ZIP file path is printed in the previous cell (e.g., `/content/spice_dataset.zip` on Colab).
- Update `dest_dir` below if you want a custom folder inside your Drive.


In [None]:
import sys, shutil
from pathlib import Path

# Detect Colab environment
IN_COLAB = 'google.colab' in sys.modules
print(f"Running in Colab: {IN_COLAB}")

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')

    src_zip = Path(f"{DATASET_DIR}.zip")
    dest_dir = Path('/content/drive/MyDrive') / 'datasets'
    dest_dir.mkdir(parents=True, exist_ok=True)
    dest_zip = dest_dir / src_zip.name

    shutil.copy(src_zip, dest_zip)
    print(f"Copied to Drive: {dest_zip}")
else:
    print("Not running on Colab. Skipping Drive copy step.")



## 6) Next Steps & Notes

- Review images and remove **duplicates**, **corrupted**, or **irrelevant** files.
- Consider normalizing file names and ensuring **class labels** match folder names.
- For training ML models, consider creating `train/`, `val/`, `test/` subfolders.
- Always check the **licensing** of images and follow fair-use and attribution rules where applicable.

**Provenance**: This English, cleaned notebook was generated from a Persian notebook and includes clearer structure and instructions for each step.  
_Last updated: 2025-09-05 10:46 UTC_
