# Data Collection through Image Scraping

In this notebook, we will explore two different ways to scrape images from the web:

**[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)** - Best for static HTML pages

**[Gallery-DL](https://github.com/mikf/gallery-dl)** - Best for structured bulk downloading from image-heavy sites

Before you proceed with the code examples, make sure you install the necessairy modules.

Beware, there might be conflicts with some dependencies, so if something doesn't get installed or work properly for you, take it easy. You can work around it or choose another method to scrape images from the web. You are given many options here.

Feel free to also explore **[Selenium](https://selenium-python.readthedocs.io/getting-started.html)**, which is best for JavaScript-heavy, dynamically loaded content.

An extra, code free option for creating a dataset, is to use **[Kaggle](https://www.kaggle.com)**. Kaggle holds many public datasets e.g. [animal image dataset](https://www.kaggle.com/datasets/iamsouravbanerjee/animal-image-dataset-90-different-animals/data) that you can download.

**NEVER FORGET TO DECLARE WHERE YOU EXTRACTED YOUR DATA FROM, NO MATTER WHICH OF THESE APPROACHES YOU CHOOSE**

In [None]:
# !conda install -c conda-forge beautifulsoup4 selenium webdriver-manager
# !pip install gallery_dl

## Beautiful Soup | a simple HTML scraper

It is best for basic scraping from static HTML pages, like [wikipedia](https://www.wikipedia.org) and [bbc](https://www.bbc.co.uk).

In [None]:
import requests
from bs4 import BeautifulSoup
import os

In [None]:
# Scraping text from a website
# This script will scrape the headlines from the BBC News website
url = "https://www.bbc.co.uk/news"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all headlines
headlines = soup.find_all("h3")

# Print the headlines
for headline in headlines:
    print(headline.text.strip())


In [None]:
# Define base directory
bs_dataset_txt = "./data/bs_dataset_txt"
# Create directories if they don't exist
os.makedirs(bs_dataset_txt, exist_ok=True)

# Find all headlines
headlines = [h3.text.strip() for h3 in soup.find_all("h3")]

# Save to a text file inside `bs_dataset/`
file_path = os.path.join(bs_dataset_txt, "headlines.txt")
with open(file_path, "w", encoding="utf-8") as f:
    for headline in headlines:
        f.write(headline + "\n")

print(f"Saved {len(headlines)} headlines to {file_path}")

Similarly to scraping text, Beautiful Soup can be used to scrape images. 

Note that this method does not work on websites where images are loaded dynamically using JavaScript.

In [None]:
# Scraping images from a website
# This script will scrape the image URLs from the BBC News website
url = "https://www.bbc.co.uk/news"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all image tags
images = soup.find_all("img")

# Print image URLs
for img in images:
    print(img["src"])


In [None]:
# Define base directory
bs_dataset_imgs = "./data/bs_dataset_imgs"
# Create directories if they don't exist
os.makedirs(bs_dataset_imgs, exist_ok=True)

# Download and save images
for idx, img in enumerate(images[:10]):  # Limit to first 10 images
    img_url = img.get("src")

    if img_url and img_url.startswith("http"):  # Ensure it's a valid URL
        img_path = os.path.join(bs_dataset_imgs, f"image_{idx}.jpg")

        # Download and save the image
        img_data = requests.get(img_url).content
        with open(img_path, "wb") as f:
            f.write(img_data)

print(f"Downloaded {len(images[:10])} images into {bs_dataset_imgs}")

## Gallery-DL | a bulk image scraper

It is best for structured bulk downloads, from [Pinterest](https://uk.pinterest.com), [Instagram](https://www.instagram.com), [Flickr](https://www.flickr.com), and similar sites.

In [None]:
# Check if the package is installed
!gallery-dl --version

We will organize the images into folders like class_1, class_2, etc.

In [None]:
import os
import shutil
import json

# Define base dataset path
my_dataset_path = "./data/gallery_dl_dataset"

# Number of classes you need (change if needed)
num_classes = 3  

# Create base dataset directory if it doesn't exist
os.makedirs(my_dataset_path, exist_ok=True)

# Create multiple class folders dynamically
class_folders = []
for i in range(1, num_classes + 1):
    class_folder = os.path.join(my_dataset_path, f"class_{i}")
    os.makedirs(class_folder, exist_ok=True)
    class_folders.append(class_folder)

print(f"Created class folders: {class_folders}")


We modify Gallery-DL’s settings to download only image files and exclude unnecessary files like JSON metadata.

In [None]:
# Define gallery-dl config directory **** choose one of the following based on your OS
config_dir = os.path.expanduser("~/.config/gallery-dl")  # for Linux/macOS
config_path = os.path.join(config_dir, "config.json")
# config_dir = os.path.expanduser("~\\AppData\\Local\\gallery-dl") # for Windows

# Ensure the config directory exists
os.makedirs(config_dir, exist_ok=True)

# Define the config to exclude non-image files
gallery_dl_config = {
    "extractor": {
        "base-directory": "pinterest_downloads",  # Base directory
        "directory": ["class-{num}"],  # Organizes into class folders
        "skip-metadata": True,  # Prevents JSON files from being downloaded
        "archive": False,  # Avoids unnecessary archive files
        "postprocessors": [],
        "filter": "extension in ('jpg', 'jpeg', 'png', 'gif', 'webp')"  # Filters only image files
    }
}

# Write to the config file
with open(config_path, "w") as f:
    json.dump(gallery_dl_config, f, indent=4)

print(f"Gallery-DL configuration updated at: {config_path}")


This will store images from each Pinterest board into separate folders (class_1, class_2, etc.).

In [None]:
# Define Pinterest boards to download from
pinterest_boards = [
    "https://uk.pinterest.com/dfordesignoc/_-mid-century-modern/eichler-homes/",
    "https://uk.pinterest.com/dfordesignoc/_-mid-century-modern/eichler-homes/",
    "https://uk.pinterest.com/dfordesignoc/_-mid-century-modern/eichler-homes/",
]

# Download images from each board
for idx, board_url in enumerate(pinterest_boards):
    class_folder = class_folders[idx % len(class_folders)]  # Rotate through class folders
    print(f"Downloading from {board_url} into {class_folder}...")

    # Run gallery-dl (Downloads images from Pinterest board)
    !gallery-dl -d "{class_folder}" "{board_url}"


Gallery-DL may create extra subdirectories, so we need to move images directly into the class folder. 

This flattens the directory structure so all images are directly inside class_X folders. 

If this confuses you or fails for some reason, you can always organise your folders manually.

In [None]:
# Function to move images from subdirectories to the main class folder
def move_images_to_class_folder(class_folder):
    """
    Moves all images from subdirectories inside `class_folder`
    directly into `class_folder`, flattening the structure.
    """
    for root, dirs, files in os.walk(class_folder, topdown=False):
        for file_name in files:
            file_path = os.path.join(root, file_name)
            new_path = os.path.join(class_folder, file_name)

            # Move the file only if it's not already in the main class folder
            if root != class_folder:
                shutil.move(file_path, new_path)
                print(f"Moved {file_path} → {new_path}")

        # Remove empty subdirectories after moving files
        for dir_name in dirs:
            dir_path = os.path.join(root, dir_name)
            if not os.listdir(dir_path):  # Check if empty
                os.rmdir(dir_path)
                print(f"Removed empty directory: {dir_path}")

# Apply function to all class folders
for class_folder in class_folders:
    move_images_to_class_folder(class_folder)
