# **Image Scraping for Dataset Creation**

To build a custom dataset for malnutrition detection, we scraped images from **Google** and **Bing** using automated crawlers (`icrawler` and `bing_image_downloader`).  
- **Healthy child faces** and **malnourished child faces** were collected with diverse search queries to ensure variation in lighting, facial expressions, and backgrounds.  
- The scraped images form the basis of our dataset, which is later cleaned, split, and augmented for training a robust ResNet50 model.


!pip install bing-image-downloader


In [None]:
from bing_image_downloader import downloader

# List of queries (You can add more keywords)
queries = ["malnourished child face", "healthy child face"]

# Loop through each query
for query in queries:
    print(f"Downloading images for: {query}")
    downloader.download(
        query,
        limit=500,  # Number of images per keyword
        output_dir='dataset',  # Images will be saved in 'dataset/<query>/'
        adult_filter_off=True,
        force_replace=False,
        timeout=60
    )

print("Download completed!")


Downloading images for: malnourished child face
[%] Downloading Images to C:\Users\asmit\Desktop\Malnutrition Detection using DL\dataset\malnourished child face


[!!]Indexing page: 1

[%] Indexed 35 Images on Page 1.


[%] Downloading Image #1 from https://images.fineartamerica.com/images/artworkimages/mediumlarge/2/malnourished-nigerian-child-bettmann.jpg
[%] File Downloaded !

[%] Downloading Image #2 from https://images.fineartamerica.com/images-medium-large/malnourished-child-mauro-fermariello.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://www.marham.pk/healthblog/wp-content/uploads/2018/04/malnourished-1.jpg
[!] Issue getting: https://www.marham.pk/healthblog/wp-content/uploads/2018/04/malnourished-1.jpg
[!] Error:: HTTP Error 410: Gone
[%] Downloading Image #3 from https://static.toiimg.com/thumb/msid-107789467,width-1070,height-580,imgsize-1507363,resizemode-75,overlay-toi_sw,pt-32,y_pad-40/photo.jpg
[%] File Downloaded !

[%] Downloading Image #4 from https:/

In [None]:
!pip install icrawler

from icrawler.builtin import GoogleImageCrawler
import os

# Function to download images
def download_images(folder_name, queries, per_query=50):
    os.makedirs(folder_name, exist_ok=True)
    for q in queries:
        print(f"Downloading images for query: {q}")
        google_crawler = GoogleImageCrawler(storage={'root_dir': folder_name})
        google_crawler.crawl(keyword=q, max_num=per_query)

# Queries for healthy faces
healthy_queries = [
    "healthy child face",
    "healthy baby portrait",
    "smiling child face",
    "happy toddler face",
    "healthy kid closeup",
    "cute baby headshot",
    "smiling school child",
    "healthy baby closeup face",
    "happy kid portrait",
    "child face outdoor smiling"
]

# Queries for malnourished faces
malnourished_queries = [
    "malnourished child face",
    "thin baby face malnutrition",
    "starving child closeup",
    "malnutrition baby portrait",
    "undernourished child face",
    "skinny baby malnourished",
    "child suffering malnutrition",
    "severe malnourished baby face",
    "malnutrition kid closeup",
    "malnourished toddler portrait"
]

# Download ~500 images per class (50 * 10 queries)
download_images('dataset/healthy child face', healthy_queries, per_query=50)
download_images('dataset/malnourished child face', malnourished_queries, per_query=50)

print("Scraping finished for both classes!")
