
# **Image Scraping for Dataset Creation**

To build a custom dataset for malnutrition detection, we scraped images from **Google** and **Bing** using automated crawlers (`icrawler` and `bing_image_downloader`).  
- **Healthy child faces** and **malnourished child faces** were collected with diverse search queries to ensure variation in lighting, facial expressions, and backgrounds.  
- The scraped images form the basis of our dataset, which is later cleaned, split, and augmented for training a robust ResNet50 model.


In [2]:
!pip install icrawler


Collecting icrawler
  Obtaining dependency information for icrawler from https://files.pythonhosted.org/packages/c1/14/1d68f9d2b01955f4c4c63d378e0a331497055b4b96ec1d3a175222411544/icrawler-0.6.10-py3-none-any.whl.metadata
  Downloading icrawler-0.6.10-py3-none-any.whl.metadata (6.2 kB)
Downloading icrawler-0.6.10-py3-none-any.whl (36 kB)
Installing collected packages: icrawler
Successfully installed icrawler-0.6.10


In [3]:

from icrawler.builtin import GoogleImageCrawler
import os

# Ensure the "healthy child face" folder is created
os.makedirs('dataset/healthy child face', exist_ok=True)

# Download healthy child images
google_crawler = GoogleImageCrawler(storage={'root_dir': 'dataset/healthy child face'})
google_crawler.crawl(keyword='healthy child face', max_num=200)

print("Healthy child images downloaded!")


2025-07-20 02:46:10,945 - INFO - icrawler.crawler - start crawling...
2025-07-20 02:46:10,945 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-07-20 02:46:10,946 - INFO - feeder - thread feeder-001 exit
2025-07-20 02:46:10,949 - INFO - icrawler.crawler - starting 1 parser threads...
2025-07-20 02:46:10,951 - INFO - icrawler.crawler - starting 1 downloader threads...
2025-07-20 02:46:12,979 - INFO - parser - parsing result page https://www.google.com/search?q=healthy+child+face&ijn=0&start=0&tbs=&tbm=isch
2025-07-20 02:46:13,496 - ERROR - downloader - Response status code 403, file https://www.shutterstock.com/image-photo/happy-baby-face-little-child-260nw-1414784948.jpg
2025-07-20 02:46:13,965 - ERROR - downloader - Response status code 403, file https://static.vecteezy.com/system/resources/previews/009/713/845/large_2x/outdoor-portrait-handsome-young-boy-looking-at-camera-with-smiling-face-healthy-kid-with-a-happy-face-standing-alone-in-the-park-positive-child-playing-out

Healthy child images downloaded!


In [4]:
!pip install icrawler

from icrawler.builtin import GoogleImageCrawler
import os

# Ensure the "healthy child face" folder is created
os.makedirs('dataset/healthy child face', exist_ok=True)

# Multiple queries for more diversity and image count
queries = [
    "healthy child face",
    "happy child face",
    "healthy baby face",
    "cute baby portrait",
    "smiling healthy kid face",
    "child portrait studio photo",
    "happy baby closeup face",
    "child outdoor face",
    "smiling toddler portrait",
    "healthy boy face",
    "healthy girl face",
    "kids headshot",
    "school child portrait",
    "cute smiling baby face",
    "happy children closeup face",
    "baby face high resolution"
]

# Download images for each query (30 images per query × ~15 queries = 450+ images)
for query in queries:
    google_crawler = GoogleImageCrawler(storage={'root_dir': 'dataset/healthy child face'})
    google_crawler.crawl(keyword=query, max_num=30)

print("400+ healthy child images downloaded (approx)!")


2025-07-20 02:51:16,090 - INFO - icrawler.crawler - start crawling...
2025-07-20 02:51:16,091 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-07-20 02:51:16,091 - INFO - feeder - thread feeder-001 exit
2025-07-20 02:51:16,094 - INFO - icrawler.crawler - starting 1 parser threads...
2025-07-20 02:51:16,096 - INFO - icrawler.crawler - starting 1 downloader threads...




2025-07-20 02:51:17,653 - INFO - parser - parsing result page https://www.google.com/search?q=healthy+child+face&ijn=0&start=0&tbs=&tbm=isch
2025-07-20 02:51:17,765 - INFO - downloader - skip downloading file 000001.jpg
2025-07-20 02:51:17,767 - INFO - downloader - skip downloading file 000002.jpg
2025-07-20 02:51:17,769 - INFO - downloader - skip downloading file 000003.jpg
2025-07-20 02:51:17,770 - INFO - downloader - skip downloading file 000004.jpg
2025-07-20 02:51:17,771 - INFO - downloader - skip downloading file 000005.jpg
2025-07-20 02:51:17,772 - INFO - downloader - skip downloading file 000006.jpg
2025-07-20 02:51:17,772 - INFO - downloader - skip downloading file 000007.jpg
2025-07-20 02:51:17,773 - INFO - downloader - skip downloading file 000008.jpg
2025-07-20 02:51:17,775 - INFO - downloader - skip downloading file 000009.jpg
2025-07-20 02:51:17,776 - INFO - downloader - skip downloading file 000010.jpg
2025-07-20 02:51:17,776 - INFO - downloader - skip downloading file 0

400+ healthy child images downloaded (approx)!


2025-07-20 02:52:43,273 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-07-20 02:52:43,273 - INFO - parser - thread parser-001 exit


In [None]:
from bing_image_downloader import downloader

# List of diverse search queries to get more healthy child images
healthy_queries = [
    "healthy child face",
    "healthy baby face",
    "smiling healthy kid",
    "happy toddler face",
    "child portrait photo",
    "cute baby closeup",
    "school child smiling",
    "baby face high quality",
    "happy child outdoor",
    "cute toddler headshot"
]

# Download images for each query
for query in healthy_queries:
    print(f"Downloading images for: {query}")
    downloader.download(
        query,
        limit=100,  # Increase to 100 images per query
        output_dir='dataset',  # Saved under dataset/<query>/
        adult_filter_off=True,
        force_replace=False,
        timeout=60
    )

print("All healthy child images downloaded!")


Downloading images for: healthy child face
[%] Downloading Images to C:\Users\asmit\Desktop\Malnutrition Detection using DL\dataset\healthy child face


[!!]Indexing page: 1

[%] Indexed 12 Images on Page 1.


[%] Downloading Image #1 from https://s10443.pcdn.co/wp-content/uploads/2017/01/ThinkstockPhotos-512965947.jpg
[%] File Downloaded !

[%] Downloading Image #2 from https://www.thehealthy.com/wp-content/uploads/2018/02/GettyImages-1155240409.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://domf5oio6qrcr.cloudfront.net/medialibrary/5850/e58e6784-ed7e-4aad-aa4f-822b8ae4bee4.jpg
[%] File Downloaded !

[%] Downloading Image #4 from https://www.news-medical.net/image.axd?picture=2020%2F11%2Fshutterstock_1572592888.jpg
[%] File Downloaded !

[%] Downloading Image #5 from https://www.thehealthy.com/wp-content/uploads/2019/05/food-healthy.jpg?resize=1024
[%] File Downloaded !

[%] Downloading Image #6 from https://domf5oio6qrcr.cloudfront.net/medialibrary/12356/d26a84e6-11

In [None]:
from bing_image_downloader import downloader

# List of diverse search queries to get more healthy child images
healthy_queries = [
    "healthy child face",
    "healthy baby face",
    "smiling healthy kid",
    "happy toddler face",
    "child portrait photo",
    "cute baby closeup",
    "school child smiling",
    "baby face high quality",
    "happy child outdoor",
    "cute toddler headshot"
]

# Download images for each query
for query in healthy_queries:
    print(f"Downloading images for: {query}")
    downloader.download(
        query,
        limit=100,  # Increase to 100 images per query
        output_dir='dataset',  # Saved under dataset/<query>/
        adult_filter_off=True,
        force_replace=False,
        timeout=60
    )

print("All healthy child images downloaded!")
