<a href="https://colab.research.google.com/github/joedockrill/image-scraper/blob/master/ImageScraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DuckDuckGo and Google Image Scraper

This notebook is an image scraper for creating deep learning datasets. Expand this section for help. 

This notebook can scrape from Google and DuckDuckGo but Google is really just an emergency backup in case the DuckDuckGo code breaks at some point.

The thumbnails from DDG are larger, the search options are better and the results include the original (full sized) image url which you can have downloaded instead of a thumbnail by using an img_size.

Bear in mind that you will get more failures downloading original images because of out of date links, truncated downloads, and sites which ban hot-linking.

**Version 2** \
\
You can now constrain DDG searches as follows:

```
duckduckgo_search(label: str, keywords: str, max_results: int=100,
                      img_size: ImgSize=ImgSize.Thumbs, 
                      img_type: ImgType=ImgType.Photo,
                      img_layout: ImgLayout=ImgLayout.Square,
                      img_color: ImgColor=ImgColor.All) -> None:

img_size can be one of the following: (default=ImgSize.Thumbs)
Thumbs, Small, Medium, Large, Wallpaper
 
img_type can be one of the following: (default=ImgType.Photo)
All, Photo, Clipart, Gif, Transparent

img_layout can be one of the following: (default=ImgLayout.Square)
All, Square, Tall, Wide
  
img_color can be one of the following: (default = ImgColor.All)
All, Color, Monochrome, Red, Orange, Yellow, Green, Blue, Purple, Pink, Brown, Black, Gray, Teal, White
```

Workflow:
- Write some search functions in the "Download your images here" cell
- Run the image cleaner to delete rubbish
- Zip it all up
- Download it or copy it to Google Drive

Images will be downloaded into folders by label name. If you want a one level zip file with all the images at the root just pass an empty string as a label name.

If you would prefer to create a CSV file of label/url pairs you can do that at the bottom of the notebook.

Hugs & kisses, Joe Dockrill. 

credits: 
- https://github.com/deepanprabhu/duckduckgo-images-api for the base DuckDuckGo code
- Iegor Timukhin for pointing out that the param for search constraints was sitting under my nose empty the whole time


# Code

In [None]:
#@title RUN THIS CELL.
#@markdown If you want to see the code, select this cell, click the ... menu in the top right of
#@markdown the cell then click Form->Hide Form

import os
import glob
import requests
import re
import json
import time
import shutil
from bs4 import BeautifulSoup
from PIL import Image
import ipywidgets as widgets
from ipywidgets import interactive
from IPython.display import display
from google.colab import files
from google.colab import drive
from typing import Callable
from enum import Enum

BASE_FOLDER = "images"

##########################################################################################
# scraping
##########################################################################################
def google_scrape_urls(keywords: str, max_results: int) -> list:
  '''scrape urls from google image search'''
  BASE_URL = "https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&q="

  HEADERS = {
      'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive',
  }
  
  searchurl = BASE_URL + keywords
  resp = requests.get(searchurl, headers=HEADERS)
  html = resp.text
  
  soup = BeautifulSoup(html, "html.parser")
  results = soup.findAll("img", {"data-src":True}, limit=max_results)
  
  links = []
  for re in results:
    links.append(re["data-src"])

  return links  

class ImgSize(Enum):
  Thumbs=""
  Small="Small"
  Medium="Medium"
  Large="Large"
  Wallpaper="Wallpaper"

class ImgType(Enum):
  All=""
  Photo="photo"
  Clipart="clipart"
  Gif="gif"
  Transparent="transparent"

class ImgLayout(Enum):
  All=""
  Square="Square"
  Tall="Tall"
  Wide="Wide"
  
class ImgColor(Enum):
  All=""
  Color="color"
  Monochrome="Monochrome"
  Red="Red"
  Orange="Orange"
  Yellow="Yellow"
  Green="Green"
  Blue="Blue"
  Purple="Purple"
  Pink="Pink" 
  Brown="Brown"
  Black="Black" 
  Gray="Gray" 
  Teal="Teal"
  White="White"

def duckduckgo_scrape_urls(keywords: str, max_results: int, 
                           img_size: ImgSize=ImgSize.Thumbs, 
                           img_type: ImgType=ImgType.Photo,
                           img_layout: ImgLayout=ImgLayout.Square,
                           img_color: ImgColor=ImgColor.All) -> list:
  '''scrape urls from duckduckgo image search'''
  BASE_URL = 'https://duckduckgo.com/'
  params = {
    'q': keywords
  };
  results = 0
  links = []

  resp = requests.post(BASE_URL, data=params)
  match = re.search(r'vqd=([\d-]+)\&', resp.text, re.M|re.I)
  assert match is not None, "Failed to obtain search token"

  HEADERS = {
      'authority': 'duckduckgo.com',
      'accept': 'application/json, text/javascript, */*; q=0.01',
      'sec-fetch-dest': 'empty',
      'x-requested-with': 'XMLHttpRequest',
      'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
      'sec-fetch-site': 'same-origin',
      'sec-fetch-mode': 'cors',
      'referer': 'https://duckduckgo.com/',
      'accept-language': 'en-US,en;q=0.9',
  }

  constraints = ""
  if(img_size != ImgSize.Thumbs): constraints +=  "size:" + img_size.name
  constraints += ","
  if(img_type != ImgType.All): constraints +=  "type:" + img_type.name
  constraints += ","
  if(img_layout != ImgLayout.All): constraints +=  "layout:" + img_layout.name
  constraints += ","
  if(img_color != ImgColor.All): constraints +=  "color:" + img_color.name
  
  PARAMS = (
      ('l', 'us-en'),
      ('o', 'json'),
      ('q', keywords),
      ('vqd', match.group(1)),
      ('f', constraints),
      ('p', '1'),
      ('v7exp', 'a'),
  )

  requestUrl = BASE_URL + "i.js"

  while True:
      while True:
          try:
              resp = requests.get(requestUrl, headers=HEADERS, params=PARAMS)
              data = json.loads(resp.text)
              break
          except ValueError as e:
              print("Hit request throttle, sleeping and retrying")
              time.sleep(5); #seems a lot but ok...
              continue

      #result["thumbnail"] is normally big enough for most purposes
      #result["width"], result["height"] are for the full size img in result["image"]
      #result["image"] url to full size img on orig site (so may be less reliable) 
      #result["url"], result["title"].encode('utf-8') from the page the img came from
      
      for result in data["results"]:
        if(img_size == ImgSize.Thumbs): links.append(result["thumbnail"])
        else:                       links.append(result["image"])

        if(max_results is not None):
          if(len(links) >= max_results) : return links

      if "next" not in data:
          #no next page, all done
          return links

      requestUrl = BASE_URL + data["next"]

##########################################################################################
# searching & downloading
##########################################################################################
def google_search(label: str, keywords: str, max_results: int=100) -> None:
  '''run a google search and download the images'''
  print("Google search: ", keywords)
  links = google_scrape_urls(keywords,max_results)
  download_urls(label, links)

def duckduckgo_search(label: str, keywords: str, max_results: int=100,
                           img_size: ImgSize=ImgSize.Thumbs, 
                           img_type: ImgType=ImgType.Photo,
                           img_layout: ImgLayout=ImgLayout.Square,
                           img_color: ImgColor=ImgColor.All) -> None:
  '''run a duckduckgo search and download the images'''
  print("Duckduckgo search:", keywords)
  links = duckduckgo_scrape_urls(keywords, max_results, img_size, img_type, img_layout, img_color)
  download_urls(label, links)

def download_urls(label: str, links: list) -> None:
  '''downloads urls into the folder for that label'''
  if(len(links) == 0):
    print("Nothing to download!"); return

  print("Downloading", len(links), "images into", label)

  folder = os.path.join(BASE_FOLDER, label)
  if not os.path.exists(folder): os.makedirs(folder)

  bar = widgets.IntProgress(0, 0, len(links) - 1)
  display(bar)

  i = 1
  mk_fn = lambda i: os.path.join(folder, label + str(i).zfill(3) + ".jpg")
  is_file = lambda i: os.path.isfile(mk_fn(i))
  while is_file(i): i += 1 # don't overwrite previous searches
  
  for link in links:
      try:
        resp = requests.get(link)      
        fn = mk_fn(i)
        with open(fn, "wb") as file:
            file.write(resp.content)

        try:
          img = Image.open(fn)
          img.verify()
          img.close()
        except:
          print(fn, "is invalid")
          shutil.os.remove(fn)
      except:
        print("Exception occured while retrieving", link)

      i += 1
      bar.value += 1

  bar.bar_style = "success"

def save_urls(filename: str, scrape_func: Callable, label: str, keywords: str, max_results: int) -> None:
  '''run a search and concat the urls to a csv'''
  if(os.path.isfile(filename) == False):
    df = pd.DataFrame(columns=["URL", "Label"])
    df.to_csv(filename, index=False)

  urls = scrape_func(keywords, max_results)
  rows = []

  for url in urls:
    rows.append({"URL":url, "Label":label})
    
  df = pd.concat([pd.read_csv(filename), pd.DataFrame(rows)]) 
  df.to_csv(filename, index=False)

##########################################################################################
# moving files around
##########################################################################################
def download_file(filename: str) -> None:
  '''trigger a file download from colab to local system'''
  files.download(filename)

def transfer_to_drive(filename: str, dest_folder: str="Datasets") -> None:
  '''transfer file from colab runtime to google drive'''
  drive.mount('/content/drive')
  folder = os.path.join("/content/drive/My Drive", dest_folder)
  if not os.path.exists(folder): os.makedirs(folder)
  shutil.copyfile(filename, os.path.join(folder, filename))

**Run this cell to delete all image files (to create another dataset or reset)**

In [None]:
!rm -r -f images/*

**Download your images here**

In [None]:
ZIP_NAME = "images.zip" # change this to something more meaningful

# and run some searches: (help and options are in the section at the top)
#
# duckduckgo_search(label: str, keywords: str, max_results: int=100,
#                       img_size: ImgSize=ImgSize.Thumbs, 
#                       img_type: ImgType=ImgType.Photo,
#                       img_layout: ImgLayout=ImgLayout.Square,
#                       img_color: ImgColor=ImgColor.All) -> None:

# EG:
# ZIP_NAME = "Clowns.zip"
# duckduckgo_search("Nice", "nice clowns", max_results=20)
# duckduckgo_search("Scary", "scary clowns", max_results=20)


# you can also use google_search() if you prefer or if the ddg code breaks.


In [None]:
#@title Quick & Dirty Dataset Cleaner 
#@markdown Run this cell for a quick image cleaner. When you hit delete it's done immediately but 
#@markdown you'll need to run the cell again or swap folders to refresh the view.

#@markdown This is SLOW at loading a lot of images. Pagination is on my list of things to do.


def click_handler(btn):
  shutil.os.remove(btn.tag)
  btn.disabled = True

def render_image_cleaner(folder):
  items = []
  if(folder == "/"): folder = ""
  path = os.path.join(BASE_FOLDER, folder)
  
  for filename in os.listdir(path):
      if filename.endswith(".jpg"):
          file = open(os.path.join(path, filename), "rb")
          fstream = file.read()
          img = widgets.Image(value=fstream, format='jpg')
          img.layout.width="150px"
          btn = widgets.Button(description="Delete")
          btn.tag = os.path.join(path, filename)
          btn.on_click(click_handler)
          box = widgets.VBox(children=(img,btn))
          box.layout.margin = "5px"
          items.append(box)
  
  grid = widgets.GridBox(items, layout=widgets.Layout(grid_template_columns="repeat(4, 25%)"))
  grid.layout.margin = "15px"
  display(grid)

files = [o for o in glob.glob(BASE_FOLDER + "/*") if os.path.isfile(o)]
folders = next(os.walk(BASE_FOLDER))[1]
folders = [f for f in folders if f[0] != "."]
folders.sort()
if(len(files) > 0): folders = ["/"] + folders

w = interactive(render_image_cleaner, folder=folders)
display(w)
display(w.children[0]) # dropdown top & bottom

interactive(children=(Dropdown(description='folder', options=('birds', 'cats'), value='birds'), Output()), _do…

Dropdown(description='folder', options=('birds', 'cats'), value='birds')

**Run this cell to create a zip file**

In [None]:
!rm -f {ZIP_NAME}
!zip -q -r {ZIP_NAME} images

**Run one of these cells to get your zip file**

In [None]:
# download to your local system
download_file(ZIP_NAME)

In [None]:
# copy to google drive 
transfer_to_drive(ZIP_NAME, dest_folder="Datasets")

# Create a CSV file of URLs

If you'd rather distribute a file with the image URLs and labels and have people download the images themselves you can do so here.

In [None]:
CSV_NAME = "images.csv" #change this to something more meaningful

!rm -f {CSV_NAME}

# save_urls(CSV_NAME, duckduckgo_scrape_urls, "dogs", "dogs or puppies", 10)
# save_urls(CSV_NAME, duckduckgo_scrape_urls, "cats", "cats or kittens", 10)
# save_urls(CSV_NAME, duckduckgo_scrape_urls, "rabbits", "rabbits sitting in mugs", 10)

In [None]:
# download to your local system
download_file(CSV_NAME)

In [None]:
# copy to google drive 
transfer_to_drive(CSV_NAME, dest_folder="Datasets")