# DuckDuckGo Image Scraper

This was originally an image scraper notebook for creating deep learning datasets.

It has since been turned into an installable library and is much easier to use as you can simply drop a few lines of code into your own notebook as you're experimenting. 

This notebook now shows you how to use the library.

Docs are at [joedockrill.github.io/jmd_imagescraper/](https://joedockrill.github.io/jmd_imagescraper/)

Hugs & kisses, Joe Dockrill. 

## Install



In [None]:
!pip install -q jmd_imagescraper

## Download images

In [None]:
from pathlib import Path
root = Path().cwd()/"images"

from jmd_imagescraper.core import * # dont't worry, it's designed to work with import *

# duckduckgo_search(root, "Cats", "cute kittens", max_results=10)
# duckduckgo_search(root, "Dogs", "cute puppies", max_results=10)
# duckduckgo_search(root, "Birds", "cute baby ducks and chickens", max_results=10)
duckduckgo_search(root, "people", "portrait photography linkedin", max_results=100)
# file paths are returned so if you want to snag a list of downloaded files as you go, do this:

# images = []
# images.extend(duckduckgo_search(root, "Cats", "cute kittens", max_results=10))
# images.extend(duckduckgo_search(root, "Dogs", "cute puppies", max_results=10))
# images.extend(duckduckgo_search(root, "Birds", "cute baby ducks and chickens", max_results=10))
# images

Duckduckgo search: portrait photography linkedin
Downloading results into /content/images/people


[PosixPath('/content/images/people/001_3494ed2a.jpg'),
 PosixPath('/content/images/people/002_c645e58d.jpg'),
 PosixPath('/content/images/people/003_1a895d91.jpg'),
 PosixPath('/content/images/people/004_85b5c267.jpg'),
 PosixPath('/content/images/people/005_7fa149ab.jpg'),
 PosixPath('/content/images/people/006_bf12f046.jpg'),
 PosixPath('/content/images/people/007_c84010a5.jpg'),
 PosixPath('/content/images/people/008_684738f0.jpg'),
 PosixPath('/content/images/people/009_8014795e.jpg'),
 PosixPath('/content/images/people/010_b5cdcbd0.jpg'),
 PosixPath('/content/images/people/011_fbc682c6.jpg'),
 PosixPath('/content/images/people/012_51af085f.jpg'),
 PosixPath('/content/images/people/013_99290b2c.jpg'),
 PosixPath('/content/images/people/014_e7aa6ddc.jpg'),
 PosixPath('/content/images/people/015_9af6f5c9.jpg'),
 PosixPath('/content/images/people/016_66b150fc.jpg'),
 PosixPath('/content/images/people/017_dd1db495.jpg'),
 PosixPath('/content/images/people/018_8c7afcc1.jpg'),
 PosixPath

## Changing params across multiple searches

In [None]:
# If you're going to override default params across multiple searches you can use a 
# dictionary like this (so you can change search params for the entire dataset once).

# params = {
#     "max_results": 10,             # this can go up to 477 at the time of writing
#     "img_size":    ImgSize.Cached, 
#     "img_type":    ImgType.Photo,
#     "img_layout":  ImgLayout.Square,
#     "img_color":   ImgColor.Purple
# }

# duckduckgo_search(root, "Nice", "nice clowns", **params)
# duckduckgo_search(root, "Scary", "scary clowns", **params)

## Deleting all images

In [None]:
#rmtree(root)

## Displaying the image cleaner

Use this to get rid of unsuitable images without leaving your notebook

In [None]:
from jmd_imagescraper.imagecleaner import *

display_image_cleaner(root)

HBox(children=(Button(description='|<<', layout=Layout(width='60px'), style=ButtonStyle()), Button(description…

HTML(value='<h2>No images left to display in this folder.</h2>', layout=Layout(visibility='hidden'))

GridBox(children=(VBox(children=(Image(value=b'', layout="Layout(width='150px')"), Button(description='Delete'…

## Create a zip to download or transfer to google drive

In [None]:
# create zip

ZIP_NAME = "images.zip" # maybe change this?

!rm -f {ZIP_NAME}
!zip -q -r {ZIP_NAME} {root}

In [None]:
# download to your local system

from google.colab import files
files.download(ZIP_NAME)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# copy to google drive 

from google.colab import drive
import shutil

DESTINATION_FOLDER = "Datasets_images" # where would you like this in Google Drive?

drive.mount("/content/drive") 
folder = Path("/content/drive/My Drive/HETIC PFA")/DESTINATION_FOLDER
folder.mkdir(parents=True, exist_ok=True)

shutil.copyfile(ZIP_NAME, str(folder/ZIP_NAME))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


'/content/drive/My Drive/HETIC PFA/Datasets_images/images.zip'

## Create a CSV file of URLs

If you'd rather distribute a file with the image URLs and labels and have people download the images themselves you can do so here.

In [None]:
CSV_NAME = "images.csv" # maybe change this?

!rm -f {CSV_NAME}

csv = Path.cwd()/CSV_NAME
save_urls_to_csv(csv, "people", "portrait photography linkedin", max_results=100)
#save_urls_to_csv(csv, "Scary", "scary clowns", max_results=5)

In [None]:
!pip install google_images_download

Collecting google_images_download
  Downloading google_images_download-2.8.0.tar.gz (14 kB)
Collecting selenium
  Downloading selenium-4.1.5-py3-none-any.whl (979 kB)
[K     |████████████████████████████████| 979 kB 10.6 MB/s 
[?25hCollecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 59.6 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.20.0-py3-none-any.whl (359 kB)
[K     |████████████████████████████████| 359 kB 62.3 MB/s 
[?25hCollecting outcome
  Downloading outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting pyOpenSSL>=0.14
  Downloading pyOpenSSL

In [None]:
from google_images_download import google_images_download   #importing the library

response = google_images_download.googleimagesdownload()   #class instantiation

arguments = {"keywords":"Polar bears,baloons,Beaches","limit":20,"print_urls":True}   #creating list of arguments
paths = response.download(arguments)   #passing the arguments to the function
print(paths)   #printing absolute paths of the downloaded images


Item no.: 1 --> Item name = Polar bears
Evaluating...
Starting Download...


Unfortunately all 20 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0


Item no.: 2 --> Item name = baloons
Evaluating...
Starting Download...


Unfortunately all 20 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0


Item no.: 3 --> Item name = Beaches
Evaluating...
Starting Download...


Unfortunately all 20 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0

({'Polar bears': [], 'baloons': [], 'Beaches': []}, 0)
