<a href="https://colab.research.google.com/github/christianwiloejo/hello-world/blob/master/Dataset_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset creation in Colab

In this notebook I'll show how to automatically create an images dataset scraping from Google Images.  You will learn also how to browse in your Colab and Drive filesystem.

Disclaimer: the content of this notebook is for *informational use* only. I recommend anyone who has a need for massive or frequent use, to consult the [Google Cloud section](https://cloud.google.com/products/) or the [Custom Search API](https: // developers. google.com/custom-search/).

## 1. Install *google_images_download*

Our scraper will be [google_images_download](https://github.com/hardikvasa/google-images-download#troubleshooting-errors-issues), a  beautiful and very easy to use script. It uses Selenium browser to scrape from the web, but don't worry, no other programs will be open, it acts in the background.  
By the way: it is NOT a official Google package.

In [None]:
!pip install google_images_download

Collecting google_images_download
  Downloading https://files.pythonhosted.org/packages/18/ed/0319d30c48f3653802da8e6dcfefcea6370157d10d566ef6807cceb5ec4d/google_images_download-2.8.0.tar.gz
Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 2.8MB/s 
Building wheels for collected packages: google-images-download
  Building wheel for google-images-download (setup.py) ... [?25l[?25hdone
  Created wheel for google-images-download: filename=google_images_download-2.8.0-py2.py3-none-any.whl size=14550 sha256=e6e6d0d393d2dba4f0fef032b123808a1b4eeb20f3293872768216f178ff3e72
  Stored in directory: /root/.cache/pip/wheels/1f/28/ad/f56e7061e1d2a9a1affe2f9c649c2570cb9198dd24ede0bbab
Successfully built google-images-download
Installing collected packages: selenium, google-images-download
Successfully install

## 2. Download Chromedriver

As said above, we need to run a browser in the background, so we need to install it and make it reachable by our scraper. <br>The browser has already been installed with the package, thus now an acces to it is needed.<br>Let's download the *chromiumdriver* and guess what it is.. 

In [None]:
!wget https://chromedriver.storage.googleapis.com/2.42/chromedriver_linux64.zip  && unzip chromedriver_linux64

--2020-02-24 12:08:18--  https://chromedriver.storage.googleapis.com/2.42/chromedriver_linux64.zip
Resolving chromedriver.storage.googleapis.com (chromedriver.storage.googleapis.com)... 74.125.141.128, 2607:f8b0:400c:c07::80
Connecting to chromedriver.storage.googleapis.com (chromedriver.storage.googleapis.com)|74.125.141.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4039043 (3.9M) [application/zip]
Saving to: ‘chromedriver_linux64.zip’


2020-02-24 12:08:18 (176 MB/s) - ‘chromedriver_linux64.zip’ saved [4039043/4039043]

Archive:  chromedriver_linux64.zip
  inflating: chromedriver            


## 3. Set the *chromedriver*  path for the script

We mount the Drive driver, thus it will allow us to use the filesystem in the Google environment, to get the path of the chromedriver.<br>
We will use [colabtools module](https://github.com/googlecolab/colabtools) by Google, a tool set still in development but very powerful yet.
<br>
So execute the next cell, follow the link and get the authorization code to past in the text field below, in order to get access to your Drive.

In [None]:
import os
#Mount the drive from Google to save the dataset
from google.colab import drive # this will be our driver
drive.mount('/gdrive')
root = '/gdrive/My Drive/'     # if you want to operate on your Google Drive

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


We set a variable with the *Colab* default path, to easily use it later

In [None]:
colab_path = '/gdrive/../content/'

In [None]:
chromedriver_path = '/gdrive/../content/chromedriver'

## 4. Scrape!

Here is an example of the scraper usage, you will find a lot of other arguments in the [official page](https://github.com/hardikvasa/google-images-download#troubleshooting-errors-issues), check it.

In [None]:
from google_images_download import google_images_download   #importing the library

keyws   = " cats with mustaches"
limit   = 100
chromedriver = chromedriver_path
offset  = None  # how many links to skip
color_type	= None# color type you want to apply to the images.[full-color, black-and-white, transparent]
size    = None  #relative size of the image to be downloaded. [large, medium, icon, >400*300, >640*480, >800*600, >1024*768, >2MP, >4MP, >6MP, >8MP, >10MP, >12MP, >15MP, >20MP, >40MP, >70MP]
usage_rights	= 'labeled-for-reuse' #Very important! Check the doc

arguments = {
        "keywords" : keyws,
        "limit" :limit,
        "chromedriver":chromedriver,
        "offset" : offset,
        "color_type" : color_type,
        "size" : size,
        "usage_rights" : usage_rights
        }   #creating list of arguments
response  = google_images_download.googleimagesdownload()   #class instantiation
response.download(arguments)  


Item no.: 1 --> Item name =  cats with mustaches
Evaluating...
Starting Download...


Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0



({' cats with mustaches': []}, 0)

Now the **driver** comes into play. We will manage our dataset through the paths, by it.

In [None]:
dataset_path = 'downloads/' + keyws + '/'
dataset = [ dataset_path + img_name for img_name in os.listdir( colab_path + dataset_path ) ]
dataset[:10]  #Peek the first ten

[]

## 4.1 Check it!

To be sure the code worked well let's peek a sample and visualize it with matplotlib

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow( plt.imread( colab_path + dataset[4] ) )
plt.grid(b=None) #remove grid

IndexError: ignored

Well done! We have now a fresh new dataset on Colab! 👌