# Creating your own dataset from Google Images

*by: Francisco Ingham and Jeremy Howard. Inspired by [Adrian Rosebrock](https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)*

This is part of Lesson 2 of the fast.ai V3 Classes. This is **not** the original notebook. It is my version with my notes and changes as I work throught he classes.

This notebooks hows how to create an image dataset through Google Images.

The next cell was not part of Jeremy's original notebook. It has to be added, specially the `matplotlib` magic so that the images display in the notebook.

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from fastai.vision import *

## Get a list of URLs

### Search and scroll

Go to [Google Images](http://images.google.com) and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:

    "canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown. **Note** This actually improved the quality of the data I got as it removed GIFs and other non-picture items.

### Download into file

**NOTE**

These instructions have only been tested in the Chrome browers.

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

Press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>J</kbd> in Windows/Linux and <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>J</kbd> in Mac, and a small window the javascript 'Console' will appear. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. You can do this by running the following commands:

```javascript
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

## Get the Images

**NOTE**

I did not follow the 'upload' instructions from Jeremy in the original notebook. Jeremy needed to do this because he downloaded the images directly to his machine and needed toupload the files to the server (AWS isntance?) where his notebook was hosted. 

For me when I executed the Javascript above, it generated a `Download.csv` file in my Downloads direcdtory. As I downloaded each set of fotos (in this case, bears, grizzly, etc) I renamed the `.csv` file to the appropriate name (for exmaple `urls_black.csv`) and placed in the data directory in the notebooks area. bears directory.

I have modified Jeremy's original notebook so that you can execute everything sequentially w/o having to loop over the categories. A little bit of repetition, but it made it more repeatable for me.

**NOTE** There may be some errors reported. These are not errors in the code, but rather errors retrieving some of the images from their sources.

In [None]:
classes = ['teddys','grizzly','black']

In [None]:
# Take care of the black bears
folder = 'black'
file = 'urls_black.csv'

In [None]:
path = Path('data/bears')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [None]:
#Download the black bears
download_images(path/file, dest, max_pics=200)

In [None]:
#Take care of the Teddy Bears
folder = 'teddys'
file = 'urls_teddys.csv'

In [None]:
path = Path('data/bears')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [None]:
#Download the Teddy bears
download_images(path/file, dest, max_pics=200)

In [None]:
#Take care of the Grizzly Bears
folder = 'grizzly'
file = 'urls_grizzly.csv'

In [None]:
path = Path('data/bears')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [None]:
#Download the Grizzly bears
download_images(path/file, dest, max_pics=200)

In [None]:
path.ls()

In [None]:
# If you have problems download, try with `max_workers=0` to see exceptions:
download_images(path/file, dest, max_pics=20, max_workers=0)

Then we can remove any images that can't be opened:

In [None]:
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

## View data

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
# If you already cleaned your data, run this cell instead of the one before
# np.random.seed(42)
# data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
#         ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Good! Let's take a look at some of our pictures then.

In [None]:
data.classes

In [None]:
data.show_batch(rows=3, figsize=(7,8))

In [None]:
data.classes, data.c, len(data.train_ds), len(data.valid_ds)

## Train Model

In this notebook I do not train the models as long as it was done in the class. I was not intersted in the actual problem - I just wanted to make sure I understood the methodlogy so I simplified the steps.

In [None]:
learn = cnn_learner(data, models.resnet34, metrics=error_rate)

In [None]:
learn.fit_one_cycle(1)

In [None]:
learn.save('stage-1')

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))

In [None]:
learn.save('stage-2')

## Intrpretation

In [None]:
learn.load('stage-2');

In [None]:
interp = ClassificationInterpretation.from_learner(learn)

In [None]:
interp.plot_confusion_matrix()

## Interpretation

## Cleaning up

The rest of the notebook will follow Jeremy's notebook. Very few comments unless I change things.

In [None]:
from fastai.widgets import *

n order to clean the entire set of images, we need to create a new dataset without the split. The video lecture demostrated the use of the ds_type param which no longer has any effect. See the thread for more details.

In [None]:
db = (ImageList.from_folder(path)
                   .no_split()
                   .label_from_folder()
                   .transform(get_transforms(), size=224)
                   .databunch()
     )

In [None]:
#New learner with db above
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)


In [None]:
learn_cln.load('stage-1');

In [None]:
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)

In [None]:
# Take care of the black bears
folder = 'black'
path = Path('data/bears')
path.ls()

In [None]:
# This did not appear to work as all I got was a note that said "A Jupyter Widget" 
# Not sure how much time I will put on this for this example, but I really would like to
# figure out what is the problem/how it works as I think I would need to have to clean 
# data in the future in other datasets

# The problem is that the GUI to select the pictures did not show up. cleaned.csv was
# created but no GUI.

# TBD

ImageCleaner(ds, idxs, path)