# Building the hotdog / no hotdog classifier

Started on 13th June 2018. Let's see how far I get.

## Project flow:

1. Get training data manually: imagenet + google images search for 'hotdog'

2. Crop "manually" the google images and imagenet to have the same input dimensions

3. Train a first example

    * Write a writeup for presentations, blog posts, etc.
    
4. Refine the example

    * Write a writeup for presentations, blog posts, etc.

5. Automate the training data collection

    * Write a writeup for presentations, blog posts, etc.

6. Implement data augmentation

7. Play around with advanced concepts: Transfer learning, resizing like in fastai, test-time augmentation, stochastic GD with restarts...

    * Write a writeup for presentations, blog posts, etc.

8. Port the thing to [Android]?

[Android]: https://medium.com/joytunes/deploying-a-tensorflow-model-to-android-69d04d1b0cba

# Getting some hotdogs

We'll go the easy way: get some results from Google Image Search.

https://stackoverflow.com/questions/36438261/extracting-images-from-google-images-using-src-and-beautifulsoup

https://gist.github.com/genekogan/ebd77196e4bf0705db51f86431099e57

In [None]:
from bs4 import BeautifulSoup
import requests
import urllib3
import os
import json

# Identify ourselves to Google
# if we don't pass a header, the response will contain everything but the images
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}

In [None]:
search_term = 'hot+dog'
search_type = 'isch'

response = requests.get(url='http://www.google.es/search', 
                        params={'q' : search_term, 'tbm' : search_type},
                        headers=header)
response

In [None]:
response.url

In [None]:
soup = BeautifulSoup(response.content, 'html5lib')

If we go straight for the links, we get #'s. I think this is because the href gets populated dinamically

In [None]:
dead_end = soup.findAll('a', {'class' : 'rg_l'})
[a['href'] for a in dead_end[:10]]

In [None]:
# How did the guys in 
# https://gist.github.com/genekogan/ebd77196e4bf0705db51f86431099e57
# find this?

divs = soup.find_all("div",{"class":"rg_meta"})
props = [json.loads(div.text) for div in divs]
image_sources = [(prop['ou'], prop['ity']) for prop in props]

In [None]:
len(image_sources)

In [None]:
image_sources[:5]

In [None]:
http = urllib3.PoolManager()
target_folder = 'data/hotdogsfromgoogle'
os.makedirs(target_folder, exist_ok=True)
urllib3.disable_warnings()

for image_url, image_type in image_sources:
    try:
        response = http.request('GET', image_url)
        if response.status == 200:
            filename = image_url.split('/')[-1].split('?')[0]
            handle = open(os.path.join(target_folder, filename), 'wb')
            handle.write(response.data)
        else: 
            print(f'Something went wrong with image {image_url}')
            
    finally:
        handle.close()
            

In [None]:
# more than 100: we will need to find the link to the next page, I guess

# For that, we need selenium
# Part 2 of the blog entry?


## ImageNet

We need a set of images that are NOT hotdogs in order to be able to train our algorithm. For that we can use [ImageNet]. 

ImageNet is a set of tagged images that has been used for a yearly competition whose 2012 edition is credited with starting the current Deep Learning boom.

![The contest that started it](https://blogs.nvidia.com/wp-content/uploads/2016/06/DefenseAIPicture3-002.png)

The thing is, ImageNet contains more than a million images totalling around 100GB. Not only would that take a long time to download: it would be prohibitively costly to train. We are going to download only a part of it for now.

If I give up I can always use [this](https://github.com/xkumiyu/imagenet-downloader/)

[ImageNet]: http://image-net.org

For that, we need to understand how ImageNet is structured. It is based upon [WordNet], a lexical database of English. WordNet contains nouns, verbs, adjectives and adverbs grouped into _synsets_ (sets of synonyms), but ImageNet contains images corresponding only to the nouns.

It actually has a hotdog term, n07697537.

Each synset is identified by its wnid (WordNet id).

ImageNet doesn't own the images, so they only provide them after a registration and a request promising to use them for non-commercial research and/or educational use. 

However, they do provide the image urls freely, so we are going to use our downloader to get them. We will need to get the urls first.

[WordNet]: https://wordnet.princeton.edu/

In [2]:
%%time

import urllib3
import os

# TODO: async retrieve images
http = urllib3.PoolManager()

synset_ids = 'http://image-net.org/archive/words.txt'

filename = synset_ids.split('/')[-1]

if not os.path.exists(filename):
    
    response = http.request('GET', synset_ids)
    f = open(filename, 'wb')
    f.write(response.data)
    f.close()

    
with open(filename) as f:
    
    wnids = {line.split()[0]: line.split()[1] for line in f.readlines()}

CPU times: user 68.5 ms, sys: 20.2 ms, total: 88.7 ms
Wall time: 86.8 ms


wnids are overlapping, that is, there are general terms and more specific terms that are contained within them (hyponyms). The easiest way to deal with this is to just get the urls for the images we want and deduplicate them.

We will also have to deal with the hotdogs in the set. That should be easy though

In [3]:
def urls_from_wnid(wnid):
    response = http.request('GET', 'http://www.image-net.org/api/text/imagenet.synset.geturls', fields={'wnid' : wnid})
    urls = response.data.decode('utf-8').splitlines()
    
    return urls

urls_from_wnid('n07697537')     
        

['http://www.loafnjug.com/images/hot-dog-and-tea.jpg',
 'http://farm1.static.flickr.com/91/220588966_8350522b9a.jpg',
 'http://farm3.static.flickr.com/2200/2252143352_1f628be218.jpg',
 'http://farm2.static.flickr.com/1411/722638089_cd4a75d59a.jpg',
 'http://farm4.static.flickr.com/3645/3396903223_f8601dcdd7.jpg',
 'http://farm1.static.flickr.com/75/182770478_fd71abd390.jpg',
 'http://farm4.static.flickr.com/3614/3363189102_c0916e49ca.jpg',
 'http://farm4.static.flickr.com/3627/3392893009_6c1b71803a.jpg',
 'http://blogs.nashvillescene.com/bites/chicago_hotdog.jpg',
 'http://www.itulip.com/images/hotdog.jpg',
 'http://farm4.static.flickr.com/3076/2655061848_ab58208110.jpg',
 'http://farm3.static.flickr.com/2462/3646689147_25cb943ddf.jpg',
 'http://www.roaringsprings.com/images/group-hotdog.jpg',
 'http://farm4.static.flickr.com/3020/2841765609_a70ea611ae.jpg',
 'http://pic.pimg.tw/smallwhite01/4b08294a99d32.jpg',
 'http://farm4.static.flickr.com/3257/2447063600_98889696b0.jpg',
 'http://

This would work, but it is extremely slow because we are making over 80000 requests.

Also, the damn imagenet site is dog slow. I let it running overnight so you see just how slow it can be:

In [4]:
%%time

# all_urls = {url for wnid in wnids.keys() for url in urls_from_wnid(wnid)}

CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 15.3 µs


Good nested list comprehension explanation: https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python

In [5]:
%%time

hotdog_wnids = {wnid for wnid, term in wnids.items() if 'hot' in term and 'dog' in term}
hotdog_urls = {url for wnid in hotdog_wnids for url in urls_from_wnid(wnid)} 

CPU times: user 20.1 ms, sys: 959 µs, total: 21 ms
Wall time: 32.2 s


In [6]:
len(hotdog_urls)

1216

Downloading all image urls:

In [7]:
%%time

imagenet_fall11_urls = 'http://image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz'
filename = imagenet_fall11_urls.split('/')[-1]

# Don't want to redownload if already have it
if not os.path.exists(filename):
    response = http.request('GET', imagenet_fall11_urls)

    with open(filename, 'wb') as f:
        f.write(response.data)


CPU times: user 33 µs, sys: 4 µs, total: 37 µs
Wall time: 42.7 µs


In [None]:
%%time
import tarfile

tf = tarfile.open('imagenet_fall11_urls.tgz')

# Shortcut: I know it only contains one file
content = tf.extractfile(tf.getmembers()[0])

wnid_urls = []
for line in content:
    try:
        wnid, url = line.decode('utf-8')[:-1].split('\t')
    except:
        print(line)
        
    wnid_urls.append((wnid, url))


b'n01878061_350\thttp://www.tregemboanimalpark.com/resize.asp?path=D:Sites\tregemboanimalpark.comimagesuploaded/wallaby2437-opt.jpg&width=450\n'
b'n02138169_7620\thttp://www.tregemboanimalpark.com/resize.asp?path=D:Sites\tregemboanimalpark.comimagesuploaded/Zoo%20Pics%20for%20Web2%2003035634-opt.jpg&width=450\n'
b'n02365108_10590\thttp://www.tregemboanimalpark.com/resize.asp?path=D:Sites\tregemboanimalpark.comimagesuploaded/2005%20zoo%20pics%200487288-opt.jpg&width=450\n'
b'n02396427_21834\thttp://www.tregemboanimalpark.com/resize.asp?path=D:Sites\tregemboanimalpark.comimagesuploaded/pot%20bellied%20pigs56018-opt.jpg&width=450\n'
b'n02416104_8034\thttp://www.tregemboanimalpark.com/resize.asp?path=D:Sites\tregemboanimalpark.comimagesuploaded/animals%2000534794-opt.jpg&width=450\n'
b'n02437312_5800\thttp://www.tregemboanimalpark.com/resize.asp?path=D:Sites\tregemboanimalpark.comimagesuploaded/websiteIMG_274089594-opt.jpg&width=450\n'
b'n02487547_6331\thttp://www.tregemboanimalpark.com/re

In [None]:
len(wnid_urls)

In [None]:
all_urls = {url for wnid, url in wnid_urls}
# We already have the hotdog urls from before> tidy up 
#hotdog_urls = {url for wnid, url in wnid_urls if wnid in hotdog_wnids}
other_urls = all_urls - hotdog_urls

In [None]:
print((len(all_urls), len(hotdog_urls), len(other_urls)))

In [None]:
base_dir = 'data'

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

In [None]:
import random
from sklearn.model_selection import train_test_split

hotdogs_sample = random.sample(hotdog_urls, k=200)
nohotdogs_sample = random.sample(other_urls, k=2000)

hd_train, hd_val = train_test_split(hotdogs_sample)
nohd_train, nohd_val = train_test_split(nohotdogs_sample)

In [None]:
%%time
import shutil
import imghdr

debug=False
urllib3.disable_warnings()

folders_urls = { os.path.join(train_dir, 'hotdog'): hd_train,
                 os.path.join(train_dir, 'nohotdog'): nohd_train,
                 os.path.join(validation_dir, 'hotdog'): hd_val,
                 os.path.join(validation_dir, 'nohotdog'): hd_val }

In [None]:
import imghdr

for folder, urls in folders_urls.items():
    n = 0
    shutil.rmtree(folder, ignore_errors=True)
    os.makedirs(folder)
    
    print(f'Starting with {folder}')
    
    for image_url in urls:
        try:
            response = http.request('GET', image_url)
            
            if response.status == 200 and imghdr.what(None, h=response.data) in ['jpeg', 'png']:
                filename = image_url.split('/')[-1].split('?')[0]
                
                f = open(os.path.join(folder, filename), 'wb')
                f.write(response.data)
                n += 1 
                if debug: print(f'wrote {filename}')
                
            else: 
                print(f'{response.status} response for {image_url} of type {imghdr.what(None, h=response.data)}')
                
        except Exception as e:
            print(f'Something went wrong with image {image_url}: {e}')
                    
    print(f'Wrote {n} images in folder {folder}')

    

To do: either get all the images and subset, or keep retrying connections until I get the desired number of images. There are _a lot_ of broken links.

TODO: simultaneous requests for speedup

Tidy up

Convert into proper blog post

http://image-net.org/synset?wnid=n07697537

http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n07697537

http://caffe.berkeleyvision.org/gathered/examples/imagenet.html

### Notes


Last layer -> sigmoid (1 vs all classification)

How to calculate learning rates with Keras?? -> maybe another blog post

  * Somewhat related: what does Jeremy Howard refer to as each of the three parts of a net? and can you set differential learning rates in Keras??

### List of blog posts to write

In no particular order

* First approximation

* Training data collection: ImageNet and Google Image Search

* Refinements to the first model

  * Data augmentation with Keras

  * Using a pretrained network, then training the last layer, then unfreezing and training all layers with a small learning rate.


* Prototyping your model locally with a small subset of images then training it in the cloud (?)

* Porting a learned model to mobile (?), and how to minimize disk memory and usage by quantizing the model.

### Further reading

[diff_learning_rates_1]: https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1

[diff_learning_rates_2]: https://github.com/keras-team/keras/issues/898

[Android]: https://medium.com/joytunes/deploying-a-tensorflow-model-to-android-69d04d1b0cba

### Doubts

Why is normalization of input variable values positive or required for Deep Learning?

Where does it make more sense to put a dropout layer?? After a conv, before...