# Building the hotdog / no hotdog classifier

Started on 13th June 2018. Let's see how far I get.

## Project flow:

1. Get training data manually: imagenet + google images search for 'hotdog'

2. Crop "manually" the google images and imagenet to have the same input dimensions

3. Train a first example

    * Write a writeup for presentations, blog posts, etc.
    
4. Refine the example: as per _Deep Learning with Python_, page 130. Data augmentation, pretrained network, finetuning (maybe two different blog posts?).

    * Write a writeup for presentations, blog posts, etc.

5. Automate the training data collection

    * Write a writeup for presentations, blog posts, etc.

6. Implement data augmentation

7. Play around with advanced concepts: Transfer learning, resizing like in fastai, test-time augmentation, stochastic GD with restarts...

    * Write a writeup for presentations, blog posts, etc.

8. Port the thing to [Android]?

[Android]: https://medium.com/joytunes/deploying-a-tensorflow-model-to-android-69d04d1b0cba

### List of blog posts to write

In no particular order

* Training data collection: ImageNet

* Splitting into train, validation and test set

* First approximation, overfitting,  data augmentation

* Dropout to combat overfitting.

* Transfer learning: Using a pretrained network, then training the last layer, then unfreezing and training all layers with a small learning rate.

* Confusion matrix? where?

* Cross validation?

* Prototyping your model locally with a small subset of images then training it in the cloud (?). Paperspace?

* Porting a learned model to mobile (?), and how to minimize disk memory and usage by quantizing the model.

# Retales

### downloading the images from Python

In [41]:
base_dir = 'data'

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

folders_urls = { os.path.join(train_dir, 'hotdog'): hd_train,
                 os.path.join(train_dir, 'nohotdog'): nohd_train,
                 os.path.join(validation_dir, 'hotdog'): hd_val,
                 os.path.join(validation_dir, 'nohotdog'): hd_val }

We should actually use a logging specific module, but let's keep things simple

TODO: download in background.

In [None]:
import imghdr
import shutil
from tqdm import tqdm
from datetime import datetime

log = open('log_img_download', 'w')
debug=False

for folder, urls in folders_urls.items():
    ok = 0
    fail = 0
    shutil.rmtree(folder, ignore_errors=True)
    os.makedirs(folder)
    
    log.write(f'{datetime.now()} Starting with {folder}\n')
    
    # tqdm is a very nice little module for progress reporting
    for image_url in tqdm(urls):
        try:
            # It is VERY important to set a timeout here! otherwise the program will stall
            # 
            response = requests.get(image_url)
            
            if response.status_code == 200 and imghdr.what(None, h=response.content) in ['jpeg', 'png']:
                filename = image_url.split('/')[-1].split('?')[0]
                
                f = open(os.path.join(folder, filename), 'wb')
                f.write(response.content)
                if debug: log.write(f'{datetime.now()} wrote {filename}\n')
                ok += 1
            else: 
                log.write(f'{datetime.now()} {response.status_code} response for {image_url} of type {imghdr.what(None, h=response.content)}\n')
                fail += 1
                
        except Exception as e:
            log.write(f'{datetime.now()} Something went wrong with image {image_url}: {e}\n')
                    
    log.write(f'{datetime.now()} Wrote {ok} images in folder {folder}, with {fail} failures\n')

log.close()    

 75%|███████▌  | 687/912 [30:38<10:01,  2.68s/it]

### Getting some hotdogs

We'll go the easy way: get some results from Google Image Search.

https://stackoverflow.com/questions/36438261/extracting-images-from-google-images-using-src-and-beautifulsoup

https://gist.github.com/genekogan/ebd77196e4bf0705db51f86431099e57

In [None]:
from bs4 import BeautifulSoup
import requests
import urllib3
import os
import json

# Identify ourselves to Google
# if we don't pass a header, the response will contain everything but the images
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}

In [None]:
search_term = 'hot+dog'
search_type = 'isch'

response = requests.get(url='http://www.google.es/search', 
                        params={'q' : search_term, 'tbm' : search_type},
                        headers=header)
response

In [None]:
response.url

In [None]:
soup = BeautifulSoup(response.content, 'html5lib')

If we go straight for the links, we get #'s. I think this is because the href gets populated dinamically

In [None]:
dead_end = soup.findAll('a', {'class' : 'rg_l'})
[a['href'] for a in dead_end[:10]]

In [None]:
# How did the guys in 
# https://gist.github.com/genekogan/ebd77196e4bf0705db51f86431099e57
# find this?

divs = soup.find_all("div",{"class":"rg_meta"})
props = [json.loads(div.text) for div in divs]
image_sources = [(prop['ou'], prop['ity']) for prop in props]

In [None]:
len(image_sources)

In [None]:
image_sources[:5]

In [None]:
http = urllib3.PoolManager()
target_folder = 'data/hotdogsfromgoogle'
os.makedirs(target_folder, exist_ok=True)
urllib3.disable_warnings()

for image_url, image_type in image_sources:
    try:
        response = http.request('GET', image_url)
        if response.status == 200:
            filename = image_url.split('/')[-1].split('?')[0]
            handle = open(os.path.join(target_folder, filename), 'wb')
            handle.write(response.data)
        else: 
            print(f'Something went wrong with image {image_url}')
            
    finally:
        handle.close()
            

In [None]:
# more than 100: we will need to find the link to the next page, I guess

# For that, we need selenium
# Part 2 of the blog entry?


### ResNet50

WTF?? It doesn't learn anything transferable at all!!

In [14]:
resnet = ResNet50(weights='imagenet', 
                  include_top=False, 
                  input_shape=(200,200,3))


resnet.trainable = False

othermodel = keras.Sequential()
othermodel.add(resnet)
othermodel.add(Flatten())
othermodel.add(Dense(256, activation='relu'))
othermodel.add(Dropout(0.5))
othermodel.add(Dense(128, activation='relu'))
othermodel.add(Dropout(0.5))
othermodel.add(Dense(64, activation='relu'))
othermodel.add(Dropout(0.5))
othermodel.add(Dense(1, activation='sigmoid'))

othermodel.summary()

ValueError: Variable bn_conv1/moving_mean/biased already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1011, in moving_average_update
    x, value, momentum, zero_debias=True)
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/keras/layers/normalization.py", line 195, in call
    self.momentum),
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/keras/engine/base_layer.py", line 460, in __call__
    output = self.call(inputs, **kwargs)


### Further reading

[diff_learning_rates_1]: https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1

[diff_learning_rates_2]: https://github.com/keras-team/keras/issues/898

[Android]: https://medium.com/joytunes/deploying-a-tensorflow-model-to-android-69d04d1b0cba

### Doubts

Why is normalization of input variable values positive or required for Deep Learning?

Where does it make more sense to put a dropout layer?? After a conv, before...

### Notes


Last layer -> sigmoid (1 vs all classification)

How to calculate learning rates with Keras?? -> maybe another blog post

  * Somewhat related: what does Jeremy Howard refer to as each of the three parts of a net? and can you set differential learning rates in Keras??
  
[Image(url=url, width=400) for url in urls_from_wnid('n07697537')[:5]]

https://stackoverflow.com/questions/24398044/downloading-a-lot-of-files-using-python