## ImageNet

We need a set of images that are NOT hotdogs in order to be able to train our algorithm. For that we can use [ImageNet]. 

ImageNet is a set of tagged images that has been used for a yearly competition whose 2012 edition is credited with starting the current Deep Learning boom.

![The contest that started it](https://blogs.nvidia.com/wp-content/uploads/2016/06/DefenseAIPicture3-002.png)

The thing is, ImageNet contains more than a million images totalling around 100GB. Not only would that take a long time to download: it would be prohibitively costly to train. We are going to download only a part of it for now.

If I give up I can always use [this](https://github.com/xkumiyu/imagenet-downloader/)

[ImageNet]: http://image-net.org

For that, we need to understand how ImageNet is structured. It is based upon [WordNet], a lexical database of English. WordNet contains nouns, verbs, adjectives and adverbs grouped into _synsets_ (sets of synonyms), but ImageNet contains images corresponding only to the nouns.

It actually has a hotdog term, n07697537.

Each synset is identified by its wnid (WordNet id).

ImageNet doesn't own the images, so they only provide them after a registration and a request promising to use them for non-commercial research and/or educational use. 

However, they do provide the image urls freely, so we are going to use our downloader to get them. We will need to get the urls first.

[WordNet]: https://wordnet.princeton.edu/

In [None]:
%%time

import requests
import os

# TODO: async retrieve images

synset_ids = 'http://image-net.org/archive/words.txt'

filename = synset_ids.split('/')[-1]

if not os.path.exists(filename):
    
    response = requests.get(synset_ids, headers=header)
    f = open(filename, 'wb')
    f.write(response.content)
    f.close()

    
with open(filename) as f:
    
    wnids = {line.split()[0]: line.split()[1] for line in f.readlines()}

wnids are overlapping, that is, there are general terms and more specific terms that are contained within them (hyponyms). The easiest way to deal with this is to just get the urls for the images we want and deduplicate them.

We will also have to deal with the hotdogs in the set. That should be easy though

In [None]:
def urls_from_wnid(wnid):
    response = requests.get('http://www.image-net.org/api/text/imagenet.synset.geturls', params={'wnid' : wnid})
    urls = response.content.decode('utf-8').splitlines()
    
    return urls

urls_from_wnid('n07697537')     
        

This would work, but it is extremely slow because we are making over 80000 requests.

Also, the damn imagenet site is dog slow. I let it running overnight so you see just how slow it can be:

In [None]:
%%time

# all_urls = {url for wnid in wnids.keys() for url in urls_from_wnid(wnid)}

Good nested list comprehension explanation: https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python

In [None]:
%%time

hotdog_wnids = {wnid for wnid, term in wnids.items() if 'hot' in term and 'dog' in term}
hotdog_urls = {url for wnid in hotdog_wnids for url in urls_from_wnid(wnid)} 

In [None]:
len(hotdog_urls)

Downloading all image urls:

In [None]:
%%time

imagenet_fall11_urls = 'http://image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz'
filename = imagenet_fall11_urls.split('/')[-1]

# Don't want to redownload if already have it
if not os.path.exists(filename):
    response = request.get(imagenet_fall11_urls, headers=header)

    with open(filename, 'wb') as f:
        f.write(response.content)


In [None]:
%%time
import tarfile

tf = tarfile.open('imagenet_fall11_urls.tgz')

# Shortcut: I know it only contains one file
content = tf.extractfile(tf.getmembers()[0])

wnid_urls = []
for line in content:
    try:
        wnid, url = line.decode('utf-8')[:-1].split('\t')
    except:
        print(line)
        
    wnid_urls.append((wnid, url))


In [None]:
len(wnid_urls)

In [None]:
all_urls = {url for wnid, url in wnid_urls}
# We already have the hotdog urls from before> tidy up 
#hotdog_urls = {url for wnid, url in wnid_urls if wnid in hotdog_wnids}
other_urls = all_urls - hotdog_urls

print((len(all_urls), len(hotdog_urls), len(other_urls)))

In [None]:
base_dir = 'data'

train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

In [None]:
import random
from sklearn.model_selection import train_test_split

random.seed(42)

hotdogs_sample = random.sample(hotdog_urls, 1216) # Use all for now
nohotdogs_sample = random.sample(other_urls, k=5000)

hd_train, hd_val = train_test_split(hotdogs_sample)
nohd_train, nohd_val = train_test_split(nohotdogs_sample)

In [None]:
%%time

debug=False

folders_urls = { os.path.join(train_dir, 'hotdog'): hd_train,
                 os.path.join(train_dir, 'nohotdog'): nohd_train,
                 os.path.join(validation_dir, 'hotdog'): hd_val,
                 os.path.join(validation_dir, 'nohotdog'): hd_val }

https://stackoverflow.com/questions/24398044/downloading-a-lot-of-files-using-python

We should actually use a logging specific module, but let's keep things simple

In [None]:
import imghdr
import shutil
from datetime import datetime

log = open('log_img_download', 'w')

for folder, urls in folders_urls.items():
    ok = 0
    fail = 0
    shutil.rmtree(folder, ignore_errors=True)
    os.makedirs(folder)
    
    log.write(f'{datetime.now()} Starting with {folder}\n')
    
    for image_url in urls:
        try:
            response = requests.get(image_url)
            
            if response.status_code == 200 and imghdr.what(None, h=response.content) in ['jpeg', 'png']:
                filename = image_url.split('/')[-1].split('?')[0]
                
                f = open(os.path.join(folder, filename), 'wb')
                f.write(response.content)
                n += 1 
                if debug: log.write(f'{datetime.now()} wrote {filename}\n')
                ok += 0
            else: 
                log.write(f'{datetime.now()} {response.status_code} response for {image_url} of type {imghdr.what(None, h=response.content)}\n')
                fail += 0
                
        except Exception as e:
            log.write(f'{datetime.now()} Something went wrong with image {image_url}: {e}\n')
                    
    log.write(f'{datetime.now()} Wrote {ok} images in folder {folder}, with {fail} failures\n')

log.close()    

To do: either get all the images and subset, or keep retrying connections until I get the desired number of images. There are _a lot_ of broken links.

https://stackoverflow.com/questions/36554365/downloading-many-images-with-python-requests-and-multiprocessing

TODO: simultaneous requests for speedup

Tidy up

Convert into proper blog post

http://image-net.org/synset?wnid=n07697537

http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n07697537

http://caffe.berkeleyvision.org/gathered/examples/imagenet.html

### Notes


Last layer -> sigmoid (1 vs all classification)

How to calculate learning rates with Keras?? -> maybe another blog post

  * Somewhat related: what does Jeremy Howard refer to as each of the three parts of a net? and can you set differential learning rates in Keras??