# Description
- Scrape images of common grocery items. These include:
    - Butter
    - Soy Milk
    - Soy Sauce
    - Rice bags
    - Chin Kiang Vinegar
    - Yoghurt
    - Kecap Manis
    - Coconut Water
    - Canned Tuna
    - Ms Chens Prawn Hargow
    
I used [this floydhub blog post](https://blog.floydhub.com/web-scraping-with-python/) as a web-scraping guide. Key Ideas:
- Jupyter notebooks great for agile development
- But, cache results in CSV to avoid needing to re-scrape every time

## Surprise...don't scrape Google Images!
Reason: https://stackoverflow.com/questions/36438261/extracting-images-from-google-images-using-src-and-beautifulsoup
tldr;
- Google makes it super difficult
- Use Google's [custom search API](https://cse.google.com/cse/create/new) instead
- In fact, ALWAYS use an API before scraping!
- Http request: GET https://www.googleapis.com/customsearch/v1

In [1]:
# Use Google Custom Search Engine to store images
import requests
import json
import numpy as np

ROOT_URL = 'https://www.googleapis.com/customsearch/v1?'
google_custom_api_key = 'AIzaSyBOzEUMnkNGzrTDCK1USlTyTFovHY95czw' #Bob Jin's private API key
SE_ID = '007117080109183320818:lzdpxbparoi' #Custom entire web image search engine
product = 'katsuobushi'
search_term = 'katsuobushi dried bonito flakes'
searchType = 'image'

q = f'q={search_term}'
key = f'key={google_custom_api_key}'
cx = f'cx={SE_ID}'
searchType = f'searchType={searchType}'

# Initialise list of returned images and data
imgs = []

# Get 200 images over 20 searches. Limit of 100 searches per day!
for start_num in np.arange(1,101,10):
    print(f'Getting data for images {start_num}-{start_num+9}')
    start = f'start={start_num}'

    params = [q,cx,searchType,start,key]
    SUFFIX_URL = '&'.join(params)
    url = ROOT_URL + SUFFIX_URL
    result = requests.get(url)
    
    c = json.loads(result.text)
    
    items = c['items']

    for item in items:
        img = {
            'product': product,
            'link': item['link'],
            'height': item['image']['height'],
            'width': item['image']['width'],
        }
        imgs.append(img)

Getting data for images 1-10
Getting data for images 11-20
Getting data for images 21-30
Getting data for images 31-40
Getting data for images 41-50
Getting data for images 51-60
Getting data for images 61-70
Getting data for images 71-80
Getting data for images 81-90
Getting data for images 91-100


# Method to Store Images for Later Access
https://realpython.com/storing-images-in-python/#storing-many-images
tldr;
- Use a HDF5 format to improve efficiency
- I will have to store my images to the disk anyway
- I can save space by augmenting images in memory and saving to a HDF5 database (will need functions)


# Augmenting image data
Useful Medium blogs:
- [Part 1](https://medium.com/nanonets/nanonets-how-to-use-deep-learning-when-you-have-limited-data-f68c0b512cab)
- [Part 2](https://medium.com/nanonets/how-to-use-deep-learning-when-you-have-limited-data-part-2-data-augmentation-c26971dc8ced)


In [2]:
imgs[11]

{'product': 'katsuobushi',
 'link': 'https://image.shutterstock.com/image-photo/katsuobushi-dried-bonito-flakes-600w-1363113500.jpg',
 'height': 420,
 'width': 600}

In [3]:
# Download images and store in data
from io import open as iopen
import os

def requests_image(file_url,new_file_name=None):
    i = requests.get(file_url)    
    if i.status_code == requests.codes.ok:
        with iopen(new_file_name, 'wb') as file:
            file.write(i.content)
    else:
        return False

img_count = 0
path = f'data/{product}'
try:
    os.mkdir(path)
except OSError:
    print(f'Creation of the directory {path} failed')
else:
    print(f'Successfully created directory {path}')
    
for img in imgs:
    img_count += 1
    img_filename = f'{product}_{img_count}'
    
    img_path = f'data/{product}/{img_filename}.png'
    requests_image(img['link'],img_path)
    print(f'Downloaded {img_filename}')

Successfully created directory data/katsuobushi
Downloaded katsuobushi_1
Downloaded katsuobushi_2
Downloaded katsuobushi_3
Downloaded katsuobushi_4
Downloaded katsuobushi_5
Downloaded katsuobushi_6
Downloaded katsuobushi_7
Downloaded katsuobushi_8
Downloaded katsuobushi_9
Downloaded katsuobushi_10
Downloaded katsuobushi_11
Downloaded katsuobushi_12
Downloaded katsuobushi_13
Downloaded katsuobushi_14
Downloaded katsuobushi_15
Downloaded katsuobushi_16
Downloaded katsuobushi_17
Downloaded katsuobushi_18
Downloaded katsuobushi_19
Downloaded katsuobushi_20
Downloaded katsuobushi_21
Downloaded katsuobushi_22
Downloaded katsuobushi_23
Downloaded katsuobushi_24
Downloaded katsuobushi_25
Downloaded katsuobushi_26
Downloaded katsuobushi_27
Downloaded katsuobushi_28
Downloaded katsuobushi_29
Downloaded katsuobushi_30
Downloaded katsuobushi_31
Downloaded katsuobushi_32
Downloaded katsuobushi_33
Downloaded katsuobushi_34
Downloaded katsuobushi_35
Downloaded katsuobushi_36
Downloaded katsuobushi_37

In [4]:
imgs[12]

{'product': 'katsuobushi',
 'link': 'https://images.japancentre.com/images/pics/15333/medium/13042_Makurazaki_France_Katsuobushi_Dried_Bonito_Flakes_-_Thin_Type_100.jpg?1548767376',
 'height': 250,
 'width': 250}