# Description
- Scrape images of common grocery items. These include:
    - Butter
    - Soy Milk
    - Soy Sauce
    - Rice bags
    - Chin Kiang Vinegar
    - Yoghurt
    - Kecap Manis
    - Coconut Water
    - Canned Tuna
    - Ms Chens Prawn Hargow
    
I used [this floydhub blog post](https://blog.floydhub.com/web-scraping-with-python/) as a web-scraping guide. Key Ideas:
- Jupyter notebooks great for agile development
- But, cache results in CSV to avoid needing to re-scrape every time

## Surprise...don't scrape Google Images!
Reason: https://stackoverflow.com/questions/36438261/extracting-images-from-google-images-using-src-and-beautifulsoup
tldr;
- Google makes it super difficult
- Use Google's [custom search API](https://cse.google.com/cse/create/new) instead
- In fact, ALWAYS use an API before scraping!
- Http request: GET https://www.googleapis.com/customsearch/v1

In [3]:
# Use Google Custom Search Engine to store images
import requests
import json

ROOT_URL = 'https://www.googleapis.com/customsearch/v1?'
google_custom_api_key = 'AIzaSyBOzEUMnkNGzrTDCK1USlTyTFovHY95czw' #Bob Jin's private API key
SE_ID = '007117080109183320818:lzdpxbparoi' #Custom entire web image search engine
product_to_search = 'lurpak'
searchType = 'image'

q = f'q={product_to_search}'
key = f'key={google_custom_api_key}'
cx = f'cx={SE_ID}'
searchType = f'searchType={searchType}'

# Initialise list of returned images and data
imgs = []

for start in [1,11,21,31,41]:
    start = f'start={start}'

    params = [q,cx,searchType,start,key]
    SUFFIX_URL = '&'.join(params)
    url = ROOT_URL + SUFFIX_URL
    result = requests.get(url)
    
    c = json.loads(result.text)
    
    items = c['items']

    for item in items:
        img = {
            'product': product_to_search,
            'link': item['link'],
            'height': item['image']['height'],
            'width': item['image']['width'],
        }
        imgs.append(img)

# Method to Store Images for Later Access
https://realpython.com/storing-images-in-python/#storing-many-images
tldr;
- Use a HDF5 format to improve efficiency
- I will have to store my images to the disk anyway
- I can save space by augmenting images in memory and saving to a HDF5 database (will need functions)


# Augmenting image data
Useful Medium blogs:
- [Part 1](https://medium.com/nanonets/nanonets-how-to-use-deep-learning-when-you-have-limited-data-f68c0b512cab)
- [Part 2](https://medium.com/nanonets/how-to-use-deep-learning-when-you-have-limited-data-part-2-data-augmentation-c26971dc8ced)


In [32]:
imgs[11]

{'product': 'lurpak',
 'link': 'https://cdnprod.mafretailproxy.com/cdn-cgi/image/format=auto,onerror=redirect/sys-master-prod/h52/hc6/9049142919198/662867_main.jpg_480Wx480H',
 'height': 480,
 'width': 480}

In [35]:
# Download images and store in data
from io import open as iopen
from urllib.parse import urlsplit

def requests_image(file_url,new_file_name=None):
#     suffix_list = ['jpg', 'gif', 'png', 'tif', 'svg',]
#     file_name =  urlsplit(file_url)[2].split('/')[-1]
#     file_suffix = file_name.split('.')[1]
    i = requests.get(file_url)
    
#     if new_file_name is None:
#         new_file_name = file_name
#     else:
#         new_file_name += f'.{file_suffix}'
    
    if i.status_code == requests.codes.ok:
        with iopen(new_file_name, 'wb') as file:
            file.write(i.content)
    else:
        return False

img_count = 0

for img in imgs:
    img_count += 1
    img_filename = f'{product_to_search}_{img_count}'
    img_path = f'data/{img_filename}.png'
    requests_image(img['link'],img_path)

In [None]:
# 