# Description
- Scrape images of common grocery items. These include:
    - Butter
    - Soy Milk
    - Soy Sauce
    - Rice bags
    - Chin Kiang Vinegar
    - Yoghurt
    - Kecap Manis
    - Coconut Water
    - Canned Tuna
    - Ms Chens Prawn Hargow
    
I used [this floydhub blog post](https://blog.floydhub.com/web-scraping-with-python/) as a web-scraping guide. Key Ideas:
- Jupyter notebooks great for agile development
- But, cache results in CSV to avoid needing to re-scrape every time

## Surprise...don't scrape Google Images!
Reason: https://stackoverflow.com/questions/36438261/extracting-images-from-google-images-using-src-and-beautifulsoup
tldr;
- Google makes it super difficult
- Use Google's [custom search API](https://cse.google.com/cse/create/new) instead
- In fact, ALWAYS use an API before scraping!
- Http request: GET https://www.googleapis.com/customsearch/v1

In [114]:
# Use Google Custom Search Engine to store images
import requests
import json

ROOT_URL = 'https://www.googleapis.com/customsearch/v1?'
google_custom_api_key = 'AIzaSyBOzEUMnkNGzrTDCK1USlTyTFovHY95czw' #Bob Jin's private API key
SE_ID = '007117080109183320818:lzdpxbparoi' #Custom entire web image search engine
product_to_search = 'lurpak'
searchType = 'image'

q = f'q={product_to_search}'
key = f'key={google_custom_api_key}'
cx = f'cx={custom_search_engine_id}'
searchType = f'searchType={searchType}'

# Initialise list of returned images and data
imgs = []

for start in [1,11,21,31,41]:
    start = f'start={start}'

    params = [q,cx,searchType,start,key]
    SUFFIX_URL = '&'.join(params)
    url = ROOT_URL + SUFFIX_URL
    result = requests.get(url)
    
    c = json.loads(result.text)
    
    items = c['items']

    for item in items:
        img = {
            'product': product_to_search,
            'link': item['link'],
            'height': item['image']['height'],
            'width': item['image']['width'],
        }
        imgs.append(img)

# Method to Store Images for Later Access
https://realpython.com/storing-images-in-python/#storing-many-images
tldr;
- Use a HDF5 format to improve efficiency
- I will have to store my images to the disk anyway
- I can save space by augmenting images in memory and saving to a HDF5 database (will need functions)


# Augmenting image data
Useful Medium blogs:
- [Part 1](https://medium.com/nanonets/nanonets-how-to-use-deep-learning-when-you-have-limited-data-f68c0b512cab)
- [Part 2](https://medium.com/nanonets/how-to-use-deep-learning-when-you-have-limited-data-part-2-data-augmentation-c26971dc8ced)


In [116]:
imgs

[{'product': 'lurpak',
  'link': 'https://cdn0.woolworths.media/content/wowproductimages/large/223868.jpg',
  'height': 1200,
  'width': 1200},
 {'product': 'lurpak',
  'link': 'https://cdn.bmstores.co.uk/images/hpcProductImage/imgFull/335483-lurpak-500g.jpg',
  'height': 800,
  'width': 800},
 {'product': 'lurpak',
  'link': 'https://cdn0.woolworths.media/content/wowproductimages/large/223865.jpg',
  'height': 1200,
  'width': 1200},
 {'product': 'lurpak',
  'link': 'https://cdnprod.mafretailproxy.com/cdn-cgi/image/format=auto,onerror=redirect/sys-master-prod/h0d/h22/9215148064798/76732_main.jpg_480Wx480H',
  'height': 480,
  'width': 480},
 {'product': 'lurpak',
  'link': 'https://cdn0.woolworths.media/content/wowproductimages/large/223867.jpg',
  'height': 1200,
  'width': 1200},
 {'product': 'lurpak',
  'link': 'https://www.lurpak.com/siteassets/ui/default/logo/lurpak-start.png',
  'height': 179,
  'width': 238},
 {'product': 'lurpak',
  'link': 'https://www.lurpak.co.uk/siteassets

In [111]:
test_url = c['items'][0]['link']

img_data = requests.get(test_url).content