# Webscraping for Images
This script will be used to scrape images for the following types of stemware:
- champagne flutes
- martini glass (formally cocktail glass)
- rummer glass
- snifters
- red wine glasses
- white wine glasses
- sherry glasses (schooners)

The images will then be passed through a Mechanical Turk-type process (Manual Turk) for label validation. Label validation is important for this exercise as stemware has a wide variety to them. It'll be important to keep our recognition engine to a standard stemware collection and leave the other items to their own devices.

Having a large sample size is also important. I believe a sample size of more than 1,000 images per category will lead to high accuracy, but the goal for now is to start at 200 images per category and use image adjustment techniques (see Jupyter Notebook: 'Image Adjustment').

Chalice/Goblets have been removed due to their ambiguity.

## Bing Scraping
The unfortunate result of this query is a small set of 35 images. Not enough for our needs, but all data points are useful.

Resources: https://gist.github.com/stephenhouser/c5e2b921c3770ed47eb3b75efbc94799 ; https://stackoverflow.com/questions/18497840/beautifulsoup-how-to-open-images-and-download-them

In [1]:
from bs4 import BeautifulSoup
import requests
import sys
import os
import urllib.request, urllib.error, urllib.parse
import json
from time import sleep

In [2]:
def get_bing_soup(searchterm):
    #uses the search term as the url search
    term = searchterm
    term= term.split()
    term='+'.join(term)
    url="http://www.bing.com/images/search?q=" + str(term) + "&FORM=HDRSC2"
    #creates a folder for the search term
    DIR="C:\\Users\\chris\\Documents\\CUNY\\DATA698 - Final Masters Thesis\\Images\\" + term
    if not os.path.exists(DIR):
        os.mkdir(DIR)
    #header will imitate a user
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
    soup = BeautifulSoup(urllib.request.urlopen(urllib.request.Request(url, headers=header)), 'html.parser')
    #will be used to save image details
    ActualImages=[]
    
    for a in soup.find_all("a", {"class":"iusc"}):
        #m contains murl which contains image url and name
        m =json.loads(a["m"])
        murl = m["murl"]
        image_name = urllib.parse.urlsplit(murl).path.split("/")[-1]
        #mad contains turl which contains original image url
        mad=json.loads(a["mad"])
        turl = mad["turl"]
        ActualImages.append((image_name, turl, murl))
    print("there are total" , len(ActualImages), "images")
    
    for i, (image_name, turl, murl) in enumerate(ActualImages):
        try:
            raw_img = urllib.request.urlopen(turl).read()
            f = open(os.path.join(DIR, image_name), 'wb')
            f.write(raw_img)
            f.close()
        except Exception as e:
            print("could not load : " + image_name)
            print(e)

In [3]:
#added sleep to give a time delay of 10 seconds before query
get_bing_soup('champagne flute')
sleep(10)
get_bing_soup('martini glass')
sleep(10)
get_bing_soup('rummer glass')
sleep(10)
get_bing_soup('snifters')
sleep(10)
get_bing_soup('red wine glass')
sleep(10)
get_bing_soup('white wine glass')
sleep(10)
get_bing_soup('sherry glass')
sleep(10)

there are total 35 images
there are total 35 images
there are total 35 images
there are total 35 images
there are total 35 images
there are total 35 images
there are total 35 images


# Google Images Search
https://stackoverflow.com/questions/20716842/python-download-images-from-google-image-search
https://github.com/CumminUp07/imengine/blob/master/get_google_images.py

In [4]:
import re
import urllib.request, urllib.error, urllib.parse

In [5]:
def get_google_soup(searchterm):
    #uses the search term as the url search
    term = searchterm
    term= term.split()
    term='+'.join(term)
    url="https://www.google.com/search?q=" + str(term) + "&source=lnms&tbm=isch"
    #creates a folder for the search term
    DIR="C:\\Users\\chris\\Documents\\CUNY\\DATA698 - Final Masters Thesis\\Images\\" + term
    if not os.path.exists(DIR):
        os.mkdir(DIR)
    #header will imitate a user
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
    soup = BeautifulSoup(urllib.request.urlopen(urllib.request.Request(url, headers=header)), 'html.parser')
    ActualImages=[]
    
    #pulls image name, link, and image type
    for a in soup.find_all("div",{"class":"rg_meta"}):
        link, Type =json.loads(a.text)["ou"], json.loads(a.text)["ity"]
        #added google to identify source
        image_name = urllib.parse.urlsplit(link).path.split("/")[-1] + '_google'
        ActualImages.append((image_name, link, Type))
    print("there are total" , len(ActualImages), "images")
    
    #defaults image type
    image_type="ActiOn"
    
    #saves images using jpg or the default image type if not jpg
    try:
        for i, (image_name, img, Type) in enumerate(ActualImages):
            raw_img = urllib.request.urlopen(urllib.request.Request(img, headers=header)).read()
            cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
            if len(Type)==0:
                f=open(os.path.join(DIR, image_name + ".jpg"), 'wb')
            else:
                f=open(os.path.join(DIR, image_name + "." + Type), 'wb')

            f.write(raw_img)
            f.close()
            
    except Exception as e:
        print("could not load : " + img)
        print(e)

In [6]:
#added sleep to give a time delay of 10 seconds before query
get_google_soup('champagne flute')
sleep(10)
get_google_soup('martini glass')
sleep(10)
get_google_soup('rummer glass')
sleep(10)
get_google_soup('snifters')
sleep(10)
get_google_soup('red wine glass')
sleep(10)
get_google_soup('white wine glass')
sleep(10)
get_google_soup('sherry glass')
sleep(10)

there are total 100 images
there are total 100 images
there are total 100 images
there are total 100 images
could not load : https://images.prod.meredith.com/product/9d547fb10349ba6c2f7dbf68cde9498a/1504157618842/l/orrefors-prestige-set-of-4-cognac-snifters
HTTP Error 403: Forbidden
there are total 100 images
there are total 100 images
there are total 100 images
could not load : https://images.prod.meredith.com/product/061163df1312777a58ba58b729732efd/1507781871925/l/baccarat-perfection-sherry-glass-plain
HTTP Error 403: Forbidden


# Yahoo Image Search

NEED URL TO SEARCH APPROPRIATELY

In [None]:
https://images.search.yahoo.com/search/images;_ylt=AwrWnfY47OdaaHEAV3iLuLkF;_ylc=X1MDOTYwNTc0ODMEX3IDMgRiY2sDOHN2OTNndGQ2cWg2ciUyNmIlM0QzJTI2cyUzRDVnBGZyAwRncHJpZANWQjFVYXJ0dFJHeTVVeERkemJXX0dBBG10ZXN0aWQDbnVsbARuX3N1Z2cDMTAEb3JpZ2luA2ltYWdlcy5zZWFyY2gueWFob28uY29tBHBvcwMwBHBxc3RyAwRwcXN0cmwDBHFzdHJsAzE1BHF1ZXJ5A2NoYW1wYWduZSBmbHV0ZQR0X3N0bXADMTUyNTE0ODc0NwR2dGVzdGlkA251bGw-?gprid=VB1UarttRGy5UxDdzbW_GA&pvid=a4Ft6jEwLjKOfSOHWm1E2wLANzMuNwAAAABcL.ky&fr2=sb-top-images.search.yahoo.com&p=champagne+flute&ei=UTF-8&iscqry=&fr=sfp#id=0&iurl=http%3A%2F%2Fwww.urbanbar.com%2Fwp-content%2Fuploads%2F2016%2F11%2FUB3246-2.png&action=close

In [7]:
def get_yahoo_soup(searchterm):
    #uses the search term as the url search
    term = searchterm
    term= term.split()
    term='+'.join(term)
    url="https://www.google.com/search?q=" + str(term) + "&source=lnms&tbm=isch"
    #creates a folder for the search term
    DIR="C:\\Users\\chris\\Documents\\CUNY\\DATA698 - Final Masters Thesis\\Images\\" + term
    if not os.path.exists(DIR):
        os.mkdir(DIR)
    #header will imitate a user
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
    soup = BeautifulSoup(urllib.request.urlopen(urllib.request.Request(url, headers=header)), 'html.parser')
    ActualImages=[]
    
    #pulls image name, link, and image type
    for a in soup.find_all("div",{"class":"rg_meta"}):
        link, Type =json.loads(a.text)["ou"], json.loads(a.text)["ity"]
        image_name = urllib.parse.urlsplit(link).path.split("/")[-1] + ''
        ActualImages.append((image_name, link, Type))
    print("there are total" , len(ActualImages), "images")
    
    #defaults image type
    image_type="ActiOn"
    
    #saves images using jpg or the default image type if not jpg
    try:
        for i, (image_name, img, Type) in enumerate(ActualImages):
            raw_img = urllib.request.urlopen(urllib.request.Request(img, headers=header)).read()
            cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
            if len(Type)==0:
                f=open(os.path.join(DIR, image_name + ".jpg"), 'wb')
            else:
                f=open(os.path.join(DIR, image_name + "." + Type), 'wb')

            f.write(raw_img)
            f.close()
            
    except Exception as e:
        print("could not load : " + img)
        print(e)

In [8]:
term = 'dogs'
term= term.split()
term='+'.join(term)
url="https://images.search.yahoo.com&p=" + str(term) + "&ei=UTF-8&iscqry=&fr=sfp"
#creates a folder for the search term
DIR="C:\\Users\\chris\\Documents\\CUNY\\DATA698 - Final Masters Thesis\\Images\\" + term
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
soup = BeautifulSoup(urllib.request.urlopen(urllib.request.Request(url, headers=header)), 'html.parser')
ActualImages=[]
https://images.search.yahoo.com/search/images;_ylt=AwrWnfY47OdaaHEAV3iLuLkF;_ylc=X1MDOTYwNTc0ODMEX3IDMgRiY2sDOHN2OTNndGQ2cWg2ciUyNmIlM0QzJTI2cyUzRDVnBGZyAwRncHJpZANWQjFVYXJ0dFJHeTVVeERkemJXX0dBBG10ZXN0aWQDbnVsbARuX3N1Z2cDMTAEb3JpZ2luA2ltYWdlcy5zZWFyY2gueWFob28uY29tBHBvcwMwBHBxc3RyAwRwcXN0cmwDBHFzdHJsAzE1BHF1ZXJ5A2NoYW1wYWduZSBmbHV0ZQR0X3N0bXADMTUyNTE0ODc0NwR2dGVzdGlkA251bGw-?gprid=VB1UarttRGy5UxDdzbW_GA&pvid=a4Ft6jEwLjKOfSOHWm1E2wLANzMuNwAAAABcL.ky&fr2=sb-top-images.search.yahoo.com&p=champagne+flute&ei=UTF-8&iscqry=&fr=sfp#id=0&iurl=http%3A%2F%2Fwww.urbanbar.com%2Fwp-content%2Fuploads%2F2016%2F11%2FUB3246-2.png&action=close
https://images.search.yahoo.com/search/images;_ylt=AwrUi5xn7uda4kIAxASLuLkF;_ylc=X1MDOTYwNTc0ODMEX3IDMgRiY2sDOHN2OTNndGQ2cWg2ciUyNmIlM0QzJTI2cyUzRDVnBGZyAwRncHJpZANQTUJmNnU0VFN1ZUxadnN0ZkFvUDVBBG10ZXN0aWQDbnVsbARuX3N1Z2cDMTAEb3JpZ2luA2ltYWdlcy5zZWFyY2gueWFob28uY29tBHBvcwMwBHBxc3RyAwRwcXN0cmwDBHFzdHJsAzMEcXVlcnkDZG9nBHRfc3RtcAMxNTI1MzIxNTY2BHZ0ZXN0aWQDbnVsbA--?gprid=PMBf6u4TSueLZvstfAoP5A&pvid=uUi.xzEwLjKOfSOHWm1E2wAeNzMuNwAAAAB9hL0i&fr2=sb-top-images.search.yahoo.com&p=dog&ei=UTF-8&iscqry=&fr=sfp

SyntaxError: invalid syntax (<ipython-input-8-7ab9f6320ee2>, line 10)

# Yandex
Yandex is a Russian company that owns Yandex Search generating over half of search traffic in Russia. As such, it is a valuable resource

In [138]:
import urllib.request, urllib.error, urllib.parse
from urllib.parse import unquote

In [9]:
def get_yandex_soup(searchterm):
    #uses the search term as the url search
    term = searchterm
    term= term.split()
    term='+'.join(term)
    url="https://yandex.com/images/search?text=" + str(term)
    #creates a folder for the search term
    DIR="C:\\Users\\chris\\Documents\\CUNY\\DATA698 - Final Masters Thesis\\Images\\" + term
    if not os.path.exists(DIR):
        os.mkdir(DIR)
    #header will imitate a user
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
    soup = BeautifulSoup(urllib.request.urlopen(urllib.request.Request(url, headers=header)), 'html.parser')
    ActualImages=[]

    #pulls image name, link, and image type
    for a in soup.find_all("a",{"class":"serp-item__link"}):
        link = a.get('href')
        link = re.search('(?<=img_url\=).*', link).group()
        link = unquote(link)
        link = re.sub('(\&pos).*', '', link)
        link = re.sub('\\', '', link)
        name = a.find('img')
        #added yandex to name in case of duplicates
        name = name['alt'] + 'yandex'
        ActualImages.append((link, name))
    print("there are total" , len(ActualImages), "images")
    
    #saves the image
    for i, (link, name) in enumerate(ActualImages):
        try:
            raw_img = urllib.request.urlopen(urllib.request.Request(link, headers=header)).read()
            f = open(os.path.join(DIR, name + ".jpg"), 'wb')
            f.write(raw_img)
            f.close()
        except Exception as e:
            print("could not load : " + name)
            print(e)

In [10]:
#added sleep to give a time delay of 10 seconds before query
get_yandex_soup('champagne flute')
sleep(10)
get_yandex_soup('martini glass')
sleep(10)
get_yandex_soup('rummer glass')
sleep(10)
get_yandex_soup('snifters')
sleep(10)
get_yandex_soup('red wine glass')
sleep(10)
get_yandex_soup('white wine glass')
sleep(10)
get_yandex_soup('sherry glass')
sleep(10)

NameError: name 'unquote' is not defined

# imgdl

In [1]:
import imgdl

In [11]:
google.py "champagne flute" -n 600 --interactive

SyntaxError: invalid syntax (<ipython-input-11-f1e159e5b808>, line 1)

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import json
from pathlib import Path
from time import sleep

from imgdl import download
from imgdl.settings import config
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

__file__ = ''

IMAGE_STORE = Path(__file__).parent / 'images'

CHROME_DRIVER = Path(__file__).parent / 'chromedriver'
CHROME_DRIVER_DOWNLOAD_PAGE = "https://sites.google.com/a/chromium.org/chromedriver/downloads"
MAX_RETRIES = 3

if not CHROME_DRIVER.exists():
    raise FileNotFoundError(f"'chromedriver' executable not found. "
                            f"Download it from {CHROME_DRIVER_DOWNLOAD_PAGE} "
                            f"and place it next to this script file")


def get_driver(headless=True):
    options = webdriver.ChromeOptions()

    if headless:
        options.add_argument("headless")

    driver = webdriver.Chrome(
        executable_path=str(CHROME_DRIVER),
        options=options
    )

    return driver


def scroll_down(driver, click_more_results=False):
    if click_more_results:
        smr = driver.find_element_by_id("smb")
        if smr.is_displayed():
            smr.click()
    else:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")


def parse_urls_from_source(page_source):

    soup = BeautifulSoup(page_source, "lxml")
    return [
        json.loads(rg_di.find("div", class_="rg_meta").contents[0])["ou"]
        for rg_di in soup.find_all("div", class_='rg_di')
    ]


def get_urls(driver, n_images):
    urls = parse_urls_from_source(driver.page_source)
    previous_n = new_n = len(urls)

    current_retries = 0
    n_scrolls = 0
    # Scroll down until there are enough images or unsuccessful retries exceeded maximum retries
    while (new_n < n_images) and (current_retries < MAX_RETRIES):
        scroll_down(
            driver,
            click_more_results=(new_n == previous_n) and (current_retries != 0)
        )
        n_scrolls += 1
        print(f"Scrolled {n_scrolls} times already")
        current_retries += 1
        # Do incremental waits until more images appear
        for i in range(4):
            sleep(0.5 * i + 1)
            urls = parse_urls_from_source(driver.page_source)
            new_n = len(urls)
            if new_n > previous_n:
                current_retries = 0
                print(f"{new_n} images so far")
                break
        previous_n = new_n

    print(f"{len(urls)} images found.")
    return urls


def main(args):

    print(f"Querying google images for '{args.query}'")
    driver = get_driver(headless=not args.interactive)
    driver.get("https://images.google.com")

    elem = driver.find_element_by_name("q")
    elem.send_keys(args.query)
    elem.send_keys(Keys.RETURN)

    urls = get_urls(driver, args.n_images)

    store_path = args.store_path / 'google' / args.query.replace(" ", "_")
    print(f"Downloading to {store_path}")
    paths = download(
        urls,
        store_path=store_path,
        n_workers=args.n_workers,
        timeout=args.timeout,
        min_wait=args.min_wait,
        max_wait=args.max_wait,
        proxies=args.proxy,
        user_agent=args.user_agent,
        notebook=args.notebook,
        debug=args.debug,
        force=args.force,
    )

    return dict(zip(urls, paths))


if __name__ == '__main__':

    import argparse
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
        description="Download images from a google images query"
    )

    parser.add_argument('query', type=str,
                        help="Query string to be executed on google images")

    parser.add_argument('-n', '--n_images', type=int, default=100,
                        help="Number of expected images to download")

    parser.add_argument('--interactive', action='store_true',
                        help="Open up chrome interactively to see the search results and scrolling action.")

    parser.add_argument('-o', '--store_path', type=str, default=IMAGE_STORE,
                        help="Root path where images should be stored")

    parser.add_argument('--n_workers', type=int, default=config['N_WORKERS'],
                        help="Number of simultaneous threads to use")

    parser.add_argument('--timeout', type=float, default=config['TIMEOUT'],
                        help="Timeout to be given to the url request")

    parser.add_argument('--min_wait', type=float, default=config['MIN_WAIT'],
                        help="Minimum wait time between image downloads")

    parser.add_argument('--max_wait', type=float, default=config['MAX_WAIT'],
                        help="Maximum wait time between image downloads")

    parser.add_argument('--proxy', type=str, action='append', default=config['PROXIES'],
                        help="Proxy or list of proxies to use for the requests")

    parser.add_argument('-u', '--user_agent', type=str, default=config['USER_AGENT'],
                        help="User agent to be used for the requests")

    parser.add_argument('-f', '--force', action='store_true',
                        help="Force the download even if the files already exists")

    parser.add_argument('--notebook', action='store_true',
                        help="Use the notebook version of tqdm")

    parser.add_argument('-d', '--debug', action='store_true',
                        help="Activate debug mode")

    paths = main(parser.parse_args())


FileNotFoundError: 'chromedriver' executable not found. Download it from https://sites.google.com/a/chromium.org/chromedriver/downloads and place it next to this script file

In [None]:
"paris by night" -n 600 --interactive

# Other Website Search (?)