# Introduction to Selenium

## Background

<a href="docs.seleniumhq.org">Selenium</a> is a free and open-source testing framework for web browser automation. Because Selenium provides a programmatic way of controlling web browser interactions, it can also be used for simple web scraping or crawling of webpages. 

Implemented in Java, Selenium WebDriver controls browsers by injecting Javascript calls. Selenium come with built-in support for Firefox with supplemental drivers to support other browsers. Selenium offers multiple programming language bindings including Python.

This tutorial is designed to get you familiarized with Selenium and assumes you have basic working knowledge of the <a href="en.wikipedia.org/wiki/Document_Object_Model">Document Object Model (DOM)</a>, which Selenium uses to identify elements on a page.

Make sure you have selenium module installed:<br><br> 
<code>pip install selenium</code><br><br>
and <a href="www.mozilla.org/en-US/firefox/new/">Firefox browser</a> installed.

## Launching the Browser

In [1]:
import time, re, json
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait

Now open a new Firefox browser.

In [2]:
browser = webdriver.Chrome("/Users/flynn_chen/Desktop/Projects/findus/missingpersons/chromedriver")

Navigate to Yelp's search page using <code>get()</code>

In [3]:
browser.get("http://www.yelp.com")
time.sleep(5)

## Looking for Pizza in Berkeley

Now let's try to locate the element Find search bar by id:
<br>
<input autocomplete="off" id="find_desc" maxlength="64" name="find_desc" placeholder="tacos, cheap dinner, Max’s" tabindex="1" value="">
<br><br>
Right-click the element in the browser and select inspect element. You should be taken to the source for the Find search bar. You'll discover that this search bar has <code>id="find_desc"</code>.

In [4]:
find = browser.find_element_by_id("find_desc")

Then type "pizza" into the search bar.

In [5]:
find.clear()
find.send_keys("pizza")

Now let's try locating the element Location search bar by name:<br>
<input autocomplete="off" id="dropperText_Mast" maxlength="64" name="find_loc" placeholder="address, neighborhood, city, state or zip" tabindex="2" value="Berkeley, CA">

In [6]:
# loc = browser.find_element_by_name("find_loc")
loc = browser.find_element_by_id("dropperText_Mast")

Clear the existing text before typing in "Berkeley, CA".

In [7]:
print(loc.clear())
loc.send_keys("Berkeley, CA")

None


Then find and click the search button.

In [8]:
browser.find_element_by_id("header-search-submit").click()
time.sleep(4)

Now that you have all of the search results for pizza in Berkeley, CA in the browser, how we do interact with each result? Well-designed interfaces with static content should label each element with an id or name value. With dynamic content and poorly implemented interfaces, there are times where you have to locate one element in reference to another element, or anchor element.

In [9]:
yelp_page_source_page1 = browser.page_source
soup = BeautifulSoup(yelp_page_source_page1,'html.parser')
all_pizza = soup.find_all('a',{'class':'lemon--a__373c0__IEZFH link__373c0__1G70M link-color--inherit__373c0__3dzpk link-size--inherit__373c0__1VFlE'})
biz_names = [pizza.text for pizza in all_pizza]
print(biz_names)

# items = browser.find_elements_by_xpath("//div[@id='super-container']/div[3]/div[3]/div/div/div/ul/li")
# print(items)
# biz_names = []
# for item in final_data:
#     biz_names.append(item.find_element_by_name("biz-names").text)
#     print(biz_names[-1])

['Philomena', 'Beta Lounge', 'Cheese Board Pizza', 'Sliver Pizzeria', 'Zachary’s Chicago Pizza', 'Gioia Pizzeria', 'Emilia’s Pizzeria', 'Pollara Pizzeria', 'Sliver Pizzeria', 'Artichoke Basille’s Pizza', 'Arinell Pizza', 'Mountain Mike’s Pizza', 'Rotten City Pizza', 'Red Tomato Pizza House', 'Zachary’s Chicago Pizza', 'Little Star Pizza', 'Lanesplitter Pizza & Pub', 'North Beach Pizza', 'West Coast Pizza', 'iSlice', 'Barbarian Grub And Ale', 'La Val’s Pizza', 'Seniore’s Pizza', 'ABE’s Pizza', 'Bobby G’s Pizzeria', 'Baiano Pizzeria - Berkeley', 'Lucia’s Berkeley', 'Pizzaiolo', 'Creekwood', 'Paisan', 'Namaste Pizza', 'iSlice', 'West Coast Pizza', '2', '3', '4', '5', '6', '7', '8', '9']


Click on the link for the first business.

In [10]:
first_restaurant = browser.find_element_by_link_text(biz_names[0]).click()
time.sleep(5)

Let's access some basic information about this business:

* address

* phone number

* website

* number of reviews

* average star rating

In [11]:
while True:
    page_source_page = browser.page_source
    soup = BeautifulSoup(page_source_page,'html.parser')

    street_address = soup.find_all('span', {'itemprop':'streetAddress'})
    locality_address = soup.find_all('span', {'itemprop':'addressLocality'})
    region_address = soup.find_all('span', {'itemprop':'addressRegion'})
    postal_address = soup.find_all('span', {'itemprop':'postalCode'})

    if len(street_address) > 0 or len(locality_address) > 0 or len(region_address) > 0 or len(postal_address) > 0:
        break
    else:
        time.sleep(5)

street_address = soup.find_all('span', {'itemprop':'streetAddress'})[0].text
print(street_address)
locality_address = soup.find_all('span', {'itemprop':'addressLocality'})[0].text
region_address = soup.find_all('span', {'itemprop':'addressRegion'})[0].text
postal_address = soup.find_all('span', {'itemprop':'postalCode'})[0].text
print(locality_address, region_address, postal_address)

phone_re = re.compile('(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
phone = soup.find_all('p', {'class':'lemon--p__373c0__3Qnnj text__373c0__2Kxyz text-color--normal__373c0__3xep9 text-align--left__373c0__2XGa-'})
for p in phone:
    if phone_re.search(p.text):
        print(p.text)
        break

website = soup.find_all('a', {'class':'lemon--a__373c0__IEZFH link__373c0__1G70M link-color--blue-dark__373c0__85-Nu link-size--inherit__373c0__1VFlE'})
for w in website:
    if ".com" in w.text:
        print(w.text)
        break
# print(website[0].text)

all_script_json = soup.find_all("script", {"type":"application/ld+json"})
for a in all_script_json:
    if "aggregateRating" in str(a):
        review_json = a

# print(review_json)
review_prefix = '"reviewCount": '
count_number_beg = str(review_json).find(review_prefix)
count_json = str(review_json)[count_number_beg:]
count_number_end = count_json.find(",")
# print(count_number_beg, count_number_end)
review_number = str(review_json)[count_number_beg:count_number_end + count_number_beg].replace(review_prefix, "")
print("total reviews", review_number)


review_prefix = '"AggregateRating", "ratingValue": '
count_number_beg = str(review_json).find(review_prefix)
count_json = str(review_json)[count_number_beg:]
count_number_end = count_json.find("},")
review_number = str(review_json)[count_number_beg:count_number_end + count_number_beg].replace(review_prefix, "")
print("avereage ratings", review_number)

# print(str(review_json))


1801 14th Ave
Oakland CA 94606
(510) 532-2399
philomenapizza.com
total reviews 234
avereage ratings 4.0


Go back to the previous page with all search results.

In [12]:
browser.back()
time.sleep(5)

Now let's combine all of the previous code and visit each business' website to scrape some information.

In [13]:
for biz_name in biz_names:
    
    getting_data = True
    while getting_data:
        try:
            
            browser.find_element_by_link_text(biz_name).click()
            time.sleep(5)

            page_source_page = browser.page_source
            soup = BeautifulSoup(page_source_page,'html.parser')        


            street_address = soup.find_all('span', {'itemprop':'streetAddress'})[0].text

            locality_address = soup.find_all('span', {'itemprop':'addressLocality'})[0].text
            region_address = soup.find_all('span', {'itemprop':'addressRegion'})[0].text
            postal_address = soup.find_all('span', {'itemprop':'postalCode'})[0].text

            phone_re = re.compile('(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
            phone = soup.find_all('p', {'class':'lemon--p__373c0__3Qnnj text__373c0__2Kxyz text-color--normal__373c0__3xep9 text-align--left__373c0__2XGa-'})

            website = soup.find_all('a', {'class':'lemon--a__373c0__IEZFH link__373c0__1G70M link-color--blue-dark__373c0__85-Nu link-size--inherit__373c0__1VFlE'})

            all_script_json = soup.find_all("script", {"type":"application/ld+json"})

            review_prefix = '"reviewCount": '
            count_number_beg = str(review_json).find(review_prefix)
            count_json = str(review_json)[count_number_beg:]
            count_number_end = count_json.find(",")
            review_number = str(review_json)[count_number_beg:count_number_end + count_number_beg].replace(review_prefix, "")

            review_prefix = '"AggregateRating", "ratingValue": '
            count_number_beg = str(review_json).find(review_prefix)
            count_json = str(review_json)[count_number_beg:]
            count_number_end = count_json.find("},")
            review_rate = str(review_json)[count_number_beg:count_number_end + count_number_beg].replace(review_prefix, "")

            browser.back()

            time.sleep(5)
            getting_data = False
        except:
            print(biz_name, "failed attempt")
            time.sleep(5)
            pass
        
    print("")
    print("")
    print(biz_name)
    print(street_address)
    print(locality_address, region_address, postal_address)
    for p in phone:
        if phone_re.search(p.text):
            print(p.text)
            break
    for w in website:
        if ".com" in w.text:
            print(w.text)
            break
    for a in all_script_json:
        if "aggregateRating" in str(a):
            review_json = a
    print("total reviews:", review_number)
    print("avereage ratings:", review_rate)
    print("")
    print("")



Philomena
1801 14th Ave
Oakland CA 94606
(510) 532-2399
philomenapizza.com
total reviews 234
avereage ratings 4.0





Beta Lounge
2129 Durant Ave
Berkeley CA 94704
(510) 845-3200
total reviews 234
avereage ratings 4.0





Cheese Board Pizza
1512 Shattuck Ave
Berkeley CA 94709
(510) 549-3183
total reviews 357
avereage ratings 4.0





Sliver Pizzeria
2468 Telegraph Ave
Berkeley CA 94704
(510) 356-4044
sliverpizzeria.com
total reviews 5276
avereage ratings 4.5





Zachary’s Chicago Pizza
1853 Solano Ave
Berkeley CA 94707
(510) 525-5950
zacharys.com
total reviews 1400
avereage ratings 4.5





Gioia Pizzeria
1586 Hopkins St
Berkeley CA 94707
(510) 528-4692
gioiapizzeria.com
total reviews 1961
avereage ratings 4.5





Emilia’s Pizzeria
2995 Shattuck Ave
Berkeley CA 94705
(510) 704-1794
emiliaspizzeria.com
total reviews 819
avereage ratings 4.0





Pollara Pizzeria
1788 4th St
Berkeley CA 94710
(510) 529-4548
pollarapizzeria.com
total reviews 406
avereage ratings 4.5



Sliver Pizzer

KeyboardInterrupt: 