# Using Selenium for Advanced Web Crawling
               
   #### By Miriam Rodriguez



## In this tutorial, we will be capturing data with Selenium.

## Selenium

Selenium is a powerful framework designed to automate testing for web applications. It provides a way for developers to write tests in a number of programming languages such as C#, Java, Python, Ruby, etc. The tests can run against most web browsers such as Chrome, IE and Firefox.  

Selenium automates browsers. Out of the box Selenium allows one to open a web browser, go to a page, and do any action a human could do (clicking button, filling in forms, etc), in addition to the base task of parsing html for information.

## 1. Install Selenium (Windows):

For the purposes of this tutorial, we will focus on the Windows setup.

Using the command prompt:

- pip install selenium

To upgrade the currently installed Selenium Webdriver package: 
    
- pip install -U selenium    



## 2. How to Set Webdrivers for Selenium (Windows)

### Firefox: 
driver = webdriver.Firefox()

### iPhone:
driver = webdriver.Remote(browser_name="iphone", command_executor='http://172.24.101.36:3001/hub')
 
### Android:
driver = webdriver.Remote(browser_name="android", command_executor='http://127.0.0.1:8080/hub')

### Google Chrome: 
For Selenium to utilize Chrome, the driver has to be downloaded. 

Download Chrome Driver from:

http://chromedriver.storage.googleapis.com/2.9/chromedriver_win32.zip or 
https://sites.google.com/a/chromium.org/chromedriver/downloads

Paste the chromedriver.exe file in "C:\Python27\Scripts" Folder (or anywhere that you wish, just make sure you point to it when setting the driver).

The Chrome webdriver can be represented in one of the following formats:

- driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")

If you do not want to use a raw string you can escape the slash as \\:

- driver = webdriver.Chrome(executable_path="C:\\Python27\\Scripts\\chromedriver.exe")

Or replace the \ with a /, you will get this:

- driver = webdriver.Chrome(executable_path="C:/Chrome/chromedriver.exe")

In [33]:
from selenium import webdriver
%matplotlib inline
import time,re,json,numpy as np
import pandas as pd
import matplotlib.pyplot as plt 



In [34]:
#Set the options and set the arguments for chromium to work on selenium
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
options.binary_location = "/usr/bin/chromium"
options.add_argument("--no-sandbox") #This make Chromium reachable
options.add_argument("--no-default-browser-check") #Overrides default choices
options.add_argument("--no-first-run")
options.add_argument("--disable-default-apps") 



### Here's an example of how Selenium can be used.  For this example we will obtain reviews from www.loudersound.com (formerly www.teamrock.com)

## Scenario
- start url for www.loudersound.com
- go through the different Selenium commands 
- additional applications will be added

### For this tutorial we will use the Chrome driver

#### The below will open up a Chrome browser.  It will indicate that it is controlled by automated test software.

In [35]:
# Set the Chrome driver
driver = webdriver.Chrome(executable_path="C:/Python27/Scripts/chromedriver.exe")


#### This will open the below url

In [36]:
# Start the URL. We are on the first page of the website.
driver.get('http://loudersound.com/reviews/')

In [37]:
# Return the title
# Good to use assert clause to make sure you are where you think you are
driver.title

u'Reviews | Louder'

In [38]:
# driver.find_element_by_link_text('Led Zeppelin')
driver.find_element_by_partial_link_text('Led Zeppelin')

<selenium.webdriver.remote.webelement.WebElement (session="c4a39ce4baf2a3f89b7b063bd69d7b94", element="0.11668459941225184-1")>

In [39]:
# Add this method to click it! This will select the 'Led Zeppelin' article.
driver.find_element_by_partial_link_text('Led Zeppelin').click()

In [40]:
# Navigate back with the browser buttons. This will go back to the previous screen. 
driver.back()

In [41]:
# Navigate forward with the browser buttons. This will go forward to the led Zeppelin screen.
driver.forward()

In [42]:
# Save a screenshot in the directory you are in 
driver.save_screenshot('loudersound_shot.png')

True

In [43]:
# close the webdriver when done
driver.close()