### Accessing Data: Some Preliminary Considerations

Whenever you're trying to get information from the web, it's very important to first know whether you're accessing it through appropriate means.

The UC Berkeley library has some excellent resources on this topic. Here is a flowchart that can help guide your course of action.

![](figures/scraping_flowchart.png)

You can see the library's licensed sources [here](http://guides.lib.berkeley.edu/text-mining).

# Installing Selenium

We're going to use Selenium for Firefox, which means we'll have to install `geckodriver`. You can download it [here](https://github.com/mozilla/geckodriver/releases/). Download the right version for your system, and then unzip it.

You'll need to then move it to the correct path. This workshop expects you to be running Python 3.X with Anaconda. If you drag geckodriver into your anaconda/bin folder, then you should be all set.

# Selenium

Very helpful documentation on how to navigate a webpage with selenium can be found [here](http://selenium-python.readthedocs.io/navigating.html). There are a lot of different ways to navigate, so you'll want to refer to this throughout the workshops, as well as when you're working on your own projects in the future.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup

First we'll set up the (web)driver. This will open up a Firefox window.

In [None]:
# setup driver
driver = webdriver.Firefox()

To go to a webpage, we just enter the url as the argument of the `get` method.

In [None]:
driver.get("http://www.google.com")

In [None]:
# go to page
driver.get("http://wbsec.gov.in/(S(eoxjutirydhdvx550untivvu))/DetailedResult/Detailed_gp_2013.aspx")

### Zilla Parishad Name

We can use the method `find_element_by_name` to find an element on the page by its name. An easy way to do this is to inspect the element.

In [None]:
# find "district" drop down
district = driver.find_element_by_name("ddldistrict")

Now if we want to get the different options in this drop down, we can do the same. You'll notice that each name is associated with a unique value. Here since we're getting multiple elements, we'll use `find_elements_by_tag_name`

In [None]:
# find options in that drop down
district_options = district.find_elements_by_tag_name("option")

print(district_options[1].get_attribute("value"))
print(district_options[1].text)

Now we'll make a dictionary associating each name with its value.

In [None]:
d_options = {option.text.strip(): option.get_attribute("value") for option in district_options if option.get_attribute("value").isdigit()}
print(d_options)

Now we can select a district by using its name and our dictionary. First we'll make our own function using Selenium's `Select`, and then we'll call it on "Bankura".

In [None]:
district_select = Select(district)
district_select.select_by_value(d_options["Bankura"])

### Panchayat Samity Name

We can do the same as we did above to find the different blocks.

In [None]:
# find the "block" drop down
block = driver.find_element_by_name("ddlblock")

In [None]:
# get options
block_options = block.find_elements_by_tag_name("option")

print(block_options[1].get_attribute("value"))
print(block_options[1].text)

In [None]:
b_options = {option.text.strip(): option.get_attribute("value") for option in block_options if option.get_attribute("value").isdigit()}
print(b_options)

In [None]:
block_select = Select(block)
block_select.select_by_value(b_options["BANKURA-I"])

### Gram Panchayat Name

Let's do it again for the third drop down menu.

In [None]:
# get options
gp = driver.find_element_by_name("ddlgp")
gp_options = gp.find_elements_by_tag_name("option")

print(gp_options[1].get_attribute("value"))
print(gp_options[1].text)

In [None]:
gp_options = {option.text.strip(): option.get_attribute("value") for option in gp_options if option.get_attribute("value").isdigit()}
print(gp_options)

In [None]:
gp_select = Select(gp)
gp_select.select_by_value(gp_options["ANCHURI"])

### Save data from the generated table

Our selections brought us to a table. Now let's get the underlying html. First we'll identify it by its CSS selector, and then use the `get_attribute` method.

In [None]:
# get the html for the table
table = driver.find_element_by_css_selector("#DataGrid1").get_attribute('innerHTML')

To parse the html, we'll use BeautifulSoup.

In [None]:
# soup-ify
table = BeautifulSoup(table, 'lxml')

In [None]:
table

First we'll get all the rows of the table using the `tr` selector.

In [None]:
# get list of rows
rows = [row for row in table.select("tr")]

But the first row is the header so we don't want that.

In [None]:
print(rows[0])
print()
print(rows[1])

rows = rows[1:]

Each cell in the row corresponds to the data we want.

In [None]:
rows[0].select('td')

Now it's just a matter of looping through the rows and getting the information we want from each one.

In [None]:
#for row in rows:
data = []
for row in rows:
    dic = {}
    dic['seat'] = row.select('td')[0].text
    dic['electors'] = row.select('td')[1].text
    dic['polled'] = row.select('td')[2].text
    dic['rejected'] = row.select('td')[3].text
    dic['osn'] = row.select('td')[4].text
    dic['candidate'] = row.select('td')[5].text
    dic['party'] = row.select('td')[6].text
    dic['secured'] = row.select('td')[7].text
    data.append(dic)

Let's clean up the text a little bit.

In [None]:
# strip whitespace
for dic in data:
    for key in dic:
        dic[key] = dic[key].strip()

In [None]:
not data[0]['seat']

You'll notice that some of the information, such as total electors, is not supplied for each canddiate. This code will add that information for the candidates who don't have it.

In [None]:
#fill out info

i = 0
while i < len(data):
    if data[i]['seat']:
        seat = data[i]['seat']
        electors = data[i]['electors']
        polled = data[i]['polled']
        rejected = data[i]['rejected']
        i = i+1
    else:
        data[i]['seat'] = seat
        data[i]['electors'] = electors
        data[i]['polled'] = polled
        data[i]['rejected'] = rejected
        i = i+1

In [None]:
data

# Selenium demo

a quick and dirty selenium demo, with lots of useful functions

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC

def click_xpath(xpath):
    
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, xpath)))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, xpath)))
    driver.find_element_by_xpath(xpath).click()
    
def click_name(name):
    
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.NAME, name)))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, name)))
    driver.find_element_by_name(name).click()
    
def click_id(id_):
    
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, id_)))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, id_)))
    driver.find_element_by_id(id_).click()
    
def click_class(class_):
    
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, class_)))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, class_)))
    driver.find_element_by_class_name(class_).click()
    
def type_xpath(xpath, text):
    
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, xpath)))
    driver.find_element_by_xpath(xpath).click()
    driver.find_element_by_xpath(xpath).click()
    driver.find_element_by_xpath(xpath).send_keys(text)
    
def type_name(name, text):
    
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, name)))
    driver.find_element_by_name(name).click()
    driver.find_element_by_name(name).clear()
    driver.find_element_by_name(name).send_keys(text)

Let's go to Google to start with an easy website.

In [None]:
url = "https://www.google.com"

driver = webdriver.Firefox()

In [None]:
driver.get(url)

Let's try something simple like clicking on the text box. We can use inspect element to find an identifying attribute, and then use the `selenium` methods to clik on it.

It looks like `name="q"`, so we'll do this:

In [None]:
text_box = driver.find_element_by_name("q")
text_box.click()

We could also use this to enter some text.

In [None]:
text_box.send_keys("golden state warriors")

Now let's enter our search.

In [None]:
text_box.send_keys(Keys.RETURN)

Suppose we wanted to gather all the links that show up on the first page. We could use BeautifulSoup to do this easily from the HTML source of the page.

In [None]:
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')