### Accessing Data: Some Preliminary Considerations

Whenever you're trying to get information from the web, it's very important to first know whether you're accessing it through appropriate means.

The UC Berkeley library has some excellent resources on this topic. Here is a flowchart that can help guide your course of action.

![](figures/scraping_flowchart.png)

You can see the library's licensed sources [here](http://guides.lib.berkeley.edu/text-mining).

# Webscraping with Beautiful Soup
*****


## Intro

In this tutorial, we'll be scraping information on the state senators of Illinois, available [here](http://www.ilga.gov/senate), as well as the list of bills each senator has sponsored (e.g., [here](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True).

## The Tools

1. [Requests](http://docs.python-requests.org/en/latest/user/quickstart/)
2. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

If you haven't already installed Beautiful Soup, go to the command line and enter this command:

`pip install beautifulsoup4`

In [None]:
# import required modules
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import re
import sys

# Part 1: Using Beautiful Soup
*****

## 1.1 Make a GET Request and Read in HTML

We can use the `requests` library to:
1. make a GET request to the page
2. read in the html of the page

This should be somewhat familiar from when we used it with APIs. Now we're making a request directly to the website, and we're going to have to parse the html, instead of something more straightforward like json or xml.

In [None]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# read the content of the server’s response
src = req.text

## 1.2 Soup it

Now we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

In [None]:
# parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# take a look
print(soup.prettify()[:1000])

## 1.3 Find Elements

BeautifulSoup has a number of functions to find things on a page. Like other webscraping tools, Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors


Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [None]:
# find all elements with a certain tag

soup.find_all("a")

**NB**: Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [None]:
soup.find_all("a")
soup("a")

That's a lot! Many elements on a page will have the same html tag. For instance, if you search for everything with the `a` tag, you're likely to get a lot of stuff, much of which you don't want. Remember, the `a` tag defines a hyperlink, so they'rell often be a lot of those on a page.

What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes? 

We can do this by adding an additional argument to the `find_all`

In the example below, we are finding all the `a` tags, and then filtering those with `class = "sidemenu"`.

In [None]:
# Get only the 'a' tags in 'sidemenu' class
soup("a", class_="sidemenu")

Oftentimes a more efficient way to search and find things on a website is by **CSS selector.** For this we have to use a different method, `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.sidemenu" as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [None]:
# get elements with "a.sidemenu" CSS Selector.
soup.select("a.sidemenu")

## Challenge 1

Find all the `<a>` elements in class `mainmenu`

In [None]:
# YOUR CODE HERE


## 1.4 Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Oftentimes this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [None]:
# this is a list
soup.select("a.sidemenu")

# we first want to get an individual tag object
first_link = soup.select("a.sidemenu")[0]

# check out its class
type(first_link)

It's a tag! Which means it has a `text` member:

In [None]:
print(first_link.text)

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
print(first_link['href'])

## Challenge 2

Find all the `href` attributes (url) from the mainmenu.

In [None]:
# YOUR CODE HERE

# Part 2
****

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of BeautifulSoup and Python.

Let's apply these skills to scrape http://www.ilga.gov/senate/default.asp?GA=98

**NB: we're just going to scrape the 98th general assembly.**

Our goal is to scrape information on each senator, including their:
    - name
    - district
    - party

## 2.1 First, make the get request and soup it.

In [None]:
# import required modules from previous session
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import re
import sys

In [None]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# read the content of the server’s response
src = req.text
# soup it
soup = BeautifulSoup(src, "lxml")

## 2.2 Find the right elements and text.

Now let's try to get a list of rows in that table. Remember that rows are identified by the `tr` tag.

In [None]:
# get all tr elements
rows = soup.find_all("tr")
len(rows)

But remember, `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in.

In [None]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for r in rows[:5]:
    print(r)
    print()

Looks like we want everything after the first two rows. Let's work with a single row to start, and then from that we'll build our loop.

In [None]:
print(rows[2].prettify())

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.

* We could use the the class name `.detail`.

* We could combine both and use the selector `td.detail`.

In [None]:
for cell in rows[2].select('td'):
    print(cell)
print()

for cell in rows[2].select('.detail'):
    print(cell)
print()

for cell in rows[2].select('td.detail'):
    print(cell)
print()

We can confirm that these are all the same.

In [None]:
assert rows[2].select('td') == rows[2].select('.detail') == rows[2].select('td.detail')

Let's go with `td.detail` to be as specific as possible.

In [None]:
# select only those 'td' tags with class 'detail'
row = rows[2] 
detailCells = row.select('td.detail')
detailCells

Most of the time, we're interested in the actual **text** of a website, not its tags. Remember, to get the text of an HTML element, use the `text` member.

In [None]:
# Keep only the text in each of those cells
rowData = [cell.text for cell in detailCells]

print(rowData)

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [None]:
# check em out
print(rowData[0]) # Name
print(rowData[3]) # district
print(rowData[4]) # party

## 2.3 Getting rid of junk rows

We saw at the beginning that not all of the rows we got actually correspond to a senator.

In [None]:
# bad rows
print(rows[0])
print()
print(rows[-1])

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [None]:
# bad row
print(len(rows[0]))
print(len(rows[1]))

# good row
print(len(rows[2]))
print(len(rows[3]))

# maybe this will work?
good_rows = [r for r in rows if len(r) == 5]

# doesn't look like it
print(good_rows[-1])

In [None]:
# bad row
print(rows[-1].select('td.detail'))
print()

# good row
print(rows[5].select('td.detail'))
print()

# how about this?
good_rows = [r for r in rows if r.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0])
print()
print(good_rows[-1])

Looks like we found something that worked!

## 2.4 Loop it all together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [None]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
src = req.text

# soup it
soup = BeautifulSoup(src, "lxml")

# Create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# get rid of junk rows
rows = [r for r in rows if r.select('td.detail')]

# loop through all rows
for row in rows:
    # select only those 'td' tags with class 'detail'
    detailCells = row.select('td.detail')
        
    # Keep only the text in each of those cells
    rowData = [cell.text for cell in detailCells]
    
    # Collect information
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    
    # Store in a tuple
    tup = (name,district,party)
    
    # Append to list
    members.append(tup)

In [None]:
# should be 61
len(members)

Let's take a look at what we have in `members`.

In [None]:
print(members[:5])

## Challege 3: Get HREF element pointing to members' bills. 

The code above retrieves information on:  

    - the senator's name
    - their district number
    - and their party

We now want to retrieve the URL for each senator's list of bills. The format for the list of bills for a given senator is:

http://www.ilga.gov/senate/SenatorBills.asp + ? + GA=98 + &MemberID=**_memberID_** + &Primary=True

to get something like:

http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True

You should be able to see that, unfortunately, _memberID_ is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. (See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details).
* There are a _lot_ of different ways to use BeautifulSoup to get things done; whatever you need to do to pull that HREF out is fine.

I've started out the code for you. Fill it in where it says `#YOUR CODE HERE` (Save the path into an object called `full_path`

In [None]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
src = req.text

# soup it
soup = BeautifulSoup(src, "lxml")

# Create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# get rid of junk rows
rows = [r for r in rows if r.select('td.detail')]

# loop through all rows
for row in rows:
    # select only those 'td' tags with class 'detail'
    detailCells = row.select('td.detail')
        
    # Keep only the text in each of those cells
    rowData = [cell.text for cell in detailCells]
    
    # Collect information
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    
    # YOUR CODE HERE.
    
    
    
    
    # Store in a tuple
    tup = (name, district, party, full_path)
    
    # Append to list
    members.append(tup)

In [None]:
# Uncomment to test 

# members[:5]

## Challenge 4: Make a function

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. 

In [None]:
# YOUR FUNCTION HERE

In [None]:
# Uncomment to test your code!

# senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')
# len(senateMembers)

# Part 3: Scrape Bills
****

## 3.1 Writing a Scraper Function

Now we want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given Bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
I've started the function for you. Fill in the rest.

In [None]:
# COMPLETE THIS FUNCTION
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        
        # YOUR CODE HERE
               
        tup = (bill_id, description, chamber, last_action, last_action_date)
        bills.append(tup)
    return(bills)

In [None]:
# uncomment to test your code:
# test_url = senateMembers[0][3]
# get_bills(test_url)[0:5]

## 3.2 Get all the bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list_of_bills (the value) eminating from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

NOTE: please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# YOUR CODE HERE

In [None]:
# Uncomment to test
# bills_dict[52]