# Web Scraping with Beautiful Soup

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [5]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [6]:
%pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [7]:
%pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [8]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

<a id='extract'></a>

# Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [9]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

<html lang="en"> 
<!-- Trigger/Open The Modal -->
<div style="position: fixed; z-index: 999; top: 5; left: 600; background-color: navy; display: block">
<button id="myBtn" style="color: white; background-color: navy; display: block">Translate Website</button></div>
<!-- The Modal -->
<div id="myModal" class="modal" style="display: none">
  <!-- Modal content -->
  <div class="modal-content">
      <div class="modal-header"><h3>
    <span class="close">&times;</span></h3></div>    
    <p>The Illinois General Assembly offers the Google Translate service for visitor convenience. In no way should it be considered accurate as to the translation of any content herein.</p>
    <p>Visitors of the Illinois General Assembly website are encouraged to use other translation services available on the internet.</p>
    <p>The English language version is always the official and authoritative version of this website.</p>
    <p>NOTE: To return to the original English language version, se


## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [10]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

<html lang="en">
 <!-- Trigger/Open The Modal -->
 <body>
  <div style="position: fixed; z-index: 999; top: 5; left: 600; background-color: navy; display: block">
   <button id="myBtn" style="color: white; background-color: navy; display: block">
    Translate Website
   </button>
  </div>
  <!-- The Modal -->
  <div class="modal" id="myModal" style="display: none">
   <!-- Modal content -->
   <div class="modal-content">
    <div class="modal-header">
     <h3>
      <span class="close">
       ×
      </span>
     </h3>
    </div>
    <p>
     The Illinois General Assembly offers the Google Translate service for visitor convenience. In no way should it be considered accurate as to the translation of any content herein.
    </p>
    <p>
     Visitors of the Illinois General Assembly website are encouraged to use other translation services available on the internet.
    </p>
    <p>
     The English language version is always the official and authoritative version of this website.
   

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [11]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

[<a class="goog-logo-link" href="https://translate.google.com" target="_blank"><img alt="Google Translate" height="14" src="https://www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_42x16dp.png" style="padding-right: 3px;" width="37"/>Translate</a>, <a href="/default.asp"><img alt="Illinois General Assembly" border="0" height="49" src="/images/logo_sm.gif" width="462"/></a>, <a class="mainmenu" href="/">Home</a>, <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>, <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>, <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseove

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [12]:
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

<a class="goog-logo-link" href="https://translate.google.com" target="_blank"><img alt="Google Translate" height="14" src="https://www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_42x16dp.png" style="padding-right: 3px;" width="37"/>Translate</a>
<a class="goog-logo-link" href="https://translate.google.com" target="_blank"><img alt="Google Translate" height="14" src="https://www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_42x16dp.png" style="padding-right: 3px;" width="37"/>Translate</a>


How many links did we obtain?

In [13]:
print(len(a_tags))

213


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [14]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>]

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [15]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu")
selected[:5]

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>]

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [16]:
# YOUR CODE HERE
main_menus = soup.select("a.mainmenu")
main_menus[:5]


[<a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>]

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [17]:
# Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu")

# Examine the first link
first_link = side_menu_links[0]
print(first_link)

# What class is this variable?
print('Class: ', type(first_link))

<a class="sidemenu" href="/senate/default.asp">  Members  </a>
Class:  <class 'bs4.element.Tag'>


It's a Beautiful Soup tag! This means it has a `text` member:

In [18]:
print(first_link.text)

  Members  


Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [19]:
print(first_link['href'])

/senate/default.asp


## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [20]:
# YOUR CODE HERE
for link in main_menus:
    print(link['href'])

for i in range(2):
    print(main_menus[i]['href'])

/
/legislation/
/senate/
/house/
/mylegislation/
/sitemap.asp
/
/legislation/


<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [21]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [22]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

73

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [23]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

<tr><td colspan="5">
<span class="heading">Illinois State Senators</span>
<span class="italics">  98th  General Assembly</span><br/>
<!-- 3/2/09 temp comment out until fixed for GA specific-->
<!-- add 97th ga currently no info -->
<a href="98GA_Senate_Leadership.pdf">Leadership</a> <a href="98th_Senate_Officers.pdf">Officers</a> <a href="98GA_Senate_Seating_Chart.pdf">Senate Seating Chart</a>  <span class="content"><b>Democrats:</b> 40   <b>Republicans:</b> 19</span><br/>
</td></tr> 

<tr>
<td class="header" width="45%"><a class="filetab" href="javascript:Sort('LastName','',98);" title="Sort by Senator">Senator</a></td>
<td align="center" class="header" width="15%">Bills</td>
<td align="center" class="header" width="10%">Committees</td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascript:Sort('DistrictNumber','',98);" title="Sort by District">District</a></td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascript:Sort('Party','

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [24]:
example_row = rows[2]
print(example_row.prettify())

<tr>
 <td bgcolor="white" class="detail" width="40%">
  <a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">
   Pamela J. Althoff
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenatorBills.asp?GA=98&amp;MemberID=1911">
   Bills
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenCommittees.asp?GA=98&amp;MemberID=1911">
   Committees
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  32
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  R
 </td>
</tr>



Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [25]:
for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print()

<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td>
<td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td>
<td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td>
<td align="center" bgcolor="white" class="detail" width="15%">32</td>
<td align="center" bgcolor="white" class="detail" width="15%">R</td>

<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td>
<td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td>
<td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td>
<td align="center" bgcolor="white" class

We can confirm that these are all the same.

In [26]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')

Let's use the selector `td.detail` to be as specific as possible.

In [27]:
# Select only those 'td' tags with class 'detail' 
detail_cells = example_row.select('td.detail')
detail_cells

[<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [28]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]

print(row_data)

['Pamela J. Althoff', 'Bills', 'Committees', '32', 'R']


Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [29]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

Pamela J. Althoff
32
R


## Desaciéndonos de las filas basura

Vimos que en el principio que no todas las files que se obtienen corresponden a un Senador. Vamos a necesitar hacer un poco de limpieza antes de seguir. Veamos algunos ejemplos:

In [30]:
print('Fila 0:\n', rows[0], '\n')
print('Fila 1:\n', rows[1], '\n')
print('Última fila:\n', rows[-1])

Fila 0:
 <tr><td colspan="5">
<span class="heading">Illinois State Senators</span>
<span class="italics">  98th  General Assembly</span><br/>
<!-- 3/2/09 temp comment out until fixed for GA specific-->
<!-- add 97th ga currently no info -->
<a href="98GA_Senate_Leadership.pdf">Leadership</a> <a href="98th_Senate_Officers.pdf">Officers</a> <a href="98GA_Senate_Seating_Chart.pdf">Senate Seating Chart</a>  <span class="content"><b>Democrats:</b> 40   <b>Republicans:</b> 19</span><br/>
</td></tr> 

Fila 1:
 <tr>
<td class="header" width="45%"><a class="filetab" href="javascript:Sort('LastName','',98);" title="Sort by Senator">Senator</a></td>
<td align="center" class="header" width="15%">Bills</td>
<td align="center" class="header" width="10%">Committees</td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascript:Sort('DistrictNumber','',98);" title="Sort by District">District</a></td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascr

Cuando utilizamos el bucle "for", nosotros solo queremos aplicarlo en las filas más relevantes. Por lo que necesitamos filtrar las filas irrelevantes. La manera de hacer esto es comparando algunas de las filas que queremos, para ver que las hace diferentes, y luego utilizar esa información en un condicional.

Como lo puedes imaginar, hay muchas maneras de hacer esto, y va a depender del sitio web. Vamos a mostrar algunas maneras aquí para darte una idea de como hacerlo.

In [31]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

1
11
5
5


Tal vez las filas buenas tienen una longitud de 5. Revisemos:

In [32]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

<tr><td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td><td align="center" bgcolor="white" class="detail" width="15%">32</td><td align="center" bgcolor="white" class="detail" width="15%">R</td></tr> 

<tr><td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=2035">Patricia Van Pelt</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=2035">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=2035">Committees</a></td><td align="center" bgcolor="white

Encontramos una fila footer en nuestra lista que nos gustaría evitar. Vamos a intentar otra cosa:

In [33]:
rows[2].select('td.detail') 

[<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

In [34]:
# Bad row
print(rows[-1].select('td.detail'), '\n')

# Good row
print(rows[5].select('td.detail'), '\n')

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')
print(good_rows[-1])

[] 

[<td bgcolor="EBEBEB" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=2022">Jennifer Bertino-Tarrant</a></td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=2022">Bills</a></td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=2022">Committees</a></td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%">49</td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%">D</td>] 

Checking rows...

<tr><td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></

Parece que encontramos algo que ha funcionado!

## Consolidarlo todo

Ahora que hemos visto como obtener los datos que queremos de una fila, así como también filtrar las filas que no queremos, vamos a consolidar todo en un bucle.

In [35]:
# Define storage list
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

In [36]:
# Should be 61
len(members)

61

Vamos a ver que tenemos en `members`.

In [37]:
print(members[:5])

[('Pamela J. Althoff', 32, 'R'), ('Jason A. Barickman', 53, 'R'), ('Scott M Bennett', 52, 'D'), ('Jennifer Bertino-Tarrant', 49, 'D'), ('Daniel Biss', 9, 'D')]


## 🥊  Reto: Obtener elementos `href` que apunten a los proyectos de ley de los miembros  

El código anterior recupera información sobre:  

- el nombre del senador,  
- su número de distrito,  
- y su partido.  

Ahora queremos obtener la URL de la lista de proyectos de ley de cada senador. Cada URL seguirá un formato específico.  

El formato para la lista de proyectos de ley de un senador dado es:  

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`  

para obtener algo como:  

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`  

donde `MEMBER_ID=1911`.  

Deberías notar que, desafortunadamente, `MEMBER_ID` no se está extrayendo actualmente en nuestro código de scraping.  

Tu tarea inicial es modificar el código anterior para que también **recupere la URL completa que apunta a la página correspondiente de proyectos de ley patrocinados principalmente**, para cada miembro, y la devuelva junto con su nombre, distrito y partido.  

### Consejos:  

* Para hacer esto, querrás obtener el elemento de anclaje (`<a>`) apropiado en la fila de la tabla de cada legislador. Puedes usar nuevamente el método `.select()` en el objeto `row` dentro del bucle para hacer esto, de manera similar al comando que encuentra todas las celdas `td.detail` en la fila. Recuerda que solo queremos el enlace a los proyectos de ley del legislador, no los comités ni la página de perfil del legislador.  
* Los elementos de anclaje en el HTML se verán como `<a href="/senate/Senator.asp/...">Bills</a>`. La cadena en el atributo `href` contiene el **enlace relativo** que estamos buscando. Puedes acceder a un atributo de un objeto `Tag` de BeautifulSoup de la misma manera que accedes a un diccionario en Python: `anchor['attributeName']`. Consulta la <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentación</a> para más detalles.  
* Hay _muchas_ formas diferentes de usar BeautifulSoup para hacer esto. Cualquier método que utilices para extraer el `href` está bien.  

El código ha sido parcialmente completado para ti. Complétalo donde dice `#YOUR CODE HERE`. Guarda la ruta en un objeto llamado `full_path`.  


In [38]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")
# Create empty list to store our data
members = []

# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
# Get rid of junk rows
rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail') 
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]

    # YOUR CODE HERE
    #full_path = ''
    # Buscar el enlace a los proyectos de ley dentro de la fila
    bill_anchor = row.select_one('td.detail a[href*="SenatorBills.asp"]')
    
    # Extraer el 'href' si existe, y construir la URL completa
    if bill_anchor:
        relative_path = bill_anchor['href']
        full_path = f"http://www.ilga.gov{relative_path}"
    else:
        full_path = ''

    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)

In [39]:
# Uncomment to test 
members[:5]

[('Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.govSenatorBills.asp?GA=98&MemberID=1911'),
 ('Jason A. Barickman',
  53,
  'R',
  'http://www.ilga.govSenatorBills.asp?GA=98&MemberID=2018'),
 ('Scott M Bennett',
  52,
  'D',
  'http://www.ilga.govSenatorBills.asp?GA=98&MemberID=2272'),
 ('Jennifer Bertino-Tarrant',
  49,
  'D',
  'http://www.ilga.govSenatorBills.asp?GA=98&MemberID=2022'),
 ('Daniel Biss',
  9,
  'D',
  'http://www.ilga.govSenatorBills.asp?GA=98&MemberID=2020')]

## 🥊  Reto: Modulariza tu código  

Convierte el código anterior en una función que acepte una URL, extraiga los senadores de esa URL y devuelva una lista de tuplas que contengan información sobre cada senador.  

In [80]:
# YOUR CODE HERE
def get_members(url):
    # Hacer la solicitud GET
    req = requests.get(url)
    # Leer el contenido de la respuesta del servidor
    src = req.text
    # Analizar con BeautifulSoup
    soup = BeautifulSoup(src, "lxml")

    # Crear una lista vacía para almacenar los datos
    members = []

    # Obtener todas las filas 'tr tr tr' que contienen datos
    rows = soup.select('tr tr tr')
    # Filtrar las filas basura
    rows = [row for row in rows if row.select('td.detail')]

    # Iterar sobre cada fila
    for row in rows:
        # Seleccionar solo las celdas 'td' con clase 'detail'
        detail_cells = row.select('td.detail')  
        # Extraer el texto de cada celda
        row_data = [cell.text.strip() for cell in detail_cells]

        # Extraer información relevante
        name = row_data[0]
        district = int(row_data[3])
        party = row_data[4]

        # Buscar el enlace a los proyectos de ley dentro de la fila
        bill_anchor = row.select_one('td.detail a[href*="SenatorBills.asp"]')

        # Extraer el 'href' si existe, y construir la URL completa
        if bill_anchor:
            relative_path = bill_anchor['href']
            
            # Verificar si la URL ya tiene el dominio base
            if relative_path.startswith('http://www.ilga.gov'):
                full_path = relative_path  # Si ya tiene la URL completa, usarla tal cual
            elif relative_path.startswith('/'):
                full_path = f"http://www.ilga.gov{relative_path}"  # Concatenar solo si es relativa
            else:
                # Si no tiene la barra inicial, corregirla
                full_path = f"http://www.ilga.gov/{relative_path}"
        else:
            full_path = ''


        # Almacenar en una tupla
        senator = (name, district, party, full_path)
        # Agregar a la lista
        members.append(senator)

    return members


In [81]:
# Test your code
url = 'http://www.ilga.gov/senate/default.asp?GA=98'
senate_members = get_members(url)
len(senate_members)

61

## 🥊 Reto para llevar a casa: Escribir una función de scraping  

Queremos extraer información de las páginas web correspondientes a los proyectos de ley patrocinados por cada legislador.  

Escribe una función llamada `get_bills(url)` para analizar una URL de proyectos de ley dada. Esto implicará:  

- hacer una solicitud a la URL usando la librería <a href="http://docs.python-requests.org/en/latest/">`requests`</a>  
- usar las funciones de la librería `BeautifulSoup` para encontrar todos los elementos `<td>` con la clase `billlist`  
- devolver una _lista_ de tuplas, cada una con:  
    - la descripción (2ª columna)  
    - la cámara (S o H) (3ª columna)  
    - la última acción (4ª columna)  
    - la fecha de la última acción (5ª columna)  

Esta función ha sido parcialmente completada. Complétala.  


In [82]:
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        # YOUR CODE HERE
        cells = row.select('td.billlist')

        # Verificar que la fila tenga suficientes columnas
        if len(cells) >= 5:
            bill_id = cells[0].text.strip()  # ID del proyecto de ley
            description = cells[1].text.strip()  # Descripción
            chamber = cells[2].text.strip()  # Cámara (S o H)
            last_action = cells[3].text.strip()  # Última acción
            last_action_date = cells[4].text.strip()  # Fecha de la última acción
            bill = (bill_id, description, chamber, last_action, last_action_date)
            bills.append(bill)
    return bills

In [83]:
# Uncomment to test your code
test_url = senate_members[0][3]

print("Raw test_url:", test_url)

# Verificar si test_url comienza con la base correcta
base_url = "http://www.ilga.gov"
if test_url.startswith(base_url):
    # Si test_url comienza con la URL base, asegurarse de agregar la palabra 'senate/' entre la base y la ruta
    test_url = test_url[:len(base_url)] + "/senate/" + test_url[len(base_url):]
else:
    # Si no empieza con "http", corregirla añadiendo la base URL correctamente
    if not test_url.startswith('/'):
        test_url = "/" + test_url  # Asegurarse de que comience con una barra '/'
    
    # Concatenar correctamente la URL
    test_url = base_url + "/senate/" + test_url  # Concatenar la URL base con la parte relativa

# Imprimir la URL corregida para verificar que esté bien
print("Fixed test_url:", test_url)  # Verifica que ahora sea válida

get_bills(test_url)[0:5]

Raw test_url: http://www.ilga.gov/SenatorBills.asp?GA=98&MemberID=1911
Fixed test_url: http://www.ilga.gov/senate//SenatorBills.asp?GA=98&MemberID=1911


[('SB2', 'STATE GOVERNMENT-TECH', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB2', 'STATE GOVERNMENT-TECH', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB2', 'STATE GOVERNMENT-TECH', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB9',
  'PUBLIC UTIL-PERFORMANCE-BASED',
  'S',
  'Public Act . . . . . . . . . 98-0015',
  '5/23/2013'),
 ('SB27', 'MEDICAID BUDGET NOTE ACT', 'S', 'Session Sine Die', '1/13/2015')]

### Raspado de Todos los Proyectos de Ley

Finalmente, crea un diccionario `bills_dict` que asocie un número de distrito (la clave) con una lista de proyectos de ley (el valor) provenientes de ese distrito. Puedes hacerlo iterando sobre todos los miembros del senado en `members_dict` y llamando a `get_bills()` para cada una de sus URLs asociadas de proyectos de ley.

**NOTA:** por favor, llama a la función `time.sleep(1)` en cada iteración del bucle, para no sobrecargar el sitio web del estado.

In [91]:
bills_dict = {}

# Iterar sobre todos los senadores en la lista 'members'
for senator in members:
    # Obtener el número de distrito del senador (índice 1 de la tupla)
    district = senator[1]
    
    # Obtener la URL de los proyectos de ley de este senador (índice 3 de la tupla)
    bills_url = senator[3]
    
    # Depuración: Verificar la URL original solo una vez
    if not bills_url.startswith("http"):
        print(f"Raw bills URL: {bills_url}")
    
    # Corregir la URL si no tiene el formato correcto
    if bills_url and not bills_url.startswith("http"):
        # Si la URL es relativa, agregar la base correcta
        if bills_url.startswith("/"):
            bills_url = "http://www.ilga.gov" + bills_url
        else:
            # Si la URL no tiene '/', agregamos el dominio con una barra
            bills_url = "http://www.ilga.gov/" + bills_url
    
    # Asegurarse de que la palabra "senate" esté en la URL antes de 'SenatorBills.asp'
    if "SenatorBills.asp" in bills_url:
        # Agregar 'senate' (minúsculas) a la URL justo antes de 'SenatorBills.asp'
        bills_url = bills_url.replace("SenatorBills.asp", "senate/SenatorBills.asp")
    
    # Corregir la URL para evitar concatenación incorrecta
    if bills_url.startswith("http://www.ilga.gov") and not bills_url.startswith("http://www.ilga.gov/"):
        # Si ya tiene el dominio, asegurarnos de que la URL contenga la barra '/'
        bills_url = "http://www.ilga.gov/" + bills_url[len("http://www.ilga.gov"):]
    
    # Depuración: Verificar la URL corregida solo una vez
    #print(f"Fixed bills URL: {bills_url}")
    
    # Llamar a la función get_bills() para obtener los proyectos de ley
    bills = get_bills(bills_url)
    
    # Agregar los proyectos de ley al diccionario bills_dict
    if district not in bills_dict:
        bills_dict[district] = []  # Si el distrito no está en el diccionario, agregarlo
    
    bills_dict[district].extend(bills)  # Agregar los proyectos de ley a la lista correspondiente al distrito
    
    # Pausar la ejecución para evitar sobrecargar el servidor
    time.sleep(1)

In [92]:
# Uncomment to test your code
bills_dict[52]

[('SR1730',
  'MEMORIAL - THOMAS G. HAYS',
  'S',
  'Resolution Adopted',
  '1/13/2015'),
 ('SR1730',
  'MEMORIAL - THOMAS G. HAYS',
  'S',
  'Resolution Adopted',
  '1/13/2015'),
 ('SR1730',
  'MEMORIAL - THOMAS G. HAYS',
  'S',
  'Resolution Adopted',
  '1/13/2015'),
 ('SB10',
  'CIVIL LAW-TECH',
  'S',
  'Public Act . . . . . . . . . 98-0597',
  '11/20/2013'),
 ('SB10',
  'CIVIL LAW-TECH',
  'S',
  'Public Act . . . . . . . . . 98-0597',
  '11/20/2013'),
 ('SB10',
  'CIVIL LAW-TECH',
  'S',
  'Public Act . . . . . . . . . 98-0597',
  '11/20/2013'),
 ('SB16', 'EDUCATION-TECH', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB30',
  'THOMSON PRISON CESSION ACT',
  'S',
  'Public Act . . . . . . . . . 98-0070',
  '7/15/2013'),
 ('SB34', 'HLTH BENEFITS EX-ADMIN', 'S', 'Session Sine Die', '1/13/2015'),
 ('SB46',
  'UNIVERSITY OF ILLINOIS TRUSTEE',
  'S',
  'Session Sine Die',
  '1/13/2015'),
 ('SB47',
  'PUB AID-CAUSE OF ACTION-NOTICE',
  'S',
  'Public Act . . . . . . . . . 98-0073',
  '7/15