# Web Scraping with Beautiful Soup

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:


La siguiente sección corresponde a la instalación de los paquetes necesarios: requests, beautifulsoup4 y lxml.
## Funcion del request

* Permite realizar peticiones HTTP de manera sencilla y manejar las respuestas devueltas por un servidor web.
* Facilita operaciones como enviar parámetros, encabezados y autenticación en las solicitudes.

## Funcion del beautifulsoup4

* Se utiliza para extraer y procesar información de páginas web obtenidas mediante una solicitud HTTP.
* Permite navegar y buscar de forma intuitiva entre etiquetas, atributos y textos dentro del código HTML.

## Funcion del lxml

* Es una librería especializada en el procesamiento y manipulación de documentos XML y HTML.
* Resulta muy eficiente en tareas que requieren analizar, transformar o extraer datos estructurados en forma de árbol.




In [3]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [5]:
%pip install lxml

Note: you may need to restart the kernel to use updated packages.


## Importación de librerías

En esta sección se importan las librerías necesarias que serán utilizadas en el script para realizar web scraping, manipulación de fechas y control de tiempo en la ejecución.

- **from bs4 import BeautifulSoup** → importa BeautifulSoup, que se usará para analizar y extraer información de documentos HTML.  
- **from datetime import datetime** → permite trabajar con fechas y horas, como obtener la fecha actual o formatear timestamps.  
- **import requests** → se utiliza para realizar peticiones HTTP y obtener el contenido de páginas web.  
- **import time** → proporciona funciones para controlar pausas en la ejecución del script, por ejemplo usando `time.sleep()`.



In [1]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

<a id='extract'></a>

# Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

Este bloque realiza un request HTTP mediante el metodo GET a un sitio web y muestra parte del contenido recibido.  

- **req = requests.get(http://www.ilga.gov/senate/default.asp)** → Realiza una solicitud HTTP GET a la URL especificada

- **src = req.text** → Obtiene el contenido de la respuesta del servidor en formato de texto (HTML) 

- **print(src[:1000])** →  Muestra los primeros 1000 caracteres del contenido para ver una vista previa

In [2]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

<!DOCTYPE html>
<html lang="en">
<head id="Head1">
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta charset="utf-8" />
    <meta charset="UTF-8">
    <!-- Meta Description -->
    <meta name="description" content="Welcome to the official government website of the Illinois General Assembly">
    <meta name="contactName" content="Legislative Information System">
    <meta name="contactOrganization" content="LIS Staff Services">
    <meta name="contactStreetAddress1" content="705 Stratton Office Building">
    <meta name="contactCity" content="Springfield">
    <meta name="contactZipcode" content="62706">
    <meta name="contactNetworkAddress" content="webmaster@ilga.gov">
    <meta name="contactPhoneNumber" content="217-782-3944">
    <meta name="contactFaxNumber" content="217-524-6059">
    <meta name


## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

## Explicacion del bloque de codigo ##
Este bloque convierte la respuesta obtenida del servidor en un árbol HTML utilizando *BeautifulSoup*, lo que facilita navegar y extraer información de la página de los primeros 1000 caracteres.


In [3]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html lang="en">
 <head id="Head1">
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta charset="utf-8"/>
  <!-- Meta Description -->
  <meta content="Welcome to the official government website of the Illinois General Assembly" name="description"/>
  <meta content="Legislative Information System" name="contactName"/>
  <meta content="LIS Staff Services" name="contactOrganization"/>
  <meta content="705 Stratton Office Building" name="contactStreetAddress1"/>
  <meta content="Springfield" name="contactCity"/>
  <meta content="62706" name="contactZipcode"/>
  <meta content="webmaster@ilga.gov" name="contactNetworkAddress"/>
  <meta content="217-782-3944" name="contactPhoneNumber"/>
  <meta content="217-524-6059" name="contactFaxNumber"/>
  <meta content="State Of Illinois" name="originatorJur

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

## Explicación del bloque del código ##

Busca todos los elementos con la etiqueta `<a>` dentro del árbol HTML y muestra únicamente los primeros 10 resultados encontrados.

In [4]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

[<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="af" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-za"></span> Afrikaans
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="sq" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-al"></span> Albanian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="ar" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-ae"></span> Arabic
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="hy" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-am"></span> Armenian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="az" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-az"></span> Azerbaijani
            

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

## Explicación del bloque del código ##
Se buscan todos los elementos con la etiqueta `<a>` usando dos formas equivalentes (`soup.find_all("a")` y `soup("a")`) y se imprime el primer elemento obtenido en cada caso. Por ultimo, imprime la cantidad de elementos del primer caso.

In [5]:
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>
<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>


How many links did we obtain?

In [6]:
print(len(a_tags))

270


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

## Explicación del bloque del código ##
Se buscan específicamente las etiquetas `<a>` que pertenecen a la clase `sidemenu` dentro del HTML. Primero se usa el codigo `soup("a", class_="sidemenu")` y luego la sintaxis de selectores CSS con `soup.select("a.sidemenu")`. En ambos casos se mostraria solo los primeros 5 resultados encontrados, sin embargo, no encuentra ninguna clase sidemenu por lo que no muestra algun dato y la cantidad aparece en 0.

In [13]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
print(len(side_menus))
side_menus[:5]

0


[]

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [14]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu")
selected[:5]

[]

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

## Explicación del bloque del código ##
En este caso busca la etiqueta `<a>` que pertenecen a la clase `mainmenu` dentro del HTML con la sintaxis de selectores y presentaría los ultimos 5 elementos. Sin embargo, como no encuentra una clase mainmenu, no presenta data.

In [7]:
# Get elements with "a.mainmenu" CSS Selector.
selected_main = soup.select("a.mainmenu")
selected_main[:5]


[]

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

## Explicación del bloque del código ##
Intenta obtener todos los enlaces `<a>` con la clase `sidemenu` y examinar el primero de ellos. También verifica el tipo de dato de la variable que contiene el primer enlace.  
Sin embargo, como la lista no obtiene datos con la clase sidemenu, al intentar acceder a `side_menu_links[0]` se genera un `IndexError`.  
Luego, con los comando `first_link.text` y `first_link['href']` nos arroja un `NameError` porque la variable `first_link` no está definida, ya que la lista `side_menu_links` estaba vacía en el paso anterior.

In [9]:
 # Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu")   

    # Examine the first link
first_link = side_menu_links[0]
print(first_link)

    # What class is this variable?
print('Class: ', type(first_link))

IndexError: list index out of range

It's a Beautiful Soup tag! This means it has a `text` member:

In [10]:
print(first_link.text)

NameError: name 'first_link' is not defined

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [18]:
print(first_link['href'])

NameError: name 'first_link' is not defined

## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

## Explicación del bloque del código ##
Se buscan todos los enlaces `<a>` que pertenecen a la clase `mainmenu` y se guardan en la lista `main_menu_links`. Luego se intenta acceder al primer elemento de esa lista y mostrar su atributo `href`.  
Como resultado se genera un `IndexError` porque la lista `main_menu_links` está vacía, es decir, no se encontraron enlaces con la clase `mainmenu` en el HTML descargado.

In [15]:

# Get all mainmenu links as a list
main_menu_links = soup.select("a.mainmenu")
first_link_menu = main_menu_links[0]
print(first_link_menu['href'])


IndexError: list index out of range

<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

## Explicación del bloque del código ##
Se realiza una solicitud HTTP GET a la página con el parámetro `GA=98`, se obtiene el contenido de la respuesta y se convierte en un árbol HTML con BeautifulSoup utilizando el parser `lxml`.

In [16]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.


## Explicación del bloque del código ##
Se obtienen todas las filas de tabla (`<tr>`) en el HTML usando dos métodos: `find_all("tr")` y el selector CSS `'tr tr tr'`. Luego se imprimen las primeras 5 filas** encontradas.   
Sin embargo, como no obtiene dada no tenemos resultados, asi mismo en la longitud con el comando len(rows) podemos verificar que aparece el valor de 0. Por ultimo, al tratar de acceder al tercer valor se genera un `IndexError` ya que la variable `rows` está vacía y no contiene ningún elemento.

In [17]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

0

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [21]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [22]:
example_row = rows[2]
print(example_row.prettify())

IndexError: list index out of range

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

## Explicación del bloque del código ##
Se recorren los elementos de una fila de tabla (`example_row`) y se trata de imprimir las celdas que coinciden con distintos selectores:  
- `td` selecciona todas las celdas de la fila.  
- `.detail` selecciona los elementos que tienen la clase `detail`.  
- `td.detail` selecciona específicamente las celdas `<td>` con la clase `detail`.  

Finalmente, con `assert` se valida que los tres selectores devuelven el mismo resultado.

Presentamos como resultado el error `NameError` ya que el `example_row` no logro ser definida por lo mencionado en el anterior bloque de codigo

In [23]:
for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print()

NameError: name 'example_row' is not defined

We can confirm that these are all the same.

In [24]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')

NameError: name 'example_row' is not defined

Let's use the selector `td.detail` to be as specific as possible.

## Explicación del bloque del código ##
En este bloque se busca extraer únicamente las celdas `<td>` de una fila (`example_row`) que tengan la clase `detail`.  
Después, se crea una lista row_data que contiene solo el texto de esas celdas y por ultimo, se pueda acceder a posiciones específicas para extraer valores concretos como Name, District y Party.  
Como resultado se tienen error del tipo `NameError` por lo antes mencionado con la variable `example_row`.

In [25]:
# Select only those 'td' tags with class 'detail' 
detail_cells = example_row.select('td.detail')
detail_cells

NameError: name 'example_row' is not defined

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [27]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]

print(row_data)

NameError: name 'detail_cells' is not defined

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [28]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

NameError: name 'row_data' is not defined

## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

## Explicación del bloque del código ##
Este bloque intenta inspeccionar filas específicas de la lista `rows` y luego medir su tamaño con `len(rows[0])` para distinguir entre filas “malas” y “buenas”.  

Sin embargo, se produce un `IndexError`al acceder a `rows[0]` porque `rows` está vacía (no hay ningún elemento en esa posición). Por el mismo motivo, la llamada `len(rows[0])` vuelve a fallar e intenta obtener la longitud de un elemento que no existe.



In [29]:
print('Row 0:\n', rows[0], '\n')
print('Row 1:\n', rows[1], '\n')
print('Last Row:\n', rows[-1])

IndexError: list index out of range

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [30]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

IndexError: list index out of range

Perhaps good rows have a length of 5. Let's check:

## Explicación del bloque del código ##

En este bloque se intenta filtrar las filas "buenas" de la lista `rows`.  
Primero se crea la lista `good_rows`, que guarda únicamente aquellas filas cuyo tamaño (`len(row)`) es igual a 5.  

Posteriormente, se intenta imprimir algunos elementos de `good_rows`. Sin embargo, se genera un `IndexError` al tratar de imprimir el primer elemento, es decir `good_rows[0]`, ya que la lista `good_rows` está vacía.

De manera similar, al intentar acceder a `rows[2].select('td.detail')` o a `rows[-1]`, aparece el mismo error `IndexError`: la lista `rows` está vacía y no existe ningún elemento en esas posiciones.   


In [31]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

IndexError: list index out of range

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [32]:
rows[2].select('td.detail') 

IndexError: list index out of range

In [33]:
# Bad row
print(rows[-1].select('td.detail'), '\n')

# Good row
print(rows[5].select('td.detail'), '\n')

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')
print(good_rows[-1])

IndexError: list index out of range

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

## Explicación del bloque del código ##
En este bloque se busca procesar las filas válidas (`valid_rows`) de la lista `rows` y almacenar información de cada senador en una lista llamada `members` y al final mostrar los primeros 5 registros. Como la lista `rows` esta vacia desde el inicio, `valid_rows` también estará vacía y, en consecuencia, `members` quedará como una lista vacía []

In [37]:
# Define storage list
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

In [42]:
# Should be 61
len(members)

0

Let's take a look at what we have in `members`.

In [39]:
print(members[:5])

[]


## 🥊  Challenge: Get `href` elements pointing to members' bills 

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. 

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`. 

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

## Explicación del bloque del código ##
Se hace una petición HTTP a la página del Senado y se analiza el contenido con `BeautifulSoup` usando el parser lxml.
Se seleccionan todas las filas `<tr>` y se filtran solo aquellas que contienen celdas con clase `td.detail`, que corresponden a los datos de los senadores.  

Posteriormente, se recorre las filas y extrae información, para cada fila válida se obtiene los datos: Nombre del senador, Número de distrito, Partido político, Enlace a los proyectos de ley (Bills), construyendo la URL completa a partir del atributo `href`.

Toda la información se guarda en una lista members como tuplas (name, district, party, full_path). Finalmente se pueden inspeccionar los primeros cinco elementos para verificar los datos.

In [61]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")
# Create empty list to store our data
members = []
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr')
# Get rid of junk rows
rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail') 
    # Keep only the text in each of those cells
    row_data = [cell.text.strip() for cell in detail_cells]  # strip() limpia espacios extra
    #row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # YOUR CODE HERE
    # Buscar el enlace hacia "Bills"
    bill_link = row.select_one('a[href*="SenatorBills.asp"]')
    if bill_link:
        relative_path = bill_link['href']  # extrae el atributo href
        full_path = "http://www.ilga.gov" + relative_path  # construye el enlace completo
    else:
        full_path = None  # en caso de que no exista enlace


    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)

In [62]:
# Uncomment to test 
members[:5]

[]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. 

## Explicación del bloque del código ##
Este bloque define una función modular llamada get_members que realiza scraping de una página del Senado de Illinois y devuelve la información de cada senador en forma de tuplas.
La función acepta cualquier URL de la página del Senado que siga la misma estructura de tablas.
Se hace una petición HTTP a la URL y se analiza el contenido con BeautifulSoup usando el parser lxml. Se seleccionan todas las filas `<tr>` y se filtran solo las que contienen celdas con clase `td.detail`, que corresponden a los datos de los senadores.  

Extrae la información de cada senador como en el anterior ejercicio, construyendo la URL completa a partir del atributo `href`. Cada senador se almacena como una tupla (nombre, distrito, partido, full_path) y se agrega a la lista members.  
Al final, la función devuelve la lista con todos los senadores extraídos.

In [68]:
# YOUR CODE HERE
def get_members(url):
    # Hacer la petición HTTP
    req = requests.get(url)
    src = req.text
    soup = BeautifulSoup(src, "lxml")

    members = []

    # Seleccionar todas las filas <tr> que contengan celdas con clase 'detail'
    rows = [row for row in soup.select('tr') if row.select('td.detail')]

    # Recorrer las filas válidas
    for row in rows:
        detail_cells = row.select('td.detail')
        row_data = [cell.text.strip() for cell in detail_cells]

        name = row_data[0]
        district = int(row_data[3])
        party = row_data[4]

        # Obtener el enlace a los Bills
        bill_link = row.select_one('a[href*="SenatorBills.asp"]')
        if bill_link:
            full_path = "http://www.ilga.gov" + bill_link['href']
        else:
            full_path = None

        # Guardar la información en la lista
        members.append((name, district, party, full_path))

    return members


In [73]:
# Test your code
url = 'http://www.ilga.gov/senate/default.asp?GA=98'
senate_members = get_members(url)
len(senate_members)

0

## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

## Explicación del bloque del código ##
En el codigo agregado se selecciona todas las filas `<tr>` y dentro de ellas solo las celdas `<td>` con clase `billlist`, que contienen los datos de cada bill.  
Extrae los distintos datos para almacenar cada bill como una tupla (bill_id, description, chamber, last_action, last_action_date) y la agrega a la lista `bills`. Como en la variable `senate_members` no tenia data, en este caso tendre un error del tipo `IndexError` ya que trata de obtener un elemento en una posición vacia.

In [70]:
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        # YOUR CODE HERE
        cells = row.select('td.billlist')  # Solo las celdas con clase billlist
        if len(cells) < 5:
            continue  # Ignorar filas incompletas

        bill_id = cells[0].text.strip()           # 1ra columna
        description = cells[1].text.strip()       # 2da columna
        chamber = cells[2].text.strip()           # 3ra columna
        last_action = cells[3].text.strip()       # 4ta columna
        last_action_date = cells[4].text.strip()  # 5ta columna
        bill = (bill_id, description, chamber, last_action, last_action_date)
        bills.append(bill)
    return bills

In [None]:
# Uncomment to test your code
test_url = senate_members[0][3]
get_bills(test_url)[0:5]

IndexError: list index out of range

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

## Explicación del bloque del código ##
Se crea un diccionario vacío `bills_dict` que almacenará como clave el número de distrito y como valor la lista de `bills` correspondientes a ese distrito. Luego se itera sobre cada senador en `members_dict`.
Para cada senador, se obtiene la URL de sus bills y se llama a la función `get_bills(bill_url)`.
Los resultados se almacenan en `bills_dict` bajo la clave correspondiente al distrito del senador.
Se incluye `time.sleep(1)` en cada iteración para evitar saturar el sitio web.  

En este caso, como los datos de senadores no se extrajeron correctamente en pasos previos, `members_dict` está vacío. Por ello, bills_dict también resulta vacío y al intentar acceder a una clave específica `como bills_dict[52]` genera un `KeyError`.

In [81]:
# YOUR CODE HERE
# Crear un diccionario para almacenar los bills por distrito
bills_dict = {}
# Crear un diccionario a partir de la lista de miembros
# La clave será el distrito, el valor la tupla completa de información del senador
members_dict = {member[1]: member for member in senate_members}

# Verificar el diccionario
print(list(members_dict.keys()))  # Primeros los distritos
print(list(members_dict.values()))  # Primeros los senadores
# Recorrer todos los senadores en members_dict
for district, member_info in members_dict.items():
    bill_url = member_info[3]  # Tomar la URL de Bills de cada senador
    if bill_url:  # Verificar que exista URL
        bills = get_bills(bill_url)  # Obtener los bills de ese senador
        bills_dict[district] = bills
    else:
        bills_dict[district] = []  # Si no hay URL, dejar lista vacía

    time.sleep(1)  # Pausa de 1 segundo para no saturar el sitio

# Verificar algunos resultados
for district in list(bills_dict.keys())[:5]:
    print(f"Distrito {district}: {len(bills_dict[district])} bills")

[]
[]


In [82]:
# Uncomment to test your code
bills_dict[52]

KeyError: 52