# Web Scraping with Beautiful Soup

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [16]:
%pip install requests     
# Este script instala la biblioteca requests usando pip.

Note: you may need to restart the kernel to use updated packages.


In [17]:
%pip install beautifulsoup4
# Este script instala la biblioteca BeautifulSoup4 usando pip.

Note: you may need to restart the kernel to use updated packages.


We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [18]:
%pip install lxml
# Este script instala la biblioteca lxml usando pip.

Note: you may need to restart the kernel to use updated packages.


In [4]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
# # Importa las bibliotecas necesarias para el web scraping y el manejo de datos.

<a id='extract'></a>

# Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [19]:
# Realiza una solicitud GET
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Lee el contenido de la respuesta del servidor
src = req.text
# Muestra una parte de la salida
print(src[:1000])
# # Realiza una solicitud GET a la página del Senado de Illinois y muestra los primeros

<!DOCTYPE html>
<html lang="en">
<head id="Head1">
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta charset="utf-8" />
    <meta charset="UTF-8">
    <!-- Meta Description -->
    <meta name="description" content="Welcome to the official government website of the Illinois General Assembly">
    <meta name="contactName" content="Legislative Information System">
    <meta name="contactOrganization" content="LIS Staff Services">
    <meta name="contactStreetAddress1" content="705 Stratton Office Building">
    <meta name="contactCity" content="Springfield">
    <meta name="contactZipcode" content="62706">
    <meta name="contactNetworkAddress" content="webmaster@ilga.gov">
    <meta name="contactPhoneNumber" content="217-782-3944">
    <meta name="contactFaxNumber" content="217-524-6059">
    <meta name


## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [20]:
# Analiza la respuesta y la convierte en un árbol HTML
soup = BeautifulSoup(src, 'lxml')
# Muestra los primeros 1000 caracteres del HTML formateado
print(soup.prettify()[:1000])
# BeautifulSoup(src, 'lxml') toma el contenido HTML (src) y lo analiza usando el parser lxml, creando un objeto llamado soup que representa la estructura del documento HTML.
# soup.prettify() devuelve el HTML con una indentación legible, facilitando la visualización de la estructura.
# Al imprimir solo los primeros 1000 caracteres, puedes ver una muestra del HTML procesado, lo que ayuda a identificar las etiquetas y clases necesarias para extraer información

<!DOCTYPE html>
<html lang="en">
 <head id="Head1">
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta charset="utf-8"/>
  <!-- Meta Description -->
  <meta content="Welcome to the official government website of the Illinois General Assembly" name="description"/>
  <meta content="Legislative Information System" name="contactName"/>
  <meta content="LIS Staff Services" name="contactOrganization"/>
  <meta content="705 Stratton Office Building" name="contactStreetAddress1"/>
  <meta content="Springfield" name="contactCity"/>
  <meta content="62706" name="contactZipcode"/>
  <meta content="webmaster@ilga.gov" name="contactNetworkAddress"/>
  <meta content="217-782-3944" name="contactPhoneNumber"/>
  <meta content="217-524-6059" name="contactFaxNumber"/>
  <meta content="State Of Illinois" name="originatorJur

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [21]:
# Busca todos los elementos con la etiqueta 'a' (enlaces)
a_tags = soup.find_all("a")
# Muestra los primeros 10 enlaces encontrados
print(a_tags[:10])
# Busca todos los enlaces en el documento HTML y muestra los primeros 10 encontrados.
# soup.find_all("a") busca y devuelve una lista con todos los elementos <a> (enlaces) del documento HTML.

[<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="af" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-za"></span> Afrikaans
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="sq" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-al"></span> Albanian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="ar" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-ae"></span> Arabic
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="hy" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-am"></span> Armenian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="az" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-az"></span> Azerbaijani
            

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [22]:
# Ambas líneas buscan todos los elementos con la etiqueta 'a' (enlaces)
a_tags = soup.find_all("a")
a_tags_alt = soup("a")  # Equivalente a find_all("a")

# Imprime el primer enlace encontrado usando ambos métodos
print(a_tags[0])
print(a_tags_alt[0])

<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>
<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>


How many links did we obtain?

In [23]:
# Imprime la cantidad total de enlaces <a> encontrados en el documento HTML
print(len(a_tags))
#Esto muestra cuántos elementos <a> (enlaces) hay en la página, lo que te da una idea de la cantidad de hipervínculos presentes en el HTML analizado.


270


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [24]:
# Obtiene solo los elementos 'a' con la clase 'sidemenu'
side_menus = soup("a", class_="sidemenu")
# Muestra los primeros 5 enlaces encontrados con la clase 'sidemenu'
print(side_menus[:5])
# Obtiene todos los enlaces con la clase 'sidemenu' y muestra los primeros 5 encontrados.

[]


A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [25]:
# Obtiene los elementos con el selector CSS "a.sidemenu"
selected = soup.select("a.sidemenu")
# Muestra los primeros 5 elementos encontrados
print(selected[:5])
#soup.select("a.sidemenu") devuelve todos los elementos <a> que tienen la clase sidemenu usando un selector CSS.
#print(selected[:5]) muestra los primeros 5 resultados.

[]


## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [None]:
# Encontrar todos los elementos 'a' con clase 'mainmenu'
elementos_mainmenu = soup.find_all('a', class_='mainmenu')

# Sintaxis alternativa (ambas funcionan igual):
# elementos_mainmenu = soup.find_all('a', {'class': 'mainmenu'})

print(f"Se encontraron {len(elementos_mainmenu)} elementos con clase 'mainmenu'")

# Imprimir cada elemento
for i, elemento in enumerate(elementos_mainmenu):
    print(f"Elemento {i + 1}: {elemento}")
    
# Si quieres obtener el contenido de texto de cada elemento:
for elemento in elementos_mainmenu:
    print(f"Texto: {elemento.text}")
    
# Si quieres obtener atributos específicos (como href):
for elemento in elementos_mainmenu:
    if elemento.has_attr('href'):
        print(f"Enlace: {elemento['href']}")

[]


## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [None]:
# Obtiene todos los enlaces con la clase 'sidemenu' como una lista
side_menu_links = soup.select("a.sidemenu")

# Examina el primer enlace
first_link = side_menu_links[0]
print(first_link)

# ¿Qué clase tiene esta variable?
print('Clase: ', type(first_link))
# Este código selecciona todos los elementos <a> que tienen la clase 'sidemenu' usando un selector CSS.
# Luego, toma el primer enlace de esa lista y lo imprime para mostrar su estructura HTML.
# Después, imprime el tipo de la variable 'first_link', que será <class 'bs4.element.Tag'>,
# indicando que es un objeto Tag de BeautifulSoup, lo que permite acceder fácilmente a sus atributos y contenido.

IndexError: list index out of range

It's a Beautiful Soup tag! This means it has a `text` member:

In [None]:
# Imprime el texto contenido en el primer enlace con la clase 'sidemenu'
print(first_link.text)

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
# Imprime el valor del atributo 'href' del primer enlace con la clase 'sidemenu'
print(first_link['href'])

## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [None]:
# YOUR CODE HERE


<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [None]:
# Realiza una solicitud GET a la página del Senado de Illinois para la 98ª Asamblea General
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
  # Solicita el contenido HTML de la página web

# Lee el contenido de la respuesta del servidor (HTML de la página)
src = req.text  # Extrae el texto (HTML) de la respuesta

# Analiza el HTML usando BeautifulSoup con el parser 'lxml'
soup = BeautifulSoup(src, "lxml")  
# Convierte el HTML en un objeto BeautifulSoup para facilitar su análisis y extracción de datos

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [None]:
# Obtiene todos los elementos de fila de la tabla (<tr>) del objeto soup
rows = soup.find_all("tr")
# Imprime la cantidad total de filas encontradas en la tabla
len(rows)
# Este código obtiene todas las filas (<tr>) de la tabla HTML y muestra cuántas filas hay en total.

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [None]:
# Devuelve todos los elementos que coinciden con el selector CSS 'tr tr tr' en la página
rows = soup.select('tr tr tr')

# Imprime las primeras 5 filas encontradas
for row in rows[:5]:
    print(row, '\n')

# Este código selecciona todas las filas anidadas (tr dentro de tr dentro de tr) en la tabla HTML y muestra las primeras 5, lo que ayuda a inspeccionar la estructura de los datos extraídos.

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [None]:
# Selecciona la tercera fila relevante de la tabla (usando el índice 2)
example_row = rows[2]
# Imprime la representación formateada (indentada) en HTML de esa fila para facilitar su análisis
print(example_row.prettify())
# Este código selecciona una fila específica de la tabla HTML y muestra su contenido estructurado para examinar su formato y etiquetas.

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [None]:
# Itera sobre todas las celdas <td> en la fila de ejemplo y las imprime
for cell in example_row.select('td'):
    print(cell)
print()
# Recorre e imprime todas las celdas de tipo <td> en la fila seleccionada.

# Itera sobre todas las celdas con la clase 'detail' en la fila de ejemplo y las imprime
for cell in example_row.select('.detail'):
    print(cell)
print()
# Recorre e imprime todas las celdas que tienen la clase 'detail' en la fila seleccionada.

# Itera sobre todas las celdas <td> que tienen la clase 'detail' en la fila de ejemplo y las imprime
for cell in example_row.select('td.detail'):
    print(cell)
print()
# Recorre e imprime todas las celdas <td> que además tienen la clase 'detail' en la fila seleccionada.

We can confirm that these are all the same.

In [None]:
# Verifica que seleccionar todas las celdas <td>, todas las celdas con clase 'detail' y todas las celdas <td> con clase 'detail' en la fila de ejemplo
# devuelve exactamente la misma lista de elementos. Esto asegura que los tres métodos de selección son equivalentes en este caso.
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')
# El código comprueba que los tres métodos de selección obtienen exactamente los mismos elementos de la fila de ejemplo.

Let's use the selector `td.detail` to be as specific as possible.

In [None]:
# Selecciona solo aquellas celdas 'td' que tienen la clase 'detail' en la fila de ejemplo
detail_cells = example_row.select('td.detail')
detail_cells  # Devuelve una lista de celdas <td> con clase 'detail' de la fila seleccionada

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [None]:
# Conserva solo el texto de cada una de las celdas 'detail'
row_data = [cell.text for cell in detail_cells]

# Imprime la lista con los textos extraídos de las celdas
print(row_data)
# Este código extrae el texto de cada celda con clase 'detail' en una fila y lo muestra como una lista.

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [None]:
# Imprime el nombre del senador (primer elemento de row_data)
print(row_data[0]) # Nombre
# Muestra el nombre del senador extraído de la fila

# Imprime el distrito del senador (cuarto elemento de row_data)
print(row_data[3]) # Distrito
# Muestra el número de distrito del senador extraído de la fila

# Imprime el partido del senador (quinto elemento de row_data)
print(row_data[4]) # Partido
# Muestra el partido político del senador extraído de la fila

## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [None]:
# Imprime la primera fila (índice 0) de la lista 'rows'
print('Row 0:\n', rows[0], '\n')
# Imprime la segunda fila (índice 1) de la lista 'rows'
print('Row 1:\n', rows[1], '\n')
# Imprime la última fila de la lista 'rows'
print('Last Row:\n', rows[-1])

# Este código muestra por pantalla el contenido de la primera, segunda y última fila de la lista 'rows', permitiendo inspeccionar su estructura y contenido.

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [None]:
# Filas no deseadas (bad rows)
print(len(rows[0]))   # Imprime la longitud de la primera fila (probablemente no relevante)
print(len(rows[1]))   # Imprime la longitud de la segunda fila (probablemente no relevante)

# Filas deseadas (good rows)
print(len(rows[2]))   # Imprime la longitud de la tercera fila (probablemente relevante)
print(len(rows[3]))   # Imprime la longitud de la cuarta fila (probablemente relevante)

# Este código imprime la cantidad de elementos (hijos) que tiene cada una de las primeras filas de la tabla, permitiendo comparar la estructura de filas no relevantes y relevantes.

Perhaps good rows have a length of 5. Let's check:

In [None]:
# Filtrar las filas que tienen exactamente 5 elementos (columnas)
# Esto nos ayuda a identificar las filas que contienen datos válidos de senadores
# ya que las filas con información completa deben tener: nombre, foto, comités, distrito y partido
good_rows = [row for row in rows if len(row) == 5]

# Verificar algunas filas para confirmar que el filtrado funciona correctamente
# Imprimir la primera fila válida
print(good_rows[0], '\n')
# Imprimir la penúltima fila válida
print(good_rows[-2], '\n')
# Imprimir la última fila válida
print(good_rows[-1]

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [None]:
# Selecciona todas las celdas <td> que tienen la clase 'detail' en la tercera fila de la tabla (rows[2])
rows[2].select('td.detail')

In [None]:
# Intentamos filtrar las filas relevantes de la tabla usando selectores CSS.

# 1. Revisamos una fila "mala" (por ejemplo, el pie de página o filas vacías)
print(rows[-1].select('td.detail'), '\n')  # Normalmente devuelve una lista vacía porque no tiene celdas con la clase 'detail'

# 2. Revisamos una fila "buena" (una que sí contiene datos de senadores)
print(rows[5].select('td.detail'), '\n')  # Devuelve una lista con celdas que tienen la clase 'detail' (datos útiles)

# 3. Filtramos todas las filas que contienen al menos una celda con la clase 'detail'
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')      # Imprime la primera fila válida (con datos de senador)
print(good_rows[-1])           # Imprime la última fila válida

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [None]:
# Definir la lista donde se almacenarán los datos de los senadores
members = []

# Eliminar las filas que no contienen datos relevantes (solo filas con celdas 'td.detail')
valid_rows = [row for row in rows if row.select('td.detail')]

# Recorrer todas las filas válidas
for row in valid_rows:
    # Seleccionar solo las celdas 'td' con clase 'detail'
    detail_cells = row.select('td.detail')
    # Obtener solo el texto de cada celda
    row_data = [cell.text for cell in detail_cells]
    # Extraer la información relevante: nombre, distrito y partido
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Guardar los datos en una tupla
    senator = (name, district, party)
    # Agregar la tupla a la lista de miembros
    members.append(senator)
 #   *Este bloque de código recorre las filas relevantes de la tabla HTML, extrae el nombre, distrito y partido de cada senador, y almacena esa información en una lista de tuplas para su posterior análisis o uso.*

In [None]:
# Debe ser 61 (número esperado de miembros del senado)
len(members)

Let's take a look at what we have in `members`.

In [None]:
# Imprime los primeros 5 elementos de la lista 'members', que contienen información sobre los senadores
print(members[:5])

## 🥊  Challenge: Get `href` elements pointing to members' bills 

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. 

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`. 

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")
# Create empty list to store our data
members = []

# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
# Get rid of junk rows (solo filas con celdas 'td.detail')
rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail') 
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]

    # Buscar el enlace a los proyectos de ley del senador
    # El enlace suele estar en la celda con el texto "Bills"
    bills_link = row.find('a', string='Bills')
    if bills_link and bills_link.has_attr('href'):
        # Construir la URL completa usando el dominio base
        full_path = 'http://www.ilga.gov' + bills_link['href']
    else:
        full_path = ''

    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)

# Explicación:
# Este código realiza una petición HTTP para obtener la página de senadores de Illinois,
# la analiza con BeautifulSoup, filtra las filas relevantes de la tabla,
# extrae el nombre, distrito y partido de cada senador,
# y además obtiene el enlace directo a los proyectos de ley ("Bills") de cada senador.
# Finalmente, guarda toda la información en una lista de tuplas.

In [None]:
# Uncomment to test 
# members[:5]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. 

In [None]:
# TU CÓDIGO AQUÍ
def get_members(url):
    # Realiza una solicitud GET a la URL proporcionada y obtiene el contenido HTML
    req = requests.get(url)
    src = req.text
    # Analiza el HTML usando BeautifulSoup con el parser 'lxml'
    soup = BeautifulSoup(src, "lxml")
    # Crea una lista vacía para almacenar los datos de los miembros
    members = []
    # Selecciona todas las filas relevantes de la tabla usando el selector CSS 'tr tr tr'
    rows = soup.select('tr tr tr')
    # Filtra solo las filas que contienen al menos una celda <td> con clase 'detail'
    rows = [row for row in rows if row.select('td.detail')]
    # Itera sobre todas las filas válidas
    for row in rows:
        # Selecciona únicamente las celdas <td> que tienen la clase 'detail'
        detail_cells = row.select('td.detail')
        # Extrae solo el texto de cada una de esas celdas
        row_data = [cell.text for cell in detail_cells]
        # Obtiene el nombre del senador (primer elemento)
        name = row_data[0]
        # Obtiene el distrito del senador (cuarto elemento) y lo convierte a entero
        district = int(row_data[3])
        # Obtiene el partido del senador (quinto elemento)
        party = row_data[4]
        # Busca el enlace a la lista de proyectos de ley del senador
        # Busca el enlace cuyo texto sea 'Bills'
        bill_link = row.find('a', string='Bills')
        # Si se encuentra el enlace, construye la URL completa
        if bill_link and bill_link.has_attr('href'):
            full_path = 'http://www.ilga.gov' + bill_link['href']
        else:
            full_path = ''
        # Almacena la información en una tupla
        senator = (name, district, party, full_path)
        # Agrega la tupla a la lista de miembros
        members.append(senator)
    # Devuelve la lista de miembros
    return members


In [None]:
# Prueba tu código
# Define la URL de la página del Senado de Illinois para la 98ª Asamblea General
url = 'http://www.ilga.gov/senate/default.asp?GA=98'
# Llama a la función get_members para obtener la lista de senadores
senate_members = get_members(url)
# Imprime la cantidad de senadores encontrados
len(senate_members)

## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

In [None]:
def get_bills(url):
    # Realiza una solicitud HTTP GET a la URL proporcionada y obtiene el contenido HTML como texto
    src = requests.get(url).text
    
    # Analiza el contenido HTML usando BeautifulSoup
    soup = BeautifulSoup(src)

    # Selecciona todos los elementos <tr> del documento (filas de una tabla)
    rows = soup.select('tr')

    # Inicializa una lista vacía para almacenar las tuplas con la información de cada proyecto de ley
    bills = []

    # Itera sobre cada fila encontrada
    for row in rows:
        # TU CÓDIGO AQUÍ:
        # Aquí deberías extraer las celdas <td> de la fila y luego extraer la información de cada campo.
        # Por ejemplo, podrías hacer algo como:
        # cols = row.find_all('td')
        # if len(cols) >= 5:
        #     bill_id = cols[0].text.strip()
        #     description = cols[1].text.strip()
        #     chamber = cols[2].text.strip()
        #     last_action = cols[3].text.strip()
        #     last_action_date = cols[4].text.strip()
        #     bill = (bill_id, description, chamber, last_action, last_action_date)
        #     bills.append(bill)

        bill_id =        # Aquí debes extrae_


In [None]:
# Uncomment to test your code
# test_url = senate_members[0][3]
# get_bills(test_url)[0:5]

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# YOUR CODE HERE


In [None]:
# Uncomment to test your code
# bills_dict[52]