# Parseando HTML con BeautifulSoup (C2_S2 · Demo del profesor)

**Meta:** crear `soup`, elegir selectores, extraer texto/atributos, resolver URLs y no romper ante ausencias.


# INICIO CLASE ********************

## Descargando una pagina con BeautifulSoup

👉 Qué harás a continuación: descargar una página real, ajustar encoding si es necesario y crear la sopa de BeautifulSoup.


In [2]:
import requests
from bs4 import BeautifulSoup

URL = "https://books.toscrape.com/"
r = requests.get(URL, timeout=10)
print("Status:", r.status_code, "| Content-Type:", r.headers.get("Content-Type"))

# Si notas acentos raros, usar apparent_encoding (no siempre es necesario)
#r.encoding = r.apparent_encoding or r.encoding

soup = BeautifulSoup(r.text, "html.parser")
print("Título del documento:", soup.title.get_text(strip=True))



Status: 200 | Content-Type: text/html
Título del documento: All products | Books to Scrape - Sandbox


#### Extrayendo header

![Pantallazo](head.png)

![l](parseando.png)

In [7]:
# Extraer el head

head = soup.find("title")

print(head)

<title>
    All products | Books to Scrape - Sandbox
</title>


#### Funcíón get_text()

In [5]:
# Sin etiquetas
head = soup.find("title").get_text()

print(head)


    All products | Books to Scrape - Sandbox



In [6]:
# Sin espacios
head = soup.find("title").get_text(strip=True)

print(head)

All products | Books to Scrape - Sandbox


#### Extraer título
![Pantallazo](image.png)

In [9]:
titulo = soup.find("div", class_="col-sm-8 h1")

print(titulo)


<div class="col-sm-8 h1"><a href="index.html">Books to Scrape</a><small> We love being scraped!</small>
</div>


In [10]:
titulo = soup.find("div", class_="col-sm-8 h1")

print(titulo.get_text(strip=True))

Books to ScrapeWe love being scraped!


#### find_all

In [11]:
print("\nEncontrar todos los títulos de los libros (etiquetas <h3>):")
all_book_titles = soup.find_all("h3")

len(all_book_titles)


Encontrar todos los títulos de los libros (etiquetas <h3>):


20

In [13]:
for title in all_book_titles:
    print(f"  - {title.get_text(strip=True)}")

  - A Light in the ...
  - Tipping the Velvet
  - Soumission
  - Sharp Objects
  - Sapiens: A Brief History ...
  - The Requiem Red
  - The Dirty Little Secrets ...
  - The Coming Woman: A ...
  - The Boys in the ...
  - The Black Maria
  - Starving Hearts (Triangular Trade ...
  - Shakespeare's Sonnets
  - Set Me Free
  - Scott Pilgrim's Precious Little ...
  - Rip it Up and ...
  - Our Band Could Be ...
  - Olio
  - Mesaerion: The Best Science ...
  - Libertarianism for Beginners
  - It's Only the Himalayas


#### Recorrer los productos

In [14]:
url = "https://books.toscrape.com/"

In [15]:
# Realizar la petición GET
response = requests.get(url)

In [16]:
soup = BeautifulSoup(response.text ,"html.parser")

#### Buscar todos los productos

![s](selectores.png)

![pantallazo](products.png)
![pantallazo](products2.png)

In [17]:
products = soup.select("article.product_pod")

In [18]:
print(products)

[<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>, <article class="product_pod">
<div class="image_container">
<a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="media/cache/2

In [19]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro

    # Precio

    # Imagen
    pass

![P](h3.png)

In [20]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    nombre = product.find("h3")
    print(nombre)

    # Precio

    # Imagen
    pass

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>
<h3><a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a></h3>
<h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>
<h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>
<h3><a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>
<h3><a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>
<h3><a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title=

In [21]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    nombre = product.find("h3").find("a")
    print(nombre)

    # Precio

    # Imagen
    pass

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
<a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>
<a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>
<a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>
<a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>
<a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a>
<a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Femini

In [23]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    nombre = product.find("h3").find("a")["title"]
    print(nombre)

    # Precio

    # Imagen
    pass

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


![p](price.png)

In [29]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    # nombre = product.find("h3").find("a")["title"]
    #print(nombre)

    # Precio
    precio = product.find("div", class_ = "product_price").find("p", class_ = "price_color")
    print(precio)
    # Imagen
    pass

<p class="price_color">Â£51.77</p>
<p class="price_color">Â£53.74</p>
<p class="price_color">Â£50.10</p>
<p class="price_color">Â£47.82</p>
<p class="price_color">Â£54.23</p>
<p class="price_color">Â£22.65</p>
<p class="price_color">Â£33.34</p>
<p class="price_color">Â£17.93</p>
<p class="price_color">Â£22.60</p>
<p class="price_color">Â£52.15</p>
<p class="price_color">Â£13.99</p>
<p class="price_color">Â£20.66</p>
<p class="price_color">Â£17.46</p>
<p class="price_color">Â£52.29</p>
<p class="price_color">Â£35.02</p>
<p class="price_color">Â£57.25</p>
<p class="price_color">Â£23.88</p>
<p class="price_color">Â£37.59</p>
<p class="price_color">Â£51.33</p>
<p class="price_color">Â£45.17</p>


In [30]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    # nombre = product.find("h3").find("a")["title"]
    #print(nombre)

    # Precio
    precio = product.find("p", class_ = "price_color").get_text()
    print(precio)
    # Imagen
    pass

Â£51.77
Â£53.74
Â£50.10
Â£47.82
Â£54.23
Â£22.65
Â£33.34
Â£17.93
Â£22.60
Â£52.15
Â£13.99
Â£20.66
Â£17.46
Â£52.29
Â£35.02
Â£57.25
Â£23.88
Â£37.59
Â£51.33
Â£45.17


![p](divImage.png)

In [32]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    # nombre = product.find("h3").find("a")["title"]
    #print(nombre)

    # Precio
    # precio = product.find("p", class_ = "price_color").get_text()
    #print(precio)

    # Imagen
    imagen = product.find("div", class_="image_container")
    print(imagen)
    
    pass

<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<div class="image_container">
<a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/></a>
</div>
<div class="image_container">
<a href="catalogue/soumission_998/index.html"><img alt="Soumission" class="thumbnail" src="media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"/></a>
</div>
<div class="image_container">
<a href="catalogue/sharp-objects_997/index.html"><img alt="Sharp Objects" class="thumbnail" src="media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/></a>
</div>
<div class="image_container">
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html"><img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea56

In [33]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    # nombre = product.find("h3").find("a")["title"]
    #print(nombre)

    # Precio
    # precio = product.find("p", class_ = "price_color").get_text()
    #print(precio)

    # Imagen
    imagen = product.find("div", class_="image_container").find("a")
    print(imagen)
    
    pass

<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
<a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/></a>
<a href="catalogue/soumission_998/index.html"><img alt="Soumission" class="thumbnail" src="media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"/></a>
<a href="catalogue/sharp-objects_997/index.html"><img alt="Sharp Objects" class="thumbnail" src="media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/></a>
<a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html"><img alt="Sapiens: A Brief History of Humankind" class="thumbnail" src="media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg"/></a>
<a href="catalogue/the-requiem-red_995/index.html"><img alt="The Requiem Red" class="thumbnail" src="media/cache/68/33/68339b4c9bc034267e1d

##### 🔗 Problema: URLs relativas

En el HTML de Books to Scrape, los enlaces de los libros no son absolutos. Ejemplo:

![s](urlsJoin.png)

In [37]:
# nombre , precio y url

product_list = []

for product in products:
    # Nombre del libro
    # nombre = product.find("h3").find("a")["title"]
    #print(nombre)

    # Precio
    # precio = product.find("p", class_ = "price_color").get_text()
    #print(precio)

    # Imagen
    imagen = product.find("div", class_="image_container").find("a").find("img")["src"]
    print(imagen)
    
    pass

media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg
media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg
media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg
media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg
media/cache/58/46/5846057e28022268153beff6d352b06c.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg
media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg
media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg
media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg
media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg
media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg
media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg
media/cach

#### Listo para guardar

media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg → relativo, no puedes abrirlo directo si no lo completas.

Si haces un requests.get(href_rel) → te dará error, porque no sabe el dominio.

In [38]:
from urllib.parse import urljoin
# URL base del sitio
base_url = "https://books.toscrape.com/"

product_list = []

for product in products:
    # Nombre del libro
    # nombre = product.find("h3").find("a")["title"]
    #print(nombre)

    # Precio
    # precio = product.find("p", class_ = "price_color").get_text()
    #print(precio)

    # Imagen (relativa)
    imagen = product.find("div", class_="image_container").find("a").find("img")["src"]
    #imagen_url = "https://books.toscrape.com/" + imagen

    # Construir URL absoluta con urljoin
    imagen_url = urljoin(base_url, imagen)

    print(imagen_url)
    
    pass

https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
https://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg
https://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg
https://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg
https://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg
https://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg
https://books.toscrape.com/media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
https://books.toscrape.com/media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg
https://books.to

#### find/find_all vs select/select_one




![f](article.png)

#### Ejemplo con find_all

In [39]:
# Todos los artículos (libros)
cards = soup.find_all("article", class_="product_pod")
print("Total libros en la página:", len(cards))



Total libros en la página: 20


#### Relación con selectors css

La misma búsqueda se puede hacer con CSS selectors:

In [41]:
# Todos los artículos
cards_css = soup.select("article.product_pod")
print(len(cards_css))  # también 20

# Solo el primer título
first_css = soup.select_one("article.product_pod h3 a")["title"]
print("Primer título con CSS:", first_css)


20
Primer título con CSS: A Light in the Attic


##### Refuerzo

In [43]:
# Comparar conteos de productos con find_all vs select
# (usa 'soup' creado desde https://books.toscrape.com/)

cards_find = soup.find_all("article", class_="product_pod")
cards_css  = soup.select("article.product_pod")

print(f"find_all -> {len(cards_find)} | select -> {len(cards_css)}")

# Comparar el primer título encontrado con ambos enfoques
first_find = cards_find[0].find("h3").find("a")["title"] 
first_css  = soup.select_one("article.product_pod h3 a")["title"] 

print("Primer título (find):", first_find)
print("Primer título (select):", first_css)


find_all -> 20 | select -> 20
Primer título (find): A Light in the Attic
Primer título (select): A Light in the Attic


### Patrón de seguridad

⚠️ Problema típico

Cuando haces scraping, no siempre todos los elementos están en el HTML.
Ejemplos:

Un producto sin imagen.

Un enlace roto sin href.

Un precio que no aparece.

Si haces esto:

In [44]:
price = product.select_one("p.price_color").get_text(strip=True)

price


'Â£45.17'

y ese elemento no existe, obtendrás un error:

In [45]:
price = product.select_one("a.price_color").get_text(strip=True)

price

AttributeError: 'NoneType' object has no attribute 'get_text'

✅ Solución: Fallback seguro

La idea es nunca asumir que el nodo existe.
Se usa el patrón:

In [None]:
tag = soup.select_one("...") or None
texto = tag.get_text(strip=True) if tag else ""

Si el nodo existe → lo usas.

Si no existe → te quedas con "" en lugar de romper.

🔎 ¿Por qué es útil?

Tu código no explota por un NoneType.

Si falta un dato, en tu diccionario queda vacío "" → más fácil de limpiar luego.

Puedes iterar toda la página aunque haya productos con datos incompletos.

### Ejemplo final

In [46]:
from urllib.parse import urljoin

# URL base del sitio
base_url = "https://books.toscrape.com/"

product_list = []

for product in products:
    # Nombre del libro
    h3 = product.find("h3")
    a_tag = h3.find("a") if h3 else None
    nombre = a_tag["title"] if a_tag and a_tag.has_attr("title") else ""
    #print(nombre)

    # Precio
    p_price = product.find("p", class_="price_color")
    precio = p_price.get_text(strip=True) if p_price else ""
    #print(precio)

    # Imagen (relativa)
    img_tag = product.find("div", class_="image_container")
    img_tag = img_tag.find("img") if img_tag else None
    imagen = img_tag["src"] if img_tag and img_tag.has_attr("src") else ""
    
    # Construir URL absoluta con urljoin
    imagen_url = urljoin(base_url, imagen) if imagen else ""

    print(nombre, precio, imagen_url)

    product_list.append({
        "nombre": nombre,
        "precio": precio,
        "imagen_url": imagen_url
    })


A Light in the Attic Â£51.77 https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
Tipping the Velvet Â£53.74 https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
Soumission Â£50.10 https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
Sharp Objects Â£47.82 https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
Sapiens: A Brief History of Humankind Â£54.23 https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
The Requiem Red Â£22.65 https://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg
The Dirty Little Secrets of Getting Your Dream Job Â£33.34 https://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull Â£17.93 https://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg
The Boys in the Boat: 