# Web Scraping con Requests y BeautifulSoup

Este notebook muestra cómo realizar web scraping usando `requests` y `beautifulsoup4` en el sitio https://quotes.toscrape.com

In [1]:
# Importar las librerías necesarias
import requests
from bs4 import BeautifulSoup
import pandas as pd
from typing import List, Dict

## 1. Obtener el contenido de la página

In [2]:
# URL del sitio a scrapear
url = "https://quotes.toscrape.com"

# Realizar la petición HTTP
response = requests.get(url)

# Verificar que la petición fue exitosa
print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.content)} bytes")

# Crear el objeto BeautifulSoup para parsear el HTML
soup = BeautifulSoup(response.content, 'html.parser')
print("\nHTML parseado correctamente!")

Status Code: 200
Content Length: 11064 bytes

HTML parseado correctamente!


## 2. Listar los primeros 5 quotes

In [3]:
# Encontrar todos los quotes en la página
quotes = soup.find_all('div', class_='quote')

# Extraer los primeros 5 quotes
print("Primeros 5 Quotes:\n")
print("=" * 80)

for i, quote in enumerate(quotes[:5], 1):
    # Extraer el texto del quote
    quote_text = quote.find('span', class_='text').get_text()
    print(f"\n{i}. {quote_text}")
    print("-" * 80)

Primeros 5 Quotes:


1. “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
--------------------------------------------------------------------------------

2. “It is our choices, Harry, that show what we truly are, far more than our abilities.”
--------------------------------------------------------------------------------

3. “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
--------------------------------------------------------------------------------

4. “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
--------------------------------------------------------------------------------

5. “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
--------------------------------------------------------------------------------


## 3. Listar los autores de los primeros 5 quotes

In [4]:
# Extraer los autores de los primeros 5 quotes
print("Autores de los primeros 5 quotes:\n")
print("=" * 80)

authors = []
for i, quote in enumerate(quotes[:5], 1):
    author = quote.find('small', class_='author').get_text()
    authors.append(author)
    print(f"{i}. {author}")

print("\n" + "=" * 80)
print(f"\nTotal de autores únicos: {len(set(authors))}")

Autores de los primeros 5 quotes:

1. Albert Einstein
2. J.K. Rowling
3. Albert Einstein
4. Jane Austen
5. Marilyn Monroe


Total de autores únicos: 4


## 4. Listar los tags de los primeros 5 quotes

In [5]:
# Extraer los tags de los primeros 5 quotes
print("Tags de los primeros 5 quotes:\n")
print("=" * 80)

all_tags = []
for i, quote in enumerate(quotes[:5], 1):
    # Encontrar todos los tags de este quote
    tag_elements = quote.find_all('a', class_='tag')
    tags = [tag.get_text() for tag in tag_elements]
    all_tags.extend(tags)
    
    print(f"\nQuote {i}:")
    print(f"  Tags: {', '.join(tags)}")

print("\n" + "=" * 80)
print(f"\nTotal de tags únicos: {len(set(all_tags))}")
print(f"Tags únicos: {', '.join(sorted(set(all_tags)))}")

Tags de los primeros 5 quotes:


Quote 1:
  Tags: change, deep-thoughts, thinking, world

Quote 2:
  Tags: abilities, choices

Quote 3:
  Tags: inspirational, life, live, miracle, miracles

Quote 4:
  Tags: aliteracy, books, classic, humor

Quote 5:
  Tags: be-yourself, inspirational


Total de tags únicos: 16
Tags únicos: abilities, aliteracy, be-yourself, books, change, choices, classic, deep-thoughts, humor, inspirational, life, live, miracle, miracles, thinking, world


## 5. Crear un DataFrame con la información completa

In [6]:
# Crear una lista de diccionarios con la información de los primeros 5 quotes
quotes_data = []

for quote in quotes[:5]:
    quote_text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    tag_elements = quote.find_all('a', class_='tag')
    tags = [tag.get_text() for tag in tag_elements]
    
    quotes_data.append({
        'quote': quote_text,
        'author': author,
        'tags': ', '.join(tags),
        'num_tags': len(tags)
    })

# Crear el DataFrame
df = pd.DataFrame(quotes_data)

# Mostrar el DataFrame
print("DataFrame con los primeros 5 quotes:\n")
print(df.to_string(index=False))

DataFrame con los primeros 5 quotes:

                                                                                                                              quote          author                                         tags  num_tags
                “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein       change, deep-thoughts, thinking, world         4
                                              “It is our choices, Harry, that show what we truly are, far more than our abilities.”    J.K. Rowling                           abilities, choices         2
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein inspirational, life, live, miracle, miracles         5
                           “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”     Jane Austen   

## 6. Extraer todos los quotes de la página

In [7]:
# Extraer todos los quotes de la página (no solo los primeros 5)
all_quotes_data = []

for quote in quotes:
    quote_text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    tag_elements = quote.find_all('a', class_='tag')
    tags = [tag.get_text() for tag in tag_elements]
    
    all_quotes_data.append({
        'quote': quote_text,
        'author': author,
        'tags': ', '.join(tags),
        'num_tags': len(tags)
    })

# Crear DataFrame con todos los quotes
df_all = pd.DataFrame(all_quotes_data)

print(f"Total de quotes en la página: {len(df_all)}")
print(f"\nPrimeras 3 filas:\n")
print(df_all.head(3).to_string(index=False))

Total de quotes en la página: 10

Primeras 3 filas:

                                                                                                                              quote          author                                         tags  num_tags
                “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein       change, deep-thoughts, thinking, world         4
                                              “It is our choices, Harry, that show what we truly are, far more than our abilities.”    J.K. Rowling                           abilities, choices         2
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein inspirational, life, live, miracle, miracles         5


## 7. Estadísticas básicas

In [8]:
# Estadísticas básicas sobre los quotes
print("Estadísticas de los quotes:\n")
print("=" * 80)

# Número total de quotes
print(f"Total de quotes: {len(df_all)}")

# Número de autores únicos
unique_authors = df_all['author'].unique()
print(f"Número de autores únicos: {len(unique_authors)}")

# Autores más frecuentes
author_counts = df_all['author'].value_counts()
print(f"\nAutores más frecuentes:")
print(author_counts)

# Promedio de tags por quote
avg_tags = df_all['num_tags'].mean()
print(f"\nPromedio de tags por quote: {avg_tags:.2f}")

# Quote con más tags
max_tags_idx = df_all['num_tags'].idxmax()
print(f"\nQuote con más tags ({df_all.loc[max_tags_idx, 'num_tags']} tags):")
print(f"  {df_all.loc[max_tags_idx, 'quote']}")
print(f"  Autor: {df_all.loc[max_tags_idx, 'author']}")

Estadísticas de los quotes:

Total de quotes: 10
Número de autores únicos: 8

Autores más frecuentes:
author
Albert Einstein      3
J.K. Rowling         1
Jane Austen          1
Marilyn Monroe       1
André Gide           1
Thomas A. Edison     1
Eleanor Roosevelt    1
Steve Martin         1
Name: count, dtype: int64

Promedio de tags por quote: 3.00

Quote con más tags (5 tags):
  “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
  Autor: Albert Einstein


## 8. Buscar quotes por autor

In [9]:
# Función para buscar quotes por autor
def get_quotes_by_author(df: pd.DataFrame, author_name: str) -> pd.DataFrame:
    """Retorna todos los quotes de un autor específico"""
    return df[df['author'].str.contains(author_name, case=False)]

# Ejemplo: buscar quotes de Albert Einstein
einstein_quotes = get_quotes_by_author(df_all, "Einstein")

print(f"Quotes de Albert Einstein ({len(einstein_quotes)}):\n")
print("=" * 80)

for idx, row in einstein_quotes.iterrows():
    print(f"\n{row['quote']}")
    print(f"Tags: {row['tags']}")
    print("-" * 80)

Quotes de Albert Einstein (3):


“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tags: change, deep-thoughts, thinking, world
--------------------------------------------------------------------------------

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tags: inspirational, life, live, miracle, miracles
--------------------------------------------------------------------------------

“Try not to become a man of success. Rather become a man of value.”
Tags: adulthood, success, value
--------------------------------------------------------------------------------


## 9. Listar todos los tags únicos y su frecuencia

In [10]:
# Extraer todos los tags únicos y contar su frecuencia
from collections import Counter

all_tags_list = []
for tags_str in df_all['tags']:
    if tags_str:
        all_tags_list.extend([tag.strip() for tag in tags_str.split(',')])

# Contar la frecuencia de cada tag
tag_counts = Counter(all_tags_list)

# Mostrar los tags ordenados por frecuencia
print("Tags y su frecuencia:\n")
print("=" * 80)

for tag, count in tag_counts.most_common():
    print(f"{tag}: {count}")

print(f"\nTotal de tags únicos: {len(tag_counts)}")

Tags y su frecuencia:

inspirational: 3
life: 2
humor: 2
change: 1
deep-thoughts: 1
thinking: 1
world: 1
abilities: 1
choices: 1
live: 1
miracle: 1
miracles: 1
aliteracy: 1
books: 1
classic: 1
be-yourself: 1
adulthood: 1
success: 1
value: 1
love: 1
edison: 1
failure: 1
paraphrased: 1
misattributed-eleanor-roosevelt: 1
obvious: 1
simile: 1

Total de tags únicos: 26


## 10. Guardar los datos en un archivo CSV

In [11]:
# Guardar el DataFrame en un archivo CSV
output_file = 'quotes_data.csv'
df_all.to_csv(output_file, index=False, encoding='utf-8')

print(f"Datos guardados en '{output_file}'")
print(f"Total de registros: {len(df_all)}")
print(f"\nPrimeras líneas del archivo:")
print(df_all.head().to_string(index=False))

Datos guardados en 'quotes_data.csv'
Total de registros: 10

Primeras líneas del archivo:
                                                                                                                              quote          author                                         tags  num_tags
                “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Albert Einstein       change, deep-thoughts, thinking, world         4
                                              “It is our choices, Harry, that show what we truly are, far more than our abilities.”    J.K. Rowling                           abilities, choices         2
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Albert Einstein inspirational, life, live, miracle, miracles         5
                           “The person, be it gentleman or lady, who has not pleasure in a good no