**Integrantes:** Wilson Inga y Anthony Reinoso

# Ejercicio 12: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

In [27]:
from bs4 import BeautifulSoup

file = '/content/rotisserie-chicken.html'

# Load the HTML file
with open(file, "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [28]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Rotisserie Chicken'

## Ingredientes

In [29]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


## Valoración

In [30]:
calification = float(soup.find("div", class_="comp mm-recipes-review-bar__rating mntl-text-block text-label-300").text.strip())
calification

4.7

## Tiempo y Porciones


In [31]:
contenedor_tiempos = soup.find("div", class_="mm-recipes-details__content")

tiempos = contenedor_tiempos.find_all("div", class_="mm-recipes-details__item")

for tiempo in tiempos:
    etiqueta = tiempo.find("div", class_="mm-recipes-details__label").text.strip()
    valor = tiempo.find("div", class_="mm-recipes-details__value").text.strip()
    print(f"{etiqueta} {valor}")

Prep Time: 10 mins
Cook Time: 1 hr 10 mins
Additional Time: 10 mins
Total Time: 1 hr 30 mins
Servings: 6


## Fuentes Nutricionales


In [32]:
nutri_tabla = soup.find("div", id="mm-recipes-nutrition-facts-summary_1-0")
filas = nutri_tabla.find_all("tr")

print("Información nutricional (por porción):")
for fila in filas:
    valor = fila.find("td", class_="text-body-100-prominent").text.strip()
    etiqueta = fila.find("td", class_="text-body-100").text.strip()
    print(f"{etiqueta}: {valor}")

Información nutricional (por porción):
Calories: 357
Fat: 25g
Carbs: 1g
Protein: 31g


## Descripción

In [33]:
descripcion = soup.find("p", class_="article-subheading text-utility-300").text.strip()
print(descripcion)

This rotisserie chicken recipe is so easy to make with simple seasonings on your grill. Occasional basting with a butter mixture ensures crispy skin and moist meat. Our family loves this! Rotisserie chicken is perfect as the main dish with French fries and coleslaw, or with any number of other sides.


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [34]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Rotisserie Chicken
Description: Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¼ cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¼ tablespoon ground black pepper
Instructions:
1. Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.
2. Here's what you'll need to make rotisserie chicken at home:
3. · Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is simply seasoned with salt, pepper, and paprika.
4. You'll find the full, step-by-step recipe below — b

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [35]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
https://www.magazines.com/allrecipes-magazine.html?utm_source=allrecipes.com&utm_medium=owned&utm_campaign=i111arr1w2661
https://www.magazines.com/allrecipes-magazine.html
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-s

In [36]:
from urllib.parse import urljoin

base_url = "https://www.allrecipes.com"

# Filtrar solo recetas reales
recipe_urls = []
for link in recipe_links:
    href = link['href']

    # Asegurar que es un link de receta válida
    if "/recipe/" in href and not any(x in href for x in ["login", "account", "favorites", "subscribe", "#"]):
        # Convertir a URL absoluta si es relativa
        if href.startswith("/"):
            href = urljoin(base_url, href)

        # Evitar duplicados
        if href not in recipe_urls:
            recipe_urls.append(href)

# Mostrar los enlaces de recetas filtrados
print("Recetas encontradas:")
for url in recipe_urls:
    print(url)


Recetas encontradas:
https://www.allrecipes.com/recipe/238575/cilantro-lime-grilled-chicken/
https://www.allrecipes.com/recipe/275062/buttermilk-barbecue-chicken/
https://www.allrecipes.com/recipe/274724/grilled-spatchcocked-chicken/
https://www.allrecipes.com/recipe/14531/beer-butt-chicken/
https://www.allrecipes.com/recipe/221093/good-frickin-paprika-chicken/
https://www.allrecipes.com/recipe/264278/miso-honey-chicken/
https://www.allrecipes.com/recipe/258659/rosemary-buttermilk-chicken/
https://www.allrecipes.com/recipe/222936/smoked-beer-butt-chicken/
https://www.allrecipes.com/recipe/228070/the-best-beer-can-chicken-ever/
https://www.allrecipes.com/recipe/214619/bbq-beer-can-chicken/
https://www.allrecipes.com/recipe/19944/drunk-chicken/
https://www.allrecipes.com/recipe/275044/grilled-chicken-under-a-brick/
https://www.allrecipes.com/recipe/281255/smoked-whole-chicken/
https://www.allrecipes.com/recipe/34957/easy-barbeque-chicken/
https://www.allrecipes.com/recipe/8998/darn-good-

In [44]:
import requests
import time
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Construir corpus desde los links válidos
corpus = []
receta_info = []  # Para almacenar los elementos separados por receta

for url in recipe_urls[:10]:
    try:
        response = requests.get(url, timeout=10)
        if response.status_code != 200:
            continue

        soup = BeautifulSoup(response.text, "html.parser")

        # Extraer campos clave
        title_tag = soup.find("meta", {"property": "og:title"})
        desc_tag = soup.find("meta", {"name": "description"})

        if not title_tag or not title_tag.get("content"):
            continue

        title = title_tag["content"]
        description = desc_tag["content"] if desc_tag and desc_tag.get("content") else ""

        ingredients = [
            i.get_text().strip()
            for i in soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
        ]

        instructions = [
            i.get_text().strip()
            for i in soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
        ]

        nutrition_facts = [
            n.parent.get_text().strip().replace('\n', ' ')
            for n in soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
        ]

        full_recipe = f"{title}\n{description}\n" + "\n".join(ingredients) + "\n" + "\n".join(instructions) + "\n" + "\n".join(nutrition_facts)
        corpus.append(full_recipe)

        # Guardar los campos separados por si se consulta algo específico
        receta_info.append({
            "title": title,
            "description": description,
            "ingredients": ingredients,
            "instructions": instructions,
            "nutrition_facts": nutrition_facts
        })

        print(f"🍽️ Receta añadida: {title}")

    except Exception as e:
        print(f"🟥 Error en {url}: {e}")

    time.sleep(1)

# Cargar el modelo de embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generar embeddings del corpus
if len(corpus) > 0:
    embeddings = model.encode(corpus, normalize_embeddings=True)
    embeddings = embeddings.astype('float32')

    # Crear el índice FAISS
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)
    index.add(embeddings)
    print(f"\nÍndice FAISS creado con {len(corpus)} recetas.")
else:
    print("\nEl corpus está vacío. No se pudo crear el índice FAISS.")


🍽️ Receta añadida: Cilantro-Lime Grilled Chicken
🍽️ Receta añadida: Buttermilk Barbecue Chicken
🍽️ Receta añadida: Grilled Spatchcocked Chicken
🍽️ Receta añadida: Beer Butt Chicken
🍽️ Receta añadida: Good Frickin' Paprika Chicken
🍽️ Receta añadida: Miso Honey Chicken
🍽️ Receta añadida: Rosemary Buttermilk Chicken
🍽️ Receta añadida: Smoked Beer Butt Chicken
🍽️ Receta añadida: The Best Beer Can Chicken Ever
🍽️ Receta añadida: Best Beer Can Chicken

Índice FAISS creado con 10 recetas.


## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [45]:
# Construir el corpus textual con las recetas
corpus = []

# Corpus del archivo actual
full_recipe = f"{title}\n{description}\n" + "\n".join(ingredients) + "\n" + "\n".join(instructions) + "\n" + "\n".join(nutrition_facts)
corpus.append(full_recipe)

In [46]:
# Función para consulta tipo RAG
def buscar_receta(query, top_k=1):
    query_lower = query.lower().strip()

    # Buscar si el nombre de alguna receta está incluido en la pregunta
    for receta in receta_info:
        titulo = receta["title"].lower().strip()
        if titulo in query_lower:
            print(f"\nResultado:")
            if "ingrediente" in query_lower:
                print("Ingredientes:")
                for ing in receta["ingredients"]:
                    print("-", ing)
            elif "instruccion" in query_lower or "preparación" in query_lower:
                print("Instrucciones:")
                for i, ins in enumerate(receta["instructions"], 1):
                    print(f"{i}. {ins}")
            elif "descrip" in query_lower:
                print("Descripción:")
                print(receta["description"])
            elif "nutri" in query_lower:
                print("Información nutricional:")
                for fact in receta["nutrition_facts"]:
                    print("-", fact)
            else:
                print("Descripción:", receta["description"])
                print("Ingredientes:")
                for ing in receta["ingredients"]:
                    print("-", ing)
                print("Instrucciones:")
                for i, ins in enumerate(receta["instructions"], 1):
                    print(f"{i}. {ins}")
            return

    # Si no encontró coincidencia directa, usar similitud
    query_vec = model.encode([query], normalize_embeddings=True)

    # usar faiss
    similarity, indices = index.search(query_vec, top_k)


In [47]:
# Usar el sistema RAG simulado
print("Sistema RAG de recetas activado. Escribe tu pregunta:")
while True:
    pregunta = input("\nTú: ")
    if pregunta.strip().lower() == "salir":
        break
    buscar_receta(pregunta)

Sistema RAG de recetas activado. Escribe tu pregunta:

Tú: dame los ingredientes de Smoked Beer Butt Chicken

Resultado:
Ingredientes:
- 1 cup butter, divided
- 2 tablespoons garlic salt, divided
- 2 tablespoons paprika, divided
- salt and pepper to taste
- 1 (12 fluid ounce) can beer
- 1 (4 pound) whole chicken

Tú: dame los factores nutricionales de The Best Beer Can Chicken Ever

Resultado:
Información nutricional:
- Total Fat 29g
- Saturated Fat 8g
- Cholesterol 162mg
- Sodium 480mg
- Total Carbohydrate 10g
- Dietary Fiber 1g
- Total Sugars 7g
- Protein 52g
- Vitamin C 1mg
- Calcium 45mg
- Iron 3mg
- Potassium 484mg

Tú: dame una descripcion de Grilled Spatchcocked Chicken

Resultado:
Descripción:
This grilled spatchcock chicken recipe calls for removing the chicken's backbone and spreading it open like a book for quicker and more even cooking.

Tú: Grilled Spatchcocked Chicken

Resultado:
Descripción: This grilled spatchcock chicken recipe calls for removing the chicken's backbone