# Ejercicio 12: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

In [1]:
from bs4 import BeautifulSoup

file = '../data/rotisserie-chicken.html'

# Load the HTML file
with open(file, "r", encoding="utf-8") as file:
    html_content = file.read()
    
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

In [2]:
# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]
title

'Rotisserie Chicken'

In [3]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [17]:
def get_details(soup):
    # Extracting the recipe title
    title = soup.find("meta", {"property": "og:title"})["content"].strip()

    # Extracting the rating
    rating = soup.find("div", class_="mm-recipes-review-bar__rating")
    if not rating:
        rating = "No rating found"
    else:
        rating = rating.get_text().strip()

    # Extracting the time to prepare
    total_time_item = soup.find("div", class_="mm-recipes-details__label", string="Total Time:")
    if total_time_item:
        total_time = total_time_item.find_next_sibling("div", class_="mm-recipes-details__value").get_text().strip()
    else:
        total_time = "No total time found"

    # Extracting the servings
    servings_item = soup.find("div", class_="mm-recipes-details__label", string="Servings:")
    if servings_item:
        servings = servings_item.find_next_sibling("div", class_="mm-recipes-details__value").get_text().strip()
    else:
        servings = "No servings found"

    # Extracting the description
    description = soup.find("meta", {"property": "og:description"})["content"]
    if not description:
        description = "No description found"
    else:
        description = description.strip()
        
    # Extracting the ingredients
    ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
    ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

    # Extracting the instructions
    instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
    instructions = [instruction.get_text().strip() for instruction in instructions_section]

    # Extracting the nutrition information
    nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
    nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

    # Extracting image
    image = soup.find("meta", {"property": "og:image"})
    if image:
        image_url = image["content"].strip()
    else:
        image_url = "No image found"
    
    # Extracting the full text of the recipe
    full_text = f"{title}. {rating}. {total_time}. {servings}. {description}. Ingredients: {', '.join(ingredients)}. Instructions: {' '.join(instructions)}. Nutrition Facts: {' '.join(nutrition_facts)}. Image URL: {image_url}"

    # Create a dictionary to hold all the details
    recipe_details = {
        "title": title,
        "rating": rating,
        "total_time": total_time,
        "servings": servings,
        "description": description,
        "ingredients": ingredients,
        "instructions": instructions,
        "nutrition_facts": nutrition_facts,
        "image_url": image_url,
        "full_text": full_text
    }
    
    return recipe_details


In [19]:
# Call the function to get details
get_details(soup)

{'title': 'Rotisserie Chicken',
 'rating': '4.7',
 'total_time': '1 hr 30 mins',
 'servings': '6',
 'description': "Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.",
 'ingredients': ['1 (3 pound) whole chicken',
  '1 pinch salt',
  '¼ cup butter, melted',
  '1 tablespoon salt',
  '1 tablespoon ground paprika',
  '¼ tablespoon ground black pepper'],
 'instructions': ["Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.",
  "Here's what you'll need to make rotisserie chicken at home:",
  "· Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is sim

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [6]:
import re
def get_recipe_urls(links):
    recipe_urls = []
    for link in links:
        href = link['href']
        # Filtro específico para el patrón de AllRecipes
        if re.search(r'/recipe/\d+/[\w-]+/?$', href):
            recipe_urls.append(href)
        # if "recipe" in href:
        #     recipe_urls.append(href)
    return recipe_urls

In [7]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = get_recipe_urls(recipe_links)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/recipe/238575/cilantro-lime-grilled-chicken/
https://www.allrecipes.com/recipe/275062/buttermilk-barbecue-chicken/
https://www.allrecipes.com/recipe/274724/grilled-spatchcocked-chicken/
https://www.allrecipes.com/recipe/14531/beer-butt-chicken/
https://www.allrecipes.com/recipe/221093/good-frickin-paprika-chicken/
https://www.allrecipes.com/recipe/264278/miso-honey-chicken/
https://www.allrecipes.com/recipe/258659/rosemary-buttermilk-chicken/
https://www.allrecipes.com/recipe/222936/smoked-beer-butt-chicken/
https://www.allrecipes.com/recipe/228070/the-best-beer-can-chicken-ever/
https://www.allrecipes.com/recipe/214619/bbq-beer-can-chicken/
https://www.allrecipes.com/recipe/19944/drunk-chicken/
https://www.allrecipes.com/recipe/275044/grilled-chicken-under-a-brick/
https://www.allrecipes.com/recipe/281255/smoked-whole-chicken/
https://www.allrecipes.com/recipe/34957/easy-barbeque-chicken/
https://www.allrecipes.com/recipe/8998/darn-good-chick

In [8]:
import requests

In [9]:
def extract_recipe_links(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for HTTP errors
        soup = BeautifulSoup(response.content, "html.parser")
        recipe_links = soup.find_all("a", href=True)
        return get_recipe_urls(recipe_links)
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch {url}: {e}")

In [10]:
category_urls = [
    "https://www.allrecipes.com/recipes/17562/dinner/",
    "https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/",
    "https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/",
    "https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/",
    "https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/",
    "https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/",
    "https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
    "https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/",
    "https://www.allrecipes.com/recipes/80/main-dish/",
    "https://www.allrecipes.com/recipes/22992/everyday-cooking/sheet-pan-dinners/",
    "https://www.allrecipes.com/recipes-a-z-6735880/",
    "https://www.allrecipes.com/recipes/78/breakfast-and-brunch/",
    "https://www.allrecipes.com/recipes/17561/lunch/",
    "https://www.allrecipes.com/recipes/84/healthy-recipes/",
    "https://www.allrecipes.com/recipes/76/appetizers-and-snacks/",
    "https://www.allrecipes.com/recipes/96/salad/",
    "https://www.allrecipes.com/recipes/81/side-dish/",
    "https://www.allrecipes.com/recipes/16369/soups-stews-and-chili/soup/",
    "https://www.allrecipes.com/recipes/156/bread/",
    "https://www.allrecipes.com/recipes/77/drinks/",
    "https://www.allrecipes.com/recipes/79/desserts/",
    "https://www.allrecipes.com/recipes/201/meat-and-poultry/chicken/",
    "https://www.allrecipes.com/recipes/200/meat-and-poultry/beef/",
    "https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
    "https://www.allrecipes.com/recipes/93/seafood/",
    "https://www.allrecipes.com/recipes/95/pasta-and-noodles/",
    "https://www.allrecipes.com/recipes/1058/fruits-and-vegetables/fruits/",
    "https://www.allrecipes.com/recipes/1059/fruits-and-vegetables/vegetables/",
    "https://www.allrecipes.com/recipes/728/world-cuisine/latin-american/mexican/",
    "https://www.allrecipes.com/recipes/723/world-cuisine/european/italian/",
    "https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
    "https://www.allrecipes.com/recipes/233/world-cuisine/asian/indian/",
    "https://www.allrecipes.com/recipes/722/world-cuisine/european/german/",
    "https://www.allrecipes.com/recipes/731/world-cuisine/european/greek/",
    "https://www.allrecipes.com/recipes/696/world-cuisine/asian/filipino/",
    "https://www.allrecipes.com/recipes/699/world-cuisine/asian/japanese/"
]

In [11]:
# Processing category URLs to extract recipe links
all_recipe_urls = []
for url in category_urls:
    all_recipe_urls.extend(extract_recipe_links(url))

all_recipe_urls = list(set(all_recipe_urls))  # Remove duplicates
all_recipe_urls

['https://www.allrecipes.com/recipe/266245/pandan-chiffon-cake/',
 'https://www.allrecipes.com/recipe/100008/homemade-pickled-ginger-gari/',
 'https://www.allrecipes.com/recipe/239180/greek-style-lemon-roasted-potatoes/',
 'https://www.allrecipes.com/recipe/212921/atsara-papaya-relish/',
 'https://www.allrecipes.com/recipe/261452/velveting-chicken-breast-chinese-restaurant-style/',
 'https://www.allrecipes.com/recipe/147103/delicious-egg-salad-for-sandwiches/',
 'https://www.allrecipes.com/recipe/240974/a-healthy-egg-salad/',
 'https://www.allrecipes.com/recipe/280052/bbq-chicken-breasts-in-the-oven/',
 'https://www.allrecipes.com/recipe/235153/easy-baked-chicken-thighs/',
 'https://www.allrecipes.com/recipe/244950/baked-chicken-schnitzel/',
 'https://www.allrecipes.com/recipe/237284/creamy-avocado-chicken-salad/',
 'https://www.allrecipes.com/recipe/18480/red-currant-pie/',
 'https://www.allrecipes.com/recipe/283301/sheet-pan-roasted-chicken-thighs-with-brussels-sprouts/',
 'https://w

In [20]:
# all_recipes
recipes_data = []
def fetch_recipe_content(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for HTTP errors
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Call the function to get details from the fetched content
        recipe_details = get_details(soup)
        recipes_data.append(recipe_details)

    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch {url}: {e}")

# Loop through the list of recipe URLs and fetch their contents
print("Recopilando recetas...")
for url in all_recipe_urls[:100]:
    fetch_recipe_content(url)

Recopilando recetas...


In [23]:
# Create a DataFrame from the recipes data
import pandas as pd
recipes_df = pd.DataFrame(recipes_data)
recipes_df.head()

Unnamed: 0,title,rating,total_time,servings,description,ingredients,instructions,nutrition_facts,image_url,full_text
0,Pandan Chiffon Cake,5.0,1 hr,8,This pandan chiffon cake recipe makes a fluffy...,"[6 large eggs, separated, ¼ teaspoon cream of ...",[Preheat the oven to 325 degrees F (165 degree...,"[Total Fat 12g, Saturated Fat 2g, Cholesterol ...",https://www.allrecipes.com/thmb/hI7DjiN8BM8Xzw...,Pandan Chiffon Cake. 5.0. 1 hr. 8. This pandan...
1,Homemade Pickled Ginger (Gari),4.7,45 mins,32,"Pickled ginger, or gari, is served as a palate...","[8 ounces fresh young ginger root, peeled, 1 ½...","[Gather all ingredients., Cut ginger into chun...","[Total Fat 0g, Sodium 83mg, Total Carbohydrate...",https://www.allrecipes.com/thmb/AiFUs73Pna0irz...,Homemade Pickled Ginger (Gari). 4.7. 45 mins. ...
2,Roasted Greek Lemon Potatoes,4.8,1 hr 15 mins,6,"These Greek lemon potatoes, with olive oil, le...","[3 pounds potatoes, peeled and cut into thick ...",[Gather all ingredients. Preheat the oven to 4...,"[Total Fat 12g, Saturated Fat 2g, Sodium 789mg...",https://www.allrecipes.com/thmb/dMTp0AGi_9eed2...,Roasted Greek Lemon Potatoes. 4.8. 1 hr 15 min...
3,Atsara (Papaya Relish),4.9,1 day 1 hr 30 mins,10,This atsara recipe is a tasty condiment. It's ...,"[4 cups grated fresh green papaya, ¼ cup salt,...",[Toss grated papaya with 1/4 cup salt together...,"[Total Fat 0g, Sodium 241mg, Total Carbohydrat...",https://www.allrecipes.com/thmb/WdxYfN3Ld_13ip...,Atsara (Papaya Relish). 4.9. 1 day 1 hr 30 min...
4,"Velveting Chicken Breast, Chinese Restaurant S...",4.8,50 mins,4,This velveting chicken recipe uses a Chinese t...,"[1 large egg white, 1 tablespoon Chinese rice ...","[Whisk together egg white, vinegar, cornstarch...","[Total Fat 6g, Saturated Fat 1g, Cholesterol 5...",https://www.allrecipes.com/thmb/fuTegA8PjbiwBO...,"Velveting Chicken Breast, Chinese Restaurant S..."


## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [32]:
from sentence_transformers import SentenceTransformer
import faiss
import os
from dotenv import load_dotenv
from openai import OpenAI

In [26]:
# Create embeddings for the recipes data
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Generando embeddings...")
embeddings = model.encode(recipes_df['full_text'].tolist(), convert_to_numpy=True)

Generando embeddings...


In [27]:
# Create index FAISS
index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance
index.add(embeddings)  # Add embeddings to the index

In [28]:
# Adding embeddings to DataFrame
recipes_df['embeddings'] = embeddings.tolist()  # Convert numpy array to list for Data

In [34]:
# Config OpenAI
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

In [47]:
def seeker(query, k=5):
    """
    Function to search for recipes based on a query.
    Returns the top k recipes based on cosine similarity.
    """
    query_embedding = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, k)
    
    context_recipes = []
    recipe_summaries = []
    for i in range(k):
        recipe = recipes_df.iloc[indices[0][i]]
        context_recipes.append({
            "title": recipe['title'],
            "rating": recipe['rating'],
            "total_time": recipe['total_time'],
            "servings": recipe['servings'],
            "description": recipe['description'],
            "ingredients": recipe['ingredients'],
            "instructions": recipe['instructions'],
            "nutrition_facts": recipe['nutrition_facts'],
            "image_url": recipe['image_url'],
            "distance": distances[0][i]
        })
        
        # Create a summary for the recipe
        summary = f"{recipe['title']} - {recipe['rating']} stars, {recipe['total_time']} to prepare, serves {recipe['servings']}. {recipe['description']}. Ingredients: {', '.join(recipe['ingredients'])}. Instructions: {' '.join(recipe['instructions'])}. Nutrition Facts: {' '.join(recipe['nutrition_facts'])}. Image URL: {recipe['image_url']}"
        recipe_summaries.append(summary)
    
    # Create prompt for ChatGPT
    prompt = f"""
    Basándote únicamente en las siguientes recetas, responde la consulta del usuario: "{query}"
    
    Recetas relevantes:
    {chr(10).join(recipe_summaries)}
    
    Proporciona una respuesta útil y específica basada en estas recetas. Si es apropiado, recomienda una receta específica y explica por qué.
    """
    
    # Generate response using OpenAI
    response = client.responses.create(
        model="gpt-4.1",
        input=prompt
    )

    return response.output_text

In [48]:
query = "¿Cómo puedo hacer una cena rápida y saludable?"
response = seeker(query)
print("Respuesta de ChatGPT:")
print(response)

Respuesta de ChatGPT:
¡Claro! Si buscas una **cena rápida y saludable** basada en las recetas que proporcionaste, la **mejor opción es el "Salsa Chicken"** (Pollo con salsa y queso).

### ¿Por qué esta receta?
- **Rápida:** Solo necesita 35 minutos de preparación total.
- **Saludable:** Es rica en proteínas (36g por porción), baja en carbohidratos y grasas moderadas; no requiere fritura y no contiene ingredientes ultraprocesados o azúcares añadidos. Puedes hacerla aún más ligera usando menos queso o salsa baja en sodio.
- **Fácil:** Solo se necesita hornear el pollo con salsa y queso; la preparación es muy simple.
- **Versátil:** Puedes acompañarla con ensalada, verduras al vapor o incluso arroz integral para un extra de fibra.

---

### ¿Cómo prepararla?  
**Ingredientes:**  
- 4 pechugas de pollo sin piel ni hueso  
- 4 cucharaditas de condimento para tacos  
- 1 taza de salsa (puede ser casera o comprada)  
- 1 taza de queso Cheddar rallado  
- (Opcional) 2 cucharadas de crema agria