# Volcánicas Scraper & Vision 

This repository contains scripts to support **Volcánicas**, a Latin American feminist journalism project, by automating content extraction from key sources and exploring AI-assisted video transcription.

## Overview

- Scrapes articles from Volcanicas links to extract structured information: title, URL, and detailed content.
- Designed to enhance research and editorial workflows with feminist tech principles: autonomy, transparency, and ethical storytelling.


### Web Scraping

Three different approaches were explored to extract content from URLs:

#### 1. BeautifulSoup + Function Calling
This method is fast and produces good results for static HTML pages. However, it struggles with dynamic websites, such as those using .aspx pages, which limits its applicability to the types of documents we are dealing with. To address output structure, json_schema was used to provide a structured response format. Despite its speed, this method is not suitable for JavaScript-heavy or dynamic pages.

#### 2. Browser Use (Agent + Computer Vision)
This approach leverages an AI-driven agent using GPT-4.1-mini and computer vision to scrape content. It iterates through leeloo-links.json, retrieves each URL, and sends it to the agent. The agent returns structured output in the Document format, including the title, URL, and full content. Results are saved in two formats: Markdown and JSON. Additionally, this method was tested to extract YouTube video descriptions (particularly for videos without audio). The best outcomes were stored in results/video_browser_use_results.json. This method was the most accurate and reliable overall.

#### 3. ChatGPT Operator
This method was less reliable. While it occasionally returned results in Markdown format, it could not generate downloadable files. Furthermore, it was unable to interpret or extract content from YouTube videos.s

#### Conclusion
The most effective approach is Browser Use with AI agents, especially for dynamic and media-rich content. To update the results, simply modify leeloo_links.json and run the second notebook option. The full process takes approximately 30 minutes and consumes around 21,555 tokens aournd $0.00618



## Option 2: Brower use

In [1]:
%pip install browser-use python-dotenv pydantic langchain-openai
# Install browser
%playwright install chromium --with-deps --no-shell

Collecting browser-use
  Downloading browser_use-0.1.46-py3-none-any.whl.metadata (8.9 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting pydantic
  Downloading pydantic-2.11.4-py3-none-any.whl.metadata (66 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.16-py3-none-any.whl.metadata (2.3 kB)
Collecting anyio>=4.9.0 (from browser-use)
  Using cached anyio-4.9.0-py3-none-any.whl.metadata (4.7 kB)
Collecting click>=8.1.8 (from browser-use)
  Downloading click-8.2.0-py3-none-any.whl.metadata (2.5 kB)
Collecting faiss-cpu>=1.9.0 (from browser-use)
  Using cached faiss_cpu-1.11.0-cp312-cp312-macosx_14_0_x86_64.whl.metadata (4.8 kB)
Collecting google-api-core>=2.24.0 (from browser-use)
  Using cached google_api_core-2.24.2-py3-none-any.whl.metadata (3.0 kB)
Collecting httpx>=0.27.2 (from browser-use)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-anthropic==0.3.3 (from browser-use)

UsageError: Line magic function `%playwright` not found.


In [2]:
# === Imports ===
import os
import json
from pathlib import Path

from dotenv import load_dotenv
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from browser_use import Controller, Agent

# === Environment Setup ===
load_dotenv()

# === Models ===
class Document(BaseModel):
    title: str
    url: str
    content: str

folder_name = "data"
# === Ensure Output Folder Exists ===
os.makedirs(folder_name, exist_ok=True)

# === Load and Flatten Input Links ===
with open("volcanicas_links.json", "r") as f:
    links = json.load(f)

all_urls = []
for key in ["opinion", "reportage", "interview"]:
    all_urls.extend(links.get(key, []))

# === Debugging: Limit URLs for Testing ===
# Remove this line to scrape all links
all_urls = all_urls[:1]

# === Helper: Process a Single URL ===
async def process_url(agent, format="markdown"):
    history = await agent.run()
    result = history.final_result()

    try:
        history_json = json.loads(result)
        doc = Document(**history_json)
    except Exception as e:
        print(f"[ERROR] Failed to parse result into Document: {e}")
        return

    # Create safe filename and directories
    file_id = doc.title.replace(" ", "_").replace("/", "_")[:50]
    subfolder = "markdown" if format == "markdown" else "json"
    full_path = Path(f"{folder_name}/{subfolder}")
    full_path.mkdir(parents=True, exist_ok=True)

    filename = full_path / f"{file_id}.{('md' if format == 'markdown' else 'json')}"

    # Save content in the specified format
    try:
        if format == "markdown":
            with open(filename, "w", encoding="utf-8") as f:
                f.write("---\n")
                f.write(f'title: "{doc.title}"\n')
                f.write(f'url: "{doc.url}"\n')
                f.write("---\n\n")
                f.write(doc.content.strip())
        elif format == "json":
            with open(filename, "w", encoding="utf-8") as f:
                json.dump(doc.dict(), f, ensure_ascii=False, indent=2)
        else:
            print(f"[ERROR] Unsupported format: {format}")
            return
        print(f"Saved: {filename}")
    except Exception as e:
        print(f"[ERROR] Failed to save file {filename}: {e}")

# === Main Async Loop ===
async def call_agent(controller: Controller = None, format="markdown"):
    api_key = os.getenv("OPENAI_API_KEY")
    model = ChatOpenAI(model="gpt-4.1-mini")
    for url in all_urls:
        task = f"Scrape and summarize the page at {url}"
        agent = Agent(task=task, llm=model, controller=controller) if controller else Agent(task=task, llm=model)
        await process_url(agent, format=format)


controller = Controller(output_model=Document)
await call_agent(controller=controller, format="json")


INFO     [agent] 🧠 Starting an agent with main_model=gpt-4.1-mini +tools +vision +memory, planner_model=None, extraction_model=gpt-4.1-mini 
INFO     [mem0.vector_stores.faiss] Loaded FAISS index from /tmp/mem0_1536_faiss/mem0.faiss with 1 vectors
INFO     [mem0.vector_stores.faiss] Loaded FAISS index from /Users/dnavas/.mem0/migrations_faiss/mem0_migrations.faiss with 14 vectors
INFO     [agent] 🚀 Starting task: Scrape and summarize the page at https://volcanicas.com/ecuadorbajoataque-la-militarizacion-esa-nueva-democracia/
INFO     [agent] 📍 Step 1
INFO     [agent] 👍 Eval: Success - Browser started on a blank page as expected.
INFO     [agent] 🧠 Memory: Browser started at blank page, ready to navigate to target URL.
INFO     [agent] 🎯 Next goal: Navigate to the target URL to begin scraping content.
INFO     [agent] 🛠️  Action 1/1: {"go_to_url":{"url":"https://volcanicas.com/ecuadorbajoataque-la-militarizacion-esa-nueva-democracia/"}}
INFO     [controller] 🔗  Navigated to https://volc

/var/folders/vz/tt94cvjx41xgsbzht25hmsbw0000gp/T/ipykernel_64867/3976602310.py:67: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  json.dump(doc.dict(), f, ensure_ascii=False, indent=2)
