## NOTEBOOK `doctorolia`


#### Se navega a `https://www.doctoralia.co/especialidades-medicas` se extraen todos los enlaces de especialdidades, 
#### identificados por contener el texto `Ver mas` (este es selector 1, nivel 1 de profundidad), se guarda cada uno de los enlaces.
#### En cada uno de estos enlaces hay elementos `<a>` que tienen un respectivo enlace ( `href` selector 2, en nivel 2) con un texto relativo a la especilidad en la que se esta navegando, se deben guardar estos enlaces.
#### En cada una de estas paginas existe informacion de doctores, algunas tienen varias pesta'nas de navegacion.

In [20]:
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
import asyncio
import nest_asyncio

# --- CONFIGURATION ---
# The specific page listing medical specialties
URL = "https://www.doctoralia.co/especialidades-medicas"
# Selector targeting an anchor tag <a> that contains the exact visible text "ver más"
# This reliably finds the elements you are interested in.
LINK_SELECTOR = 'a:has-text("ver más")' 
# Common selector for cookie acceptance banners
COOKIE_ACCEPT_SELECTOR = 'text="Aceptar"'
# ---------------------

async def extract_ver_mas_links(url, selector):
    """
    Navigates to the specified URL and extracts the 'href' attribute 
    from all anchor tags containing the text 'ver más'.
    
    Args:
        url (str): The target URL to scrape.
        selector (str): The CSS selector used to locate the links.

    Returns:
        list: A list of strings, where each string is an extracted URL.
    """
    print(f"Starting extraction process for URL: {url}")
    print(f"Looking for elements matching selector: {selector}")
    extracted_urls = []

    async with async_playwright() as p:
        # Launching the browser in headless mode for efficiency
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        try:
            # Navigate to the page and wait for the DOM and network to settle
            await page.goto(url, wait_until="domcontentloaded", timeout=30000)
            await page.wait_for_load_state('networkidle')

            # 1. Handle Cookie Consent (Crucial for interaction stability)
            try:
                cookie_button = page.locator(COOKIE_ACCEPT_SELECTOR)
                # Check for visibility and click if it exists
                if await cookie_button.is_visible(timeout=5000):
                    await cookie_button.click()
                    print("Cookie consent accepted successfully.")
                    await page.wait_for_timeout(500)
            except PlaywrightTimeoutError:
                # If the button doesn't appear within the timeout, gracefully ignore
                pass 
            
            # 2. Locate and Extract Links
            link_locators = page.locator(selector)
            count = await link_locators.count()
            print(f"Found {count} elements matching the selector.")

            # Efficiently extract the 'href' attribute from all located elements 
            # using Playwright's evaluate_all method (running JavaScript in the browser context)
            urls = await link_locators.evaluate_all('elements => elements.map(e => e.href)')
            
            # 3. Clean up the extracted list
            extracted_urls = [url.strip() for url in urls if url and url.strip()]
            
        except PlaywrightTimeoutError:
            print(f"\n[ERROR] Playwright Timeout occurred during navigation or waiting.")
        except Exception as e:
            print(f"\n[CRITICAL ERROR] An unexpected error occurred: {e}")

        finally:
            await browser.close()
            print("\nBrowser closed.")
            
    return extracted_urls

if __name__ == "__main__":
    
    # Use nest_asyncio to allow asyncio.run in interactive/notebook environments
    try:
        nest_asyncio.apply()
    except Exception:
        pass

    # Execute the main asynchronous function
    links = asyncio.run(extract_ver_mas_links(URL, LINK_SELECTOR))
    
    if links:
        print("\n--- Extracted Landing URLs ('ver más') ---")
        for i, link in enumerate(links, 1):
            print(f"{i}. {link}")
        print(f"\nTotal unique links extracted: {len(set(links))}")
    else:
        print("\nNo links were extracted or an error occurred.")

Starting extraction process for URL: https://www.doctoralia.co/especialidades-medicas
Looking for elements matching selector: a:has-text("ver más")
Found 98 elements matching the selector.

Browser closed.

--- Extracted Landing URLs ('ver más') ---
1. https://www.doctoralia.co/especialidades-medicas/en-detalle/alergologo
2. https://www.doctoralia.co/especialidades-medicas/en-detalle/alergologo-pediatrico
3. https://www.doctoralia.co/especialidades-medicas/en-detalle/anestesiologo
4. https://www.doctoralia.co/especialidades-medicas/en-detalle/audiologo
5. https://www.doctoralia.co/especialidades-medicas/en-detalle/auxiliar-de-enfermeria
6. https://www.doctoralia.co/especialidades-medicas/en-detalle/cardiologo
7. https://www.doctoralia.co/especialidades-medicas/en-detalle/cardiologo-pediatrico
8. https://www.doctoralia.co/especialidades-medicas/en-detalle/cirujano-cardiovascular
9. https://www.doctoralia.co/especialidades-medicas/en-detalle/cirujano-de-cabeza-y-cuello
10. https://www.do

In [21]:
substring = "aler"
LINK_SELECTOR = f'a:has-text("{substring}")' 

URL = "https://www.doctoralia.co/especialidades-medicas/en-detalle/alergologo"
if True: #__name__ == "__main__":
    
    # Use nest_asyncio to allow asyncio.run in interactive/notebook environments
    try:
        nest_asyncio.apply()
    except Exception:
        pass

    # Execute the main asynchronous function
    deep_links_1 = asyncio.run(extract_ver_mas_links(URL, LINK_SELECTOR))
    
    if links:
        print("\n--- Extracted Alergólogo/City Landing URLs ---")
        # Print the first 10 links for verification
        for i, linj in enumerate(deep_links_1[:10], 1):
            print(f"{i}. {link}")
        
        # Give a summary of the count
        if len(links) > 10:
             print(f"\n... and {len(links) - 10} more links.")
        print(f"Total unique links extracted: {len(set(links))}")
    else:
        print("\nNo links were extracted or an error occurred.")

Starting extraction process for URL: https://www.doctoralia.co/especialidades-medicas/en-detalle/alergologo
Looking for elements matching selector: a:has-text("aler")
Found 28 elements matching the selector.

Browser closed.

--- Extracted Alergólogo/City Landing URLs ---
1. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
2. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
3. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
4. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
5. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
6. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
7. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
8. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
9. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
10. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo

... and 88 more

In [22]:
substring = "aler"
LINK_SELECTOR = f'a:has-text("{substring}")' 

URL = "https://www.doctoralia.co/especialidades-medicas/en-detalle/alergologo"
if True: #__name__ == "__main__":
    
    # Use nest_asyncio to allow asyncio.run in interactive/notebook environments
    try:
        nest_asyncio.apply()
    except Exception:
        pass

    # Execute the main asynchronous function
    deep_links_1 = asyncio.run(extract_ver_mas_links(URL, LINK_SELECTOR))
    
    if links:
        print("\n--- Extracted Alergólogo/City Landing URLs ---")
        # Print the first 10 links for verification
        for i, linj in enumerate(deep_links_1[:10], 1):
            print(f"{i}. {link}")
        
        # Give a summary of the count
        if len(links) > 10:
             print(f"\n... and {len(links) - 10} more links.")
        print(f"Total unique links extracted: {len(set(links))}")
    else:
        print("\nNo links were extracted or an error occurred.")

Starting extraction process for URL: https://www.doctoralia.co/especialidades-medicas/en-detalle/alergologo
Looking for elements matching selector: a:has-text("aler")
Found 28 elements matching the selector.

Browser closed.

--- Extracted Alergólogo/City Landing URLs ---
1. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
2. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
3. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
4. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
5. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
6. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
7. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
8. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
9. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo
10. https://www.doctoralia.co/especialidades-medicas/en-detalle/urologo

... and 88 more