**Universidad Central**

**Maestría en Analítica de Datos**

**Big Data**

**Estudiante:** Mabel Ayala Meneses

**Fecha**: 19/10/2025

*Taller de Web Scraping*


Tome un sitio web de una entidad publica Colombiana (preferiblemente un ministerio o superintendencia) y aplique web scraping para:

1. Escanear las paginas del sitio de la entidad y descargue archivos PDF's sobre normatividades.
2. Extraiga el texto de los pdf (extracción normal o con OCR).
3. Cree un archivo Json por cada PDF donde se tengan los campos: "Nombre archivo", "texto", "fecha"
subir los archivos Json a una colección en mongo DB

In [None]:
# habilitamos drive de google desde colab
from google.colab import drive
drive.mount('/content/drive')

1. Librerías

In [None]:
!pip install requests beautifulsoup4 lxml
!pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20250506


In [1]:
!pip install requests beautifulsoup4 lxml pdfminer.six pymongo
# (Opcional OCR)
!apt-get -qq update
!apt-get -qq install poppler-utils tesseract-ocr
!pip install pdf2image pytesseract

Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pymongo
  Downloading pymongo-4.15.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.8.0-py3-none-any.whl.metadata (5.7 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymongo-4.15.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.8.0-py3-none-any.whl (331 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.1/331.1 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspytho

In [2]:
import os
os.environ["MONGO_URI"]  = "mongodb+srv://mayala:mayala123@mayala.y4cqo9f.mongodb.net/?retryWrites=true&w=majority&appName=mayala"
os.environ["MONGO_DB"]   = "MININTERIOR"
os.environ["MONGO_COLL"] = "NORMATIVIDAD"

In [5]:
# -*- coding: utf-8 -*-
# Scraper (Normograma MinInterior) -> PDFs -> JSON (+ opcional MongoDB)
# Librerías: requests, bs4, lxml, pdfminer.six, (opcional) pdf2image+pytesseract, pymongo

import os
import re
import time
import json
import hashlib
import argparse
from dataclasses import dataclass, asdict
from typing import List, Optional, Tuple
from urllib.parse import urljoin
from datetime import datetime

import requests
from bs4 import BeautifulSoup
from pdfminer.high_level import extract_text

# -------- OCR opcional: si no están instalados, no se usa y no rompe --------
USE_OCR = True
try:
    from pdf2image import convert_from_path
    import pytesseract
except Exception:
    USE_OCR = False

# -------- Mongo opcional: si no defines MONGO_URI, no inserta en Mongo --------
from pymongo import MongoClient
from pymongo.errors import PyMongoError
MONGO_URI  = os.environ.get("MONGO_URI")        # Si no existe, se desactiva Mongo
MONGO_DB   = os.environ.get("MONGO_DB", "MININTERIOR")
MONGO_COLL = os.environ.get("MONGO_COLL", "NORMATIVIDAD")
ENABLE_MONGO = bool(MONGO_URI)

# ================== CONFIG ==================
DEST_DIR = "/content/drive/MyDrive/Big data/Taller 1"  # <-- AJUSTA si quieres
os.makedirs(DEST_DIR, exist_ok=True)

BASE_URL = "https://www.mininterior.gov.co/normatividad/?filter=true&page={page}"
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119 Safari/537.36"
HEADERS = {"User-Agent": USER_AGENT}
TIMEOUT = 25
PAUSE   = 0.8  # cortesía entre requests

# Límites
MAX_PDFS  = 50   # Límite global de PDFs a extraer
MAX_PAGES = 60   # Límite de páginas HTML a visitar
VERBOSE   = True

# ================== SESIÓN CON REINTENTOS + CERTIFI ==================
import certifi
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from requests.exceptions import SSLError, RequestException
from requests.packages.urllib3.exceptions import InsecureRequestWarning

SESSION = requests.Session()
retry = Retry(
    total=3,
    connect=3,
    read=3,
    backoff_factor=0.6,   # 0.6s, 1.2s, 2.4s...
    status_forcelist=(429, 500, 502, 503, 504),
    allowed_methods=frozenset(["GET", "HEAD"]),
)
adapter = HTTPAdapter(max_retries=retry)
SESSION.mount("https://", adapter)
SESSION.mount("http://", adapter)

HEADERS.update({
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "es-CO,es;q=0.9,en;q=0.8",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
})
VERIFY_PATH = certifi.where()

# ================== DATA MODEL ==================
@dataclass
class DocRecord:
    titulo: str
    url_pdf: str
    fuente: str
    fecha_captura: str
    archivo: str
    sha1: str
    texto: str
    ocr_usado: bool

# ================== UTILS ==================
def norm(s: str) -> str:
    return re.sub(r"\s+", " ", s or "").strip()

def safe_name(name: str) -> str:
    name = re.sub(r"[^\w\-.]+", "_", name)
    return name[:150]

def sha1_file(path: str) -> str:
    h = hashlib.sha1()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            h.update(chunk)
    return h.hexdigest()

# ================== WEB LAYER ==================
def fetch_html(url: str) -> Optional[BeautifulSoup]:
    try:
        # Intento normal: validar SSL con bundle de certifi
        r = SESSION.get(url, headers=HEADERS, timeout=TIMEOUT, verify=VERIFY_PATH, allow_redirects=True)
        r.raise_for_status()
        return BeautifulSoup(r.text, "lxml")

    except SSLError as e:
        # Fallback único: desactivar verify solo para este request
        print(f"[WARN] SSL error con verificación. Reintentando sin verify: {e}")
        try:
            requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
            r = SESSION.get(url, headers=HEADERS, timeout=TIMEOUT, verify=False, allow_redirects=True)
            r.raise_for_status()
            return BeautifulSoup(r.text, "lxml")
        except Exception as e2:
            print(f"[WARN] HTML fetch failed (sin verify): {e2} @ {url}")
            return None

    except RequestException as e:
        print(f"[WARN] HTML fetch failed: {e} @ {url}")
        return None

def collect_pdf_links(pages: List[int]) -> List[Tuple[str, str]]:
    links, seen = [], set()
    for p in pages[:MAX_PAGES]:
        url = BASE_URL.format(page=p)
        soup = fetch_html(url)
        if not soup:
            time.sleep(PAUSE)
            continue
        for a in soup.select("a"):
            href = (a.get("href") or "").strip()
            if href.lower().endswith(".pdf"):
                abs_url = urljoin(url, href)
                if abs_url in seen:
                    continue
                seen.add(abs_url)
                title = norm(a.get_text()) or os.path.basename(abs_url)
                links.append((title, abs_url))
        if len(links) >= MAX_PDFS:
            break
        time.sleep(PAUSE)
    return links[:MAX_PDFS]

# ================== FILES & PARSING ==================
def download_pdf(title: str, pdf_url: str) -> Optional[str]:
    filename = safe_name(f"{title}.pdf")
    path = os.path.join(DEST_DIR, filename)
    if os.path.exists(path) and os.path.getsize(path) > 0:
        if VERBOSE: print(f"  • Exists: {filename}")
        return path
    try:
        with SESSION.get(pdf_url, headers=HEADERS, timeout=60, stream=True, verify=VERIFY_PATH) as r:
            r.raise_for_status()
            with open(path, "wb") as f:
                for chunk in r.iter_content(chunk_size=1 << 16):
                    if chunk:
                        f.write(chunk)
        if VERBOSE: print(f"  ✓ Downloaded: {filename}")
        return path
    except SSLError as e:
        print(f"[WARN] SSL al descargar {pdf_url}. Reintentando sin verify: {e}")
        try:
            requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
            with SESSION.get(pdf_url, headers=HEADERS, timeout=60, stream=True, verify=False) as r:
                r.raise_for_status()
                with open(path, "wb") as f:
                    for chunk in r.iter_content(chunk_size=1 << 16):
                        if chunk:
                            f.write(chunk)
            if VERBOSE: print(f"  ✓ Downloaded (sin verify): {filename}")
            return path
        except Exception as e2:
            print(f"[WARN] PDF download failed (sin verify) {pdf_url}: {e2}")
            return None
    except Exception as e:
        print(f"[WARN] PDF download failed ({pdf_url}): {e}")
        return None

def extract_text_pdfminer(path: str) -> str:
    try:
        return norm(extract_text(path) or "")
    except Exception as e:
        print(f"[WARN] pdfminer failed for {os.path.basename(path)}: {e}")
        return ""

def extract_text_ocr(path: str) -> str:
    if not USE_OCR:
        return ""
    try:
        pages = convert_from_path(path)
        texts = [pytesseract.image_to_string(img, lang="spa+eng") for img in pages]
        return norm("\n".join(texts))
    except Exception as e:
        print(f"[WARN] OCR failed for {os.path.basename(path)}: {e}")
        return ""

# ================== MONGODB ==================
def mongo_client() -> Optional[MongoClient]:
    if not ENABLE_MONGO:
        return None
    return MongoClient(MONGO_URI)

def upsert_document(rec: DocRecord) -> Optional[str]:
    if not ENABLE_MONGO:
        return None
    try:
        client = mongo_client()
        coll = client[MONGO_DB][MONGO_COLL]
        payload = asdict(rec)
        result = coll.update_one({"sha1": rec.sha1}, {"$set": payload}, upsert=True)
        return str(result.upserted_id) if result.upserted_id else None
    except PyMongoError as e:
        print(f"[WARN] MongoDB error: {e}")
        return None

# ================== PIPELINE ==================
def process_link(title: str, pdf_url: str) -> Optional[DocRecord]:
    local_pdf = download_pdf(title, pdf_url)
    if not local_pdf:
        return None

    digest = sha1_file(local_pdf)

    text = extract_text_pdfminer(local_pdf)
    used_ocr = False
    if len(text) < 200:  # fallback si casi no hay texto
        ocr_text = extract_text_ocr(local_pdf)
        if len(ocr_text) > len(text):
            text = ocr_text
            used_ocr = True

    return DocRecord(
        titulo=title,
        url_pdf=pdf_url,
        fuente="MinInterior Normatividad",
        fecha_captura=datetime.utcnow().isoformat() + "Z",
        archivo=os.path.basename(local_pdf),
        sha1=digest,
        texto=text,
        ocr_usado=used_ocr,
    )

def save_json(rec: DocRecord) -> str:
    out_name = os.path.splitext(rec.archivo)[0] + ".json"
    out_path = os.path.join(DEST_DIR, out_name)
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(asdict(rec), f, ensure_ascii=False, indent=2)
    return out_path

def run(pages: List[int], limit: int) -> None:
    links = collect_pdf_links(pages)
    print(f"Encontrados {len(links)} PDFs.")
    processed = 0
    for title, url in links:
        if processed >= limit:
            break
        print(f"\n→ Procesando: {title}")
        rec = process_link(title, url)
        if not rec:
            continue
        json_path = save_json(rec)
        print(f"  • JSON guardado: {os.path.basename(json_path)}  | OCR: {rec.ocr_usado}")
        ins_id = upsert_document(rec)
        if ins_id:
            print(f"  • Insertado en MongoDB _id={ins_id}")
        processed += 1
        time.sleep(1.0)

    print(f"\nListo. Documentos procesados: {processed}")

# ================== CLI (compatible con Colab/Jupyter) ==================
def parse_args():
    parser = argparse.ArgumentParser(
        description="Scrape MinInterior PDFs → text → JSON → (opcional) MongoDB",
        add_help=True,
    )
    parser.add_argument("--pages", type=str, default="1-2",
                        help="Rango de páginas, ej. '1-3' o lista '1,2,3'")
    parser.add_argument("--limit", type=int, default=10,
                        help="Número máximo de PDFs a procesar")
    # Ignora los argumentos desconocidos que inyecta Jupyter/Colab (p.ej. -f ...)
    args, _ = parser.parse_known_args()
    return args

def parse_pages(s: str) -> List[int]:
    s = s.strip()
    if "-" in s:
        a, b = s.split("-", 1)
        return list(range(int(a), int(b) + 1))
    return [int(x) for x in s.split(",") if x.strip()]

if __name__ == "__main__":
    args = parse_args()
    pages = parse_pages(args.pages)
    run(pages=pages, limit=args.limit)




[WARN] SSL error con verificación. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /normatividad/?filter=true&page=1 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))




[WARN] SSL error con verificación. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /normatividad/?filter=true&page=2 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))




Encontrados 17 PDFs.

→ Procesando: Carta de trato digno al ciudadano




[WARN] SSL al descargar https://www.mininterior.gov.co/wp-content/uploads/2025/04/28-abr-25-carta-trato-digno-v7.pdf. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /wp-content/uploads/2025/04/28-abr-25-carta-trato-digno-v7.pdf (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
  ✓ Downloaded (sin verify): Carta_de_trato_digno_al_ciudadano.pdf


  fecha_captura=datetime.utcnow().isoformat() + "Z",


  • JSON guardado: Carta_de_trato_digno_al_ciudadano.json  | OCR: False
  • Insertado en MongoDB _id=68f54ad242457f3cbdba9cb7





→ Procesando: Documento




[WARN] SSL al descargar https://www.mininterior.gov.co/wp-content/uploads/2025/10/resolucion-numero-cocorpun01782024-de-21-de-octubre-de-2024.pdf. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /wp-content/uploads/2025/10/resolucion-numero-cocorpun01782024-de-21-de-octubre-de-2024.pdf (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
  ✓ Downloaded (sin verify): Documento.pdf


  fecha_captura=datetime.utcnow().isoformat() + "Z",


  • JSON guardado: Documento.json  | OCR: False
  • Insertado en MongoDB _id=68f54ad942457f3cbdba9cb8

→ Procesando: Documento
  • Exists: Documento.pdf
  • JSON guardado: Documento.json  | OCR: False

→ Procesando: Documento
  • Exists: Documento.pdf
  • JSON guardado: Documento.json  | OCR: False

→ Procesando: Documento
  • Exists: Documento.pdf
  • JSON guardado: Documento.json  | OCR: False

→ Procesando: Documento
  • Exists: Documento.pdf
  • JSON guardado: Documento.json  | OCR: False

→ Procesando: Documento
  • Exists: Documento.pdf
  • JSON guardado: Documento.json  | OCR: False





→ Procesando: Otras Políticas




[WARN] SSL al descargar https://www.mininterior.gov.co/wp-content/uploads/2022/09/2022-09-22_DOCUMENTO-POLITICA-PUBLICA-DE-PARTICIPACION-CIUDADANA-VERSION-FINAL-AJUSTADA-27092022.pdf. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /wp-content/uploads/2022/09/2022-09-22_DOCUMENTO-POLITICA-PUBLICA-DE-PARTICIPACION-CIUDADANA-VERSION-FINAL-AJUSTADA-27092022.pdf (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
  ✓ Downloaded (sin verify): Otras_Políticas.pdf


  fecha_captura=datetime.utcnow().isoformat() + "Z",


  • JSON guardado: Otras_Políticas.json  | OCR: False
  • Insertado en MongoDB _id=68f54af542457f3cbdba9cb9





→ Procesando: Terminos y condiciones




[WARN] SSL al descargar https://www.mininterior.gov.co/wp-content/uploads/2022/09/ir-a-terminos-y-condiciones-e-uso.pdf. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /wp-content/uploads/2022/09/ir-a-terminos-y-condiciones-e-uso.pdf (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
  ✓ Downloaded (sin verify): Terminos_y_condiciones.pdf


  fecha_captura=datetime.utcnow().isoformat() + "Z",


  • JSON guardado: Terminos_y_condiciones.json  | OCR: False
  • Insertado en MongoDB _id=68f54afb42457f3cbdba9cba





→ Procesando: Datos personales




[WARN] SSL al descargar https://www.mininterior.gov.co/wp-content/uploads/2022/07/politica_de_tratamiento_de_datos_personales.pdf. Reintentando sin verify: HTTPSConnectionPool(host='www.mininterior.gov.co', port=443): Max retries exceeded with url: /wp-content/uploads/2022/07/politica_de_tratamiento_de_datos_personales.pdf (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
  ✓ Downloaded (sin verify): Datos_personales.pdf


  fecha_captura=datetime.utcnow().isoformat() + "Z",


  • JSON guardado: Datos_personales.json  | OCR: True
  • Insertado en MongoDB _id=68f54b7642457f3cbdba9cc0

Listo. Documentos procesados: 10
