# Enunciado do Trabalho

Universidade Federal de Minas Gerais

Departamento de Ciência da Computação

TCC/TSI/TECC: Information Retrieval

## Programming Assignment #1 - Web Crawler

- **Deadline:** Apr 28th, 2025 23:59 via Moodle

### Overview

The goal of this assignment is to implement a crawler capable of fetching a mid-sized corpus of webpages in a short time frame while respecting the politeness constraints defined by each crawled website. In addition to the source code of your implementation and the actual crawled documents, your submission must include a characterization of these documents.

### Implementation

You must use Python 3 for this assignment. Your code must run in a virtual environment **using only the libraries included** in the provided `requirements.txt` file. Execution errors due to missing libraries or incompatible library versions will result in a zero grade. To make sure you have the correct setup, you can test it in one of the [Linux machines provided by the Department of Computer Science $^{1}$][Link_CRC] using the following commands:

[Link_CRC]: <https://www.crc.dcc.ufmg.br/doku.php/infraestrutura/laboratorios/linux>

```bash
$ python3 -m venv pa1
$ source pa1/bin/activate
$ pip3 install -r /path/to/requirements.txt
```


### Execution

Your implementation should include a `main.py` file, which will be executed in the same virtual environment described above, as follows:

```bash
$ python3 main.py -s <SEEDS> -n <LIMIT> [-d]
```

with the following arguments:

- **-s \<SEEDS>:** the path to a file containing a list of seed URLs (one URL per line) for initializing the crawling process.
- **-n \<LIMIT>:** the target number of webpages to be crawled; the crawler should stop its execution once this target is reached.
- **-d:** (optional argument) run in debug mode (see below).

### Debugging

When executed in debugging mode (i.e. when -d is passed as a command-line argument), your implementation must print a record of each crawled webpage to [standard output $^2$][StandardOutput] as it progresses. Such a record must be formatted as a JSON document containing the following fields:

[StandardOutput]: <https://en.wikipedia.org/wiki/Standard_streams#Standard_output_(stdout)>

- `URL`, containing the page URL;
- `Title`, containing the page title;
- `Text`, containing the first 20 words from the page visible text;
- `Timestamp`, containing the [Unix time $^3$][UnixTime] when the page was crawled.

[UnixTime]: <https://en.wikipedia.org/wiki/Unix_time>

The following example illustrates the required debugging output format for the first webpage fetched during a crawling execution:

```json
{
    "URL": "https://g1.globo.com/",
    "Title": "G1 - O portal de notícias da Globo",
    "Text": "Deseja receber as notícias mais importantes em tempo real? Ative as notificações do G1! Agora não Ativar Gasto familiar Despesa",
    "Timestamp": 1649945049
}
```

### Crawling Policies

Your implementation must keep a frontier of URLs to be crawled, which must be initialized with the seed URLs provided as input to you with the -s argument. For each URL consumed from the frontier, your implementation must fetch the corresponding webpage, parse it, store the extracted HTML content in the local corpus, and enqueue the extracted outlinks in the frontier to be crawled later. In addition to this standard workflow, **your implementation must abide by the following crawling policies:**

1. *Selection Policy:* Starting from the provided seed URLs, your implementation **must only follow discovered links to HTML pages** (i.e. resources with MIME type `text/html`). To improve coverage, you may optionally choose to limit the crawling depth of any given website.
2. *Revisitation Policy:* Because this is a one-off crawling exercise, you **must not revisit a previously crawled webpage**. To ensure only new links are crawled, you may choose to normalize URLs and check for duplicates before adding new URLs to the frontier.
3. *Parallelization Policy:* To ensure maximum efficiency, you **must parallelize the crawling process across multiple threads**. You may experiment to find an optimal number of threads to maximize your download rate while minimizing the incurred parallelization overhead.
4. *Politeness Policy:* To avoid overloading the crawled websites, your implementation [**must abide by the robots exclusion protocol** $^4$][4]. Unless explicitly stated otherwise in a `robots.txt` file, you must obey a delay of at least 100ms between consecutive requests to the same website.
5. *Storage Policy:* As the main target for this assignment, your implementation **must crawl and store a total of 100,000 unique webpages**. The raw HTML content of the crawled webpages must be packaged using the [WARC format $^5$][5], with 1,000 webpages stored per WARC file (totalling 100 such files), compressed with gzip to reduce storage costs.

### Deliverables

Before the deadline (Apr 28th, 2025 23:59), you must submit a package file (zip) via Moodle containing the following:

1. Source code of your implementation;
2. Link to your crawled corpus (stored on Google Drive);
3. Documentation file (pdf, max 2 pages).

### Grading

This assignment is worth a total of 15 points distributed as:

- 10 points for your *implementation*, assessed based on the quality of your source code, including its overall organization (modularity, readability, indentation, use of comments) and appropriate use of data structures, as well as on how well it abides by the five aforementioned crawling policies.
- 5 points for your *documentation*, assessed based on a [short (pdf) report $^6$][6] describing your implemented data structures and algorithms, their computational complexity, as well as a discussion of their empirical efficiency (e.g. the download rate throughout the crawling execution, the speedup achieved as new threads are added). Your documentation should also include a characterization of your crawled corpus, including (but not limited to) the following statistics: total number of unique domains, size distribution (in terms of number of webpages) per domain, and size distribution (in terms of number of tokens) per webpage.

### Teams

This assignment must be performed individually. Any sign of plagiarism will be investigated and reported to the appropriate authorities.

[4]: <https://en.wikipedia.org/wiki/Robots_exclusion_standard>
[5]: <https://en.wikipedia.org/wiki/Web_ARChive>
[6]: <https://portalparts.acm.org/hippo/latex_templates/acmart-primary.zip> "Your documentation should be no longer than 2 pages and use the ACM LATEX template (sample-sigconf.tex)"

---


## To-Do's

### Execução

- [ ] Install `requirements.txt` libraries
- Definir os parâmetros de inicialização
  - [ ] Path to Seeds
  - [ ] Target Number of Webpages
  - [ ] Debug Mode

### Debugging Prints

- [ ] URL
- [ ] Title
- [ ] Text
- [ ] Timestamp (Unix time)

### Follow Policies

- Selection Policy
  - [ ] only MIME type `text/html`
    - CONTENT-TYPE || MIMETYPE
  - [ ] Limit crawling depth of any given website (opcional)
- Revisitation Policy
  - [ ] Normalize URLs before adding to frontier
- Parallelization Policy
  - [ ] Parallelize the crawling process across multiple threads
- Politeness Policy
  - [ ] Obey the `robots.txt` file
  - [ ] Delay of at least 100ms between consecutive requests to the same website
- Storage Policy
  - [ ] Store 100,000 unique webpages
  - [ ] Package using WARC format
  - [ ] Compress with gzip to reduce storage costs
  - [ ] Store at Google Drive

### Documentation

- 2 pages (pdf)
  - [ ] ACM LATEX template (sample-sigconf.tex)
    - Data Structures
    - Algorithms
    - Computational Complexity
    - Empirical Efficiency
    - Crawled Corpus Characterization
      - Total number of unique domains
      - Size distribution (in terms of number of webpages) per domain
      - Size distribution (in terms of number of tokens) per webpage

---


## Ideias:

- Explorar urls
  - Vasculhar o sitemap
  - Árvore de aprendizado descritivo pra armazenar links
    - Todas as tarefas computacionalmente intensas devem ser delegadas à threads?
    - Armazenar logo quando achar?
    - Percorrer a árvore pra poder extrair os dados?
    - Percorrer enquanto se preenche?
  - Threads
    1. Um que fica verificando se novas páginas foram adicionadas aos filhos
    2. Em cada filho da raiz

- Estruturas de Dados
  - Árvore de aprendizado descritivo
    - Cada nó é uma página

```json
{
    "quantity":  2,
    "frontier": [],
    "children": {
        "https://boaforma.abril.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        },
        "https://olhardigital.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        }.
        "http://www.paguemenos.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        }
    }
}
```

- Paralelismo
  - Adiar o processamento de threads que passarem de algum determinado tempo de processamento.  
  - Uma nova thread para cada nó de árvore? Não... Threads demais.
    - Qual limite?


## Pseudocódigo

- [ ] Instalar bibliotecas com `requirements.txt`
  - [ ] `pip install -r requirements.txt`
    - [ ] BeautifulSoup4
    - [ ] requests
    - [ ] threading
    - [ ] time
    - [ ] gzip
    - [ ] warcio
    - [ ] json
    - [ ] os
- [ ] Importar bibliotecas
- [ ] Definir os parâmetros de inicialização
  - [ ] Path to Seeds
  - [ ] Target Number of Webpages
  - [ ] Debug Mode
- [ ] Criar uma árvore inicial
- [ ] Buscar as URL iniciais e armazenar na árvore
- [ ] Criar o Scraper da página
  - [ ] Definir mínima url
  - [ ] checar se a mínima url é válida e igual a inicial
  - [ ] Armazenar
    - [ ] Robots.txt
    - [ ] Sitemap
    - [ ] HTML
    - [ ] o título
    - [ ] URL
    - [ ] as 20 primeiras palavras do texto
    - [ ] o timestamp
    - [ ] o HTML completo
- [ ] Criar o gerador de threads
- [ ] Pensar em como percorrer a árvore
  - [ ] Por enquanto seguir um dicionário simples

---


## Pacotes necessários

Primeiro instalaremos e importaremos os pacotes necessários para o funcionamento do crawler. Utilizou-se do arquivo `requirements.txt` para instalar as dependências do projeto. O arquivo `requirements.txt` que contém essa lista.


In [3]:
""" Installing Python packages """

# run installation with requirements.txt

%pip install -r requirements.txt


Note: you may need to restart the kernel to use updated packages.


## Importação

Os pacotes pacotes importados serão de extrema importância para o funcionamento do crawler. Abaixo estão os pacotes que foram importados e suas respectivas funções:

- **BeautifulSoup4:** Utilizado para fazer o parsing do HTML e extrair informações relevantes, como o título da página e o texto visível.
- **requests:** Usado para fazer requisições HTTP e baixar o conteúdo das páginas da web.
- **datetime:** Usado para manipular datas e horários, especialmente para registrar o timestamp de quando a página foi baixada.
- **warcio:** Usado para criar arquivos WARC (Web ARChive) que armazenam o conteúdo baixado de forma eficiente.
- **os:** Usado para interagir com o sistema operacional, como criar diretórios e manipular arquivos.
- **json:** Usado para manipular dados no formato JSON, especialmente para imprimir os resultados do crawler.
- **gzip:** Usado para compactar os arquivos WARC, reduzindo o espaço de armazenamento necessário.
- **threading:** Usado para criar e gerenciar threads, permitindo que o crawler baixe várias páginas simultaneamente.


In [6]:
# Importando as bibliotecas necessárias

# import beautifulsoup4 as bs
import certifi
import charset_normalizer
import idna
from protego import Protego
import requests
import six
import soupsieve
import typing_extensions
from url_normalize import url_normalize
import urllib3
import warcio

# JV
import bs4 as bs
import datetime
import json

## Variáveis Globais

Alguns parâmetros globais foram definidos para facilitar o controle do crawler. Esses parâmetros incluem:

- **SEEDS:** Caminho para o arquivo que contém as URLs iniciais (sementes) para o crawler.
- **LIMIT:** Número máximo de páginas a serem baixadas.
- **DEBUG:** Modo de depuração, que imprime informações detalhadas sobre o processo de download.
- **DELAY:** Tempo de espera entre as requisições para evitar sobrecarregar o servidor.
- **MAX_THREADS:** Número máximo de threads que podem ser criadas para o download simultâneo.
- **visited_urls:** Conjunto para armazenar URLs já visitadas, evitando revisitações.
- **frontier:** Lista de URLs a serem visitadas, que será preenchida com as URLs extraídas das páginas baixadas.


In [3]:
""" Variáveis globais """

# Substituir posteriormente como um parâmetro enviado pelo comando.
SEEDS = './Seeds/seeds-2024711370.txt'
LIMIT = 1000
DEBUG_MODE = True

# Variáveis de controle
DELAY = 100
MAX_THREADS = 10
visited_URLs = set()
frontier = {}
trash_can = {}


In [4]:
""" Helper functions: Timestamp, Node, Minimal URLs, Debug """

def get_timestamp():
    """ Retorna o timestamp atual em segundos desde 1970 """
    return int(datetime.datetime.now().timestamp())

def get_node():
    """ Retorna um nó vazio """
    node = {
        'URL': None,
        'Title': None,
        'Timestamp': get_timestamp(),
        'Text': None,
        
        'time_elapsed': None,
        'raw_content': None,
        'quantity': 0,
        'url': None,
        
        'visited': {},
        'frontier': {},
        'trash_can': {},
        'children': {},
    }
    return node

def get_minimal_url(URL):
    """ Retorna a URL mínima
    A intenção é encontrar a URL mínima que leva ao mesmo conteúdo.
    Linf. Form.: htt(p|ps)://(www.|)
    """
    
    https = 'https://'
    http = 'http://'
    www = 'www.'
    
    no_start = ''
    only_www = ''
    only_https = ''
    only_https_www = ''
    only_http = ''
    only_http_www = ''
    
    minimal_urls = []
    
    return minimal_urls

def debug_as_json(node):
    """ Retorna um nó em formato JSON """
    print(json.dumps(node, indent=4))

In [5]:
""" Setting up the seeds """

def get_seeds(SEEDS):
    """ Função para ler os seeds do arquivo e retornar uma lista de URLs. """
    seeds = set()
    with open(SEEDS, 'r') as file:
        for line in file:
            line = line.strip()
            if line and not line.startswith('#'):
                seeds.add(line)
    return seeds

base_seeds = get_seeds(SEEDS)


In [6]:
""" Visiting seeds """

def visit_seed(seed):
    """ Função para visitar um seed e armazenar o conteúdo. """
    try:
        # Minimize seed URL
        # seed = get_minimal_url(seed)
        response = requests.get(seed, timeout=5)
        mime = response.headers.get('Content-Type', '').split(';')[0]
        if response.status_code == 200 and mime == 'text/html':
            new_node = get_node()
            # Explorando respostas
            info = {
                'url': response.url,
                'status_code': response.status_code,
                # 'headers': dict(response.headers),
                'encoding': response.encoding,
                'text': response.text,
                # 'content': response.content.decode('utf-8', errors='ignore'),
                'time_elapsed': response.elapsed.microseconds,
            }
            if DEBUG_MODE:
                # debug_as_json(info)
                # print(info['text'])
                # print(20*'\n')
                # print(info['content'])
                pass
            
            new_node['url'] = seed
            new_node['quantity'] += 1
            new_node['raw_content'] = response.text
            new_node['time_elapsed'] = response.elapsed.microseconds
            
            return new_node
        else:
            trash_can[seed] = response
            if DEBUG_MODE:
                print(f'Failed to visit {seed}: {response.status_code}')
            return trash_can
    except requests.RequestException as e:
        if DEBUG_MODE:
            print(f'Error visiting {seed}: {e}')


for seed in base_seeds:
    if DEBUG_MODE:
        print(f'Seed: {seed}')
    visited_URLs.add(seed)
    new_node = visit_seed(seed)
    debug_as_json(new_node)

Seed: http://www.paguemenos.com.br/
{
    "URL": null,
    "Title": null,
    "Timestamp": 1743974493,
    "Text": null,
    "time_elapsed": 42256,
    "quantity": 1,
    "url": "http://www.paguemenos.com.br/",
    "visited": {},
    "frontier": {},
    "trash_can": {},
    "children": {}
}
Seed: https://olhardigital.com.br/
{
    "URL": null,
    "Title": null,
    "Timestamp": 1743974493,
    "Text": null,
    "time_elapsed": 340890,
    "raw_content": "<!DOCTYPE html>\n<html lang=\"pt-BR\" >\n<head>\n\t<meta charset=\"UTF-8\">\n\t<meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=6\">\n\t<meta property=\"fb:app_id\" content=\"1180679052138532\" />\n\t\n\t<link rel=\"preconnect\" href=\"https://img.odcdn.com.br\" /><link rel=\"preconnect\" href=\"https://assets.odcdn.com.br\" /><link rel=\"preload\" as=\"image\" href=\"https://img.odcdn.com.br/wp-content/uploads/2025/04/cropped-image-25-365x232.png\" media=\"(max-width: 480px)\"/><link rel=\"preload\

In [7]:
""" Processing sitemaps and robots.txt """

def get_sitemap(seed):
    """ Função para obter o sitemap de um site. """
    sitemap_url = seed + '/sitemap.xml'
    try:
        response = requests.get(sitemap_url, timeout=5)
        if response.status_code == 200:
            return response.text
        else:
            trash_can[sitemap_url] = response
            if DEBUG_MODE:
                print(f'Failed to get sitemap for {seed}: {response.status_code}')
            return trash_can
    except requests.RequestException as e:
        if DEBUG_MODE:
            print(f'Error getting sitemap for {seed}: {e}')
    return None

def get_robots_txt(seed):
    """ Função para obter o arquivo robots.txt de um site. """
    robots_url = seed + '/robots.txt'
    try:
        response = requests.get(robots_url, timeout=5)
        if response.status_code == 200:
            return response.text
        else:
            trash_can[robots_url] = response
            if DEBUG_MODE:
                print(f'Failed to get robots.txt for {seed}: {response.status_code}')
            return trash_can
    except requests.RequestException as e:
        if DEBUG_MODE:
            print(f'Error getting robots.txt for {seed}: {e}')
    return None

def get_robots_rules(robots_txt):
    """ Função para processar o arquivo robots.txt e retornar as regras. """
    rules = { 'sitemaps': set(), 'user_agents': {} }
    lines = robots_txt.split('\n')
    user_agent = None
    for line in lines:
        line = line.strip()
        if line.startswith('User-agent:'):
            user_agent = line.split(':')[1].strip()
            rules['user_agents'][user_agent] = {'allow': [], 'disallow': []}
        elif line.startswith('Disallow:') and user_agent:
            path = line.split(':')[1].strip()
            rules['user_agents'][user_agent]['allow'].append(path)
        elif line.startswith('Allow:') and user_agent:
            path = line.split(':')[1].strip()
            rules['user_agents'][user_agent]['disallow'].append(path)
        elif line.startswith('Sitemap:'):
            sitemap_url = line.split(': ')[1].strip()
            rules['sitemaps'].add(sitemap_url)
    return rules

def process_sitemaps(robots_rules):
    """ empty all xml sitemap files """
    def process_sitemap(sitemap):
        """ Função para processar cada sitemap. """
        
        try:
            response = requests.get(sitemap, timeout=5)
            mime = response.headers.get('Content-Type', '').split(';')[0]
            print(f'Processing sitemap: {sitemap}; mime: {mime}')
            # MIMEs: text/xml, application/xml
            if response.status_code == 200:
                soup = bs.BeautifulSoup(response.content, 'lxml')
                # try:
                #     soup = bs.BeautifulSoup(response.content, 'xml')
                # except (bs.FeatureNotFound):
                #     soup = bs.BeautifulSoup(response.content, 'html.parser')
                print(soup)
            #     urls = [url.text for url in soup.find_all('loc')]
            #     for url in urls:
            #         if url not in visited_URLs:
            #             visited_URLs.add(url)
            #             new_node = visit_seed(url)
            #             debug_as_json(new_node)
            # else:
            #     trash_can[sitemap] = response
            #     if DEBUG_MODE:
            #         print(f'Failed to process sitemap {sitemap}: {response.status_code}')
        except requests.RequestException as e:
            if DEBUG_MODE:
                print(f'Error processing sitemap {sitemap}: {e}')
    
    for sitemap in robots_rules['sitemaps']:
        process_sitemap(sitemap)
    

# print(base_seeds)

def process_seed(seed):
    """ Função para processar cada seed em uma thread. """
    if DEBUG_MODE:
        print(f'Processing seed: {seed}')
    robots_txt = get_robots_txt(seed)
    robots_rules = get_robots_rules(robots_txt)
    processed_sitemaps = process_sitemaps(robots_rules)
    

for seed in base_seeds:
    process_seed(seed)
    print(100*'=')

Processing seed: http://www.paguemenos.com.br/
Processing sitemap: https://www.paguemenos.com.br/sitemap.xml; mime: text/xml
<html><body><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap>
<loc>https://www.paguemenos.com.br/sitemap/appsRoutes-0.xml</loc>
<lastmod>2025-04-04T22:30:32.106Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.paguemenos.com.br/sitemap/brand-0.xml</loc>
<lastmod>2025-04-04T22:30:32.106Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.paguemenos.com.br/sitemap/brand-1.xml</loc>
<lastmod>2025-04-04T22:30:32.106Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.paguemenos.com.br/sitemap/userRoute-0.xml</loc>
<lastmod>2025-04-04T22:30:32.106Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.paguemenos.com.br/sitemap/category-0.xml</loc>
<lastmod>2025-04-04T22:30:32.106Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.paguemenos.com.br/sitemap/subcategory-0.xml</loc>
<lastmod>2025-04-04T22:30:32.106Z</lastmod>
</sitemap>
<sitemap>
<loc>https://


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = bs.BeautifulSoup(response.content, 'lxml')


<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" --><html><body><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url><loc>https://boaforma.abril.com.br/alimentacao/</loc></url><url><loc>https://boaforma.abril.com.br/astrologia/</loc></url><url><loc>https://boaforma.abril.com.br/beleza/</loc></url><url><loc>https://boaforma.abril.com.br/equilibrio/celebridades/</loc></url><url><loc>https://boaforma.abril.com.br/alimentacao/culinaria-saudavel/</loc></url><url><loc>https://boaforma.abril.com.br/alimentacao/dieta/</loc></url><url><loc>https://boaforma.abril.com.br/equilibrio/</loc></url><url><loc>https://boaforma.abril.com.br/equilibrio/estilo-de-vida/</loc></url><u

# Tentando focar no PseudoCódigo

- Get Seeds
- Store Frontier
  1. Dicionário
     1. Sem prioridade
     2. Com prioridade
  2. Árvore
- Percorrer frontier (1: Single, 2: Threading)
  - If new domain
   - Pre-processing
     - Read Robots
       - Agents
       - Delay
     - Read Sitemap
       - Recursively
         - URL:
           - loc: url
           - lastmod: timestamp
  - Get HTML
    1. focus on delaying
  - Parse
  - Get Links
  - Update Frontier
  - Rerank
  - new domains > outlinks > old inlinks > new inlinks

---

### MVP

- Get Seeds
- Store Frontier on list
- While Frontier:
  - Get URL
  - Parse HTML
  - Get Links
  - Update Frontier
- 
