# Enunciado do Trabalho

Universidade Federal de Minas Gerais

Departamento de Ciência da Computação

TCC/TSI/TECC: Information Retrieval

## Programming Assignment #1 - Web Crawler

- **Deadline:** Apr 28th, 2025 23:59 via Moodle

### Overview

The goal of this assignment is to implement a crawler capable of fetching a mid-sized corpus of webpages in a short time frame while respecting the politeness constraints defined by each crawled website. In addition to the source code of your implementation and the actual crawled documents, your submission must include a characterization of these documents.

### Implementation

You must use Python 3 for this assignment. Your code must run in a virtual environment **using only the libraries included** in the provided `requirements.txt` file. Execution errors due to missing libraries or incompatible library versions will result in a zero grade. To make sure you have the correct setup, you can test it in one of the [Linux machines provided by the Department of Computer Science $^{1}$][Link_CRC] using the following commands:

[Link_CRC]: <https://www.crc.dcc.ufmg.br/doku.php/infraestrutura/laboratorios/linux>

```bash
$ python3 -m venv pa1
$ source pa1/bin/activate
$ pip3 install -r /path/to/requirements.txt
```


### Execution

Your implementation should include a `main.py` file, which will be executed in the same virtual environment described above, as follows:

```bash
$ python3 main.py -s <SEEDS> -n <LIMIT> [-d]
```

with the following arguments:

- **-s \<SEEDS>:** the path to a file containing a list of seed URLs (one URL per line) for initializing the crawling process.
- **-n \<LIMIT>:** the target number of webpages to be crawled; the crawler should stop its execution once this target is reached.
- **-d:** (optional argument) run in debug mode (see below).

### Debugging

When executed in debugging mode (i.e. when -d is passed as a command-line argument), your implementation must print a record of each crawled webpage to [standard output $^2$][StandardOutput] as it progresses. Such a record must be formatted as a JSON document containing the following fields:

[StandardOutput]: <https://en.wikipedia.org/wiki/Standard_streams#Standard_output_(stdout)>

- `URL`, containing the page URL;
- `Title`, containing the page title;
- `Text`, containing the first 20 words from the page visible text;
- `Timestamp`, containing the [Unix time $^3$][UnixTime] when the page was crawled.

[UnixTime]: <https://en.wikipedia.org/wiki/Unix_time>

The following example illustrates the required debugging output format for the first webpage fetched during a crawling execution:

```json
{
    "URL": "https://g1.globo.com/",
    "Title": "G1 - O portal de notícias da Globo",
    "Text": "Deseja receber as notícias mais importantes em tempo real? Ative as notificações do G1! Agora não Ativar Gasto familiar Despesa",
    "Timestamp": 1649945049
}
```

### Crawling Policies

Your implementation must keep a frontier of URLs to be crawled, which must be initialized with the seed URLs provided as input to you with the -s argument. For each URL consumed from the frontier, your implementation must fetch the corresponding webpage, parse it, store the extracted HTML content in the local corpus, and enqueue the extracted outlinks in the frontier to be crawled later. In addition to this standard workflow, **your implementation must abide by the following crawling policies:**

1. *Selection Policy:* Starting from the provided seed URLs, your implementation **must only follow discovered links to HTML pages** (i.e. resources with MIME type `text/html`). To improve coverage, you may optionally choose to limit the crawling depth of any given website.
2. *Revisitation Policy:* Because this is a one-off crawling exercise, you **must not revisit a previously crawled webpage**. To ensure only new links are crawled, you may choose to normalize URLs and check for duplicates before adding new URLs to the frontier.
3. *Parallelization Policy:* To ensure maximum efficiency, you **must parallelize the crawling process across multiple threads**. You may experiment to find an optimal number of threads to maximize your download rate while minimizing the incurred parallelization overhead.
4. *Politeness Policy:* To avoid overloading the crawled websites, your implementation [**must abide by the robots exclusion protocol** $^4$][4]. Unless explicitly stated otherwise in a `robots.txt` file, you must obey a delay of at least 100ms between consecutive requests to the same website.
5. *Storage Policy:* As the main target for this assignment, your implementation **must crawl and store a total of 100,000 unique webpages**. The raw HTML content of the crawled webpages must be packaged using the [WARC format $^5$][5], with 1,000 webpages stored per WARC file (totalling 100 such files), compressed with gzip to reduce storage costs.

### Deliverables

Before the deadline (Apr 28th, 2025 23:59), you must submit a package file (zip) via Moodle containing the following:

1. Source code of your implementation;
2. Link to your crawled corpus (stored on Google Drive);
3. Documentation file (pdf, max 2 pages).

### Grading

This assignment is worth a total of 15 points distributed as:

- 10 points for your *implementation*, assessed based on the quality of your source code, including its overall organization (modularity, readability, indentation, use of comments) and appropriate use of data structures, as well as on how well it abides by the five aforementioned crawling policies.
- 5 points for your *documentation*, assessed based on a [short (pdf) report $^6$][6] describing your implemented data structures and algorithms, their computational complexity, as well as a discussion of their empirical efficiency (e.g. the download rate throughout the crawling execution, the speedup achieved as new threads are added). Your documentation should also include a characterization of your crawled corpus, including (but not limited to) the following statistics: total number of unique domains, size distribution (in terms of number of webpages) per domain, and size distribution (in terms of number of tokens) per webpage.

### Teams

This assignment must be performed individually. Any sign of plagiarism will be investigated and reported to the appropriate authorities.

[4]: <https://en.wikipedia.org/wiki/Robots_exclusion_standard>
[5]: <https://en.wikipedia.org/wiki/Web_ARChive>
[6]: <https://portalparts.acm.org/hippo/latex_templates/acmart-primary.zip> "Your documentation should be no longer than 2 pages and use the ACM LATEX template (sample-sigconf.tex)"

---


## To-Do's

### Execução

- [X] Install `requirements.txt` libraries
- Definir os parâmetros de inicialização
  - [X] Path to Seeds
  - [X] Target Number of Webpages
  - [X] Debug Mode

### Debugging Prints

- [X] URL
- [X] Title
- [X] Text
- [X] Timestamp (Unix time)

### Follow Policies

- Selection Policy
  - [X] only MIME type `text/html`
    - CONTENT-TYPE || MIMETYPE
  - [ ] Limit crawling depth of any given website (opcional)
- Revisitation Policy
  - [ ] Normalize URLs before adding to frontier
- Parallelization Policy
  - [ ] Parallelize the crawling process across multiple threads
- Politeness Policy
  - [ ] Obey the `robots.txt` file
  - [ ] Delay of at least 100ms between consecutive requests to the same website
- Storage Policy
  - [ ] Store 100,000 unique webpages
  - [X] Package using WARC format
  - [X] Compress with gzip to reduce storage costs
  - [ ] Store at Google Drive

### Documentation

- 2 pages (pdf)
  - [ ] ACM LATEX template (sample-sigconf.tex)
    - Data Structures
    - Algorithms
    - Computational Complexity
    - Empirical Efficiency
    - Crawled Corpus Characterization
      - Total number of unique domains
      - Size distribution (in terms of number of webpages) per domain
      - Size distribution (in terms of number of tokens) per webpage

---


## Ideias:

- Explorar urls
  - Vasculhar o sitemap
  - Árvore de aprendizado descritivo pra armazenar links
    - Todas as tarefas computacionalmente intensas devem ser delegadas à threads?
    - Armazenar logo quando achar?
    - Percorrer a árvore pra poder extrair os dados?
    - Percorrer enquanto se preenche?
  - Threads
    1. Um que fica verificando se novas páginas foram adicionadas aos filhos
    2. Em cada filho da raiz

- Estruturas de Dados
  - Árvore de aprendizado descritivo
    - Cada nó é uma página

```json
{
    "quantity":  2,
    "frontier": [],
    "children": {
        "https://boaforma.abril.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        },
        "https://olhardigital.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        }.
        "http://www.paguemenos.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        }
    }
}
```

- Paralelismo
  - Adiar o processamento de threads que passarem de algum determinado tempo de processamento.  
  - Uma nova thread para cada nó de árvore? Não... Threads demais.
    - Qual limite?


## Pseudocódigo

- [ ] Instalar bibliotecas com `requirements.txt`
  - [ ] `pip install -r requirements.txt`
    - [ ] BeautifulSoup4
    - [ ] requests
    - [ ] threading
    - [ ] time
    - [ ] gzip
    - [ ] warcio
    - [ ] json
    - [ ] os
- [ ] Importar bibliotecas
- [ ] Definir os parâmetros de inicialização
  - [ ] Path to Seeds
  - [ ] Target Number of Webpages
  - [ ] Debug Mode
- [ ] Criar uma árvore inicial
- [ ] Buscar as URL iniciais e armazenar na árvore
- [ ] Criar o Scraper da página
  - [ ] Definir mínima url
  - [ ] checar se a mínima url é válida e igual a inicial
  - [ ] Armazenar
    - [ ] Robots.txt
    - [ ] Sitemap
    - [ ] HTML
    - [ ] o título
    - [ ] URL
    - [ ] as 20 primeiras palavras do texto
    - [ ] o timestamp
    - [ ] o HTML completo
- [ ] Criar o gerador de threads
- [ ] Pensar em como percorrer a árvore
  - [ ] Por enquanto seguir um dicionário simples

---


## Pacotes necessários

Primeiro instalaremos e importaremos os pacotes necessários para o funcionamento do crawler. Utilizou-se do arquivo `requirements.txt` para instalar as dependências do projeto. O arquivo `requirements.txt` que contém essa lista.


In [None]:
""" Installing Python packages """

# run installation with requirements.txt

%pip install -r requirements.txt


## Importação

Os pacotes pacotes importados serão de extrema importância para o funcionamento do crawler. Abaixo estão os pacotes que foram importados e suas respectivas funções:

- **BeautifulSoup4:** Utilizado para fazer o parsing do HTML e extrair informações relevantes, como o título da página e o texto visível.
- **requests:** Usado para fazer requisições HTTP e baixar o conteúdo das páginas da web.
- **datetime:** Usado para manipular datas e horários, especialmente para registrar o timestamp de quando a página foi baixada.
- **warcio:** Usado para criar arquivos WARC (Web ARChive) que armazenam o conteúdo baixado de forma eficiente.
- **os:** Usado para interagir com o sistema operacional, como criar diretórios e manipular arquivos.
- **json:** Usado para manipular dados no formato JSON, especialmente para imprimir os resultados do crawler.
- **gzip:** Usado para compactar os arquivos WARC, reduzindo o espaço de armazenamento necessário.
- **threading:** Usado para criar e gerenciar threads, permitindo que o crawler baixe várias páginas simultaneamente.


In [None]:
# Importando as bibliotecas necessárias

# import beautifulsoup4 as bs
# import certifi                          # Root certificates for validating SSL/TLS (used by requests)
# import charset_normalizer               # Used for detecting and normalizing text encodings (dependency of requests)
# import idna                             # Internationalized domain name support (dependency of requests)
# from protego import Protego             # Parses and enforces robots.txt rules
import requests                         # HTTP library for making requests to web resources
# import six                              # Compatibility layer for writing Python 2/3 code (used by many older libs)
# import soupsieve                        # CSS selector engine for BeautifulSoup
# import typing_extensions                # Adds backported or experimental typing features for older Python versions
# from url_normalize import url_normalize # Normalizes URLs into a consistent format
# import urllib3                          # Low-level HTTP library used by requests
# import warcio                           # Library for reading and writing WARC (Web ARChive) files
# from warcio.capture_http import capture_http # Capture HTTP requests and responses for archiving
from warcio.warcwriter import WARCWriter # GPT is helping me with WARCing
from io import BytesIO # GPT is helping me with WARCing

# JV
import bs4 as bs # BeautifulSoup wrapper for parsing HTML and XML
import datetime # Getting unix timestamp
import json
import re # Splitting strings
import argparse # Parsing command line arguments
import sys


## Variáveis Globais

Alguns parâmetros globais foram definidos para facilitar o controle do crawler. Esses parâmetros incluem:

- **SEEDS:** Caminho para o arquivo que contém as URLs iniciais (sementes) para o crawler.
- **LIMIT:** Número máximo de páginas a serem baixadas.
- **DEBUG:** Modo de depuração, que imprime informações detalhadas sobre o processo de download.
- **DELAY:** Tempo de espera entre as requisições para evitar sobrecarregar o servidor.
- **MAX_THREADS:** Número máximo de threads que podem ser criadas para o download simultâneo.
- **visited_urls:** Conjunto para armazenar URLs já visitadas, evitando revisitações.
- **frontier:** Lista de URLs a serem visitadas, que será preenchida com as URLs extraídas das páginas baixadas.


In [None]:
""" Simulating the python CLI arguments """

def get_args():
    """ Set up command line arguments """
    parser = argparse.ArgumentParser(description="Web Crawling script")
    parser.add_argument('-s', '--seeds', type=str, required=True, help="Path to seed file")
    parser.add_argument('-n', '--limit', type=int, required=True, help="Number of pages to crawl")
    parser.add_argument('-d', '--debug', action='store_true', help="Enable debug mode")
    
    if '__file__' not in globals():  # Detecta se está em um notebook
        params = ['-s', './Seeds/seeds-2024711370.txt', '-n', '50', '-d']
        args = parser.parse_args(params)  # Ignora args ou simula
    else:
        args = parser.parse_args()   # Usa normalmente no terminal
    
    return args

ARGS = get_args()

# print(f"Arguments: {ARGS}")  # Debugging line to check the arguments passed to the script

In [None]:
""" Code constants """

SEEDS_FILE = ARGS.seeds
PAGES_LIMIT = ARGS.limit
DEBUG_MODE = ARGS.debug
MIN_DELAY = 100 # Delay in milliseconds between requests


# Tentando focar no PseudoCódigo

- Get Seeds
- Store Frontier
  1. Dicionário
     1. Sem prioridade
     2. Com prioridade
  2. Árvore
- Percorrer frontier (1: Single, 2: Threading)
  - If new domain
   - Pre-processing
     - Read Robots
       - Agents
       - Delay
     - Read Sitemap
       - Recursively
         - URL:
           - loc: url
           - lastmod: timestamp
  - Get HTML
    1. focus on delaying
  - Parse
  - Get Links
  - Update Frontier
  - Rerank
  - new domains > outlinks > old inlinks > new inlinks

---

## MVP

- Get Seeds
- Store Frontier on list
- While Frontier:
  - Get URL
  - Parse HTML
  - Get Links
  - Update Frontier


In [None]:
# MVPing

def get_seeds(path='./Seeds/seeds-2024711370.txt'):
    """ Reads all seeds from a file and returns them as a set. """
    seeds = set()
    with open(path, 'r') as file:
        for line in file:
            line = line.strip()
            if line and not line.startswith('#'):
                seeds.add(line)
    return seeds

def parse_url(url):
    """ Parses a URL and returns its components. """

    def filter_text(text):
        """ Only get the 20 first words from the text (ignoring empty tokens) """
        # \W+ = qualquer sequência de caracteres que não sejam letras ou números
        words = re.split(r'\W+', text)
        # words = re.findall(r'\b\w[\w\'\-]*[!?.,]?\b', text) # Match palavras com pontuação leve grudada (.,!?, etc.)

        # remove vazios resultantes de split
        words = [word for word in words if word]
        joined_words = ' '.join(words[:20])
        return joined_words
    
    def get_timestamp():
        """ Returns the current timestamp in seconds since 1970 """
        return int(datetime.datetime.now().timestamp())
    
    def get_new_links(soup):
        """ Returns all new links found in the parsed HTML. """
        links = set()
        for link in soup.find_all('a', href=True):
            href = link['href']
            if href.startswith('http'):
                links.add(href)
        return links
    
    
    base_parsed_url = { 'URL': url, 'Title': None, 'Text': None, 'Timestamp': None }
    try:
        response = requests.get(url)
        base_parsed_url['Response'] = response
        mime = response.headers.get('Content-Type', '').split(';')[0]
        status_code_200 = response.status_code == 200
        is_HTML = mime == 'text/html'
        if status_code_200 and is_HTML:
            soup = bs.BeautifulSoup(response.content, 'html.parser')
            full_text = soup.get_text()
            base_parsed_url['Title'] = soup.title.string if soup.title else None
            base_parsed_url['Text'] = filter_text(full_text)
            base_parsed_url['Timestamp'] = get_timestamp()
            
            base_parsed_url['Outlinks'] = get_new_links(soup)
            base_parsed_url['Full_Text'] = full_text
            
            
    except requests.RequestException as e:
        # Add debug mode later
        print(f'Error parsing URL {url}: {e}')
    return base_parsed_url

def update_frontier(frontier, parsed_url):
    """ Updates the frontier with new links found in the parsed URL. """
    frontier.update(parsed_url['Outlinks'])
    return frontier

def debug_print(parsed_url):
    """ Prints the parsed URL in a readable format. """
    debug_info = {
        'URL': parsed_url['URL'],
        'Title': parsed_url['Title'],
        'Text': parsed_url['Text'],
        'Timestamp': parsed_url['Timestamp'],
    }
    print(json.dumps(debug_info, indent=4))

def store_warc(parsed_url):
    """ Stores the parsed URL in a WARC file. """
    with open('output.warc.gz', 'ab') as output: # ab = Append and Binary mode.
        writer = WARCWriter(output, gzip=True) # gzip = True makes it automatically compressed.
        record = writer.create_warc_record(
            uri=parsed_url['URL'],
            record_type='response',
            payload=BytesIO(parsed_url['Full_Text'].encode('utf-8'))
        )
        writer.write_record(record)

In [None]:
""" Runnning MVP """

def main():
    """ Main function to run the MVP. """
    frontier = get_seeds(SEEDS_FILE)
    shallow_frontier = frontier.copy()
    for url in shallow_frontier:
        parsed_url = parse_url(url)
        if DEBUG_MODE:
            debug_print(parsed_url)
        # print(len(parsed_url['Outlinks']), url)
        frontier = update_frontier(frontier, parsed_url)
        # print(frontier)
        store_warc(parsed_url)

if __name__ == '__main__':
    main()