# Enunciado do Trabalho

Universidade Federal de Minas Gerais

Departamento de Ciência da Computação

TCC/TSI/TECC: Information Retrieval

## Programming Assignment #1 - Web Crawler

- **Deadline:** Apr 28th, 2025 23:59 via Moodle

### Overview

The goal of this assignment is to implement a crawler capable of fetching a mid-sized corpus of webpages in a short time frame while respecting the politeness constraints defined by each crawled website. In addition to the source code of your implementation and the actual crawled documents, your submission must include a characterization of these documents.

### Implementation

You must use Python 3 for this assignment. Your code must run in a virtual environment **using only the libraries included** in the provided `requirements.txt` file. Execution errors due to missing libraries or incompatible library versions will result in a zero grade. To make sure you have the correct setup, you can test it in one of the [Linux machines provided by the Department of Computer Science $^{1}$][Link_CRC] using the following commands:

[Link_CRC]: <https://www.crc.dcc.ufmg.br/doku.php/infraestrutura/laboratorios/linux>

```bash
$ python3 -m venv pa1
$ source pa1/bin/activate
$ pip3 install -r /path/to/requirements.txt
```


### Execution

Your implementation should include a `main.py` file, which will be executed in the same virtual environment described above, as follows:

```bash
$ python3 main.py -s <SEEDS> -n <LIMIT> [-d]
```

with the following arguments:

- **-s \<SEEDS>:** the path to a file containing a list of seed URLs (one URL per line) for initializing the crawling process.
- **-n \<LIMIT>:** the target number of webpages to be crawled; the crawler should stop its execution once this target is reached.
- **-d:** (optional argument) run in debug mode (see below).

### Debugging

When executed in debugging mode (i.e. when -d is passed as a command-line argument), your implementation must print a record of each crawled webpage to [standard output $^2$][StandardOutput] as it progresses. Such a record must be formatted as a JSON document containing the following fields:

[StandardOutput]: <https://en.wikipedia.org/wiki/Standard_streams#Standard_output_(stdout)>

- `URL`, containing the page URL;
- `Title`, containing the page title;
- `Text`, containing the first 20 words from the page visible text;
- `Timestamp`, containing the [Unix time $^3$][UnixTime] when the page was crawled.

[UnixTime]: <https://en.wikipedia.org/wiki/Unix_time>

The following example illustrates the required debugging output format for the first webpage fetched during a crawling execution:

```json
{
    "URL": "https://g1.globo.com/",
    "Title": "G1 - O portal de notícias da Globo",
    "Text": "Deseja receber as notícias mais importantes em tempo real? Ative as notificações do G1! Agora não Ativar Gasto familiar Despesa",
    "Timestamp": 1649945049
}
```

### Crawling Policies

Your implementation must keep a frontier of URLs to be crawled, which must be initialized with the seed URLs provided as input to you with the -s argument. For each URL consumed from the frontier, your implementation must fetch the corresponding webpage, parse it, store the extracted HTML content in the local corpus, and enqueue the extracted outlinks in the frontier to be crawled later. In addition to this standard workflow, **your implementation must abide by the following crawling policies:**

1. *Selection Policy:* Starting from the provided seed URLs, your implementation **must only follow discovered links to HTML pages** (i.e. resources with MIME type `text/html`). To improve coverage, you may optionally choose to limit the crawling depth of any given website.
2. *Revisitation Policy:* Because this is a one-off crawling exercise, you **must not revisit a previously crawled webpage**. To ensure only new links are crawled, you may choose to normalize URLs and check for duplicates before adding new URLs to the frontier.
3. *Parallelization Policy:* To ensure maximum efficiency, you **must parallelize the crawling process across multiple threads**. You may experiment to find an optimal number of threads to maximize your download rate while minimizing the incurred parallelization overhead.
4. *Politeness Policy:* To avoid overloading the crawled websites, your implementation [**must abide by the robots exclusion protocol** $^4$][4]. Unless explicitly stated otherwise in a `robots.txt` file, you must obey a delay of at least 100ms between consecutive requests to the same website.
5. *Storage Policy:* As the main target for this assignment, your implementation **must crawl and store a total of 100,000 unique webpages**. The raw HTML content of the crawled webpages must be packaged using the [WARC format $^5$][5], with 1,000 webpages stored per WARC file (totalling 100 such files), compressed with gzip to reduce storage costs.

### Deliverables

Before the deadline (Apr 28th, 2025 23:59), you must submit a package file (zip) via Moodle containing the following:

1. Source code of your implementation;
2. Link to your crawled corpus (stored on Google Drive);
3. Documentation file (pdf, max 2 pages).

### Grading

This assignment is worth a total of 15 points distributed as:

- 10 points for your *implementation*, assessed based on the quality of your source code, including its overall organization (modularity, readability, indentation, use of comments) and appropriate use of data structures, as well as on how well it abides by the five aforementioned crawling policies.
- 5 points for your *documentation*, assessed based on a [short (pdf) report $^6$][6] describing your implemented data structures and algorithms, their computational complexity, as well as a discussion of their empirical efficiency (e.g. the download rate throughout the crawling execution, the speedup achieved as new threads are added). Your documentation should also include a characterization of your crawled corpus, including (but not limited to) the following statistics: total number of unique domains, size distribution (in terms of number of webpages) per domain, and size distribution (in terms of number of tokens) per webpage.

### Teams

This assignment must be performed individually. Any sign of plagiarism will be investigated and reported to the appropriate authorities.

[4]: <https://en.wikipedia.org/wiki/Robots_exclusion_standard>
[5]: <https://en.wikipedia.org/wiki/Web_ARChive>
[6]: <https://portalparts.acm.org/hippo/latex_templates/acmart-primary.zip> "Your documentation should be no longer than 2 pages and use the ACM LATEX template (sample-sigconf.tex)"

---


## To-Do's

### Execução

- [X] Install `requirements.txt` libraries
- Definir os parâmetros de inicialização
  - [X] Path to Seeds
  - [X] Target Number of Webpages
  - [X] Debug Mode

### Debugging Prints

- [X] URL
- [X] Title
- [X] Text
- [X] Timestamp (Unix time)

### Follow Policies

- Selection Policy
  - [X] only MIME type `text/html`
    - CONTENT-TYPE || MIMETYPE
  - [ ] Limit crawling depth of any given website (opcional)
- Revisitation Policy
  - [ ] Normalize URLs before adding to frontier
- Parallelization Policy
  - [ ] Parallelize the crawling process across multiple threads
- Politeness Policy
  - [ ] Obey the `robots.txt` file
  - [ ] Delay of at least 100ms between consecutive requests to the same website
- Storage Policy
  - [ ] Store 100,000 unique webpages
  - [X] Package using WARC format
  - [X] Compress with gzip to reduce storage costs
  - [ ] Store at Google Drive

### Documentation

- 2 pages (pdf)
  - [ ] ACM LATEX template (sample-sigconf.tex)
    - Data Structures
    - Algorithms
    - Computational Complexity
    - Empirical Efficiency
    - Crawled Corpus Characterization
      - Total number of unique domains
      - Size distribution (in terms of number of webpages) per domain
      - Size distribution (in terms of number of tokens) per webpage

---


## Ideias:

- Explorar urls
  - Vasculhar o sitemap
  - Árvore de aprendizado descritivo pra armazenar links
    - Todas as tarefas computacionalmente intensas devem ser delegadas à threads?
    - Armazenar logo quando achar?
    - Percorrer a árvore pra poder extrair os dados?
    - Percorrer enquanto se preenche?
  - Threads
    1. Um que fica verificando se novas páginas foram adicionadas aos filhos
    2. Em cada filho da raiz

- Estruturas de Dados
  - Árvore de aprendizado descritivo
    - Cada nó é uma página

```json
{
    "quantity":  2,
    "frontier": [],
    "children": {
        "https://boaforma.abril.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        },
        "https://olhardigital.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        }.
        "http://www.paguemenos.com.br": {
            "quantity":  0,
            "frontier": [],
            "URL": "",
            "Title": "",
            "Text": "",
            "Timestamp": 000,
            "children": {}
        }
    }
}
```

- Paralelismo
  - Adiar o processamento de threads que passarem de algum determinado tempo de processamento.  
  - Uma nova thread para cada nó de árvore? Não... Threads demais.
    - Qual limite?


## Pseudocódigo

- [ ] Instalar bibliotecas com `requirements.txt`
  - [ ] `pip install -r requirements.txt`
    - [ ] BeautifulSoup4
    - [ ] requests
    - [ ] threading
    - [ ] time
    - [ ] gzip
    - [ ] warcio
    - [ ] json
    - [ ] os
- [ ] Importar bibliotecas
- [ ] Definir os parâmetros de inicialização
  - [ ] Path to Seeds
  - [ ] Target Number of Webpages
  - [ ] Debug Mode
- [ ] Criar uma árvore inicial
- [ ] Buscar as URL iniciais e armazenar na árvore
- [ ] Criar o Scraper da página
  - [ ] Definir mínima url
  - [ ] checar se a mínima url é válida e igual a inicial
  - [ ] Armazenar
    - [ ] Robots.txt
    - [ ] Sitemap
    - [ ] HTML
    - [ ] o título
    - [ ] URL
    - [ ] as 20 primeiras palavras do texto
    - [ ] o timestamp
    - [ ] o HTML completo
- [ ] Criar o gerador de threads
- [ ] Pensar em como percorrer a árvore
  - [ ] Por enquanto seguir um dicionário simples

---


# Tentando focar no Pseudocódigo

- Get Seeds
- Store Frontier
  1. Dicionário
     1. Sem prioridade
     2. Com prioridade
  2. Árvore
- Percorrer frontier (1: Single, 2: Threading)
  - If new domain
   - Pre-processing
     - Read Robots
       - Agents
       - Delay
     - Read Sitemap
       - Recursively
         - URL:
           - loc: url
           - lastmod: timestamp
  - Get HTML
    1. focus on delaying
  - Parse
  - Get Links
  - Update Frontier
  - Rerank
  - new domains > outlinks > old inlinks > new inlinks

---

## MVP

- Get Seeds
- Store Frontier on list
- While Frontier:
  - Get URL
  - Parse HTML
  - Get Links
  - Update Frontier


# Might be needed later

```python
url = "https://example.com/some/path/page.html"
parsed_url = requests.utils.urlparse(url)
scheme = parsed_url.scheme
domain = parsed_url.netloc
path = parsed_url.path
```

## Pacotes necessários

Primeiro instalaremos e importaremos os pacotes necessários para o funcionamento do crawler. Utilizou-se do arquivo `requirements.txt` para instalar as dependências do projeto. O arquivo `requirements.txt` que contém essa lista.


In [None]:
""" Installing Python packages """

# run installation with requirements.txt

%pip install -r requirements.txt


## Importação

Os pacotes pacotes importados serão de extrema importância para o funcionamento do crawler. Abaixo estão os pacotes que foram importados e suas respectivas funções:

- **BeautifulSoup4:** Utilizado para fazer o parsing do HTML e extrair informações relevantes, como o título da página e o texto visível.
- **requests:** Usado para fazer requisições HTTP e baixar o conteúdo das páginas da web.
- **datetime:** Usado para manipular datas e horários, especialmente para registrar o timestamp de quando a página foi baixada.
- **warcio:** Usado para criar arquivos WARC (Web ARChive) que armazenam o conteúdo baixado de forma eficiente.
- **os:** Usado para interagir com o sistema operacional, como criar diretórios e manipular arquivos.
- **json:** Usado para manipular dados no formato JSON, especialmente para imprimir os resultados do crawler.
- **gzip:** Usado para compactar os arquivos WARC, reduzindo o espaço de armazenamento necessário.
- **threading:** Usado para criar e gerenciar threads, permitindo que o crawler baixe várias páginas simultaneamente.


In [None]:
""" Importando as bibliotecas necessárias """

# import beautifulsoup4 as bs
# import certifi                          # Root certificates for validating SSL/TLS (used by requests)
# import charset_normalizer               # Used for detecting and normalizing text encodings (dependency of requests)
# import idna                             # Internationalized domain name support (dependency of requests)
# from protego import Protego             # Parses and enforces robots.txt rules
import requests                         # HTTP library for making requests to web resources
# import six                              # Compatibility layer for writing Python 2/3 code (used by many older libs)
# import soupsieve                        # CSS selector engine for BeautifulSoup
# import typing_extensions                # Adds backported or experimental typing features for older Python versions
# from url_normalize import url_normalize # Normalizes URLs into a consistent format
# import urllib3                          # Low-level HTTP library used by requests
# import warcio                           # Library for reading and writing WARC (Web ARChive) files
# from warcio.capture_http import capture_http # Capture HTTP requests and responses for archiving
from warcio.warcwriter import WARCWriter # GPT is helping me with WARCing
from warcio.statusandheaders import StatusAndHeaders
from warcio.archiveiterator import ArchiveIterator # Needed for reading WARC files
import gzip # Needed for reading WARC files
from io import BytesIO # GPT is helping me with WARCing

# JV
import bs4 as bs # BeautifulSoup wrapper for parsing HTML and XML
import datetime # Getting unix timestamp
import json
import re # Splitting strings
import argparse # Parsing command line arguments
import sys
from collections import deque # Needed for the sitemap exploring
import time # Time functions



# Python CLI

- The `argparse` module is a standard library in Python that provides a way to handle command-line arguments passed to a script. It allows you to define the expected arguments, their types, and whether they are required or optional. The module automatically generates help messages and handles errors related to argument parsing.
- From the provided code, we can see that the `argparse` module is used to define three command-line arguments: `-s`, `-n`, and `-d`. Here's a breakdown of each argument:
  - `-s <SEEDS>`: This argument specifies the path to a file containing a list of seed URLs. The URLs in this file will be used to initialize the crawling process.
  - `-n <LIMIT>`: This argument specifies the target number of webpages to be crawled. The crawler will stop its execution once this target is reached.
  - `-d`: This is an optional argument that indicates whether the crawler should run in debug mode. When this argument is provided, the crawler will print debugging information to the console as it processes each webpage.

In [None]:
""" Simulating the python CLI arguments """

def get_args():
    """ Set up command line arguments """
    parser = argparse.ArgumentParser(description="Web Crawling script")
    parser.add_argument('-s', '--seeds', type=str, required=True, help="Path to seed file")
    parser.add_argument('-n', '--limit', type=int, required=True, help="Number of pages to crawl")
    parser.add_argument('-d', '--debug', action='store_true', help="Enable debug mode")
    
    if '__file__' not in globals():  # Detecta se está em um notebook
        params = ['-s', './Seeds/seeds-2024711370.txt', '-n', '2500', '-d']
        args = parser.parse_args(params)  # Ignora args ou simula
    else:
        args = parser.parse_args()   # Usa normalmente no terminal
    
    return args

ARGS = get_args()

# print(f"Arguments: {ARGS}")  # Debugging line to check the arguments passed to the script

## Variáveis Globais

Alguns parâmetros globais foram definidos para facilitar o controle do crawler. Esses parâmetros incluem:

- **ARGS:** Dicionário para armazenar os argumentos de linha de comando passados ao script
  - **SEEDS:** Caminho para o arquivo que contém as URLs iniciais (sementes) para o crawler.
  - **DEBUG:** Modo de depuração, que imprime informações detalhadas sobre o processo de download.
  - **LIMIT:** Número máximo de páginas a serem baixadas.
- **MIN_DELAY:** Tempo de espera entre as requisições para evitar sobrecarregar o servidor.
- **MAX_THREADS:** Número máximo de threads que podem ser criadas para o download simultâneo.


In [None]:
""" Code constants: SEEDS_FILE, PAGES_LIMIT, DEBUG_MODE, MIN_DELAY, MAX_THREADS """

SEEDS_FILE = ARGS.seeds if ARGS.seeds else './Seeds/seeds-2024711370.txt'
PAGES_LIMIT = ARGS.limit if ARGS.limit else 2500
DEBUG_MODE = ARGS.debug if ARGS.debug else False
MIN_DELAY = 100 # Delay in milliseconds between requests
MAX_THREADS = 10 # Maximum number of threads to use for crawling

# Helper Functions

- Here we have some helper functions that are used to perform specific tasks within the crawler. Mostly for debug purposes.
- Those functions are:
  - `print_json`: It prints the JSON object in a formatted way, making it easier to read and understand.

## JSON Pretty Print

- `print_json` function is used to print JSON objects in a human-readable format.

In [None]:
""" print_json: Pretty print JSON data """

def print_json(data):
    """ Pretty prints JSON data. """
    def convert(obj):
        if isinstance(obj, set):
            return list(obj)
        if isinstance(obj, requests.structures.CaseInsensitiveDict):
            return dict(obj)
        raise TypeError(f'Object of type {obj.__class__.__name__} is not JSON serializable')

    print(json.dumps(data, indent=4, default=convert))

## Default Requester

- `default_requester`: This function is used to make HTTP requests to a given URL. It handles the request and response process, including error handling and retries. The function takes the URL as an argument and returns the response content if the request is successful. If the request fails, it will retry a specified number of times before raising an exception.

In [None]:
""" default_requester: useful for removing try_except blocks from the main code """

def default_requester(url, timeout=5):
    """ Default function to make a GET request to a URL. """
    
    def cleaning_headers(headers):
        """ Clean headers not needed keys """
        new_headers = {
            'Content-Type': headers.get('Content-Type', ''),
            'Cache-Control': headers.get('Cache-Control', ''),
            'Content-Encoding': headers.get('Content-Encoding', ''),
            'Date': headers.get('Date', ''),
            'Strict-Transport-Security': headers.get('Strict-Transport-Security', ''),
            # 'X-Frame-Options': headers.get('X-Frame-Options', ''),
            # 'ETag': headers.get('ETag', ''),
            # 'Vary': headers.get('Vary', ''),
        }
        
        return new_headers
        
    
    def get_response_dict(response):
        """ Convert response to a dictionary. """
        response_dict = {
            'apparent_encoding': response.apparent_encoding,         # Returns the apparent encoding
            # 'close()': response.close,                               # Closes the connection to the server
            'content': response.content,                             # Returns the content of the response, in bytes
            'cookies': response.cookies,                             # Returns a CookieJar object with the cookies sent back from the server
            'elapsed': response.elapsed,                             # Returns a timedelta object with the time elapsed from sending the request to the arrival of the response
            'encoding': response.encoding,                           # Returns the encoding used to decode r.text
            'headers': response.headers,                             # Returns a dictionary of response headers
            'history': response.history,                             # Returns a list of response objects holding the history of request (url)
            'is_permanent_redirect': response.is_permanent_redirect, # Returns True if the response is the permanent redirected url, otherwise False
            'is_redirect': response.is_redirect,                     # Returns True if the response was redirected, otherwise False
            # 'iter_content()': response.iter_content,                 # Iterates over the response
            # 'iter_lines()': response.iter_lines,                     # Iterates over the lines of the response
            # 'json': response.json(),                                 # Returns a JSON object of the result (if the result was written in JSON format, if not it raises an error)
            'links': response.links,                                 # Returns the header links
            'next': response.next,                                   # Returns a PreparedRequest object for the next request in a redirection
            'ok': response.ok,                                       # Returns True if status_code is less than 400, otherwise False
            # 'raise_for_status()': response.raise_for_status,         # If an error occur, this method returns a HTTPError object
            'reason': response.reason,                               # Returns a text corresponding to the status code
            'request': response.request,                             # Returns the request object that requested this response
            'status_code': response.status_code,                     # Returns a number that indicates the status (200 is OK, 404 is Not Found)
            'text': response.text,                                   # Returns the content of the response, in unicode
            'url': response.url,                                     # Returns the URL of the response
            'version': response.raw.version,                             # Returns the version of the HTTP protocol used by the server
        }
        response_dict['headers'] = cleaning_headers(response_dict['headers'])
        return response_dict
    
    try:
        response = requests.get(url, timeout=timeout)
        # print(response.status_code)  # Print the status code of the response
        return get_response_dict(response)  # Call the function to get the response dictionary
    except requests.RequestException as e:
        # response.raise_for_status()  # Raise an error for bad responses
        print(f"Error fetching {url}: {e}")
        return None

## Benchmark Test

- `benchmark_test`: This function is used to measure the time taken to download a webpage. It takes a URL as an argument and returns the time taken to download the page. This can be useful for performance testing and optimization.

In [None]:
""" benchmarking """

def benchmark_test(function, parameters):
    """ Benchmarking function to measure execution time. """
    start_time = time.time()
    result = function(*parameters)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(
        f"Function {function.__name__} executed in {elapsed_time:.4f} seconds")
    return result

# benchmark_test(get_robots_txt, ['https://olhardigital.com.br'])
# benchmark_test(get_sitemap, ['https://olhardigital.com.br', {'sitemap': []}])

## Cleaning URL

- `get_base_url`: This function is used to extract the base URL from a given URL. It takes a URL as an argument and returns the base URL, which is useful for normalizing URLs and ensuring that they are in a consistent format. The function uses the `urlparse` module to parse the URL and extract the relevant components.

In [None]:
""" get_base_url: Get the base URL from a given URL """

def get_base_url(url):
    """ Get the base URL from a given URL. """
    
    parsed_url = requests.utils.urlparse(url)
    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
    return base_url
    

## Reading WARC Zipped File

- `read_warc_zipped_file`: This function is used to read a WARC file that has been compressed using gzip. It takes the path to the WARC file as an argument and returns the content of the file. This is useful for processing and analyzing archived web content.

In [None]:
""" read_warc: Reads the WARC file and prints its contents. """

def read_warc_zipped_file(warc_path):
    """ Reads and prints the WARC file contents. """
    with open(warc_path, 'rb') as stream:
        if warc_path.endswith('.gz'):
            stream = gzip.GzipFile(fileobj=stream)

        for record in ArchiveIterator(stream):
            if record.rec_type == 'response':
                uri = record.rec_headers.get_header('WARC-Target-URI')
                payload = record.content_stream().read()
                print(f"URI: {uri}")
                print(f"Payload: {payload[:500]}...")  # Show first 500 bytes
                print("-" * 50)

# read_warc_zipped_file('output.warc.gz')

# Actual useful functions

Now we have the actual useful functions that are used to perform the main tasks of the crawler. Those functions are:


## Debug Print

- `debug_print`: This function is used to print the debug information in the way that was specified in the assignment. It prints the URL, title, text, and timestamp of the crawled page in JSON format.


In [None]:
""" debug_print: Prints the parsed URL in a readable format that is defined in the assignment. """

def debug_print(parsed_url):
    """ Prints the parsed URL in a readable format. """
    debug_info = {
        'URL': parsed_url['URL'],
        'Title': parsed_url['Title'],
        'Text': parsed_url['Text'],
        'Timestamp': parsed_url['Timestamp'],
    }
    print_json(debug_info)

# Get Timestamp

- `get_timestamp`: This function is used to get the current timestamp in Unix time format. It uses the `datetime` module to get the current date and time and then converts it to a timestamp. This timestamp is used to record when the page was crawled.

In [None]:
""" get_timestamp: Returns the current timestamp in seconds since 1970 """
    
def get_timestamp():
    """ Returns the current timestamp in seconds since 1970 """
    return int(datetime.datetime.now().timestamp())

## Get Seeds

- `get_seeds`: This function is used to get the seeds from the file specified in the command line arguments. It reads the file and returns a list of URLs to be used as seeds for the crawler.

In [None]:
""" get_seeds: Getting seeds from file """

def get_seeds(path='./Seeds/seeds-2024711370.txt'):
    """ Reads all seeds from a file and returns them as a set. """
    seeds = set()
    with open(path, 'r') as file:
        for line in file:
            line = line.strip()
            if line and not line.startswith('#'):
                seeds.add(line)
    return seeds

## Parsing Robots.txt

- `parse_robots`: This function is used to parse the `robots.txt` file of a website. It extracts the user-agent and disallow rules from the file and returns them as a dictionary. This information is used to determine whether the crawler is allowed to access certain pages on the website.

In [None]:
""" get_robots_txt: processes the contents of the robots.txt file. """

def get_robots_txt(url):
    """ Returns the robots.txt file for a given URL. """
    def robots_scraping(robots_text):
        """ Processes the robots info """
        # Parse the robots.txt file and extract the rules
        rules = {
            'crawl-delay': MIN_DELAY,
            'user-agents': {},
            'sitemap': [],
            'misc': [],
        }
        lines = robots_text.splitlines()
        user_agent = None
        for line in lines:
            line = line.strip().lower()
            splitted_line = line.split(':', 1)

            key = splitted_line[0].strip()
            value = splitted_line[1].strip() if len(splitted_line) > 1 else ''
            if key == 'user-agent':
                user_agent = value
                if user_agent not in rules['user-agents']:
                    rules['user-agents'][user_agent] = { 'disallow': [], 'allow': []}
            
            elif key in ['disallow', 'allow'] and user_agent:
                rules['user-agents'][user_agent][key].append(value)
            elif key == 'crawl-delay' and user_agent:
                try:
                    rules[key] = int(value)
                except ValueError:
                    pass
            elif key == 'sitemap':
                rules[key].append(value)
            else:
                if len(line) > 0:
                    rules['misc'].append(line)
            
        return rules
        
    
    # Parse the URL to get the base domain
    parsed_url = requests.utils.urlparse(url)
    # print(type(parsed_url))
    # print(dict(parsed_url))
    robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
    
    try:
        response = requests.get(robots_url, timeout=5)
        if response.status_code == 200:
            return robots_scraping(response.text)
        else:
            print(f"Robots.txt not found for {url}. Status code: {response.status_code}")
            return None
    except requests.RequestException as e:
        print(f"Error fetching robots.txt for {url}: {e}")
    return None

## Parse Sitemap.xml

- `parse_sitemap`: This function is used to parse the `sitemap.xml` file of a website. It extracts the URLs from the sitemap and returns them as a list. This information is used to find additional pages to crawl on the website.
  - Possible improvements:
    - Recursively parse the sitemap to find all URLs, including those in nested sitemaps.
    - Add more info like the lastmod date for URL refreshing.

In [None]:
""" get_sitemap: processes the contents of the sitemap.xml file. """

def get_sitemap(url, robots_info=None):
    def parse_sitemap(sitemap_url):
        """ Processes the sitemap XML and extracts URLs info """
        response = default_requester(sitemap_url)
        if response is None:
            return [set(), set()]
        sitemap_text = response.text
        soup = bs.BeautifulSoup(sitemap_text, 'xml')
        urls = list()
        xmls = list()
        
        # Extract URLs from the sitemap
        for url in soup.find_all('url'):
            loc = url.find('loc')
            urls.append(loc.text.strip()) if loc else None
        for sitemap in soup.find_all('sitemap'):
            loc = sitemap.find('loc')
            xmls.append(loc.text.strip()) if loc else None
        return [xmls, urls]
    
    def traverse_sitemaps(sitemap_info):
        sitemap_queue = deque(sitemap_info['sitemap_urls'])
        # index = 0
        while sitemap_queue:
            sitemap_url = sitemap_queue.popleft()
            [nested, pages] = parse_sitemap(sitemap_url)
            # [nested, pages] = benchmark_test(parse_sitemap, [sitemap_url])
            sitemap_queue.extend(nested)  # adiciona os sitemaps internos na fila
            sitemap_info['sitemap_urls'].extend(nested)  # adiciona os sitemaps internos na lista de sitemaps
            sitemap_info['found_urls'].extend(pages)
            
            # index += 1
            # msg = f'{index}/{len(sitemap_info["sitemap_urls"])}\t {sitemap_url}: pages: {len(sitemap_info["found_urls"])}'
            # print(msg)
            # print_json({'sitemaps': len(sitemap_info['sitemap_urls']), 'pages': len(sitemap_info['found_urls'])})

        return sitemap_info
    
    if len(robots_info['sitemap']) == 0:
        parsed_url = requests.utils.urlparse(url)
        sitemap_url = f"{parsed_url.scheme}://{parsed_url.netloc}/sitemap.xml"
        robots_info['sitemap'] = [sitemap_url]

    sitemap_info = {'found_urls': [], 'sitemap_urls': robots_info['sitemap'] }
    
    sitemap_info = traverse_sitemaps(sitemap_info)

    # print_json(sitemap_info)

    return sitemap_info

## Update Frontier

- `update_frontier`: This function is based in the property of the `set` data structure in Python. It is used to update the frontier with new URLs. It adds all the new URLs to the frontier and prevents duplicates by using a set.
  - Possible improvements:
    - Use a priority queue to prioritize URLs based on their importance or relevance.
    - Implement a more sophisticated deduplication strategy, such as normalizing URLs before adding them to the frontier.


In [None]:
""" update_frontier: Adds a new URL to the frontier. """

def update_frontier(frontier, scraped_url):
    """ Updates the frontier with new links found in the parsed URL. """
    frontier.update(scraped_url['Outlinks'])
    return frontier

## Scraping URLs

- `scrape_url`: reads contents from page and stores them in a variable. It uses the `requests` library to make an HTTP GET request to the URL and retrieves the HTML content of the page. It also extracts the title and visible text from the page using BeautifulSoup. The function returns a dictionary containing the URL, title, text, and timestamp of the crawled page.
  - Possible improvements:


In [None]:
""" scrape_url: Parses a URL and returns its components. """

def scrape_url(url):
    """ Parses a URL and returns its components. """

    def first_words(text):
        """ Only get the 20 first words from the text (ignoring empty tokens) """
        # \W+ = qualquer sequência de caracteres que não sejam letras ou números
        words = re.split(r'\W+', text)
        # words = re.findall(r'\b\w[\w\'\-]*[!?.,]?\b', text) # Match palavras com pontuação leve grudada (.,!?, etc.)

        # remove vazios resultantes de split
        words = [word for word in words if word]
        joined_words = ' '.join(words[:20])
        return joined_words
    
    def get_new_links(soup):
        """ Returns all new links found in the parsed HTML. """
        links = set()
        for link in soup.find_all('a', href=True):
            href = link['href']
            if href.startswith('http'):
                links.add(get_base_url(href))
        return links
        
    def clean_text(text):
        """ Cleans the text by removing excess whitespace and newlines. """
        cleaned_text = text.replace('\t', ' ') # Converts tabs to spaces
        cleaned_text = re.sub(r'\s+', ' ', cleaned_text) # Remove excess whitespace
        cleaned_text = re.sub(r'\n+', '\n', cleaned_text) # Convert multiple newlines to a single newline
        return cleaned_text
    
    def get_useful_info(base_parsed_url, response):
        """ Returns useful information from the response. """
        soup = bs.BeautifulSoup(response['content'], 'html.parser')
        full_text = soup.get_text()
        cleaned_text = clean_text(full_text)
        
        base_parsed_url['Title'] = soup.title.string if soup.title else None
        base_parsed_url['Text'] = first_words(full_text)
        base_parsed_url['Timestamp'] = get_timestamp()
        
        base_parsed_url['Outlinks'] = get_new_links(soup)
        base_parsed_url['Full_Text'] = cleaned_text
        base_parsed_url['Headers'] = response['headers']
        # base_parsed_url['Raw'] = response['raw']
        base_parsed_url['Status_Code'] = response['status_code']
        base_parsed_url['Version'] = response['version']
        return base_parsed_url
    
    base_parsed_url = {
        'URL': url,
        'Title': None,
        'Text': None,
        'Timestamp': None,
        'Outlinks': set(),
        'Full_Text': None,
        'Headers': None,
        'Status_Code': None,
        'Version': None
    }
    response = default_requester(url)
    if response is None:
        return base_parsed_url
    
    mime = response['headers'].get('Content-Type', '').split(';')[0]
    status_code_200 = response['status_code'] == 200
    is_HTML = mime == 'text/html'
    # print_json(response)
    if status_code_200 and is_HTML:
        return get_useful_info(base_parsed_url, response)

    return base_parsed_url

## Store WARC

- `store_warc`: This function is used to store the crawled pages in WARC format. It creates a WARC file and writes the HTML content of the crawled page to it. It also compresses the file using gzip to reduce storage costs.
  - Needed improvements:
    - Limit the number of pages stored in each WARC file to 1000, as specified in the assignment.
  - Possible improvements:
    - Reduce the size of the WARC file by removing unnecessary metadata or compressing the HTML content further.

In [None]:
""" store_warc: Stores the parsed URL in a WARC file. """

def store_warc(parsed_url):
    """ Stores the parsed URL in a WARC file. """

    def get_protocol_version(version):
        """ Returns the protocol version used in the response. """
        protocol  = f'unknown {version}'
        if version == 10:
            protocol = 'HTTP/1.0'
        elif version == 11:
            protocol = 'HTTP/1.1'
        elif version == 20:
            protocol = 'HTTP/2.0'
        return protocol

    is_compressed = True
    output_path = 'output.warc'
    if is_compressed:
        output_path += '.gz'

    headers = StatusAndHeaders(
        statusline=str(parsed_url['Status_Code']),
        headers={},
        # headers=parsed_url['Headers'].items(), # Mais completo, mas poluído.
        protocol=get_protocol_version(parsed_url['Version']),
    )

    with open(output_path, 'ab') as output:  # ab = Append and Binary mode.
        # gzip = True makes it automatically compressed.
        writer = WARCWriter(output, gzip=is_compressed)
        record = writer.create_warc_record(
            uri=parsed_url['URL'],
            record_type='response',
            payload=BytesIO(parsed_url['Full_Text'].encode('utf-8')),
            # payload=parsed_url['Raw'],
            http_headers=headers,
        )
        writer.write_record(record)

## Main Function

- `main`: This is the main function of the crawler.
  - Gets the initial frontier from the seeds file.


In [None]:
""" Runnning MVP """

scraping = {
    'count': 0,
    'content': dict(),
    'frontier': get_seeds(SEEDS_FILE).copy(),  # Set of URLs to scrape
}

scraping['frontier'] = set([get_base_url(url) for url in scraping['frontier']])

def scrape_once(scraping):
    """ Scrapes a single URL and updates the scraping state. """

    def is_scrapable(scraping):
        """ Checks if there are URLs to scrape and if the limit has not been reached. """
        if scraping['count'] >= PAGES_LIMIT:  # Check if the limit has been reached
            print(f"Scraping limit reached: {PAGES_LIMIT} pages.")
            return False
        if not scraping['frontier']:  # Check if the frontier is empty
            print("No more URLs to scrape.")
            return False
        return True

    def was_scraped(url, scraped_content):
        """ Checks if the URL has already been scraped. """
        if url in scraped_content:
            print(f"URL already scraped: {url}")
            return True
        return False

    if not is_scrapable(scraping):
        return None

    url = scraping['frontier'].pop()
    
    if was_scraped(url, scraping['content']):
        return None
    
    parsed_url = scrape_url(url)
    scraping['content'][url] = parsed_url
    # Update the frontier with new links found in the parsed URL
    scraping['frontier'] = update_frontier(scraping['frontier'], parsed_url)
    scraping['count'] += 1  # Increment the count of pages scraped
    # print_json(parsed_url)  # Debugging line to check the parsed URL
    # store_warc(parsed_url)

def main():
    """ Main function to run the MVP. """
    global scraping

    while scraping['frontier'] and scraping['count'] < PAGES_LIMIT:
        scrape_once(scraping)
        msg = f'({scraping["count"]}/{len(scraping["frontier"])}) => {PAGES_LIMIT}: Current: {list(scraping["frontier"])[-1]}'
        print(msg)
        # print_json(scraping)
        # if scraping['count'] % 100 == 0:  # Print status every 100 iterations

if __name__ == '__main__':
    main()

---

In [None]:
""" debug_sitemap """

def debug_sitemap():
    url = "https://olhardigital.com.br"
    robots_info = get_robots_txt(url)
    robots_info['sitemap'] = []  # Clear the sitemap list to force fetching
    result = get_sitemap(url, robots_info)
    return result

# result = debug_sitemap()

In [None]:
print_json(scraping)