# Introduction

The first step in our task is to obtain the data necessary for analysis. Since our company is in the early stages of development and does not have its own database, we intend to use publicly available resources.  
  
For this purpose, we have been recommended the website [Scrape This Site](https://www.scrapethissite.com/pages/forms/). However, before we start downloading data, it is important to carefully review the [FAQ](https://www.scrapethissite.com/faq/) section on the site. Particular attention should be paid to the restrictions on the number of requests, which is crucial for our solution.  
  
It is expected that after executing the code contained in this notebook, the `data/raw/` folder will be populated with data, which will serve as the source for the next stage of the project.

# Notebook Configuration

## Importing Required Libraries

In [1]:
from selenium.webdriver.common.by import By
from selenium import webdriver

from time import sleep

## Driver and Selenium Configuration

#### Chrome

In [2]:
browser = webdriver.Chrome()

#### Firefox

#### Edge

#### Opera

# Fetching Website Content

This section of the notebook contains code for fetching website content. To properly execute the task, consider the following steps:  
- Ensure all available data on the site has been fetched by checking if there are additional data pages.  
- Locate the data of interest on the page using `html` inspection tools.  
- Navigate between subsequent data pages using browser mechanisms or by analyzing the `url` structure.  
  
> Remember to respect the query limits specified in the `FAQ`!  
  
Save the fetched data to the folder `data/raw/hockey_teams_page_{page_number}.html`. At this stage, we are retrieving data without processing it - analysis will be performed later.  
  
To fetch the `html` content of the page, you can use `browser.page_source`. Make sure the browser tool configuration (e.g., Selenium) is ready for use.  
  
> (Optional) If there are multiple pages to fetch, use the [zfill](https://www.programiz.com/python-programming/methods/string/zfill) function to maintain order in file names by adding leading zeros to the page numbers.



In [4]:
def is_table_empty() -> bool:
    """
    Sprawdza czy tabela na stronie nie zawiera żadnych danych

    :return: True jeśli strona jest pusta, w przeciwnym wypadku False
    """
    table_rows = (
        browser
        .find_element(by=By.TAG_NAME, value='table')
        .find_elements(by=By.TAG_NAME, value='tr')
    )

    # w pustej tabeli dostępny jest tylko nagłówek
    if len(table_rows)  == 1:
        return True
    else:
        return False

In [5]:
per_page = 50
browser.implicitly_wait(10)

In [6]:
data = []
url_template = 'https://www.scrapethissite.com/pages/forms/?page_num={page}'
page = 1

while True:
    url = url_template.format(page=page, per_page=per_page)
    browser.get(url)

    html = browser.page_source
    if is_table_empty():
        break

    page_number_str = str(page).zfill(2)
    with open(f'../data/raw/hockey_teams_page_{page_number_str}.html', 'w') as file:
        file.write(browser.page_source)

    page += 1
    sleep(1)

In [7]:
browser.quit()

# Summary

Downloading raw data from our source has reduced the risk of problems stemming from site updates during the extraction process. This method also offers an additional benefit: it allows easy access to the data in its original form, which is crucial if reprocessing is needed.

In the next step, we will focus on extracting the necessary information from the `html` pages, which is essential for conducting the analysis.