# Introduction

The first step in our task is to obtain the data necessary for analysis. Since our company is in the early stages of development and does not have its own database, we intend to use publicly available resources.  
  
For this purpose, we have been recommended the website [Scrape This Site](https://www.scrapethissite.com/pages/forms/). However, before we start downloading data, it is important to carefully review the [FAQ](https://www.scrapethissite.com/faq/) section on the site. Particular attention should be paid to the restrictions on the number of requests, which is crucial for our solution.  
  
It is expected that after executing the code contained in this notebook, the `data/raw/` folder will be populated with data, which will serve as the source for the next stage of the project.

# Notebook Configuration

## Importing Required Libraries

In [23]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [24]:
import selenium as sel
from selenium import webdriver
import time
from pathlib import Path

## Driver and Selenium Configuration

In [25]:
driver = webdriver.Chrome()
driver.get("https://www.scrapethissite.com/pages/forms/")


time.sleep(3)



# Fetching Website Content

This section of the notebook contains code for fetching website content. To properly execute the task, consider the following steps:  
- Ensure all available data on the site has been fetched by checking if there are additional data pages.  
- Locate the data of interest on the page using `html` inspection tools.  
- Navigate between subsequent data pages using browser mechanisms or by analyzing the `url` structure.  
  
> Remember to respect the query limits specified in the `FAQ`!  
  
Save the fetched data to the folder `data/raw/hockey_teams_page_{page_number}.html`. At this stage, we are retrieving data without processing it - analysis will be performed later.  
  
To fetch the `html` content of the page, you can use `browser.page_source`. Make sure the browser tool configuration (e.g., Selenium) is ready for use.  
  
> (Optional) If there are multiple pages to fetch, use the [zfill](https://www.programiz.com/python-programming/methods/string/zfill) function to maintain order in file names by adding leading zeros to the page numbers.



In [26]:
driver = webdriver.Chrome()


raw = Path.cwd() / 'data' / 'raw'
raw.mkdir(parents=True, exist_ok=True)


for page in range(1, 25):
    url = f"https://www.scrapethissite.com/pages/forms/?page_num={page}"
    driver.get(url)
    time.sleep(2)
    
    
    html_content = driver.page_source
    
    
    filename = raw / f'hockey_teams_page_{page:02d}.html'
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(html_content)
    
    print(f"Saved: {filename}")

driver.quit()
print("Done downloading data.")

Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_01.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_02.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_03.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_04.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_05.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_06.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_07.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_08.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_09.html
Saved: c:\Users\boris\OneDrive\Plocha\Hockey copy\Hockey_project\data\raw\hockey_teams_page_10.html


# Summary

Downloading raw data from our source has reduced the risk of problems stemming from site updates during the extraction process. This method also offers an additional benefit: it allows easy access to the data in its original form, which is crucial if reprocessing is needed.

In the next step, we will focus on extracting the necessary information from the `html` pages, which is essential for conducting the analysis.