# Introduction

The purpose of this task is to extract and save information about hockey teams into JSON format, based on data from files located in the `/data/raw` directory, which were generated in the previous stage. The information to be scraped and saved includes:  

- Team Name (`Team Name`),  
- Year (`Year`),  
- Number of wins (`Wins`),  
- Number of losses (`Losses`),  
- Number of overtime losses (`OT Losses` - Overtime Losses),  
- Win percentage (`Win %`),  
- Number of goals scored (`Goals For (GF)`),  
- Number of goals conceded (`Goals Against (GA)`),  
- Goal differential (`+ / -`).  

Each collected record should be organized into a dictionary with the structure shown below and then added to the results list:  

```python  
{  
    'Team Name': 'Boston Bruins',  
    'Year': '1990',  
    'Wins': '44',  
    'Losses': '24',  
    'OT Losses': '',  
    'Win %': '0.55',  
    'Goals For (GF)': '299',  
    'Goals Against (GA)': '264',  
    '+ / -': '35'  
}  
```

Place each item into the results list.

The resulting data should be saved in a file named hockey_teams.json, which will be placed in the `data/interim/` folder. This file will serve as a data source for further analysis in the next part of the workshop.

> At this point, converting HTML to JSON may seem complex and unnecessary, but it aims to consolidate knowledge regarding this data structure due to its universality and prevalence not only in the world of data analysis but generally in IT as well.


# Notebook Configuration

## Import Required Libraries

In [9]:
import selenium as sel
from selenium import webdriver
import pandas as pd
import matplotlib
from bs4 import BeautifulSoup as bs
from glob import glob
from pathlib import Path
import json

# Scraping

To scrape the required information from the saved files, follow these steps:

1. Find all HTML files in the `data/raw` folder using the `glob` module.
2. For each HTML file, use `BeautifulSoup` to scrape the page and extract the needed data.
3. Save the obtained data as partially processed in the `hockey_teams.json` file located in the `/data/interim/` folder.

These steps will allow for efficient processing of data from HTML files and prepare them for further analysis.

## List of HTML files

Using the `glob` module, find all `html` files in the `data/raw` folder.

In [10]:
path_file = Path('data/raw')
files = glob(str(path_file / '*.html'))
for file in files:
    print(file)

data\raw\hockey_teams_page_01.html
data\raw\hockey_teams_page_02.html
data\raw\hockey_teams_page_03.html
data\raw\hockey_teams_page_04.html
data\raw\hockey_teams_page_05.html
data\raw\hockey_teams_page_06.html
data\raw\hockey_teams_page_07.html
data\raw\hockey_teams_page_08.html
data\raw\hockey_teams_page_09.html
data\raw\hockey_teams_page_10.html
data\raw\hockey_teams_page_11.html
data\raw\hockey_teams_page_12.html
data\raw\hockey_teams_page_13.html
data\raw\hockey_teams_page_14.html
data\raw\hockey_teams_page_15.html
data\raw\hockey_teams_page_16.html
data\raw\hockey_teams_page_17.html
data\raw\hockey_teams_page_18.html
data\raw\hockey_teams_page_19.html
data\raw\hockey_teams_page_20.html
data\raw\hockey_teams_page_21.html
data\raw\hockey_teams_page_22.html
data\raw\hockey_teams_page_23.html
data\raw\hockey_teams_page_24.html


## Scraping

Extract data from `html` files, making sure to maintain the expected structure of a single record:

```python
{
    'Team Name': 'Boston Bruins',
    'Year': '1990',
    'Wins': '44',
    'Losses': '24',
    'OT Losses': '',
    'Win %': '0.55',
    'Goals For (GF)': '299',
    'Goals Against (GA)': '264',
    '+ / -': '35'
}
```

In [11]:


all_teams = []

html_files = glob('data/raw/hockey_teams_page_*.html')

for file in html_files:
    with open(file, 'r', encoding='utf-8') as f:
        html = f.read()

    soup = bs(html, 'html.parser')
    table = soup.find('table')
    rows = table.find_all('tr')

    for row in rows[1:]:
        cells = row.find_all('td')

        team_data = {
            'Team Name':          cells[0].text.strip(),
            'Year':               cells[1].text.strip(),
            'Wins':               cells[2].text.strip(),
            'Losses':             cells[3].text.strip(),
            'OT Losses':          cells[4].text.strip(),
            'Win %':              cells[5].text.strip(),
            'Goals For (GF)':     cells[6].text.strip(),
            'Goals Against (GA)': cells[7].text.strip(),
            '+ / -':              cells[8].text.strip(),
        }

        all_teams.append(team_data)

print(f"Total number of teams: {len(all_teams)}")

Total number of teams: 582


# Summary

After extracting the relevant information, the final step in preparation for analysis is to save the data to disk.

### Saving the file
Here, save the data to `data/interim/` and name the file `hockey_teams.json`

> Note: Remember to import the appropriate library for handling the JSON format beforehand.

In [12]:


Path('data/interim').mkdir(parents=True, exist_ok=True)


with open('data/interim/hockey_teams.json', 'w') as f:
    print(f"Number of teams: {len(all_teams)}")
    print("First team:", all_teams[0])
    print("Last team:", all_teams[-1])
    json.dump(all_teams, f)  

Number of teams: 582
First team: {'Team Name': 'Boston Bruins', 'Year': '1990', 'Wins': '44', 'Losses': '24', 'OT Losses': '', 'Win %': '0.55', 'Goals For (GF)': '299', 'Goals Against (GA)': '264', '+ / -': '35'}
Last team: {'Team Name': 'Winnipeg Jets', 'Year': '2011', 'Wins': '37', 'Losses': '35', 'OT Losses': '10', 'Win %': '0.451', 'Goals For (GF)': '225', 'Goals Against (GA)': '246', '+ / -': '-21'}
