# Introduction

The purpose of this task is to extract and save information about hockey teams into JSON format, based on data from files located in the `/data/raw` directory, which were generated in the previous stage. The information to be scraped and saved includes:  

- Team Name (`Team Name`),  
- Year (`Year`),  
- Number of wins (`Wins`),  
- Number of losses (`Losses`),  
- Number of overtime losses (`OT Losses` - Overtime Losses),  
- Win percentage (`Win %`),  
- Number of goals scored (`Goals For (GF)`),  
- Number of goals conceded (`Goals Against (GA)`),  
- Goal differential (`+ / -`).  

Each collected record should be organized into a dictionary with the structure shown below and then added to the results list:  

```python  
{  
    'Team Name': 'Boston Bruins',  
    'Year': '1990',  
    'Wins': '44',  
    'Losses': '24',  
    'OT Losses': '',  
    'Win %': '0.55',  
    'Goals For (GF)': '299',  
    'Goals Against (GA)': '264',  
    '+ / -': '35'  
}  
```

Place each item into the results list.

The resulting data should be saved in a file named hockey_teams.json, which will be placed in the `data/interim/` folder. This file will serve as a data source for further analysis in the next part of the workshop.

> At this point, converting HTML to JSON may seem complex and unnecessary, but it aims to consolidate knowledge regarding this data structure due to its universality and prevalence not only in the world of data analysis but generally in IT as well.


# Notebook Configuration

## Import Required Libraries

In [1]:
import bs4
import json

from glob import glob

# Scraping

To scrape the required information from the saved files, follow these steps:

1. Find all HTML files in the `data/raw` folder using the `glob` module.
2. For each HTML file, use `BeautifulSoup` to scrape the page and extract the needed data.
3. Save the obtained data as partially processed in the `hockey_teams.json` file located in the `/data/interim/` folder.

These steps will allow for efficient processing of data from HTML files and prepare them for further analysis.

## List of HTML files

Using the `glob` module, find all `html` files in the `data/raw` folder.

In [2]:
files = glob('../data/raw/*.html')

## Scraping

Extract data from `html` files, making sure to maintain the expected structure of a single record:

```python
{
    'Team Name': 'Boston Bruins',
    'Year': '1990',
    'Wins': '44',
    'Losses': '24',
    'OT Losses': '',
    'Win %': '0.55',
    'Goals For (GF)': '299',
    'Goals Against (GA)': '264',
    '+ / -': '35'
}
```

In [3]:
data = []
for file in files:
    with open(file) as f:
        soup = bs4.BeautifulSoup(f.read(), 'html.parser')

    table = soup.find('table')
    header = [cell.text.strip() for cell in table.find_all('th')]

    for row in table.find_all('tr', class_='team'):
        cells = row.find_all('td')
        cells = [cell.text.strip() for cell in cells]

        record = dict(zip(header, cells))
        data.append(record)

In [4]:
data[0]

{'Team Name': 'Boston Bruins',
 'Year': '1990',
 'Wins': '44',
 'Losses': '24',
 'OT Losses': '',
 'Win %': '0.55',
 'Goals For (GF)': '299',
 'Goals Against (GA)': '264',
 '+ / -': '35'}

# Summary

After extracting the relevant information, the final step in preparation for analysis is to save the data to disk.

### Saving the file
Here, save the data to `data/interim/` and name the file `hockey_teams.json`

> Note: Remember to import the appropriate library for handling the JSON format beforehand.

In [5]:
with open('../data/interim/hockey_teams.json', 'w') as f:
    json.dump(data, f, indent=4)