# PWHL Hockey Database 

### Introduction

#### **Purpose:** 
This milestone focuses on working with query results, *not* designing schemas or practicing subqueries again. <br>
We will use SQL to create a single analysis-ready dataset, load it into pandas, and clean and transform it so it is ready for analysis.

#### **Technique:** 
Instead of copying results out of DB Browser, we will use Python to connect to our `pwhl_hockey.db`, run the query, and immediately process the output by exploring the database as DataFrames. Each DataFrame should represent a meaningful analysis result (query) — not a table.

- Tables live in the database. 
- DataFrames live in Python for analysis.

We will only pull data into a DataFrame when we need it.



**In this milestone, we will perform the following:**

- **Part 1**: Querying multiple tables with JOINs
- **Part 2**: Loading query results into pandas
- **Part 3**: Cleaning/ transforming the result
- **Part 4**: Explaining why you made those choices

This notebook contains the analysis and process in completing **Part 2** and **Part 3** listed above. <br>
The other parts can be found in the GitHub repository here: [pwhl-database](https://github.com/alyzukas/pwhl-database/tree/main)

### Connecting to the database

In [6]:
# Import libraries
import sqlite3
import pandas as pd

# Connect to pwhl_hockey SQLite database
db_path = "/Users/alyssa.zukas/cpsc5071/pwhl_hockey.db"  
conn = sqlite3.connect(db_path)

In [7]:
# Sanity check
pd.read_sql("SELECT * FROM team LIMIT 5;", conn)

Unnamed: 0,team_key,name,nickname,team_code,division,date_founded
0,1,Boston Fleet,Fleet,BOS,1,2023-08-29
1,2,Minnesota Frost,Frost,MIN,1,2023-08-29
2,3,Montréal Victoire,Victoire,MTL,1,2023-08-29
3,4,New York Sirens,Sirens,NY,1,2023-08-29
4,5,Ottawa Charge,Charge,OTT,1,2023-08-29


In [8]:
# View all tables
tables = pd.read_sql_query("""
    SELECT name
    FROM sqlite_master
    WHERE type='table'
    ORDER BY name;
""", conn)

tables

Unnamed: 0,name
0,assist
1,game
2,goal
3,location
4,penalties
5,period
6,player
7,season
8,shot
9,team


### Load SQL Query results into Pandas

After choosing **one** of our JOIN queries to become our main dataset, we will either:

1. Paste the query text into Python and run `pd.read_sql()`, **OR**
2. Read the query text from the .sql file (as a normal text file) and run `pd.read_sql()`.


#### Chosen Query:

### Data Cleaning and Transformation

As the main focus on this milestone, we will perform the following steps to clean and transform the data:
1. Identify missing values using `df.isnull().sum()` and apply at least two missing-data strategies (make sure to explain).
2.  Perform at least three of the following:
    - Trim whitespace from strings
    - Standardize capitalization
    - Fix data types (dates, integers, floats)
    - Rename unclear column names
    - Remove duplicates if present
3. Create **at least two new columns** that improve usability or analysis, Show a short preview so it is clear what changed, and save it in `cleaned_data.csv`.

#### Handle Missing data 

#### Cleaned and Standardize Columns

#### Transform the data

### Closing the connection

In [4]:
conn.close()

### ( Notes ) make sure to leave out of final submission

#### Why not to put all tables into 1 df

To put everything into one df, we’d have to JOIN:
- player
- tenure
- team
- season
- game
- shot
- goal
- assist
- penalties
- period
- location

That would create a table at the lowest grain (likely one row per shot or per event). <br>

**Result:** <br>
Massive duplication of player, team, season data <br>
Thousands of repeated rows for the same player/game <br>
Hard-to-debug logic <br>

This is called **denormalization**, and it comes with tradeoffs.