# Team Project, Week 7: ETL Pipeline

## Introduction

### Purpose:
This week we will practice designing and implementing a simple ETL (Extract–Transform–Load) pipeline. The goal is to produce a clean, integrated dataset that reflects our project domain and supports more meaningful analysis.

### Technique:
We will start with out cleaned team dataset and enrich it by ingesting at data from the PWHL HockeyTech API (via LeagueStat). Our original database was created referencing the same data that this API provides and therefore is a perfect source to continue building out this project. This API can be reached via the URL: https://lscluster.hockeytech.com/feed/index.php, along with specific parameters passed to access a vast amount of data. HockeyTech is a company that provides analytics, statistics, live data feeds, and streaming solutions to hockey organizations around the world from Junior to Professional level. The lscluster.hockeytech.com is part of LeagueStat, HockeyTech's statistics engine. This API fetches JSON data and we have selected this datasource to enhance our own data, and compare aggregated statistical summaries for players across every game to our individualized in-game view. 

### Project Documentation
Individual files for steps not performed in this summative notebook can be found the GitHub repository here: [pwhl-database](https://github.com/alyzukas/pwhl-database/tree/main)

In [1]:
# Import libraries
import sqlite3
import pandas as pd
import requests
from pprint import pprint
pd.set_option('display.max_columns', None)

## Part 1 – Extract

### Load your cleaned dataset from Week 6

In [2]:
# Connect to `week6_pwhl` SQLite database 
db_path = "/Users/alyssa.zukas/cpsc5071/week6_pwhl.db"  # -- INSERT YOUR PATH HERE  
conn = sqlite3.connect(db_path)

# QUERY 1 - SHOT-LEVEL DATASET
# Paste the query text 
query_1 = """
SELECT t.name as shooting_team
,gm.game_date
,gm.duration
-- shooter details
,shooter.name as shooter_name
,shooter.player_key
,tn.position
,shooter.shoots as shoots_handed
,shooter.hometown
,shooter.birthdate
,tn.jersey_number
-- shot details
,s.shot_key 
,s.shot_type
,s.shot_time
,s.shot_quality
-- goalie details
,goalie.name as goalie_name
,s.is_goal
--blocker and assists
,blocker.name AS blocker_name
,pa.name AS assist1_name
,pa2.name as assist2_name
--goal details
,gl.is_power_play as is_power_play_goal
,gl.is_short_handed as is_short_handed_goal
,gl.is_empty_net as is_empty_net_goal
,gl.is_penalty_shot as is_penalty_shot_goal
,gl.is_insurance_goal
,gl.is_game_winning_goal
,s.x_location
,s.y_location
FROM shot s
INNER JOIN game gm 
	on s.game_key=gm.game_key
INNER JOIN player shooter 
	on s.shooter_key=shooter.player_key
INNER JOIN player goalie
	on s.goalie_key=goalie.player_key
LEFT JOIN player blocker
	on s.blocker_key=blocker.player_key
INNER JOIN tenure tn
    ON shooter.player_key = tn.player_key
   AND gm.season_key = tn.season_key
INNER JOIN team t
    ON tn.team_key = t.team_key
LEFT JOIN goal gl 
	ON s.game_key=gl.game_key
    AND s.shot_key=gl.shot_key
LEFT JOIN assist a 
	ON gl.game_key=a.game_key
	AND gl.goal_key=a.goal_key
    AND a.assist_number = 1
LEFT JOIN assist a2
	ON gl.game_key=a2.game_key
	AND gl.goal_key=a2.goal_key
    AND a2.assist_number = 2
LEFT JOIN player pa
	ON a.player_key=pa.player_key
LEFT JOIN player pa2
    ON a2.player_key=pa2.player_key
ORDER BY s.shot_key;
"""

In [3]:
# Load the query_1 into Pandas
df = pd.read_sql(query_1, conn)

# Collect the goal fields which need NAs filled with 0
goal_flag_cols = [
    "is_power_play_goal",
    "is_short_handed_goal",
    "is_empty_net_goal",
    "is_penalty_shot_goal",
    "is_insurance_goal",
    "is_game_winning_goal"
]

# Collect the name fields which need NAs filled with "None"
conditional_name_cols = [
    "blocker_name", 
    "assist1_name", 
    "assist2_name"
]

# Fill Categorical missing values with "None"
df[conditional_name_cols] = df[conditional_name_cols].fillna("None")

# Fill Goal Flags with 0 
df[goal_flag_cols] = df[goal_flag_cols].fillna(0).astype(int)

# Collect all object fields which need explicit conversion to string
to_string_cols = [
    "shooting_team",
    "shooter_name",  
    "shoots_handed",
    "position",
    "hometown",
    "shot_type",
    "shot_quality",
    "goalie_name",
    "blocker_name",
    "assist1_name",
    "assist2_name"
]

# Collect all object fields which need explicit conversion to datetime
to_datetime_cols = [
    "duration",
    "game_date",
    "birthdate",
    "shot_time"
]

# Collect all object fields which need explicit conversion to integer
to_int_cols = [
    "is_power_play_goal",
    "is_short_handed_goal",
    "is_empty_net_goal",
    "is_penalty_shot_goal",
    "is_insurance_goal",
    "is_game_winning_goal"
]

#Cast as string
for i in to_string_cols:
    df[i] = df[i].astype("string")

#Cast as datetime
for i in to_datetime_cols:
    df[i] = pd.to_datetime(df[i])

#cast as integer
for i in to_int_cols:
    df[i] = df[i].astype("int64")

#split town and state/territory from 'hometown'
df[['hometown','home_state_or_territory']] = df.hometown.str.split(",", n=1, expand=True)

#create calculated field, 'duration_hours'
df['duration_hours'] = df['duration'].dt.hour + df['duration'].dt.minute / 60


In [4]:
df.head()

Unnamed: 0,shooting_team,game_date,duration,shooter_name,player_key,position,shoots_handed,hometown,birthdate,jersey_number,shot_key,shot_type,shot_time,shot_quality,goalie_name,is_goal,blocker_name,assist1_name,assist2_name,is_power_play_goal,is_short_handed_goal,is_empty_net_goal,is_penalty_shot_goal,is_insurance_goal,is_game_winning_goal,x_location,y_location,home_state_or_territory,duration_hours
0,Toronto Sceptres,2026-01-20,2000-01-01 02:31:00,Blayre Turnbull,76,F,R,Stellarton,1993-07-15,19,4,Wrist,2000-01-01 00:01:18,Quality goal,Corinne Schroeder,1,,Claire Dalton,Kali Flanagan,0,0,0,0,0,0,513,196,NS,2.516667
1,Seattle Torrent,2026-01-18,2000-01-01 02:45:00,Alex Carpenter,34,G,L,North Reading,1994-04-13,31,16,Default,2000-01-01 00:06:20,Non quality blocked,Aerin Frankel,0,Jill Saulnier,Daryl Watts,Savannah Harmon,0,0,0,0,0,0,175,143,MA,2.75
2,Boston Fleet,2026-01-18,2000-01-01 02:45:00,Megan Keller,12,C,L,Farmington,1996-05-01,77,36,Slap,2000-01-01 00:12:28,Non quality goal,Corinne Schroeder,1,,Abby Newhook,Alina Müller,1,0,0,0,0,0,423,203,MI,2.75
3,Boston Fleet,2026-01-18,2000-01-01 02:45:00,Megan Keller,12,C,L,Farmington,1996-05-01,77,41,Slap,2000-01-01 00:13:50,Non quality blocked,Corinne Schroeder,0,Mariah Keopple,,,0,0,0,0,0,0,434,217,MI,2.75
4,Boston Fleet,2026-01-18,2000-01-01 02:45:00,Haley Winn,246,F,R,Rochester,2003-07-14,88,43,Default,2000-01-01 00:14:00,Non quality blocked,Corinne Schroeder,0,Lexie Adzija,,,0,0,0,0,0,0,389,173,NY,2.75


### Ingest at least one of the following external data sources

In [5]:
#url for API we will be using
url = "https://lscluster.hockeytech.com/feed/index.php"

#Dictionary of the parameters which the endpoint accepts and the values we will be providing
params = {
    "feed": "statviewfeed",
    "view": "players", 
    "season": 8,
    "team": "all",
    "position": "skaters", 
    "rookies": 0,
    "statsType": "standard",
    "league_id": 1,
    "limit": 500,
    "sort": "points",
    "lang": "en",
    "key": "446521baf8c38984",
    "client_code": "pwhl"
}

response = requests.get(url, params=params)

if response.status_code == 200:
    cleaned_text = response.text.strip()[1:-1]  # note that this data came through as JSONP, so we strip the 1st and last characters: ()
    api_data = json.loads(cleaned_text)
    print("we've got data!")
else:
    print("penalty assessed:", response.status_code)

we've got data!


In [6]:
# analyze the structure of the data
pprint(api_data)

[{'sections': [{'data': [{'prop': {'active': {'active': '1'},
                                   'name': {'playerLink': '20',
                                            'seoName': 'Kendall Coyne '
                                                       'Schofield'},
                                   'rookie': {'rookie': '0'},
                                   'team_code': {'teamLink': '2'}},
                          'row': {'active': '1',
                                  'assists': '6',
                                  'faceoff_attempts': '1',
                                  'faceoff_pct': '100.0',
                                  'faceoff_wins': '1',
                                  'games_played': '15',
                                  'goals': '10',
                                  'hits': '5',
                                  'hits_per_game_avg': '0.33',
                                  'ice_time_minutes_seconds': '250:40',
                                  'ice_time_p

In [7]:
# nested structure will require appropriate flattening
structure_to_flatten = api_data[0]['sections'][0]['data']

In [8]:
# The data we want lives in 'prop' and 'row', so we have to unpack it with a for loop: 
rows = [player['row'] for player in structure_to_flatten]
api_df = pd.DataFrame(rows)
api_df.head()

Unnamed: 0,player_id,name,active,position,rookie,team_code,games_played,goals,shots,hits,shots_blocked_by_player,ice_time_minutes_seconds,shooting_percentage,assists,points,points_per_game,plus_minus,penalty_minutes,penalty_minutes_per_game,ice_time_per_game_avg,hits_per_game_avg,power_play_goals,power_play_assists,short_handed_goals,short_handed_assists,faceoff_attempts,faceoff_wins,faceoff_pct,rank
0,20,Kendall Coyne Schofield,1,F,0,MIN,15,10,48,5,5,250:40,20.8,6,16,1.07,13,0,0.0,16:42,0.33,0,1,0,0,1,1,100.0,1
1,189,Britta Curl-Salemme,1,F,0,MIN,15,7,28,17,17,266:50,25.0,9,16,1.07,12,8,0.53,17:47,1.13,0,1,0,0,37,19,51.4,2
2,21,Taylor Heise,1,F,0,MIN,15,3,38,13,3,255:51,7.9,13,16,1.07,11,6,0.4,17:03,0.87,0,2,0,0,220,106,48.2,3
3,23,Kelly Pannek,1,F,0,MIN,15,8,33,2,15,256:52,24.2,7,15,1.0,10,2,0.13,17:07,0.13,3,1,0,0,271,157,57.9,4
4,58,Brianne Jenner,1,F,0,OTT,16,8,47,12,5,305:29,17.0,7,15,0.94,2,6,0.38,19:05,0.75,3,2,1,0,280,159,56.8,5


### Document the origin of the external data and why it was chosen
Please see the Technique subsection of the Introduction above.

## Part 2 - Transformation

### Clean and format the external data:
#### Handle missing values appropriately

In [9]:
api_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Data columns (total 29 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   player_id                 174 non-null    object
 1   name                      174 non-null    object
 2   active                    174 non-null    object
 3   position                  174 non-null    object
 4   rookie                    174 non-null    object
 5   team_code                 174 non-null    object
 6   games_played              174 non-null    object
 7   goals                     174 non-null    object
 8   shots                     174 non-null    object
 9   hits                      174 non-null    object
 10  shots_blocked_by_player   174 non-null    object
 11  ice_time_minutes_seconds  174 non-null    object
 12  shooting_percentage       174 non-null    object
 13  assists                   174 non-null    object
 14  points                    

The imported has no missing values as all entries are `non-null`.
#### Fix inconsistent field names or types

In [10]:
# Converting the different columns' types

str_cols = ['name', 'position', 'team_code', 'ice_time_minutes_seconds', 'ice_time_per_game_avg']

int_cols = [
    'player_id', 'rank', 'active', 'rookie', 'games_played', 'goals', 
    'shots', 'hits', 'shots_blocked_by_player', 'assists', 'points', 
    'plus_minus', 'penalty_minutes', 'power_play_goals', 
    'power_play_assists', 'short_handed_goals', 'short_handed_assists', 
    'faceoff_attempts', 'faceoff_wins'
]

float_cols = [
    'shooting_percentage', 'points_per_game', 'penalty_minutes_per_game', 
    'hits_per_game_avg', 'faceoff_pct' ]

for col in str_cols:
    api_df[col] =  api_df[col].astype("string")

for col in int_cols:
    api_df[col] =  api_df[col].astype("int64")

for col in float_cols:
    api_df[col] =  api_df[col].astype("float64")

api_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Data columns (total 29 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   player_id                 174 non-null    int64  
 1   name                      174 non-null    string 
 2   active                    174 non-null    int64  
 3   position                  174 non-null    string 
 4   rookie                    174 non-null    int64  
 5   team_code                 174 non-null    string 
 6   games_played              174 non-null    int64  
 7   goals                     174 non-null    int64  
 8   shots                     174 non-null    int64  
 9   hits                      174 non-null    int64  
 10  shots_blocked_by_player   174 non-null    int64  
 11  ice_time_minutes_seconds  174 non-null    string 
 12  shooting_percentage       174 non-null    float64
 13  assists                   174 non-null    int64  
 14  points    

In [11]:
api_df.head()

Unnamed: 0,player_id,name,active,position,rookie,team_code,games_played,goals,shots,hits,shots_blocked_by_player,ice_time_minutes_seconds,shooting_percentage,assists,points,points_per_game,plus_minus,penalty_minutes,penalty_minutes_per_game,ice_time_per_game_avg,hits_per_game_avg,power_play_goals,power_play_assists,short_handed_goals,short_handed_assists,faceoff_attempts,faceoff_wins,faceoff_pct,rank
0,20,Kendall Coyne Schofield,1,F,0,MIN,15,10,48,5,5,250:40,20.8,6,16,1.07,13,0,0.0,16:42,0.33,0,1,0,0,1,1,100.0,1
1,189,Britta Curl-Salemme,1,F,0,MIN,15,7,28,17,17,266:50,25.0,9,16,1.07,12,8,0.53,17:47,1.13,0,1,0,0,37,19,51.4,2
2,21,Taylor Heise,1,F,0,MIN,15,3,38,13,3,255:51,7.9,13,16,1.07,11,6,0.4,17:03,0.87,0,2,0,0,220,106,48.2,3
3,23,Kelly Pannek,1,F,0,MIN,15,8,33,2,15,256:52,24.2,7,15,1.0,10,2,0.13,17:07,0.13,3,1,0,0,271,157,57.9,4
4,58,Brianne Jenner,1,F,0,OTT,16,8,47,12,5,305:29,17.0,7,15,0.94,2,6,0.38,19:05,0.75,3,2,1,0,280,159,56.8,5


In [12]:
# Rename some of the columns to ensure that they match with with values they represent and also for merging purposes

api_df = api_df.rename(columns={
    "name" : "player_name",
    "player_id" : "player_key"
})

#### Drop irrelevant columns
These details are all summary statistics by players. We will only be keeping a few for comparing against shot stats within a single game. Mostly, we are utilizing theis data for practice in merging datasets.

In [13]:
api_df

Unnamed: 0,player_key,player_name,active,position,rookie,team_code,games_played,goals,shots,hits,shots_blocked_by_player,ice_time_minutes_seconds,shooting_percentage,assists,points,points_per_game,plus_minus,penalty_minutes,penalty_minutes_per_game,ice_time_per_game_avg,hits_per_game_avg,power_play_goals,power_play_assists,short_handed_goals,short_handed_assists,faceoff_attempts,faceoff_wins,faceoff_pct,rank
0,20,Kendall Coyne Schofield,1,F,0,MIN,15,10,48,5,5,250:40,20.8,6,16,1.07,13,0,0.00,16:42,0.33,0,1,0,0,1,1,100.0,1
1,189,Britta Curl-Salemme,1,F,0,MIN,15,7,28,17,17,266:50,25.0,9,16,1.07,12,8,0.53,17:47,1.13,0,1,0,0,37,19,51.4,2
2,21,Taylor Heise,1,F,0,MIN,15,3,38,13,3,255:51,7.9,13,16,1.07,11,6,0.40,17:03,0.87,0,2,0,0,220,106,48.2,3
3,23,Kelly Pannek,1,F,0,MIN,15,8,33,2,15,256:52,24.2,7,15,1.00,10,2,0.13,17:07,0.13,3,1,0,0,271,157,57.9,4
4,58,Brianne Jenner,1,F,0,OTT,16,8,47,12,5,305:29,17.0,7,15,0.94,2,6,0.38,19:05,0.75,3,2,1,0,280,159,56.8,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,269,Vanessa Upson,1,F,1,MIN,13,0,6,0,5,68:54,0.0,0,0,0.00,-2,0,0.00,5:18,0.00,0,0,0,0,36,14,38.9,170
170,217,Taylor House,1,F,0,OTT,15,0,5,14,5,94:34,0.0,0,0,0.00,-3,0,0.00,6:18,0.93,0,0,0,0,13,5,38.5,171
171,129,Mellissa Channell-Watkins,1,D,0,VAN,16,0,12,29,19,305:19,0.0,0,0,0.00,-6,2,0.13,19:04,1.81,0,0,0,0,0,0,0.0,172
172,118,Emma Greco,1,D,0,OTT,16,0,4,8,5,34:28,0.0,0,0,0.00,-2,23,1.44,6:38,0.50,0,0,0,0,0,0,0.0,173


In [14]:
api_df = api_df.drop(['player_name','active','position','team_code'], axis=1)

### Merge or join with your original dataset if relevant

#### Merge two dataframes as an inner join

In [15]:
combined_df = pd.merge(df, api_df, on='player_key', how='left')

combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 53 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   shooting_team             15 non-null     string        
 1   game_date                 15 non-null     datetime64[ns]
 2   duration                  15 non-null     datetime64[ns]
 3   shooter_name              15 non-null     string        
 4   player_key                15 non-null     int64         
 5   position                  15 non-null     string        
 6   shoots_handed             15 non-null     string        
 7   hometown                  15 non-null     string        
 8   birthdate                 15 non-null     datetime64[ns]
 9   jersey_number             15 non-null     int64         
 10  shot_key                  15 non-null     int64         
 11  shot_type                 15 non-null     string        
 12  shot_time               

### Create at least one derived column (e.g., combining fields, date extraction, totals)

In [16]:
#year of birth
combined_df['year_of_birth'] = combined_df['birthdate'].dt.year

### Ensure the final DataFrame is consistent and usable for analysis
Our dataframe was originally scoped to the shot level in an individual game. With this joined summary data we could possibly summarize fields from the original at a lower level than the season summary data and compare.

## Part 3 - Load

### Export the final transformed dataset to a CSV

In [17]:
combined_df.to_csv('week7_final_data.csv')

### Save all intermediate and merged versions for backup

In [18]:
combined_df.to_sql('week7table', conn, if_exists='replace', index=False)
conn.commit()
conn.close()

##  Part 4 – Documentation & Reflection

#### Write-up
Please see included file here: [pwhl-database-Week7_ReadMe.txt](https://github.com/alyzukas/pwhl-database/wiki)