<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Homework 2: Do It Yourself NBA Box Scores** 
_Fun with NBA game logs, with a couple challenges thrown in for the experts_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The common forms for SQL `SELECT` queries
- How aggregation is used to summarize data
- The differences between transaction data and analytical data 

### **Skills / Know how to ...**
- Write and debug SQL select queries within a Colab notebook
- Create basic summary data from transaction data

---
## **Boilerplate: Software and Database Setup**




The code below $\downarrow$ will get you started. Rerun if your Colab session times out. 

In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Initialize a SQLite database connection
conn = sqlite3.connect('NBAPlayDB.db')

# Extract data from CSV files in the cloud
play_log_df = pd.read_csv("https://raw.githubusercontent.com/christopherhuntley/DATA6510/master/data/NBA/GamePlayLog_2019-10-22_NOL_TOR.csv")
play_facts_df = pd.read_csv("https://raw.githubusercontent.com/christopherhuntley/DATA6510/master/data/NBA/GamePlayFacts_2019-10-22_NOL_TOR.csv",index_col=0)

# Load data into the SQLite database
play_log_df.to_sql('PlayLog',conn,if_exists='replace',index_label="playLogID")
play_facts_df.to_sql('PlayFacts',conn,if_exists='replace',index_label="playFactID")

# Establish a %%sql magic connection to the database
%sql sqlite:///NBAPlayDB.db

'Connected: @NBAPlayDB.db'

---
## **Overview**

## **The Goal**

We will use SQL queries to reconstruct the box score shown below from the original play-by-play data. (If you don't know what a box score is then [this might help](https://jr.nba.com/how-to-read-a-box-score/).)

Actually we will try this twice, each with a different data source:
1. A statistician's log of every play, recorded in real time while the game was in progress. The log is what is known as **transaction** data; the emphasis is on recording events accurately as they are occuring.
2. A summary for each play in the log, generated after the game. The summaries are **analytical** data, crafted after the fact to simplify the work of the game analysts and fans. 

**As we shall see, there are significant differences between transaction data and analytical data. Also, we will see that the quality of the analytical data is totally dependent on the transaction data.** 

![Box Score](https://github.com/christopherhuntley/BUAN6510/raw/master/img/HW2_box_score.png)

## **What do we consider a valid box score?**

### Player Stats
At a minimum, the box score must include the following data for each player:
- `player`: Player Name
- `team`: Player Team
- `min`: Minutes Played
- `reb`: Total Rebounds
- `ast`: Assists
- `pts`: Points Scored

One can, of course, calculate lots of more advanced statistics like:
- `fta`: Free throws attempted
- `ftm`: Free throws made
- `2pa`: 2-point field goals attempted
- `2pm`: 2-point field goals made
- `3pa`: 3-point field goals attempted
- `3pm`: 3-point field goals made
- `blk`: blocked shots
- `fls`: fouls
- `+/-`: the net score difference while the player is in the game, normalized to 36 minutes of playing time. 

Note: We won't require these advanced statistics but there is no harm in trying if you are so inclined.

### Team Stats
In addition to the player data, we desire to know the final score and point totals for each quarter. 
 
## **Source Data**

The data is kept in Google Drive:
- https://docs.google.com/spreadsheets/d/1gVKRFBQdHOLUL5vx5GXOPi9eLqJI0uI2iAdGM2lMAIY/edit?usp=sharing
- https://docs.google.com/spreadsheets/d/1nknL5nkdvChlswGyNbNF2zSbONSY3EO8r7hB10sO2vI/edit?usp=sharing

When asked, open each file in Google Sheets. You might want to keep these spreadsheets open when you work out your SQL queries. 

In the database setup section we created the `NBAPlayDB.db` database (viewable in the file browser to your left) and then loaded each CSV as a separate table (`PlayLog` and `PlayFacts`). Rerun as needed if your session goes stale. 

The queries below show the first 10 rows of each table. 

In [None]:
%%sql
SELECT * FROM PlayLog LIMIT 10;

 * sqlite:///NBAPlayDB.db
Done.


playLogID,a1,a2,a3,a4,a5,h1,h2,h3,h4,h5,period,away_score,home_score,remaining_time,elapsed,play_length,play_id,team,player,event_type,reason,assist,away,home,block,entered,left,num,opponent,outof,points,possession,steal,shot_result,shot_distance,original_x,original_y,converted_x,converted_y,play_description
0,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,0,0,0:12:00,0:00:00,0:00:00,2,,,start of period,,,,,,,,,,,,,,,,,,,,
1,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,0,0,0:12:00,0:00:00,0:00:00,4,NOP,Marc Gasol,jump ball,,,Derrick Favors,Marc Gasol,,,,,,,,Lonzo Ball,,,,,,,,Jump Ball Gasol vs. Favors: Tip to Ball
2,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,0,0,0:11:48,0:00:12,0:00:12,7,NOP,Lonzo Ball,miss,,,,,,,,,,,0.0,,,missed,11.0,2.0,114.0,24.8,16.4,MISS Ball 11' Driving Floating Jump Shot
3,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,0,0,0:11:47,0:00:13,0:00:01,8,NOP,Derrick Favors,rebound,,,,,,,,,,,,,,,,,,,,Favors REBOUND (Off:1 Def:0)
4,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,2,0,0:11:47,0:00:13,0:00:00,9,NOP,Derrick Favors,shot,,,,,,,,,,,2.0,,,made,1.0,0.0,-6.0,25.0,4.4,Favors 1' Tip Layup Shot (2 PTS)
5,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,2,0,0:11:29,0:00:31,0:00:18,10,TOR,OG Anunoby,miss,,,,,,,,,,,0.0,,,missed,3.0,15.0,28.0,26.5,86.2,MISS Anunoby 3' Driving Layup
6,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,2,0,0:11:25,0:00:35,0:00:04,11,NOP,JJ Redick,rebound,,,,,,,,,,,,,,,,,,,,Redick REBOUND (Off:0 Def:1)
7,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,2,0,0:11:16,0:00:44,0:00:09,12,NOP,Jrue Holiday,miss,,,,,,,,,,,0.0,,,missed,8.0,81.0,-1.0,16.9,4.9,MISS Holiday 8' Driving Finger Roll Layup
8,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,2,0,0:11:15,0:00:45,0:00:01,13,TOR,Fred VanVleet,rebound,,,,,,,,,,,,,,,,,,,,VanVleet REBOUND (Off:0 Def:1)
9,Jrue Holiday,Brandon Ingram,Derrick Favors,JJ Redick,Lonzo Ball,OG Anunoby,Pascal Siakam,Marc Gasol,Kyle Lowry,Fred VanVleet,1,2,0,0:11:11,0:00:49,0:00:04,14,TOR,Kyle Lowry,miss,,,,,,,,,,,0.0,,,missed,25.0,178.0,176.0,42.8,71.4,MISS Lowry 25' 3PT Running Pull-Up Jump Shot


In [None]:
%%sql
SELECT * FROM PlayFacts LIMIT 10;

## **Working with Transaction Data: The `PlayLog` Table**

The `PlayLog` table is designed to make it as simple and efficient as possible to *record* gameplay as a seuqence of events. Each row records a *play* event logged in near-real time in the course of the game. The specifics of each event (who, what, when, where, and how) are captured in numerous columns, each with a very specific meaning: 

- `PlayLogID` (autogenerated by the database) and `play_id` (found in the raw source data) are unique to each play and can be treated as candidate keys or *indexes*. (We'll learn more about keys and indexes in Lesson 4.)
- `a1` - `a5` and `h1` -`h5` list which 5 players were on the court for the *away* team and the *home* team.  Each player name is unique for a given season, by the way. If there are two players with the same name -- yes, this has happened! -- then NBA stats department assigns each of the players a unique name. We'll treat the names like unique indexes for the players. 
- `event_type` indicates what kinds of statistics can be drawn from the play. A "shot" for example is a made shot, a "miss" is a missed shot, etc. 
- `player` records which player was the *subject* (initiator) of the event. If the `event_type`="rebound" then the `player` is the one credited with the rebound.  
- `opponent` is used in the event of a foul to indicate *who* on the opposing team was fouled.  
- `possession`, `steal`, `block` work like `opponent` in that they indicate another player involved in the play. If the play is a turnover , then the player that caused the turnover would appear in one of these columns. 
- `period` (quarter), `remaining_time` (in the period), `elapsed_time` (in the period) record the approximate time of the event wihin the game. For example, `period`=1 and `remaining_time`= "9:58" mean the event occured in the first quarter with 9 minutes and 58 seconds on the game clock. 
- `shot_result`, `shot_distance`,	`original_x`, `original_y`,	`converted_x`, and	`converted_y` are used to "map" the location of made and missed shots throughout a game. 
- `play_description` is the text that would appear in the play-by-play log in an app like ESPN Gamecast. 





Below each task in **bold** is a code cell. Write and run the SQL query for the task. **Don't forget the `%%sql` magic at the top of the cells.** 

You can check your work by consulting [ESPN's box score](https://www.espn.com/nba/boxscore?gameId=401160623).


### **1. Calculate the total rebounds for the player "Marc Gasol".**

Hint: You will need to filter based on the `event_type`.

### **2. Calculate the total assists for the player "Marc Gasol".**
Hint: Assists are tracked in their own column. 

### **3. Calculate the total points scored for the player "Marc Gasol".**
Hint: Like assists, points are also recorded in their own column.

### **4. Calculate the total points for each player. Sort the results by team and player name.**
Hint: 
- You'll need to use `GROUP BY` this time. 

### **5. Calculate the total points for each team. List the away team (NOP) before the home team (TOR).**
Hint: 
- This one should be pretty easy, except for (possibly) the sorting. 
- (optional) For an extra challenge try to determine which team is home and which is away *just from the data*.  

### **6. Calculate the total points for each team by period.**
Hints:
- Use `IS NOT NULL` to eliminate null values (`None`).
- Don't forget to list the team and the period in the results!

### **7. (optional) Calculate the total minutes for the player "Marc Gasol".**

Hint: This one is really, really hard. You will need to consult SQLite docs to handle elapsed time *because SQLite stores times as text.* (See section 2.2 of the [Data Types](https://www.sqlite.org/datatype3.html) docs. Then expect to spend a while in the Date and Time functions page.)

## **Working with Analytical Data: The `PlayFacts` Table**

`PlayFacts` summarizes the data in the game log. Each play in the log is summarized twice, once from the perspective of the away team (NOP) and then again from the perspective of the home team (TOR). Depending on the play, one row of the log may correspond to several rows of facts in the `PlayFacts` table. For example, if a player scores a basket with an assist (pass) from another player, then that is two facts that have to be recorded. 

Another key difference (that's a pun, by the way) is that the `PlayFacts` table is designed to summarize every play in every game ever played. Thus it includes contextual facts that are not in the original game log data. **The complete dataset for the 2019-20 season is over a million rows.** (However, to keep things simple, we will only include the play facts for the NOP vs TOR game.)

The summaries are organized into three sets of columns: 
- **Information about the *game***: 
  - `season`,	`year`,	`date`
- **Information about the *play***: 
  - `team`: which team gets "credit" for the stat
  - `opp_team`: the opposing team
  - `period`,`remaining_time`,`elapsed` (time), `event_type`, `player`, ... : same as before
  - `lineup`: a list of players as a text string instead of five columns (so it can be searched); players always appear in alphabetical order
  - `segment_id`: each segment represents a sequence of plays in which there were no player substitutions for either team  
  - `event_type` and `player`: same as in `PlayLog` *except* it now includes `assist`, `block`, and `steal` events; note that we could capture more event types if we like (e.g., different kinds of fouls) without adding any columns to the table. 
- **Calculated facts (stats) that can counted and summed**:
  - `+points`, `+assists`, ... : the stats taken from the perspective of the team in the `team` column; if `team`=`TOR` then `+points` are how many points `TOR` scored on the play
  - `-points`, `-assists`, ... : the stats for opposing team; if `team`=`TOR` then `-points` are those scored for `NOP`
  - `play_length_secs`, `play_length_mins`: the elapsed time since the previous play in seconds and minutes; note that these columns are numeric, not text, so they can be summed. 




### **8. Calculate the points, rebounds, and assists for Marc Gasol.**
Hint: This is a single query! Use the '+' stats.

### **9. Calculate the total minutes for Marc Gasol.**
Hint: Look for "Marc Gasol" in the `lineup` column using `LIKE` or the `instr()` function. 

### **10. Calculate the points, rebounds, assists, blocks, steals, turnovers, and fouls for each team.**
Hint: It's easier than it looks.

### **11. Calculate the points, rebounds , assists, free throws attempted, free throws made, 2pt field goals attempted, 2pt field goals made, 3pt field goals attempted, 3pt field goals made for every player. Sort the results by team and player name.**
Hints: 
- Use abbreviations in the table without the plusses and minuses as column aliases; '+points' $\rightarrow$ 'points'

### **12. (optional; uses multiple tables/views) Calculate the minutes, points, rebounds, and assists for every player. Sort the results by team and player name.**
Hint: This one will likely require a subquery to merge the minutes with the other stats. If you come up with another way, then please post it in the class Slack channel. 

## **Discussion: What did we learn?**

* Which queries did you find to be easier? Why?

* How are the `PlayLog` and `PlayFacts` tables different?

* Are there any assumptions about the data in the `PlayLog` table that we relied on to create the `PlayFacts` table? In other words, where might bugs in the `PlayLog` table cause bugs in the `PlayFacts` table?

* If you had to generate the `PlayFacts` table from the `PlayLog` data, how would *you* approach it?

* Have you ever had to do anything similar to the kinds of queries in this homework? If so, how did you do it?

## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.