## Mod 6 Live Session

### H. Diana McSpadden (hdm5s)

In [1]:
import numpy as np
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text # don't need until the end of the lab
import dotenv
import os

In [2]:
nba = pd.read_csv('ASA_All_NBA_Raw_Data.csv') # load the data

  nba = pd.read_csv('ASA_All_NBA_Raw_Data.csv') # load the data


In [3]:
dotenv.load_dotenv('mod6.env')
postgrespassword = os.getenv('postgrespassword')

In [4]:
nba.columns = [c.lower() for c in nba.columns] #postgres doesn't like capitals or spaces

In [5]:
pd.set_option('display.max_rows', 81)
nba.head(1).T

Unnamed: 0,0
game_id,202202170BRK
game_date,2022-02-17
ot,0
h_a,A
team_abbrev,WAS
team_score,117
team_pace,94.5
team_efg_pct,0.627
team_tov_pct,13.5
team_orb_pct,22.9


## Database Normalization
### First normal form:

1. **All tables must have a primary key**: In this table, `game_id` and `player_id` together are unique on every row, and so they form primary key.

2. **All the data must be atomic**: Inactives is non-atomic.

3. **No repeating groups problem**: We can't solve the non-atomicity problem by creating separate columns if this leads to arbitrary ordering language in the column names (for example, `Inactive1`, `Inactive2`, etc.) and if it leads to a lot of missing data (there would be an `Inactive7` which would be missing any time a team has less than 7 inactive players).

Our solution here is to cheat -- because we have `is_inactive` for each player, it makes `Inactives` redundant. So we can just delete `Inactives`.

If we did not have `is_inactive`, we would have to create a new table called `Inactive` with three columns: `game_id` and `inactive_player_id` that contains the player ID for each inactive player for each team in each game. Each inactive player in each game would get a new row. We wouldn't need to include team in this table because we could get the information about a player's team by joining this data to the player-game table. (A player plays for one team in one game, so if we know the game and player, we can lookup the player's team).

In [6]:
nba = nba.drop(['inactives'], axis=1) # drop inactives BECAUSE they each have their own row with is_inactive = 1

### Functional Dependence
Let X and Y be columns in a data table. Y is functionally dependent on X if each value of X has exactly one value of Y.

That's pretty abstract. So here are some guidelines that help me:

1. This use of "function" is the exact same as the concept of a function from algebra and pre-calculus. A correspondence f(x)=y is a function if each value of x has only one associated value of y.

2. X is either a primary key, or something that should be a primary key in another table.

For example, `game_date` (Y) is functionally dependent on `game_id` (X) because one `game_id` takes place on exactly one date.

### Second normal form:
In this table the primary key is a superkey consisting of two columns: `game_id` and `player_id`. 

2NF is violated if any columns are functionally dependent on part of the primary key but not the entire primary key. This can only happen if the primary key is a superkey.

Here there are three columns that depend on `game_id` but not on `player_id`: `game_date`, `OT`, and `season`. We solve 2NF by moving these columns to a new table.

There is also one column, `player`, that depends on `player_id` but not `game_id`. We create a new table here as well.

In [7]:
games = nba[['game_id', 'game_date', 'ot', 'season']].drop_duplicates()
players = nba[['player_id', 'player']].drop_duplicates()

In [8]:
games

Unnamed: 0,game_id,game_date,ot,season
0,202202170BRK,2022-02-17,0,2022
26,202202170CHO,2022-02-17,2,2022
48,202202170LAC,2022-02-17,0,2022
71,202202170MIL,2022-02-17,0,2022
95,202202170NOP,2022-02-17,0,2022
...,...,...,...,...
108259,202001080GSW,2020-01-08,0,2020
108887,202008020HOU,2020-08-02,0,2020
109683,201911060HOU,2019-11-06,0,2020
110125,201912250GSW,2019-12-25,0,2020


In [9]:
players

Unnamed: 0,player_id,player
0,kispeco01,Corey Kispert
1,kuzmaky01,Kyle Kuzma
2,caldwke01,Kentavious Caldwell-Pope
3,netora01,Raul Neto
4,bryanth01,Thomas Bryant
...,...,...
109702,frazimi01,Michael Frazier
110441,howarwi01,William Howard
110913,mbahalu01,Luc Mbah a Moute
111399,bowmaky01,Ky Bowman


In [10]:
nba

Unnamed: 0,game_id,game_date,ot,h_a,team_abbrev,team_score,team_pace,team_efg_pct,team_tov_pct,team_orb_pct,...,pf_per_minute,ts,last_60_minutes_per_game_starting,last_60_minutes_per_game_bench,pg%,sg%,sf%,pf%,c%,active_position_minutes
0,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.061538,9.00,31.716667,22.017778,1.0,36.0,60.0,4.0,0.0,46.253586
1,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.099119,7.44,34.324000,18.475954,0.0,0.0,4.0,85.0,11.0,52.152590
2,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.000000,7.00,29.820290,16.051693,0.0,32.0,67.0,0.0,0.0,47.021807
3,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.048387,7.88,29.920833,14.603922,90.0,10.0,0.0,0.0,0.0,27.603314
4,202202170BRK,2022-02-17,0,A,WAS,117,94.5,0.627,13.5,22.9,...,0.000000,6.88,20.095833,14.538095,0.0,0.0,0.0,0.0,100.0,36.472537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112118,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.107914,13.08,33.110667,19.232562,0.0,2.0,77.0,21.0,0.0,57.207786
112119,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.036079,6.00,25.470833,20.228571,5.0,45.0,43.0,7.0,0.0,58.202391
112120,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.150943,4.00,24.083333,13.228788,0.0,0.0,0.0,9.0,91.0,49.630640
112121,202003070GSW,2020-03-07,0,H,GSW,118,90.9,0.606,7.0,18.9,...,0.094340,12.64,34.783333,27.691667,0.0,44.0,48.0,8.0,0.0,58.923515


### Third normal form:
3NF is violated if there are "transitive dependencies", that is, functional dependence between columns when neither column is part of the primary key.

In the main dataframe, `game_id` and `player_id` are part of the primary key. But many of the columns depend on `team_abbrev` as well as `game_id`. We pull these columns out and create a new table:

In [11]:
team_game = nba[['game_id', 'team_abbrev', 'h_a','team_score','team_pace', 
                 'team_efg_pct','team_tov_pct','team_orb_pct',
                 'team_ft_rate','team_off_rtg']].drop_duplicates()

In [12]:
nba.columns

Index(['game_id', 'game_date', 'ot', 'h_a', 'team_abbrev', 'team_score',
       'team_pace', 'team_efg_pct', 'team_tov_pct', 'team_orb_pct',
       'team_ft_rate', 'team_off_rtg', 'opponent_abbrev', 'opponent_score',
       'opponent_pace', 'opponent_efg_pct', 'opponent_tov_pct',
       'opponent_orb_pct', 'opponent_ft_rate', 'opponent_off_rtg', 'player',
       'player_id', 'starter', 'mp', 'fg', 'fga', 'fg_pct', 'fg3', 'fg3a',
       'fg3_pct', 'ft', 'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl',
       'blk', 'tov', 'pf', 'pts', 'plus_minus', 'did_not_play', 'is_inactive',
       'ts_pct', 'efg_pct', 'fg3a_per_fga_pct', 'fta_per_fga_pct', 'orb_pct',
       'drb_pct', 'trb_pct', 'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct',
       'usg_pct', 'off_rtg', 'def_rtg', 'bpm', 'season', 'minutes',
       'double_double', 'triple_double', 'dkp', 'fdp', 'sdp', 'dkp_per_minute',
       'fdp_per_minute', 'sdp_per_minute', 'pf_per_minute', 'ts',
       'last_60_minutes_per_game_starting', 'la

In [15]:
player_game = nba[['game_id', 'player_id', 'starter', 'mp', 'fg', 'fga', 'fg_pct', 'fg3', 'fg3a',
       'fg3_pct', 'ft', 'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl',
       'blk', 'tov', 'pf', 'pts', 'plus_minus', 'did_not_play', 'is_inactive',
       'ts_pct', 'efg_pct', 'fg3a_per_fga_pct', 'fta_per_fga_pct', 'orb_pct',
       'drb_pct', 'trb_pct', 'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct',
       'usg_pct', 'off_rtg', 'def_rtg', 'bpm', 'season', 'minutes',
       'double_double', 'triple_double', 'dkp', 'fdp', 'sdp', 'dkp_per_minute',
       'fdp_per_minute', 'sdp_per_minute', 'pf_per_minute', 'ts',
       'last_60_minutes_per_game_starting', 'last_60_minutes_per_game_bench',
       'pg%', 'sg%', 'sf%', 'pf%', 'c%', 'active_position_minutes']] # one row per player per game

 4 Tables
 * games
 * players
 * team_game
 * player_game

In [16]:
games

Unnamed: 0,game_id,game_date,ot,season
0,202202170BRK,2022-02-17,0,2022
26,202202170CHO,2022-02-17,2,2022
48,202202170LAC,2022-02-17,0,2022
71,202202170MIL,2022-02-17,0,2022
95,202202170NOP,2022-02-17,0,2022
...,...,...,...,...
108259,202001080GSW,2020-01-08,0,2020
108887,202008020HOU,2020-08-02,0,2020
109683,201911060HOU,2019-11-06,0,2020
110125,201912250GSW,2019-12-25,0,2020


In [17]:
players

Unnamed: 0,player_id,player
0,kispeco01,Corey Kispert
1,kuzmaky01,Kyle Kuzma
2,caldwke01,Kentavious Caldwell-Pope
3,netora01,Raul Neto
4,bryanth01,Thomas Bryant
...,...,...
109702,frazimi01,Michael Frazier
110441,howarwi01,William Howard
110913,mbahalu01,Luc Mbah a Moute
111399,bowmaky01,Ky Bowman


In [18]:
player_game

Unnamed: 0,game_id,player_id,starter,mp,fg,fga,fg_pct,fg3,fg3a,fg3_pct,...,pf_per_minute,ts,last_60_minutes_per_game_starting,last_60_minutes_per_game_bench,pg%,sg%,sf%,pf%,c%,active_position_minutes
0,202202170BRK,kispeco01,1,32:30,6,9,0.667,4,6,0.667,...,0.061538,9.00,31.716667,22.017778,1.0,36.0,60.0,4.0,0.0,46.253586
1,202202170BRK,kuzmaky01,1,30:16,2,7,0.286,0,3,0.000,...,0.099119,7.44,34.324000,18.475954,0.0,0.0,4.0,85.0,11.0,52.152590
2,202202170BRK,caldwke01,1,25:26,3,7,0.429,1,3,0.333,...,0.000000,7.00,29.820290,16.051693,0.0,32.0,67.0,0.0,0.0,47.021807
3,202202170BRK,netora01,1,20:40,5,7,0.714,1,1,1.000,...,0.048387,7.88,29.920833,14.603922,90.0,10.0,0.0,0.0,0.0,27.603314
4,202202170BRK,bryanth01,1,14:04,5,6,0.833,0,1,0.000,...,0.000000,6.88,20.095833,14.538095,0.0,0.0,0.0,0.0,100.0,36.472537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112118,202003070GSW,wiggian01,1,37:04:00,3,10,0.300,0,0,0.000,...,0.107914,13.08,33.110667,19.232562,0.0,2.0,77.0,21.0,0.0,57.207786
112119,202003070GSW,toscaju01,1,27:43:00,3,6,0.500,0,2,0.000,...,0.036079,6.00,25.470833,20.228571,5.0,45.0,43.0,7.0,0.0,58.202391
112120,202003070GSW,bendedr01,0,13:15,4,4,1.000,2,2,1.000,...,0.150943,4.00,24.083333,13.228788,0.0,0.0,0.0,9.0,91.0,49.630640
112121,202003070GSW,muldemy01,1,31:48:00,5,10,0.500,3,7,0.429,...,0.094340,12.64,34.783333,27.691667,0.0,44.0,48.0,8.0,0.0,58.923515


In [19]:
# Need: python -m pip install --upgrade 'sqlalchemy<2.0'

In [20]:
team_game

Unnamed: 0,game_id,team_abbrev,h_a,team_score,team_pace,team_efg_pct,team_tov_pct,team_orb_pct,team_ft_rate,team_off_rtg
0,202202170BRK,WAS,A,117,94.5,0.627,13.5,22.9,0.157,123.8
13,202202170BRK,BRK,H,103,94.5,0.483,13.1,33.3,0.191,109.0
26,202202170CHO,MIA,A,111,88.8,0.471,11.1,26.8,0.147,103.4
37,202202170CHO,CHO,H,107,88.8,0.453,13.6,28.1,0.221,99.7
48,202202170LAC,HOU,A,111,103.7,0.533,15.3,24.0,0.154,107.1
...,...,...,...,...,...,...,...,...,...,...
112068,202002270GSW,GSW,H,86,104.8,0.481,23.6,12.2,0.113,82.1
112079,202002290PHO,GSW,A,115,98.6,0.523,9.0,28.9,0.276,116.6
112090,202003010GSW,GSW,H,110,100.2,0.522,17.4,38.3,0.191,109.8
112101,202003030DEN,GSW,A,116,94.4,0.622,10.7,12.8,0.171,122.9


## Postgres!

In [21]:
dbserver = psycopg2.connect(
    user='postgres', 
    password=postgrespassword, 
    host="localhost"
)
dbserver.autocommit = True

In [22]:
cursor = dbserver.cursor()

In [23]:
try:
    cursor.execute("CREATE DATABASE nbadb")
except:
    cursor.execute("DROP DATABASE nbadb")
    cursor.execute("CREATE DATABASE nbadb")

In [24]:
user = "postgres"
pw = postgrespassword
db = "nbadb"
engine = create_engine(f"postgresql+psycopg2://{user}:{pw}@localhost/{db}")

In [25]:
player_game.to_sql('player_game', con=engine, index=False, chunksize=1000, if_exists='replace')
team_game.to_sql('team_game', con=engine, index=False, chunksize=1000, if_exists='replace')
games.to_sql('games', con=engine, index=False, chunksize=1000, if_exists='replace')
players.to_sql('players', con=engine, index=False, chunksize=1000, if_exists='replace')

812

## Yay, we have a database

In [26]:
myquery = """
SELECT * FROM team_game WHERE team_abbrev = 'CLE'
"""

# getting all the Cavs games

pd.read_sql_query(myquery, con=engine)

Unnamed: 0,game_id,team_abbrev,h_a,team_score,team_pace,team_efg_pct,team_tov_pct,team_orb_pct,team_ft_rate,team_off_rtg
0,202202150ATL,CLE,A,116,91.8,0.602,12.2,33.3,0.114,126.4
1,202202120PHI,CLE,A,93,90.5,0.519,15.2,13.5,0.169,102.8
2,202202040CHO,CLE,A,102,93.6,0.478,10.8,28.8,0.178,109.0
3,202202110IND,CLE,A,120,100.7,0.586,12.1,17.1,0.309,119.1
4,202202090CLE,CLE,H,105,96.1,0.530,12.3,24.4,0.220,109.2
...,...,...,...,...,...,...,...,...,...,...
190,201912060CLE,CLE,H,87,93.7,0.482,15.5,23.3,0.084,92.9
191,201912280MIN,CLE,A,94,103.1,0.442,23.2,35.0,0.338,91.2
192,202001140LAC,CLE,A,103,99.5,0.495,11.8,17.0,0.143,103.5
193,202002120CLE,CLE,H,127,102.0,0.564,13.9,37.0,0.223,124.5


Yay!