<img src='https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fscoreboredsports.com%2Fwp-content%2Fuploads%2F2016%2F06%2FClippers-logo.png&f=1&nofb=1' style='height: 128px; float:right'/>

# LA Clippers Lineup Analysis

Author: Alex Nakagawa

Last Updated: May 18, 2020

<div class='alert alert-info'>
<b>ALERT</b>: Please ensure that you have switched to a kernel that contains the environment variables listed below.
</div>

In [120]:
import json
import glob
import re
import os
import sys

import psycopg2
import sqlalchemy
import pandas as pd
import numpy as np

assert psycopg2.__version__ == '2.8.5 (dt dec pq3 ext lo64)'
assert sqlalchemy.__version__ == '1.3.16'
assert pd.__version__ == '1.0.3'
assert np.__version__ == '1.18.2'

In [94]:
# Environment variables
DATABASE_HOST = os.environ.get('DATABASE_HOST')
POSTGRES_USER = os.environ.get('POSTGRES_USER')
POSTGRES_PASSWORD = os.environ.get('POSTGRES_PASSWORD') 
DATABASE_NAME = os.environ.get('DATABASE_NAME')

# Ensure that all environment variables are not a None type
assert DATABASE_HOST and POSTGRES_USER and POSTGRES_PASSWORD and DATABASE_NAME

In [95]:
assert sys.version_info[0] == 3

In [96]:
!python3 --version

Python 3.8.1


## 1. Table Creation/Update in PostgreSQL

a. Write code to transfer the files from a directory called `dev_test_data` to a SQL database called `lac_dev_lineups` (code can be Python, SQL, etc.)

i. The tables created should be named `team`, `player`, `game_schedule`, and `lineup`

ii. Make sure your code creates tables if needed and that it can handle data reloads, merges, and/or
updates


### Determining Schema

The following code creates a python dictionary containing the contents of the jsons given by extracting from a `directory`, which in this case is `dev_test_data`.

In [97]:
# Determining the schema needed to design the database

def create_data_map(directory):
    data_file_paths = glob.glob("{}/*.json".format(directory))
    data = {} # Dictionary of python-encoded json objects
    for f in data_file_paths:
        search_obj = re.search(r'{}/(.+).json'.format(directory), f, re.M|re.I) 
        subfile_name = search_obj.group(1) # Return file name
        with open(f, "r") as read_file:
            data[subfile_name] = json.load(read_file)
    return data

In [98]:
data = create_data_map('dev_test_data')
assert data # Make sure data is not empty

From a quick glance at the json load given to us, I've established several keys and important structural notes:

<img src='./images/lineups_schema.jpg' style='height: 400px' />

The next block establishes the connection with the database. The database I've configured is a Cloud SQL Instance in the Google Cloud Platform, which is hosting our PostgreSQL 12 database named `lac_dev_lineups`. Configurations are listed as follows:

* GCP Project ID: **`clippers-test-1`**
* CloudSQL Instance ID: **`clippers-test-instance`**
* PostgreSQL Version: **12**
* \# vCPUs: **1**
* Memory: **3.75GB**
* SSD Storage: **10GB**

A snapshot of the current running GCP project (`clippers-test-1`) is attached.

<img src='./images/gcp_console.jpg' style='height: 256px' />



<div class="alert alert-info">
    You <b>must</b> run the following block to connect to the SQLAlchemy engine. If the connection fails, there may be a need to check the validity of the enviornment variables.

</div>

In [99]:
from sqlalchemy import create_engine

# PostgreSQL + SQLAlchemy.
ENGINE_STRING = 'postgresql+psycopg2://{}:{}@{}/{}'.format(POSTGRES_USER,
                                                           POSTGRES_PASSWORD,
                                                           DATABASE_HOST,
                                                           DATABASE_NAME)

try:
    engine = sqlalchemy.create_engine(ENGINE_STRING, echo=True)
    print("Connection SUCCESS.")
except:
    print("Connection to database failed. Check environment variables.")


Connection SUCCESS.


This is an important function to check whether the tables' `names` exists inside of our `lac_dev_lineups` database.

In [100]:
# Function adapted from https://stackoverflow.com/questions/40652938/flask-sqlalchemy-check-if-table-exists-in-database
def tables_exist(names):
    all_exist = True
    for name in names:
        ret = engine.dialect.has_table(engine, name)
        print('Table "{}" exists: {}'.format(name, ret))
        all_exist = all_exist and ret
    return all_exist

### Creating Tables if they do not exist

The following code takes advantage of what is known in SQLAlchemy as **[Object Relational Mapping (ORM)](https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_orm_declaring_mapping.htm)**. SQLAlchemy takes care of most of the serialization involved between Python's class/object definitions and PostgreSQL's type definitions.

In [101]:
# Declarative mapping of Objects if necessary
from sqlalchemy import Column, ForeignKey, Numeric, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship, sessionmaker

Base = declarative_base()

class Team(Base):
    ''' * = primary key
        ----------------------
       |team                  |
       |----------------------|
       |col       |type       |
       |----------------------|
       |team_id*  |INTEGER    |
       |name      |STRING(250)|
       |city      |STRING(250)|
       |abrv      |STRING(250)|
        ----------------------
    '''
    __tablename__ = 'team'
    team_id = Column(Integer, primary_key=True) # primary key
    name = Column(String(length=250), nullable=False)
    city = Column(String(length=250), nullable=False)
    abrv = Column(String(length=250), nullable=False)
    
class Player(Base):
    ''' * = primary key
        ----------------------
       |player                |
       |----------------------|
       |col       |type       |
       |----------------------|
       |player_id*|INTEGER    |
       |first_name|STRING(250)|
       |last_name |STRING(250)|
        ----------------------
    '''
    __tablename__ = 'player'
    player_id = Column(Integer, primary_key=True) # primary key
    first_name = Column(String(length=250), nullable=False)
    last_name = Column(String(length=250), nullable=False)

# TODO
class Game(Base):
    ''' *  = primary key
        ** = foreign key
        ----------------------
       |game_schedule         |
       |----------------------|
       |col       |type       |
       |----------------------|
       |game_id*  |INTEGER    |
       |home_id** |INTEGER    |
       |home_score|INTEGER    |
       |away_id** |INTEGER    |
       |away_score|INTEGER    |
       |game_date |DATE       |
        ----------------------
    '''
    __tablename__ = 'game_schedule'
    game_id = Column(Integer, primary_key=True) # primary key
    home_id = Column(Integer, ForeignKey('team.team_id'), nullable=False)
    home_score = Column(Integer)
    away_id = Column(Integer, ForeignKey('team.team_id'), nullable=False)
    away_score = Column(Integer)
    game_date = Column(DateTime(timezone=False))
    
# TODO
class Lineup(Base):
    ''' *  = primary keys
        ** = foreign key
        ----------------------
       |team                  |
       |----------------------|
       |col       |type       |
       |----------------------|
       |team*     |INTEGER    |
       |name      |STRING(250)|
       |city      |STRING(250)|
        ----------------------
    '''
    __tablename__ = 'lineup'
    team_id = Column(Integer, ForeignKey("team.team_id"), primary_key=True)
    player_id = Column(Integer, ForeignKey("player.player_id"), primary_key=True)
    game_id = Column(Integer, ForeignKey("game_schedule.game_id"), primary_key=True)
    lineup_num = Column(Integer, primary_key=True)
    period = Column(Integer, nullable=False)
    time_in = Column(Numeric, nullable=False)
    time_out = Column(Numeric, nullable=False)
        

def create_tables(engine):
    Session = sessionmaker(bind = engine)
    session = Session()
    try:
        table_names = ['team', 'player', 'game_schedule', 'lineup']
        if not tables_exist(table_names):
            Base.metadata.create_all(engine)
        else:
            print("Table names {} all exist. No new tables were created.".format(table_names))
    except:
        session.rollback()
        raise
    finally:
        session.close()

In [102]:
create_tables(engine)

2020-05-26 14:31:51,544 INFO sqlalchemy.engine.base.Engine select version()
2020-05-26 14:31:51,547 INFO sqlalchemy.engine.base.Engine {}
2020-05-26 14:31:51,736 INFO sqlalchemy.engine.base.Engine select current_schema()
2020-05-26 14:31:51,737 INFO sqlalchemy.engine.base.Engine {}
2020-05-26 14:31:51,829 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2020-05-26 14:31:51,829 INFO sqlalchemy.engine.base.Engine {}
2020-05-26 14:31:51,899 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2020-05-26 14:31:51,900 INFO sqlalchemy.engine.base.Engine {}
2020-05-26 14:31:51,948 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2020-05-26 14:31:51,949 INFO sqlalchemy.engine.base.Engine {}
2020-05-26 14:31:51,987 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
20

In [131]:
from datetime import datetime
from sqlalchemy import exc

def create_teams_list(team_json):
    '''Create a list of Pythonic Team objects, prepare for insertion to table team'''
    team_list = []
    for j in team_json:
        team = Team(team_id = j["team_id"],
                 name = j["name"],
                 city = j["city"],
                 abrv = j["abrv"])
        team_list.append(team)
    return team_list

def create_players_list(player_json):
    '''Create a list of Pythonic Player objects, prepare for insertion to table player'''
    player_list = []
    for j in player_json:
        player = Player(player_id=j['player_id'],
                        first_name=j['first_name'],
                        last_name=j['last_name'])
        player_list.append(player)
    return player_list

def create_games_list(game_json):
    '''Create a list of Pythonic Game objects, prepare for insertion to table game_schedule'''
    game_list = []
    for j in game_json:
        game = Game(game_id=j['game_id'],
                    home_id=j['home_id'],
                    home_score=j['home_score'],
                    away_id=j['away_id'],
                    away_score=j['away_score'],
                    game_date=datetime.strptime(j['game_date'], "%Y-%m-%d %H:%M:%S"))
        game_list.append(game)
    return game_list

def create_lineups_list(lineup_json):
    '''Create a list of Pythonic Lineup objects, prepare for insertion to table lineup'''
    lineup_list = []
    for j in lineup_json:
        lineup = Lineup(team_id=j['team_id'],
                    player_id=j['player_id'],
                    game_id=j['game_id'],
                    lineup_num=j['lineup_num'],
                    period=j['period'],
                    time_in=j['time_in'],
                    time_out=j['time_out'])
        lineup_list.append(lineup)
    return lineup_list


def insert_records(record_list):
    '''Takes a record_list and add them to a session, and eventually the database.'''
    Session = sessionmaker(bind = engine)
    session = Session()
    
    try:
        session.add_all(record_list)
        session.commit()
    except exc.IntegrityError as error:
        print("You attempted to add a new value, but there was a duplicate key: {}".format(error))
        session.rollback()
    except:
        session.rollback()
        raise
    finally:
        session.close()
 

In [132]:
insert_records(create_teams_list(data['team']))
insert_records(create_players_list(data['player']))
insert_records(create_games_list(data['game_schedule']))
insert_records(create_lineups_list(data['lineup']))

2020-05-26 16:59:56,619 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2020-05-26 16:59:56,624 INFO sqlalchemy.engine.base.Engine INSERT INTO team (team_id, name, city, abrv) VALUES (%(team_id)s, %(name)s, %(city)s, %(abrv)s)
2020-05-26 16:59:56,626 INFO sqlalchemy.engine.base.Engine ({'team_id': 1, 'name': 'Hawks', 'city': 'Atlanta', 'abrv': 'ATL'}, {'team_id': 2, 'name': 'Celtics', 'city': 'Boston', 'abrv': 'BOS'}, {'team_id': 3, 'name': 'Nets', 'city': 'Brooklyn', 'abrv': 'BKN'}, {'team_id': 4, 'name': 'Hornets', 'city': 'Charlotte', 'abrv': 'CHA'}, {'team_id': 5, 'name': 'Bulls', 'city': 'Chicago', 'abrv': 'CHI'}, {'team_id': 6, 'name': 'Cavaliers', 'city': 'Cleveland', 'abrv': 'CLE'}, {'team_id': 7, 'name': 'Mavericks', 'city': 'Dallas', 'abrv': 'DAL'}, {'team_id': 8, 'name': 'Nuggets', 'city': 'Denver', 'abrv': 'DEN'}  ... displaying 10 of 30 total bound parameter sets ...  {'team_id': 29, 'name': 'Jazz', 'city': 'Utah', 'abrv': 'UTA'}, {'team_id': 30, 'name': 'Wizards', 'ci

2020-05-26 16:59:59,637 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2020-05-26 17:00:00,593 INFO sqlalchemy.engine.base.Engine INSERT INTO lineup (team_id, player_id, game_id, lineup_num, period, time_in, time_out) VALUES (%(team_id)s, %(player_id)s, %(game_id)s, %(lineup_num)s, %(period)s, %(time_in)s, %(time_out)s)
2020-05-26 17:00:00,593 INFO sqlalchemy.engine.base.Engine ({'team_id': 12, 'player_id': 56, 'game_id': 5, 'lineup_num': 1, 'period': 1, 'time_in': 720.0, 'time_out': 312.0}, {'team_id': 12, 'player_id': 56, 'game_id': 5, 'lineup_num': 2, 'period': 1, 'time_in': 312.0, 'time_out': 312.0}, {'team_id': 12, 'player_id': 56, 'game_id': 5, 'lineup_num': 3, 'period': 1, 'time_in': 312.0, 'time_out': 264.0}, {'team_id': 12, 'player_id': 56, 'game_id': 5, 'lineup_num': 4, 'period': 1, 'time_in': 264.0, 'time_out': 228.0}, {'team_id': 12, 'player_id': 56, 'game_id': 5, 'lineup_num': 5, 'period': 1, 'time_in': 228.0, 'time_out': 228.0}, {'team_id': 12, 'player_id': 56, 'game

## Queries

The following block is a helper function to view the results of queries in a simple format.

In [136]:
def read_query(text: sqlalchemy.sql.expression.TextClause) -> pd.DataFrame:
    '''Takes a neutral TextClause statement from SQLAlchemy
       and creates a ResultProxy object from a connection.execute()
       call. This returns a pandas DataFrame with the result of
       the executed query.'''
    with engine.connect() as connection:
        try:
            return pd.read_sql(text, connection)
        except:
            print("There was an error. Check that your query is valid.")

### Query 2a:

Write a SQL query that can calculate team win-loss records, sorted by win percentage (defined as wins divided by games played)

In [159]:
# Creating some helper views to answer the following queries.
from sqlalchemy.exc import ProgrammingError

# home_games will contain the number of home games, home wins, home losses for each team.
query_view_home_games = text("\
CREATE VIEW home_games AS \
    SELECT home_id AS team_id, COUNT(*) AS total_home_games, \
           SUM(CASE WHEN home_score > away_score THEN 1 ELSE 0 END) AS home_wins, \
           SUM(CASE WHEN home_score < away_score THEN 1 ELSE 0 END) AS home_losses \
    FROM game_schedule \
    GROUP BY home_id")

# away_games will contain the number of away games, away wins, away losses for each team
query_view_away_games = text("\
CREATE VIEW away_games AS \
    SELECT away_id AS team_id, COUNT(*) AS total_away_games, \
           SUM(CASE WHEN away_score > home_score THEN 1 ELSE 0 END) away_wins, \
           SUM(CASE WHEN away_score < home_score THEN 1 ELSE 0 END) away_losses \
    FROM game_schedule \
    GROUP BY away_id")

# total_records will contain the totals for each team
query_view_total_records = text("\
CREATE VIEW total_records AS \
    SELECT home_games.team_id, home_wins + away_wins AS wins, \
                               home_losses + away_losses AS losses \
    FROM home_games JOIN away_games ON home_games.team_id = away_games.team_id")

with engine.connect() as connection:
    try:
        connection.execute(query_view_home_games)
        connection.execute(query_view_away_games)
        connection.execute(query_view_total_records)
    except ProgrammingError as e:
        print("one of the views already exist: {}".format(e))

2020-05-27 00:15:54,700 INFO sqlalchemy.engine.base.Engine CREATE VIEW home_games AS     SELECT home_id AS team_id, COUNT(*) AS total_home_games,            SUM(CASE WHEN home_score > away_score THEN 1 ELSE 0 END) AS home_wins,            SUM(CASE WHEN home_score < away_score THEN 1 ELSE 0 END) AS home_losses     FROM game_schedule     GROUP BY home_id
2020-05-27 00:15:54,702 INFO sqlalchemy.engine.base.Engine {}
2020-05-27 00:15:54,779 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-27 00:15:54,793 INFO sqlalchemy.engine.base.Engine CREATE VIEW away_games AS     SELECT away_id AS team_id, COUNT(*) AS total_away_games,            SUM(CASE WHEN away_score > home_score THEN 1 ELSE 0 END) away_wins,            SUM(CASE WHEN away_score < home_score THEN 1 ELSE 0 END) away_losses     FROM game_schedule     GROUP BY away_id
2020-05-27 00:15:54,793 INFO sqlalchemy.engine.base.Engine {}
2020-05-27 00:15:54,830 INFO sqlalchemy.engine.base.Engine COMMIT
2020-05-27 00:15:54,856 INFO sqlalchemy.

In [160]:
query_2_a = text("\
SELECT *, CAST(CAST(wins AS DECIMAL)/(wins + losses) AS DECIMAL(6,5)) AS win_percentage \
FROM team JOIN total_records ON team.team_id = total_records.team_id \
ORDER BY win_percentage ASC, team.team_id ASC")

In [161]:
read_query(query_2_a)

2020-05-27 00:16:00,904 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2020-05-27 00:16:00,905 INFO sqlalchemy.engine.base.Engine {'name': 'SELECT *, CAST(CAST(wins AS DECIMAL)/(wins + losses) AS DECIMAL(6,5)) AS win_percentage FROM team JOIN total_records ON team.team_id = total_records.team_id ORDER BY win_percentage ASC, team.team_id ASC'}
2020-05-27 00:16:00,944 INFO sqlalchemy.engine.base.Engine SELECT *, CAST(CAST(wins AS DECIMAL)/(wins + losses) AS DECIMAL(6,5)) AS win_percentage FROM team JOIN total_records ON team.team_id = total_records.team_id ORDER BY win_percentage ASC, team.team_id ASC
2020-05-27 00:16:00,944 INFO sqlalchemy.engine.base.Engine {}


Unnamed: 0,team_id,name,city,abrv,team_id.1,wins,losses,win_percentage
0,6,Cavaliers,Cleveland,CLE,6,1,6,0.14286
1,24,Suns,Phoenix,PHX,24,1,6,0.14286
2,30,Wizards,Washington,WAS,30,1,6,0.14286
3,11,Rockets,Houston,HOU,11,1,5,0.16667
4,5,Bulls,Chicago,CHI,5,2,6,0.25
5,7,Mavericks,Dallas,DAL,7,2,6,0.25
6,20,Knicks,New York,NYK,20,2,6,0.25
7,1,Hawks,Atlanta,ATL,1,2,5,0.28571
8,22,Magic,Orlando,ORL,22,2,5,0.28571
9,21,Thunder,Oklahoma City,OKC,21,2,4,0.33333


### Query 2b:

In the same table, show how the team ranks (highest to lowest) in terms of games played, home games,
and away games during this month of the season? Make sure your code can extend to additional months
as data is added to the data set. For each show both the number of games and the rank.

**Luckily, with the `home_games` and `away_games` views that we used in the previous questions, this should be a much simpler calculation.**


In [181]:
query_2_b = text("\
SELECT RANK () OVER ( \
       ORDER BY h.total_home_games + a.total_away_games DESC, \
                h.total_home_games DESC, \
                a.total_away_games DESC) rank, \
       h.team_id, \
       t.name, \
       h.total_home_games + a.total_away_games as total_games, \
       h.total_home_games, \
       a.total_away_games \
FROM home_games h JOIN away_games a \
        ON h.team_id = a.team_id \
    JOIN team t ON h.team_id = t.team_id \
ORDER BY total_games DESC, h.total_home_games DESC, \
    a.total_away_games DESC")

In [182]:
read_query(query_2_b)

2020-05-27 01:23:04,839 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2020-05-27 01:23:04,841 INFO sqlalchemy.engine.base.Engine {'name': 'SELECT RANK () OVER (        ORDER BY h.total_home_games + a.total_away_games DESC,                 h.total_home_games DESC,                 a.total_ ... (212 characters truncated) ...     ON h.team_id = a.team_id     JOIN team t ON h.team_id = t.team_id ORDER BY total_games DESC, h.total_home_games DESC,     a.total_away_games DESC'}
2020-05-27 01:23:04,886 INFO sqlalchemy.engine.base.Engine SELECT RANK () OVER (        ORDER BY h.total_home_games + a.total_away_games DESC,                 h.total_home_games DESC,                 a.total_away_games DESC) rank,        h.team_id,        t.name,        h.total_home_games + a.total_away_games as total_games,        h.total_home_games,        a.total_away_games FROM home_games

Unnamed: 0,rank,team_id,name,total_games,total_home_games,total_away_games
0,1,10,Warriors,9,4,5
1,2,28,Raptors,8,6,2
2,3,18,Timberwolves,8,5,3
3,3,20,Knicks,8,5,3
4,5,14,Lakers,8,4,4
5,5,5,Bulls,8,4,4
6,5,23,76ers,8,4,4
7,8,26,Kings,8,3,5
8,8,3,Nets,8,3,5
9,8,4,Hornets,8,3,5


# DANGER ZONE

In [156]:
with engine.connect() as connection:
    connection.execute("DROP VIEW total_record CASCADE")

2020-05-27 00:14:24,607 INFO sqlalchemy.engine.base.Engine DROP VIEW total_record CASCADE
2020-05-27 00:14:24,610 INFO sqlalchemy.engine.base.Engine {}
2020-05-27 00:14:24,646 INFO sqlalchemy.engine.base.Engine ROLLBACK


ProgrammingError: (psycopg2.errors.UndefinedTable) view "total_record" does not exist

[SQL: DROP VIEW total_record CASCADE]
(Background on this error at: http://sqlalche.me/e/f405)

In [68]:
Base.metadata.drop_all(engine, tables=[])   # all tables are deleted

2020-05-26 00:49:28,926 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2020-05-26 00:49:28,928 INFO sqlalchemy.engine.base.Engine {'name': 'team'}
2020-05-26 00:49:29,002 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2020-05-26 00:49:29,003 INFO sqlalchemy.engine.base.Engine {'name': 'player'}
2020-05-26 00:49:29,021 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2020-05-26 00:49:29,022 INFO sqlalchemy.engine.base.Engine {'name': 'game_schedule'}
2020-05-26 00:49:29,041 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_v