# Analyzing NBA player and team stats with Spark/Redshift
### Data Engineering Capstone Project

#### Project Summary


The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

## Step 1: Scope the Project and Gather Data

### Scope

The goal of this capstone project is to:
* Collect NBA player data, season stats data, and team data.
* Extract data from S3 files (in csv, json, txt format) to Spark DataFrame.
* Clean and transform data using Spark, load data back to S3 in parquet format.
* Load them to Redshift tables.
    * on high level, the data model I'm thinking about looks like:

<img src="data-model-high-lvl.png">

* Analyze NBA dataset for more insights using SQL. I will try to write some queries to answer questions, e.g. 
  * What is the best winning percentage team?
  * Which team have the most star players?
  * Top 10 coach in history?
  * The most efficient player? The best 3 point shooter? The best defensive player in terms of block and steal?
  * How does the game evolve over time? for example, shooting more 3 pointers? or focusing more on defense?


### Describe and Gather Data 

#### DataSet 1: NBA player and player stats per season.
https://www.kaggle.com/drgilermo/nba-players-stats

This dataset contains aggregate individual statistics for 67 NBA seasons. from basic box-score attributes such as points, assists, rebounds etc., to more advanced money-ball like features such as Value Over Replacement.
The data was scraped from [basketball-reference](https://www.basketball-reference.com/)

* **Players.csv**: 
This file basic player information, e.g. weight, height, college.
Since all the play names in this file are unique, I will mainly use this csv file to create player table. Sample data:
|Id | Player | height | weight | collage | born | birth_city | birth_state |
|:-|:-|:-|:-|:-|:-|:-|:-|
|2590 | Vince Carter | 198 | 99 | University of North Carolina | 1977 | Daytona Beach | Florida |

* **player_data.json**: 
This file contains extra player information, e.g. more accurate birth date.
Since this file contains duplicate NBA players names, as I show in Step 2: Explore and Assess the Data, for this project, I will only use it to augment the birth date information in the player table.
```
    "4290": {
        "name": "Russell Westbrook",
        "year_start": "2009",
        "year_end": "2018",
        "position": "G",
        "height": "6-3",
        "weight": "200",
        "birth_date": "November 12, 1988",
        "college": "University of California, Los Angeles"
    }
```

* **Seasons_Stats.csv**: 
This file contains NBA player stats over all the seasons, from 1950 to 2015. 
The column names are abbreviated, e.g. **3P%** - 3-Point Field Goal Percentage (available since the 1979-80 season in the NBA); the formula is 3P / 3PA.
More detailed column description can be found in [glossary](https://www.basketball-reference.com/about/glossary.html)
Maybe expand them to more human readable format when creating tables on Redshift. Sample data:

| Id | Year | Player | Pos | Age | Tm | ... | AST | STL | BLK | TOV | PF | PTS |
|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|
| 16746 | 2004 | LeBron James | SG | 19 | CLE | ... | 465 | 130 | 58 | 273 | 149 | 1654 |



#### DataSet 2: NBA team record per season.
https://www.kaggle.com/boonpalipatana/nba-season-records-from-every-year
This dataset contains every season record for each NBA teams from 73 seasons (#wins, #losses, standing, playoff result, and more).
* **Team_Records.csv**:
This file contains every season record for each NBA team from 73 seasons, from 1946 to 2017.

| Season | Lg | Team | W | L | W/L% | Finish | ...  | Coaches | Top WS |
|:-|:-|:-|:-|:-|:-|:-|:-|:-|:-|
| 2004-05 | NBA | Boston Celtics* | 45 | 37 | 0.549 | 1 | ...       | D. Rivers (45-37) | P. Pierce (11.2) |
| 2003-04 | NBA | Boston Celtics* | 36 | 46 | 0.439 | 4 | ...       | J. O'Brien (22-24) J. Carroll (14-22) | P. Pierce (7.1) |

Multiple coaches can coach the same team in a season, thus I need to parse "J. O'Brien (22-24) J. Carroll (14-22),P. Pierce (7.1)" into a list of coaching history.


#### DataSet 3: NBA team timeline.
http://www.shrpsports.com/nba/explain.htm

This is a webpage that contains team name, team abbrevation, start and end season.
Dataset 1 (player stats) uses team abbrevation, while dataset 2 (team stats) uses full team name, establishing the mapping between abbrev and full name (e.g. GSW => Golden State Warrior) requires a lots of manual work, I hope to automate joining two tables using information in this webpage.

* **team-abbrevation.txt**:
This files contains city, abbrevation, team name and time.

```
...
Baltimore    	Bal	Baltimore Bullets [2nd team] (1963-64 - 1972-73)
Boston       	Bos	Boston Celtics (1946-47 - present)
Brooklyn     	Bkn	Brooklyn Nets (2012-13 - present)
Buffalo      	Buf	Buffalo Braves (1970-71 - 1977-78)
Capital      	Cap	Capital Bullets (1973-74)
Charlotte    	Cha	Charlotte Hornets (1988-89 - 2001-02, 2014-15 - present)
Cha Bobcats  	ChB	Charlotte Bobcats (2004-05 - 2013-14)
...
```

## Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [1]:
import boto3
import os
import configparser
from datetime import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col, isnan, when, count, trim, desc, sum, asc
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
from pyspark.sql.functions import countDistinct, explode, split, concat_ws, collect_list
from pyspark.sql.types import (
    StructType as R,
    StructField as Fld,
    DoubleType as Dbl,
    StringType as Str,
    IntegerType as Int,
    DateType as Date,
    TimestampType as Ts,
)

In [2]:
config = configparser.ConfigParser()

#Normally this file should be in ~/.aws/credentials
config.read_file(open('dwh.cfg'))

KEY                    = config.get('AWS','KEY')
SECRET                 = config.get('AWS','SECRET')

os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['KEY']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['SECRET']

In [3]:
spark = SparkSession.builder\
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
                     .getOrCreate()

### Load Players.csv into dataFrame "dfPlayer", this file contains NBA player data, will use this df to create dfPlayerJoin

In [5]:
# load players.csv
playerSchema = R([
    Fld("id", Int()),
    Fld("name", Str()), # rename column, Player => name
    Fld("height", Int()), # cm
    Fld("weight", Int()), # kg
    Fld("collage", Str()), # I know it's misspelled (collage => college), will rename column later, when joining with dfPlayerExtra
    Fld("born", Int()),
    Fld("birth_city", Str()),
    Fld("birth_state", Str()),
])
dfPlayer = spark.read.csv("s3a://udacity-data-eng-capstone/Players.csv", header=True, schema=playerSchema)
#dfPlayer.printSchema()
dfPlayer.show(5)
print("count = ", dfPlayer.count())

+---+---------------+------+------+--------------------+----+-----------+-----------+
| id|           name|height|weight|             collage|born| birth_city|birth_state|
+---+---------------+------+------+--------------------+----+-----------+-----------+
|  0|Curly Armstrong|   180|    77|  Indiana University|1918|       null|       null|
|  1|   Cliff Barker|   188|    83|University of Ken...|1921|   Yorktown|    Indiana|
|  2|  Leo Barnhorst|   193|    86|University of Not...|1924|       null|       null|
|  3|     Ed Bartels|   196|    88|North Carolina St...|1925|       null|       null|
|  4|    Ralph Beard|   178|    79|University of Ken...|1927|Hardinsburg|   Kentucky|
+---+---------------+------+------+--------------------+----+-----------+-----------+
only showing top 5 rows

count =  3922


In [6]:
dfPlayer.select("name").where(dfPlayer.name.like('%Iverson%')).show()
dfPlayer.select("name").where(dfPlayer.name.like('%Yao Ming%')).show()

+--------------+
|          name|
+--------------+
|Allen Iverson*|
+--------------+

+---------+
|     name|
+---------+
|Yao Ming*|
+---------+



#### Need to clean up player name, some hall of famer have star in their names "Yao Ming*", "Allen Iverson*"

In [7]:
# trim * in name
dfPlayer = dfPlayer.withColumn("name", F.regexp_replace("name", "\*+", ""))
#dfPlayer = dfPlayer.withColumn("name", F.regexp_replace("name", "([\w+\s]+)", "$1")) #figure out capture group

In [8]:
# verify names are trimmed
dfPlayer.select("name").where(dfPlayer.name.like('%Iverson%')).show()
dfPlayer.select("name").where(dfPlayer.name.like('%Yao Ming%')).show()

+-------------+
|         name|
+-------------+
|Allen Iverson|
+-------------+

+--------+
|    name|
+--------+
|Yao Ming|
+--------+



In [9]:
# player with the same name?
dfPlayer.groupBy("name").count().filter("count > 1").show(truncate=False)

+-------------+-----+
|name         |count|
+-------------+-----+
|Patrick Ewing|2    |
|Gary Payton  |2    |
+-------------+-----+



In [10]:
# inspect player with identical names
dfPlayer.where(dfPlayer.name == 'Patrick Ewing').show(truncate=False)
dfPlayer.where(dfPlayer.name == 'Gary Payton').show(truncate=False)

+----+-------------+------+------+---------------------+----+----------+-----------+
|id  |name         |height|weight|collage              |born|birth_city|birth_state|
+----+-------------+------+------+---------------------+----+----------+-----------+
|1721|Patrick Ewing|213   |108   |Georgetown University|1962|Kingston  |Jamaica    |
|3406|Patrick Ewing|213   |108   |Georgetown University|1962|Kingston  |Jamaica    |
+----+-------------+------+------+---------------------+----+----------+-----------+

+----+-----------+------+------+-----------------------+----+----------+-----------+
|id  |name       |height|weight|collage                |born|birth_city|birth_state|
+----+-----------+------+------+-----------------------+----+----------+-----------+
|2099|Gary Payton|193   |81    |Oregon State University|1968|Oakland   |California |
|3894|Gary Payton|193   |81    |Oregon State University|1968|Oakland   |California |
+----+-----------+------+------+-----------------------+----+---

In [11]:
# Since they have identical record, except id, so its safe to drop them
print("before delete, num rows", dfPlayer.count())
dfPlayer = dfPlayer.dropDuplicates(["name", "born"])
print("after  delete, num rows", dfPlayer.count())

before delete, num rows 3922
after  delete, num rows 3920


#### Load player_data2.json into dataFrame "dfplayExtra", this file contains duplicate player names, also the birth date is more accurate than Players.csv

* both dfPlayer and dfPlayerExtra have weight and height columns, but unit is different, keep dfPlayer's height and weight.
* will parse player birth from dfPlayerExtra, and add extra colums (birth_day, birth_month, birth_year) to dfPlayer.

In [12]:
# load player_data2.json
playerExtraSchema = R([
    Fld("name", Str()),
    Fld("year_start", Int()),
    Fld("year_end", Int()),
    Fld("position", Str()),
    Fld("height", Str()), # feet-inches
    Fld("weight", Int()), # pound lbs
    Fld("birth_date", Str()),
    Fld("college", Str()),
])
# json file was generated by `df.to_json('player_data2.json', orient='records', indent=4)`
dfPlayerExtra = spark.read.option("multiline", "true").json(
    "s3a://udacity-data-eng-capstone/player_data2.json"
)
#dfPlayerExtra.printSchema()
dfPlayerExtra.show(5)
print("count = ", dfPlayerExtra.count())

+----------------+--------------------+------+-------------------+--------+------+--------+----------+
|      birth_date|             college|height|               name|position|weight|year_end|year_start|
+----------------+--------------------+------+-------------------+--------+------+--------+----------+
|   June 24, 1968|     Duke University|  6-10|     Alaa Abdelnaby|     F-C| 240.0|    1995|      1991|
|   April 7, 1946|Iowa State Univer...|   6-9|    Zaid Abdul-Aziz|     C-F| 235.0|    1978|      1969|
|  April 16, 1947|University of Cal...|   7-2|Kareem Abdul-Jabbar|       C| 225.0|    1989|      1970|
|   March 9, 1969|Louisiana State U...|   6-1| Mahmoud Abdul-Rauf|       G| 162.0|    2001|      1991|
|November 3, 1974|San Jose State Un...|   6-6|  Tariq Abdul-Wahad|       F| 223.0|    2003|      1998|
+----------------+--------------------+------+-------------------+--------+------+--------+----------+
only showing top 5 rows

count =  4550


In [13]:
# some row has missing value
dfPlayerExtra.where(dfPlayerExtra.name == "George Karl").show()

+------------+--------------------+------+-----------+--------+------+--------+----------+
|  birth_date|             college|height|       name|position|weight|year_end|year_start|
+------------+--------------------+------+-----------+--------+------+--------+----------+
|May 12, 1952|University of Nor...|  null|George Karl|    null|  null|    1978|      1974|
+------------+--------------------+------+-----------+--------+------+--------+----------+



In [14]:
# before parsing birth_date column, verify no missing value in birth_date column
dfPlayerExtra.where(dfPlayerExtra.birth_date == None).count()

0

In [15]:
# trim * in name column
# split and parse birth_date into 3 columns(year, month, day)

dfPlayerExtra = dfPlayerExtra.withColumn(
    "name", F.regexp_replace("name", "\*+", "") # trim * in name
).withColumn(
    "birth_date_split", F.split(F.regexp_replace("birth_date", ",", ""), " ")
)

dfPlayerExtra = dfPlayerExtra.withColumn(
    "birth_month", dfPlayerExtra.birth_date_split.getItem(0) # need to convert Jan=>1
).withColumn(
    "birth_day",   dfPlayerExtra.birth_date_split.getItem(1).cast(Int())
).withColumn(
    "birth_year",  dfPlayerExtra.birth_date_split.getItem(2).cast(Int())
).drop(
    "birth_date_split"
).drop(
    "birth_date"
).dropna(
    subset=["birth_year", "birth_month", "birth_day"]
)

In [16]:
# verify
dfPlayerExtra.select(["name", "birth_year", "birth_month", "birth_day"]).show(5)

+-------------------+----------+-----------+---------+
|               name|birth_year|birth_month|birth_day|
+-------------------+----------+-----------+---------+
|     Alaa Abdelnaby|      1968|       June|       24|
|    Zaid Abdul-Aziz|      1946|      April|        7|
|Kareem Abdul-Jabbar|      1947|      April|       16|
| Mahmoud Abdul-Rauf|      1969|      March|        9|
|  Tariq Abdul-Wahad|      1974|   November|        3|
+-------------------+----------+-----------+---------+
only showing top 5 rows



In [17]:
# find all distinct months
dfPlayerExtra.select("birth_month").dropDuplicates().show()

+-----------+
|birth_month|
+-----------+
|       July|
|   November|
|   February|
|    January|
|      March|
|    October|
|        May|
|     August|
|      April|
|       June|
|   December|
|  September|
+-----------+



In [18]:
# convert month Str=>Int, e.g. Jan=>1
map_month = {
    "July":         7,
    "November":     11,
    "February":     2,
    "January":      1,
    "March":        3,
    "October":      10,
    "May":          5,
    "August":       8,
    "April":        4,
    "June":         6,
    "December":     12,
    "September":    9,
}

def translate(mapping):
    def translate_(col):
        return mapping.get(col, col)
    return udf(translate_, Int())

dfPlayerExtra = dfPlayerExtra.withColumn("birth_month", translate(map_month)("birth_month"))

In [19]:
# check translate is successful
dfPlayerExtra.select("birth_month").dropDuplicates().show()
dfPlayerExtra.show(2)
#dfPlayerExtra.printSchema()

+-----------+
|birth_month|
+-----------+
|         12|
|          1|
|          6|
|          3|
|          5|
|          9|
|          4|
|          8|
|          7|
|         10|
|         11|
|          2|
+-----------+

+--------------------+------+---------------+--------+------+--------+----------+-----------+---------+----------+
|             college|height|           name|position|weight|year_end|year_start|birth_month|birth_day|birth_year|
+--------------------+------+---------------+--------+------+--------+----------+-----------+---------+----------+
|     Duke University|  6-10| Alaa Abdelnaby|     F-C| 240.0|    1995|      1991|          6|       24|      1968|
|Iowa State Univer...|   6-9|Zaid Abdul-Aziz|     C-F| 235.0|    1978|      1969|          4|        7|      1946|
+--------------------+------+---------------+--------+------+--------+----------+-----------+---------+----------+
only showing top 2 rows



In [20]:
# add column birth_ds (datestamp), will later separate (year, month, day, ts) to a dim table later, when loading to Redshift
from datetime import datetime

def translate():
    def translate_(y, m, d):
        return datetime(y, m, d)
    return udf(translate_, Date())

dfPlayerExtra = dfPlayerExtra.withColumn("birth_ds", translate()("birth_year", "birth_month", "birth_day"))

In [21]:
# check add column successfully
#dfPlayerExtra.printSchema()
dfPlayerExtra.show(2)

+--------------------+------+---------------+--------+------+--------+----------+-----------+---------+----------+----------+
|             college|height|           name|position|weight|year_end|year_start|birth_month|birth_day|birth_year|  birth_ds|
+--------------------+------+---------------+--------+------+--------+----------+-----------+---------+----------+----------+
|     Duke University|  6-10| Alaa Abdelnaby|     F-C| 240.0|    1995|      1991|          6|       24|      1968|1968-06-24|
|Iowa State Univer...|   6-9|Zaid Abdul-Aziz|     C-F| 235.0|    1978|      1969|          4|        7|      1946|1946-04-07|
+--------------------+------+---------------+--------+------+--------+----------+-----------+---------+----------+----------+
only showing top 2 rows



In [22]:
uniqPlayer      = [r[0] for r in dfPlayer.     select("name").dropDuplicates().collect()]
uniqPlayerExtra = [r[0] for r in dfPlayerExtra.select("name").dropDuplicates().collect()]

In [23]:
# players in extra, not in orig table
diff1 = set(uniqPlayerExtra) - set(uniqPlayer)
print(len(diff1))

585


In [24]:
# players in orig table, not in extra
# dfPlayer p1 left join dfPlayerExtra p2 on p1.name = p2.name, how many rows will have null value on p2
diff2 = set(uniqPlayer) - set(uniqPlayerExtra)
print(len(diff2))
print(diff2)

35
{'Juan Carlos', 'Whitey Von', 'Sheldon McClellan', 'Metta World', 'Vinny Del', 'Joe Barry', 'Butch Van', 'Peter John', 'Walter Tavares', 'Billy Ray', 'Dick Van', 'Eddie Lee', 'Johnny Macknowski', 'Don Bielke', 'Bob Schafer', 'Horacio Llamas', 'Jan Van', 'Luc Mbah', 'Nick Van', 'Norm Van', 'Micheal Ray', 'Keith Van', 'James Michael', 'World B.', 'Nando De', "Mike O'Neill", 'Frank Reddout', 'Wah Wah', 'Logan Vander', 'George Bon', 'Hot Rod', 'Ken McBride', 'nan', 'Jo Jo', 'Tom Van'}


In [25]:
# print the count of null for each columns
dfPlayer.select([count(when(col(c).isNull(), c)).alias(c) for c in dfPlayer.columns]).show()
dfPlayerExtra.select([count(when(col(c).isNull(), c)).alias(c) for c in dfPlayerExtra.columns]).show()

+---+----+------+------+-------+----+----------+-----------+
| id|name|height|weight|collage|born|birth_city|birth_state|
+---+----+------+------+-------+----+----------+-----------+
|  0|   0|     1|     1|    348|   1|       470|        483|
+---+----+------+------+-------+----+----------+-----------+

+-------+------+----+--------+------+--------+----------+-----------+---------+----------+--------+
|college|height|name|position|weight|year_end|year_start|birth_month|birth_day|birth_year|birth_ds|
+-------+------+----+--------+------+--------+----------+-----------+---------+----------+--------+
|    301|     1|   0|       1|     5|       0|         0|          0|        0|         0|       0|
+-------+------+----+--------+------+--------+----------+-----------+---------+----------+--------+



In [26]:
# join dfPlayer and dfPlayerExtra
# fill college column
dfJoinPlayer = dfPlayer.join(
    dfPlayerExtra,
    (dfPlayer.name == dfPlayerExtra.name) & (dfPlayer.born == dfPlayerExtra.birth_year),
    "left"
).withColumn(
    "college", F.coalesce(dfPlayer.collage, dfPlayerExtra.college)
).drop(
    dfPlayer.born
).drop(
    dfPlayerExtra.name
).drop(
    dfPlayerExtra.height
).drop(
    dfPlayerExtra.weight
).drop(
    dfPlayer.collage
).drop(
    dfPlayerExtra.college
)

print("dfPlayer      count = ", dfPlayer.count())
print("dfPlayerExtra count = ", dfPlayerExtra.count())
print("dfJoinPlayer  count = ", dfJoinPlayer.count())

dfPlayer      count =  3920
dfPlayerExtra count =  4519
dfJoinPlayer  count =  3920


In [27]:
dfJoinPlayer.printSchema()
dfJoinPlayer.show(2)

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- height: integer (nullable = true)
 |-- weight: integer (nullable = true)
 |-- birth_city: string (nullable = true)
 |-- birth_state: string (nullable = true)
 |-- college: string (nullable = true)
 |-- position: string (nullable = true)
 |-- year_end: long (nullable = true)
 |-- year_start: long (nullable = true)
 |-- birth_month: integer (nullable = true)
 |-- birth_day: integer (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- birth_ds: date (nullable = true)

+---+----------------+------+------+-----------+-----------+--------------------+--------+--------+----------+-----------+---------+----------+----------+
| id|            name|height|weight| birth_city|birth_state|             college|position|year_end|year_start|birth_month|birth_day|birth_year|  birth_ds|
+---+----------------+------+------+-----------+-----------+--------------------+--------+--------+----------+-----------+----

In [28]:
dfJoinPlayer.select([count(when(col(c).isNull(), c)).alias(c) for c in dfJoinPlayer.columns]).show()

+---+----+------+------+----------+-----------+-------+--------+--------+----------+-----------+---------+----------+--------+
| id|name|height|weight|birth_city|birth_state|college|position|year_end|year_start|birth_month|birth_day|birth_year|birth_ds|
+---+----+------+------+----------+-----------+-------+--------+--------+----------+-----------+---------+----------+--------+
|  0|   0|     1|     1|       470|        483|    332|     276|     275|       275|        275|      275|       275|     275|
+---+----+------+------+----------+-----------+-------+--------+--------+----------+-----------+---------+----------+--------+



In [29]:
dfJoinPlayer.where(dfJoinPlayer.college.isNull()).show(5)

+----+--------------------+------+------+----------+-----------+-------+--------+--------+----------+-----------+---------+----------+----------+
|  id|                name|height|weight|birth_city|birth_state|college|position|year_end|year_start|birth_month|birth_day|birth_year|  birth_ds|
+----+--------------------+------+------+----------+-----------+-------+--------+--------+----------+-----------+---------+----------+----------+
|3865|    Juan Hernangomez|   206|   104|    Madrid|      Spain|   null|       F|    2018|      2017|          9|       28|      1995|1995-09-28|
|3881|Timothe Luwawu-Ca...|   198|    92|    Cannes|     France|   null|       F|    2018|      2017|          5|        9|      1995|1995-05-09|
|2955|       Darko Milicic|   213|   113|  Novi Sad|     Serbia|   null|     F-C|    2013|      2004|          6|       20|      1985|1985-06-20|
|2771|       Hedo Turkoglu|   208|    99|  Istanbul|     Turkey|   null|       F|    2015|      2001|          3|       19| 

In [30]:
dfJoinPlayer.where(dfJoinPlayer.college.isNull()).count()

332

#### Create dfBirthTime dataframe and save to s3.

In [31]:
dfBirthTime = dfJoinPlayer.select(["birth_ds", "birth_year", "birth_month", "birth_day"]).dropDuplicates().dropna("any")
dfBirthTime.count()

3407

In [None]:
dfBirthTime.write.parquet("s3a://udacity-data-eng-capstone-parquet/dimBirthTime/", mode="overwrite")

#### Create dfTeamStats

In [32]:
teamStatsSchema = R([       # orig col  renamed     description
    Fld("season", Str()),   # Season    season      .
    Fld("league", Str()),   # Lg        league      .
    Fld("team", Str()),     # Team      team        .
    Fld("wins", Int()),     # W         wins        .
    Fld("losses", Int()),   # L         losses      .
    Fld("wl_pc", Dbl()),    # W/L%      wl_pc       Win-Loss Percentage
    Fld("finish", Int()),   # Finish    finish standing
    Fld("srs", Dbl()),      # SRS       srs         Simple Rating System
    Fld("pace", Dbl()),     # Pace      pace        Pace Factor - the number of possessions a team uses per game
    Fld("rel_pace", Dbl()), # Rel_Pace  rel_pace    relative Pace
    Fld("ortg", Dbl()),     # ORtg      ortg        Offensive Rating
    Fld("rel_ortg", Dbl()), # Rel_ORtg  rel_ortg    relative Offensive Rating
    Fld("drtg", Dbl()),     # DRtg      drtg        Defensive Rating
    Fld("rel_drtg", Dbl()), # Rel_DRtg  rel_drtg    relative Defensive Rating
    Fld("playoffs", Str()), # Playoffs  playoffs    .
    Fld("coaches", Str()),  # Coaches   coaches     a team can have multiple coaches throughout a season
    Fld("top_ws", Str()),   # Top WS    top_ws      top Win Share, this col seems the best player on team
])
dfTeamStats = spark.read.csv(
    "s3a://udacity-data-eng-capstone/Team_Records.csv",
    header=True, schema=teamStatsSchema,
)
#dfTeamStats.printSchema()
dfTeamStats.show(10)
print("count = ", dfTeamStats.count())

+-------+------+---------------+----+------+-----+------+-----+----+--------+-----+--------+-----+--------+--------------------+------------------+----------------+
| season|league|           team|wins|losses|wl_pc|finish|  srs|pace|rel_pace| ortg|rel_ortg| drtg|rel_drtg|            playoffs|           coaches|          top_ws|
+-------+------+---------------+----+------+-----+------+-----+----+--------+-----+--------+-----+--------+--------------------+------------------+----------------+
|2017-18|   NBA| Boston Celtics|  29|    10|0.744|     1| 4.38|95.4|    -1.7|108.0|     0.2|102.8|    -5.0|                null|B. Stevens (29-10)| K. Irving (5.7)|
|2016-17|   NBA|Boston Celtics*|  53|    29|0.646|     1| 2.25|96.8|     0.4|111.2|     2.4|108.4|    -0.4|Lost E. Conf. Finals|B. Stevens (53-29)|I. Thomas (12.5)|
|2015-16|   NBA|Boston Celtics*|  48|    34|0.585|     2| 2.84|98.5|     2.7|106.8|     0.4|103.6|    -2.8|Lost E. Conf. 1st...|B. Stevens (48-34)| I. Thomas (9.7)|
|2014-15| 

In [33]:
# figure out a way to quantify playoffs column
#dfTeamStats.select(dfTeamStats.playoffs).dropDuplicates().collect()
dfTeamStats.groupBy(dfTeamStats.playoffs).count().collect()

[Row(playoffs='Lost W. Div. Semis', count=34),
 Row(playoffs='Won Finals', count=75),
 Row(playoffs='Lost E. Div. Third Place Tiebreaker', count=1),
 Row(playoffs='Lost E. Div. Semis', count=32),
 Row(playoffs='Lost W. Conf. 1st Rnd.', count=152),
 Row(playoffs='Lost W. Div. Finals', count=26),
 Row(playoffs='Lost 1st Rnd.', count=1),
 Row(playoffs=None, count=623),
 Row(playoffs='Lost W. Conf. Finals', count=47),
 Row(playoffs='Lost W. Conf. Semis', count=94),
 Row(playoffs='Lost Central Div. Finals', count=1),
 Row(playoffs='Lost Semis', count=2),
 Row(playoffs='Lost Finals', count=73),
 Row(playoffs='Lost E. Conf. Finals', count=47),
 Row(playoffs='Eliminated in E. Div. Rnd. Robin', count=1),
 Row(playoffs='Lost E. Conf. 1st Rnd.', count=152),
 Row(playoffs='Lost Central Div. Semis', count=1),
 Row(playoffs='Lost Quarterfinals', count=2),
 Row(playoffs='Lost E. Conf. Semis', count=94),
 Row(playoffs='Lost E. Div. Finals', count=22),
 Row(playoffs='Lost W. Div. Tiebreaker', count=2),

In [34]:
# convert playoff string to playoff score, championship is 6, not in playoff is 0, "???" means I'm not sure either
map_playoff_score = {
    None:                       0, # count=623

    'Won Finals':               6, # count=75
    'Lost Finals':              5, # count=73

    'Lost W. Div. Finals':      4, # count=26
    'Lost W. Conf. Finals':     4, # count=47
    'Lost Central Div. Finals': 4, # count=1
    'Lost E. Conf. Finals':     4, # count=47
    'Lost E. Div. Finals':      4, # count=22

    'Lost Quarterfinals':       3, # count=2 ???
    'Lost W. Div. Semis':       3, # count=34
    'Lost E. Div. Semis':       3, # count=32
    'Lost W. Conf. Semis':      3, # count=94
    'Lost Semis':               3, # count=2
    'Lost Central Div. Semis':  3, # count=1
    'Lost E. Conf. Semis':      3, # count=94

    'Lost E. Conf. 1st Rnd.':   2, # count=152
    'Lost W. Conf. 1st Rnd.':   2, # count=152
    'Lost 1st Rnd.':            2, # count=1

    'Lost W. Div. Tiebreaker':              1, # count=2 ???
    'Eliminated in W. Div. Rnd. Robin':     1, # count=1 ???
    'Eliminated in E. Div. Rnd. Robin':     1, # count=1 ???
    'Lost E. Div. Third Place Tiebreaker':  1, # count=1 ???
}

def translate(mapping):
    def translate_(col):
        return mapping.get(col, col)
    return udf(translate_, Int())

dfTeamStats = dfTeamStats.withColumn(
    "playoff_score", translate(map_playoff_score)("playoffs")
).withColumn(
    # need to clean up * in name column, e.g. "Boston Celtics*", * means this team is in playoff that year
    "team", F.regexp_replace("team", "\*+", "")    
)

dfTeamStats.where(dfTeamStats.playoffs.isNull()).show(1)

for playoffs_str in ['Won Finals','Lost Finals',
                     'Lost W. Div. Finals','Lost W. Conf. Finals','Lost Central Div. Finals','Lost E. Conf. Finals','Lost E. Div. Finals',
                     'Lost Quarterfinals',
                     'Lost W. Div. Semis','Lost E. Div. Semis','Lost W. Conf. Semis','Lost Semis','Lost Central Div. Semis','Lost E. Conf. Semis',
                     'Lost E. Conf. 1st Rnd.','Lost W. Conf. 1st Rnd.','Lost 1st Rnd.',
                     'Lost W. Div. Tiebreaker','Eliminated in W. Div. Rnd. Robin','Eliminated in E. Div. Rnd. Robin','Lost E. Div. Third Place Tiebreaker']:
    print(dfTeamStats.where(dfTeamStats.playoffs == playoffs_str).select(["season", "team", "playoffs", "playoff_score"]).limit(1).collect())

+-------+------+--------------+----+------+-----+------+----+----+--------+-----+--------+-----+--------+--------+------------------+---------------+-------------+
| season|league|          team|wins|losses|wl_pc|finish| srs|pace|rel_pace| ortg|rel_ortg| drtg|rel_drtg|playoffs|           coaches|         top_ws|playoff_score|
+-------+------+--------------+----+------+-----+------+----+----+--------+-----+--------+-----+--------+--------+------------------+---------------+-------------+
|2017-18|   NBA|Boston Celtics|  29|    10|0.744|     1|4.38|95.4|    -1.7|108.0|     0.2|102.8|    -5.0|    null|B. Stevens (29-10)|K. Irving (5.7)|            0|
+-------+------+--------------+----+------+-----+------+----+----+--------+-----+--------+-----+--------+--------+------------------+---------------+-------------+
only showing top 1 row

[Row(season='2007-08', team='Boston Celtics', playoffs='Won Finals', playoff_score=6)]
[Row(season='2009-10', team='Boston Celtics', playoffs='Lost Finals',

#### Create dfCoache from dfTeamStats

In [35]:
# coaches column are non structured string, need to explode the column into multiple rows
dfTeamStats.where("season = '2003-04' and team = 'New York Knicks'").select("coaches").collect()

[Row(coaches='D. Chaney (15-24) H. Williams (1-0) L. Wilkens (23-19)')]

In [36]:
# copy to dfCoache
dfCoache = dfTeamStats.withColumn(
    'coach_and_stat', explode(F.split(dfTeamStats.coaches, '\) '))
).select(["coach_and_stat", "season", "league", "team", "finish", "top_ws"])

# parse: "D. Chaney (15-24)" => ["D. Chaney", 15, 24]
dfCoache = dfCoache.withColumn(
    "coach_and_stat", F.split(
        F.regexp_replace(
            F.regexp_replace("coach_and_stat", "\)$", ""),
            " \(",
            "-"),
        "-"
    )
)

dfCoache = dfCoache.withColumn(
    "coach", dfCoache.coach_and_stat.getItem(0).cast(Str())
).withColumn(
    "wins", dfCoache.coach_and_stat.getItem(1).cast(Int())
).withColumn(
    "losses", dfCoache.coach_and_stat.getItem(2).cast(Int())
).drop(
    "coach_and_stat"
)
dfCoache.printSchema()
dfCoache.show(1)
dfCoache.where("season = '2003-04' and team = 'New York Knicks'").show()

root
 |-- season: string (nullable = true)
 |-- league: string (nullable = true)
 |-- team: string (nullable = true)
 |-- finish: integer (nullable = true)
 |-- top_ws: string (nullable = true)
 |-- coach: string (nullable = true)
 |-- wins: integer (nullable = true)
 |-- losses: integer (nullable = true)

+-------+------+--------------+------+---------------+----------+----+------+
| season|league|          team|finish|         top_ws|     coach|wins|losses|
+-------+------+--------------+------+---------------+----------+----+------+
|2017-18|   NBA|Boston Celtics|     1|K. Irving (5.7)|B. Stevens|  29|    10|
+-------+------+--------------+------+---------------+----------+----+------+
only showing top 1 row

+-------+------+---------------+------+----------------+-----------+----+------+
| season|league|           team|finish|          top_ws|      coach|wins|losses|
+-------+------+---------------+------+----------------+-----------+----+------+
|2003-04|   NBA|New York Knicks|   

#### create dfPlayerStats

In [40]:
playerStatsSchema = R([         # orig col  renamed     description
    Fld("id",       Int()),     # Id        id          .
    Fld("year",     Int()),     # Year      year        .
    Fld("player",   Str()),     # Player    player      .
    Fld("pos",      Str()),     # Pos       pos         Position
    Fld("age",      Int()),     # Age       age         .
    Fld("team",     Str()),     # Tm        team        .
    Fld("g",        Int()),     # G         g           Games
    Fld("gs",       Int()),     # GS        gs          Games Started
    Fld("mp",       Int()),     # MP        mp          Minutes Played (available since the 1951-52 season)
    Fld("per",      Dbl()),     # PER       per         Player Efficiency Rating
    Fld("ts_p",     Dbl()),     # TS%       ts_p        True Shooting Percentage, the formula is TS% = PTS/(2*TSA)
    Fld("3par",     Dbl()),     # 3PAr      3par        3-Pointer Rate, 3-Pointer Attempts to Field Goal Attempts. 3PA/FGA.
    Fld("ftr",      Dbl()),     # FTr       ftr         Free Throw Rate, formula is FTA/FGA.
    Fld("orb_p",    Dbl()),     # ORB%      orb_p       Offensive Rebound Percentage
    Fld("drb_p",    Dbl()),     # DRB%      drb_p       Defensive Rebound Percentage
    Fld("trb_p",    Dbl()),     # TRB%      trb_p       Total Rebound Percentage
    Fld("ast_p",    Dbl()),     # AST%      ast_p       Assist Percentage
    Fld("stl_p",    Dbl()),     # STL%      stl_p       Steal Percentage
    Fld("blk_p",    Dbl()),     # BLK%      blk_p       Block Percentage
    Fld("tov_p",    Dbl()),     # TOV%      tov_p       Turnover Percentage
    Fld("usg_p",    Dbl()),     # USG%      usg_p       Usage Percentage
    Fld("blanl",    Str()),     # blanl     blanl       .
    Fld("ows",      Dbl()),     # OWS       ows         Offensive Win Shares
    Fld("dws",      Dbl()),     # DWS       dws         Defensive Win Shares
    Fld("ws",       Dbl()),     # WS        ws          Win Shares
    Fld("ws_o_48",  Dbl()),     # WS/48     ws_o_48     Win Shares Per 48 Minutes
    Fld("blank2",   Str()),     # blank2    blank2      .
    Fld("obpm",     Dbl()),     # OBPM      obpm        Offensive Box Plus Minus
    Fld("dbpm",     Dbl()),     # DBPM      dbpm        Defensive Box Plus Minus
    Fld("bpm",      Dbl()),     # BPM       bpm         Box Plus Minus
    Fld("vorp",     Dbl()),     # VORP      vorp        Value Over Replacement Player
    Fld("fg",       Int()),     # FG        fg          Field Goals
    Fld("fga",      Int()),     # FGA       fga         Field Goal Attempts (includes 2-point and 3-point field goal attempts)
    Fld("fg_p",     Dbl()),     # FG%       fg_p        Field Goal Percentage; the formula is FG/FGA.
    Fld("3p",       Int()),     # 3P        3p          3-Point Field Goals
    Fld("3pa",      Int()),     # 3PA       3pa         3-Point Field Goal Attempts
    Fld("3p_p",     Dbl()),     # 3P%       3p_p        3-Point Field Goal Percentage
    Fld("2p",       Int()),     # 2P        2p          2-Point Field Goals
    Fld("2pa",      Int()),     # 2PA       2pa         2-Point Field Goal Attempts
    Fld("2p_p",     Dbl()),     # 2P%       2p_p        2-Point Field Goal Percentage
    Fld("efg_p",    Dbl()),     # eFG%      efg_p       Effective Field Goal Percentage; the formula is (FG + 0.5 * 3P) / FGA
    Fld("ft",       Int()),     # FT        ft          Free Throws
    Fld("fta",      Int()),     # FTA       fta         Free Throws Attempts
    Fld("ft_p",     Dbl()),     # FT%       ft_p        Free Throws Pecentage
    Fld("orb",      Int()),     # ORB       orb         Offensive Rebounds
    Fld("drb",      Int()),     # DRB       drb         Defensive Rebounds
    Fld("trb",      Int()),     # TRB       trb         Total Rebounds
    Fld("ast",      Int()),     # AST       ast         Assists
    Fld("stl",      Int()),     # STL       stl         Steals
    Fld("blk",      Int()),     # BLK       blk         Blocks
    Fld("tov",      Int()),     # TOV       tov         Turnovers
    Fld("pf",       Int()),     # PF        pf          Personal Fouls
    Fld("pts",      Int()),     # PTS       pts         Points
])

dfPlayerStats = spark.read.csv(
    "s3a://udacity-data-eng-capstone/Seasons_Stats.csv",
    header=True, schema=playerStatsSchema)

#dfPlayerStats.printSchema()
dfPlayerStats.show(5)
print("count = ", dfPlayerStats.count())

+---+----+---------------+---+---+----+---+----+----+----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----+----+-------+------+----+----+----+----+---+---+-----+----+----+----+---+---+-----+-----+---+---+-----+----+----+----+---+----+----+----+---+---+
| id|year|         player|pos|age|team|  g|  gs|  mp| per| ts_p|3par|  ftr|orb_p|drb_p|trb_p|ast_p|stl_p|blk_p|tov_p|usg_p|blanl| ows| dws|  ws|ws_o_48|blank2|obpm|dbpm| bpm|vorp| fg|fga| fg_p|  3p| 3pa|3p_p| 2p|2pa| 2p_p|efg_p| ft|fta| ft_p| orb| drb| trb|ast| stl| blk| tov| pf|pts|
+---+----+---------------+---+---+----+---+----+----+----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----+----+-------+------+----+----+----+----+---+---+-----+----+----+----+---+---+-----+-----+---+---+-----+----+----+----+---+----+----+----+---+---+
|  0|1950|Curly Armstrong|G-F| 31| FTW| 63|null|null|null|0.368|null|0.467| null| null| null| null| null| null| null| null| null|-0.1| 3.6| 3.5| 

#### More detailed column description can be found in [glossary](https://www.basketball-reference.com/about/glossary.html)

#### Player stats uses team abbrevations, Team stats uses team name, need to establish mapping between them

In [41]:
teamSchema = R([
    Fld("city", Str()),
    Fld("abbrev", Str()),
    Fld("description", Str()),
])
dfTeam = spark.read.option("header", "true") \
    .option("delimiter", "	") \
    .csv("s3a://udacity-data-eng-capstone/team-abbrevation.txt",
        schema= teamSchema)

dfTeam.show(10, truncate=False)

# turn abbrev upper case Atl => ATL
# trim trailing spaces in city column
# split description to name (str) and time (str)
dfTeam = dfTeam.withColumn(
    "abbrev", F.upper(dfTeam.abbrev)
).withColumn(
    "city", F.regexp_replace("city", "\s+$", "")
).withColumn(
    "description", F.regexp_replace("description", "\)$", "")
).withColumn(
    "desc_split", F.split("description", " \(")
)

dfTeam = dfTeam.withColumn(
    "name", dfTeam.desc_split.getItem(0).cast(Str())
).withColumn(
    "time", dfTeam.desc_split.getItem(1).cast(Str())
).drop(
    "desc_split"
).drop(
    "description"
)

dfTeam.show(10, truncate=False)

# eyeball if everything okay
print([r[0] for r in dfTeam.select("time").collect()])
print([r[0] for r in dfTeam.select("city").collect()])

+-------------+------+--------------------------------------------------------+
|city         |abbrev|description                                             |
+-------------+------+--------------------------------------------------------+
|Atlanta      |Atl   |Atlanta Hawks (1968-69 - present)                       |
|Anderson     |And   |Anderson Packers (1949-50)                              |
|Bal Bullets  |BlB   |Baltimore Bullets [1st team] (1947-48 - 1953-54)        |
|Baltimore    |Bal   |Baltimore Bullets [2nd team] (1963-64 - 1972-73)        |
|Boston       |Bos   |Boston Celtics (1946-47 - present)                      |
|Brooklyn     |Bkn   |Brooklyn Nets (2012-13 - present)                       |
|Buffalo      |Buf   |Buffalo Braves (1970-71 - 1977-78)                      |
|Capital      |Cap   |Capital Bullets (1973-74)                               |
|Charlotte    |Cha   |Charlotte Hornets (1988-89 - 2001-02, 2014-15 - present)|
|Cha Bobcats  |ChB   |Charlotte Bobcats 

### Find out the mismatched abbrevs and names

In [42]:
# team name and team abbrev, mapping player stats to team stats
abbre0 = [r[0] for r in dfTeam.select(dfTeam.abbrev).collect()]
names0 = [r[0] for r in dfTeam.select(dfTeam.name).collect()]
print(
    len(abbre0) == len(set(abbre0)), 
    len(names0) == len(set(names0)),
)
abbre0 = set(abbre0)
names0 = set(names0)
abbre1 = {r[0] for r in dfPlayerStats.select(dfPlayerStats.team).dropDuplicates().collect()}
names1 = {r[0] for r in dfTeamStats.select(dfTeamStats.team).dropDuplicates().collect()}
print()
print("abbre0\n", abbre0)
print("abbre1\n", abbre1)
print("names0\n", names0)
print("names1\n", names1)
print()
print("set(abbre0) - set(abbre1)\n", set(abbre0) - set(abbre1))
print("set(abbre1) - set(abbre0)\n", set(abbre1) - set(abbre0))
print("set(names0) - set(names1)\n", set(names0) - set(names1))
print("set(names1) - set(names0)\n", set(names1) - set(names0))

True True

abbre0
 {'MLH', 'CHZ', 'HOU', 'BAL', 'BUF', 'ATL', 'CHS', 'MEM', 'POR', 'MIN', 'NY ', 'AND', 'ROC', 'OKC', 'CHI', 'GS ', 'ORL', 'LAL', 'BKN', 'SAC', 'CHB', 'SF ', 'PIT', 'NYN', 'TRI', 'MIA', 'SHE', 'UTA', 'DTF', 'WSH', 'PHO', 'PHI', 'KCO', 'WSC', 'DEN', 'VAN', 'WAT', 'LAC', 'FTW', 'CAP', 'DAL', 'BLB', 'BOS', 'NOJ', 'STL', 'TOR', 'PHW', 'SDR', 'NO ', 'SEA', 'WSB', 'IND', 'DET', 'NJ ', 'CHP', 'NOK', 'SA ', 'SYR', 'MNL', 'SD ', 'KC ', 'NOH', 'CIN', 'CLR', 'DNN', 'MIL', 'PRO', 'CHA', 'CLE', 'INJ', 'TRH', 'INO', 'STB'}
abbre1
 {'MLH', 'CHZ', 'HOU', 'BUF', 'BAL', 'CHS', 'POR', 'MEM', 'TRI', None, 'TOR', 'AND', 'MIN', 'OKC', 'ATL', 'CHI', 'ORL', 'LAL', 'STB', 'SAC', 'NOP', 'TOT', 'NYN', 'UTA', 'BRK', 'MIA', 'PHO', 'ROC', 'WSC', 'KCO', 'CHO', 'PHI', 'VAN', 'DEN', 'WAT', 'LAC', 'FTW', 'BLB', 'SDC', 'BOS', 'GSW', 'NOJ', 'DAL', 'STL', 'PHW', 'CAP', 'SDR', 'SEA', 'KCK', 'WSB', 'IND', 'DET', 'CHP', 'NOK', 'SAS', 'SYR', 'WAS', 'MNL', 'SHE', 'NOH', 'NJN', 'CIN', 'MIL', 'DNN', 'SFW', 'CHA',

### Manually establish mapping between abbrevs and names

* Tables that need change, colored red

<img src ='team-name-and-abbrev-need-manual-change.png'>

* Abbrevs

```
team  : player_stat
'BKN' : 'BRK'
            change team, altho BKN seems better, but easier to change team (BKN => BRK)
'CHB' : 'CHO'
            change team
            CHH Charlotte Hornets [1st team] (1988-89 - 2001-02)
            CHA Charlotte Bobcats (2004-05 - 2013-14)
            CHO Charlotte Hornets [2nd team] (2014-15 - present)
'CLR' : 
            Cleveland Rebels (1946-47), player_stats starts from 1950, no action
'DTF' : ?
            Detroit Falcons (1946-47), too early no action
'GS ' : 'GSW'
            change team
'INJ' : ?
            Indianapolis Jets (1948-49), too early no action
'KC ' : 'KCK'
            change team
'NJ ' : 'NJN'
            change team
'NO ' : 'NOP'
            change team
'NY ' : 'NYK'
            change team
'PIT' : ?
            Pittsburgh Ironmen (1946-47), too early no action
'PRO' : ?
            Providence Steamrollers (1946-47 - 1948-49), too early no action
'SA ' : 'SAS'
            change team
'SD ' : 'SDC'
            change team
'SF ' : 'SFW'
            change team
'TRH'
'WSH' : 'WAS'
            change team

?    : None
?    : 'TOT' => it means the player plays for multiple teams that season
             => stands for Transfer of Team? or ?
```

* Names

```
team => team_stats
'Anderson Packers'              :
'Baltimore Bullets [1st team]'  : 'Baltimore Bullets'
'Baltimore Bullets [2nd team]'  : 'Baltimore Bullets'       <need to map it based on time?>
'Chicago Stags'                 :
'Cleveland Rebels'              :
?                               : 'Dallas Chaparrals'       <ABA team, should clean all ABA team records?>
    
'Denver Nuggets [1st team]'     : 'Denver Nuggets'
'Denver Nuggets [2nd team]'     : 'Denver Nuggets'          <need to map it based on time?>
?                               : 'Denver Rockets'          <ABA team>
'Detroit Falcons'               :
'Ft Wayne Pistons'              : 'Fort Wayne Pistons'      <change>
'Indianapolis Jets'             :
'Indianapolis Olympians'        :
'KC-Omaha Kings'                : 'Kansas City-Omaha Kings' <change>
?                               : 'New Jersey Americans'    <ABA team>
'New York Knick(erbocker)s'     : 'New York Knicks'         <change>
'Oklahoma City'                 : 'Oklahoma City Thunder'   <change>
'Pittsburgh Ironmen'            :
'Portland TrailBlazers'         : 'Portland Trail Blazers'  <change>
'Providence Steamrollers'       :
'Sheboygan Redskins'            :
'St Louis Bombers'              :                           <change too, all St to St.>
'St Louis Hawks'                : 'St. Louis Hawks'         <change>
'Toronto Huskies'               :
?                               : 'Texas Chaparrals'        <ABA team>
'Washington Capitols'           :
'Waterloo Hawks'                :
```

#### Filter all non NBA teams records

In [43]:
leagues = {r[0] for r in dfTeamStats.select(dfTeamStats.league).dropDuplicates().collect()}
print(leagues)
print(dfTeamStats.count())
dfTeamStats = dfTeamStats.filter(dfTeamStats.league == "NBA")
print(dfTeamStats.count())

{'ABA', 'NBA', 'BAA'}
1483
1435


#### Manually make corresponding changes in team-abbrevation-v2.txt and uploaded to S3

```
...
Baltimore    	BLB	Baltimore Bullets [1st team] (1947-48 - 1953-54)
Baltimore    	BAL	Baltimore Bullets [2nd team] (1963-64 - 1972-73)
Boston       	BOS	Boston Celtics (1946-47 - present)
Brooklyn     	BRK	Brooklyn Nets (2012-13 - present)
Buffalo      	BUF	Buffalo Braves (1970-71 - 1977-78)
Capital      	CAP	Capital Bullets (1973-74)
Charlotte    	CHH	Charlotte Hornets [1st team] (1988-89 - 2001-02)
Charlotte    	CHA	Charlotte Bobcats (2004-05 - 2013-14)
Charlotte    	CHO	Charlotte Hornets [2nd team] (2014-15 - present)
Chicago      	CHI	Chicago Bulls (1966-67 - present)
Chicago      	CHP	Chicago Packers (1961-62)
Chicago      	CHS	Chicago Stags (1946-47 - 1949-50)
Chicago      	CHZ	Chicago Zephyrs (1962-63)
Cincinnati   	CIN	Cincinnati Royals (1957-58 - 1971-72)
Cleveland    	CLE	Cleveland Cavaliers (1970-71 - present)
...
```

In [44]:
dfTeam = spark.read.option("header", "true") \
    .option("delimiter", "	") \
    .csv("s3a://udacity-data-eng-capstone/team-abbrevation-v2.txt", schema= teamSchema)

# turn abbrev upper case Atl => ATL
# trim trailing spaces in city column
# split description to name (str) and time (str)
dfTeam = dfTeam.withColumn(
    "abbrev", F.upper(dfTeam.abbrev)
).withColumn(
    "city", F.regexp_replace("city", "\s+$", "")
).withColumn(
    "description", F.regexp_replace("description", "\)$", "")
).withColumn(
    "desc_split", F.split("description", " \(")
)

dfTeam = dfTeam.withColumn(
    "name", dfTeam.desc_split.getItem(0).cast(Str())
).withColumn(
    "time", dfTeam.desc_split.getItem(1).cast(Str())
).drop(
    "desc_split"
).drop(
    "description"
)

# eyeball if everything okay
print([r[0] for r in dfTeam.select("time").collect()])
print([r[0] for r in dfTeam.select("city").collect()])

['1968-69 - present', '1949-50', '1947-48 - 1953-54', '1963-64 - 1972-73', '1946-47 - present', '2012-13 - present', '1970-71 - 1977-78', '1973-74', '1988-89 - 2001-02', '2004-05 - 2013-14', '2014-15 - present', '1966-67 - present', '1961-62', '1946-47 - 1949-50', '1962-63', '1957-58 - 1971-72', '1970-71 - present', '1946-47', '1980-81 - present', '1949-50', '1976-77 - present', '1946-47', '1957-58 - present', '1948-49 - 1956-57', '1971-72 - present', '1971-72 - present', '1976-77 - present', '1948-49', '1949-50 - 1952-53', '1975-76 - 1984-85', '1972-73 - 1974-75', '1984-85 - present', '1960-61 - present', '2001-02 - present', '1988-89 - present', '1951-52 - 1954-55', '1968-69 - present', '1948-49 - 1959-60', '1989-90 - present', '1977-78 - 2011-12', '2002-03 - 2004-05, 2007-08 - 2012-13', '1974-75 - 1978-79', '2013-14 - present', '2005-06 - 2006-07', '1946-47 - present', '1976-77', '2008-09 - present', '1989-90 - present', '1946-47 - 1961-62', '1963-64 - present', '1968-69 - present',

In [45]:
# team name and team abbrev, mapping player stats to team stats
abbre0 = [r[0] for r in dfTeam.select(dfTeam.abbrev).collect()]
names0 = [r[0] for r in dfTeam.select(dfTeam.name).collect()]
print(
    len(abbre0) == len(set(abbre0)), 
    len(names0) == len(set(names0)),
)
abbre0 = set(abbre0)
names0 = set(names0)
abbre1 = {r[0] for r in dfPlayerStats.select(dfPlayerStats.team).dropDuplicates().collect()}
names1 = {r[0] for r in dfTeamStats.select(dfTeamStats.team).dropDuplicates().collect()}
print()
print("set(abbre0) - set(abbre1)\n", set(abbre0) - set(abbre1))
print("set(abbre1) - set(abbre0)\n", set(abbre1) - set(abbre0))
print("set(names0) - set(names1)\n", set(names0) - set(names1))
print("set(names1) - set(names0)\n", set(names1) - set(names0))

True True

set(abbre0) - set(abbre1)
 {'PIT', 'DTF', 'CLR', 'PRO', 'INJ', 'TRH'}
set(abbre1) - set(abbre0)
 {'TOT', None}
set(names0) - set(names1)
 {'Indianapolis Jets', 'Charlotte Hornets [1st team]', 'Waterloo Hawks', 'Toronto Huskies', 'Detroit Falcons', 'Baltimore Bullets [1st team]', 'Charlotte Hornets [2nd team]', 'Denver Nuggets [2nd team]', 'Chicago Stags', 'Baltimore Bullets [2nd team]', 'Washington Capitols', 'Cleveland Rebels', 'Denver Nuggets [1st team]', 'Pittsburgh Ironmen', 'Indianapolis Olympians', 'Providence Steamrollers', 'Sheboygan Redskins', 'Anderson Packers', 'St. Louis Bombers'}
set(names1) - set(names0)
 {'Denver Nuggets', 'Charlotte Hornets', 'Baltimore Bullets'}


In [46]:
dfTeam.where(dfTeam.name.like("% team]")).show(truncate=False)

+---------+------+----------------------------+-----------------+
|city     |abbrev|name                        |time             |
+---------+------+----------------------------+-----------------+
|Baltimore|BLB   |Baltimore Bullets [1st team]|1947-48 - 1953-54|
|Baltimore|BAL   |Baltimore Bullets [2nd team]|1963-64 - 1972-73|
|Charlotte|CHH   |Charlotte Hornets [1st team]|1988-89 - 2001-02|
|Charlotte|CHO   |Charlotte Hornets [2nd team]|2014-15 - present|
|Denver   |DNN   |Denver Nuggets [1st team]   |1949-50          |
|Denver   |DEN   |Denver Nuggets [2nd team]   |1976-77 - present|
+---------+------+----------------------------+-----------------+



#### Some team show up multiple times in NBA history. Need to map team name to specific version, so that we can find the correct abbrevation
|city         |abbrev|name                        |time             |
|-------------|------|----------------------------|-----------------|
|Bal Bullets  |BLB   |Baltimore Bullets [1st team]|1947-48 - 1953-54|
|Baltimore    |BAL   |Baltimore Bullets [2nd team]|1963-64 - 1972-73|
|Charlotte    |CHH   |Charlotte Hornets [1st team]|1988-89 - 2001-02|
|Charlotte    |CHO   |Charlotte Hornets [2nd team]|2014-15 - present|
|Den Nuggets  |DNN   |Denver Nuggets [1st team]   |1949-50          |
|Denver       |DEN   |Denver Nuggets [2nd team]   |1976-77 - present|

In [47]:
def extract_end_year():
    def t(col):
        return int(col[:4])+1 # corner cases, e.g. 1999-00
    return udf(t, Int())

dfTeamStats = dfTeamStats.withColumn("season_end_year", extract_end_year()("season"))
dfTeamStats.where(dfTeamStats.team.isin({'Baltimore Bullets', 'Denver Nuggets', 'Charlotte Hornets'})).show(5)

def add_team_version(): # for 'Baltimore Bullets', 'Denver Nuggets', 'Charlotte Hornets'
    def t(team, year):
        if team == 'Baltimore Bullets':
            if 1948 <= year <= 1954:
                return "Baltimore Bullets [1st team]"
            if 1964 <= year <= 1973:
                return "Baltimore Bullets [2nd team]"
        if team == 'Charlotte Hornets':
            if 1989 <= year <= 2002:
                return "Charlotte Hornets [1st team]"
            if 2015 <= year <= 2018:
                return "Charlotte Hornets [2nd team]"
        if team == 'Denver Nuggets':
            if year == 1950:
                return "Denver Nuggets [1st team]"
            if 1977 <= year <= 2018:
                return "Denver Nuggets [2nd team]"
        return team
    return udf(t, Str())

dfTeamStats = dfTeamStats.withColumn("team", add_team_version()("team", "season_end_year"))

+-------+------+--------------+----+------+-----+------+-----+----+--------+-----+--------+-----+--------+--------+--------------------+---------------+-------------+---------------+
| season|league|          team|wins|losses|wl_pc|finish|  srs|pace|rel_pace| ortg|rel_ortg| drtg|rel_drtg|playoffs|             coaches|         top_ws|playoff_score|season_end_year|
+-------+------+--------------+----+------+-----+------+-----+----+--------+-----+--------+-----+--------+--------+--------------------+---------------+-------------+---------------+
|2017-18|   NBA|Denver Nuggets|  19|    16|0.543|     3| 1.53|96.3|    -0.8|110.5|     2.7|108.9|     1.1|    null|   M. Malone (19-16)| N. Jokic (3.8)|            0|           2018|
|2016-17|   NBA|Denver Nuggets|  40|    42|0.488|     4|  0.7|98.3|     1.9|113.2|     4.4|112.7|     3.9|    null|   M. Malone (40-42)| N. Jokic (9.7)|            0|           2017|
|2015-16|   NBA|Denver Nuggets|  33|    49|0.402|     4|-2.81|95.7|    -0.1|105.6|   

In [48]:
# check data after transformation
dfTeamStats.select("team", "season_end_year").where(
    dfTeamStats.team.like('Baltimore Bullets%') | dfTeamStats.team.like('Denver Nuggets%') | dfTeamStats.team.like('Charlotte Hornets%')
).orderBy(
    "team", "season_end_year"
).dropDuplicates().show(10, truncate=False)

dfTeamStats.select("team", "season_end_year").show(10)

dfTeamStats.select("team", "season_end_year").where("team is null").count()

dfTeamStats.select("team", "season_end_year").where("team is null").count()

dfTeamStats.select("team").where(
    dfTeamStats.team.like('Baltimore Bullets%') | dfTeamStats.team.like('Denver Nuggets%') | dfTeamStats.team.like('Charlotte Hornets%')
).dropDuplicates().show(10, truncate=False)

+----------------------------+---------------+
|team                        |season_end_year|
+----------------------------+---------------+
|Baltimore Bullets [2nd team]|1964           |
|Baltimore Bullets [2nd team]|1965           |
|Baltimore Bullets [2nd team]|1966           |
|Baltimore Bullets [2nd team]|1967           |
|Baltimore Bullets [2nd team]|1968           |
|Baltimore Bullets [2nd team]|1969           |
|Baltimore Bullets [2nd team]|1970           |
|Baltimore Bullets [2nd team]|1971           |
|Baltimore Bullets [2nd team]|1972           |
|Baltimore Bullets [2nd team]|1973           |
+----------------------------+---------------+
only showing top 10 rows

+--------------+---------------+
|          team|season_end_year|
+--------------+---------------+
|Boston Celtics|           2018|
|Boston Celtics|           2017|
|Boston Celtics|           2016|
|Boston Celtics|           2015|
|Boston Celtics|           2014|
|Boston Celtics|           2013|
|Boston Celtics|   

## Step 3: Define the Data Model
### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### 3.3 Create Redshift

In [None]:
import pandas as pd
import boto3
import json

In [None]:
import configparser
config = configparser.ConfigParser()
config.read_file(open('dwh.cfg'))

KEY                    = config.get('AWS','KEY')
SECRET                 = config.get('AWS','SECRET')

DWH_CLUSTER_TYPE       = config.get("DWH","DWH_CLUSTER_TYPE")
DWH_NUM_NODES          = config.get("DWH","DWH_NUM_NODES")
DWH_NODE_TYPE          = config.get("DWH","DWH_NODE_TYPE")

DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
DWH_DB                 = config.get("DWH","DWH_DB")
DWH_DB_USER            = config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD        = config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT               = config.get("DWH","DWH_PORT")

DWH_IAM_ROLE_NAME      = config.get("DWH", "DWH_IAM_ROLE_NAME")

(DWH_DB_USER, DWH_DB_PASSWORD, DWH_DB)

pd.DataFrame({"Param": ["DWH_CLUSTER_TYPE", "DWH_NUM_NODES", "DWH_NODE_TYPE", "DWH_CLUSTER_IDENTIFIER", 
                        "DWH_DB", "DWH_DB_USER", "DWH_DB_PASSWORD", "DWH_PORT", "DWH_IAM_ROLE_NAME"],
              "Value": [ DWH_CLUSTER_TYPE ,  DWH_NUM_NODES ,  DWH_NODE_TYPE ,  DWH_CLUSTER_IDENTIFIER , 
                         DWH_DB ,  DWH_DB_USER ,  DWH_DB_PASSWORD ,  DWH_PORT ,  DWH_IAM_ROLE_NAME ]
             })

In [None]:
args = {
    "region_name": "us-west-2",
    "aws_access_key_id": KEY,
    "aws_secret_access_key": SECRET
}

ec2 = boto3.resource('ec2', **args)
s3 = boto3.resource('s3', **args)
iam = boto3.client('iam', **args)
redshift = boto3.client('redshift', **args)

In [None]:
s3bucket =  s3.Bucket("udacity-data-eng-capstone-parquet") # private

s3_data = iter(s3bucket.objects.filter(Prefix="dimBirthTime/"))
for _ in range(5): print(next(s3_data))

In [None]:
try:
    print('1.1 Creating a new IAM Role')
    dwhRole = iam.create_role(
        Path='/',
        RoleName=DWH_IAM_ROLE_NAME,
        Description = "Allows Redshift clusters to call AWS services on your behalf.",
        AssumeRolePolicyDocument=json.dumps(
            {'Statement': [{'Action': 'sts:AssumeRole',
               'Effect': 'Allow',
               'Principal': {'Service': 'redshift.amazonaws.com'}}],
             'Version': '2012-10-17'})
    )    

except Exception as e:
    print(e)

In [None]:
print('1.2 Attaching Policy')
iam.attach_role_policy(RoleName=DWH_IAM_ROLE_NAME,
                       PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
                      )['ResponseMetadata']['HTTPStatusCode']

In [None]:
print('1.3 Get the IAM role ARN')
roleArn = iam.get_role(RoleName=DWH_IAM_ROLE_NAME)['Role']['Arn']
print(roleArn)

In [None]:
try:
    response = redshift.create_cluster(        
        # parameters for hardware
        ClusterType=DWH_CLUSTER_TYPE,
        NodeType=DWH_NODE_TYPE,
        NumberOfNodes=int(DWH_NUM_NODES),

        # parameters for identifiers & credentials
        DBName=DWH_DB,
        ClusterIdentifier=DWH_CLUSTER_IDENTIFIER,
        MasterUsername=DWH_DB_USER,
        MasterUserPassword=DWH_DB_PASSWORD,
        
        # parameter for role (to allow s3 access)
        IamRoles=[roleArn]
    )
except Exception as e:
    print(e)

In [None]:
def prettyRedshiftProps(props):
    pd.set_option('display.max_colwidth', -1)
    keysToShow = ["ClusterIdentifier", "NodeType", "ClusterStatus", "MasterUsername", "DBName", "Endpoint", "NumberOfNodes", 'VpcId']
    x = [(k, v) for k,v in props.items() if k in keysToShow]
    return pd.DataFrame(data=x, columns=["Key", "Value"])

In [None]:
# wait till cluster status is availabe
myClusterProps = redshift.describe_clusters(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER)['Clusters'][0]
prettyRedshiftProps(myClusterProps)

### Print and copy them to dwh.cfg, erase before submitting or pushing to github

In [None]:
DWH_ENDPOINT = myClusterProps['Endpoint']['Address']
DWH_ROLE_ARN = myClusterProps['IamRoles'][0]['IamRoleArn']
print("DWH_ENDPOINT :: ", DWH_ENDPOINT)
print("DWH_ROLE_ARN :: ", DWH_ROLE_ARN)

In [None]:
%load_ext sql

## Step 4: Run Pipelines to Model the Data 
### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

## Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

In [None]:
redshift.delete_cluster(ClusterIdentifier=DWH_CLUSTER_IDENTIFIER, SkipFinalClusterSnapshot=True)
iam.detach_role_policy(RoleName=DWH_IAM_ROLE_NAME, PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess")
iam.delete_role(RoleName=DWH_IAM_ROLE_NAME)