# HW 1: Review and Intro

This homework should serve as a simple review of Data 8.  There are 8 parts, with the last one consisting of questions about the you and your interest in the course.

#### Setup

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
import numpy as np
from datascience_stats import linear_fit

from datascience import Table, make_array

#### 1. Data load
Load the file `NBAPlayerStats2017.csv` into a file.  Call the table `players`. Display the first five rows.

In [None]:
players = Table().read_table('NBAPlayerStats2017.csv')

In [None]:
players.take[:5]

#### 2. Quick Clean
The column `Player` contains strings that contain the player name as well as a unique ID for the player.  We should split these up.

Loop through the entries in the column `Player` and split the string on the character `\`.  Overwrite the column `Player` with the player name and add a new column `Player ID`.

Some players switched teams midseason.  Quincy Acy is an example.  His team's are listed as `DAL`, `BRK`, and `TOT`, where `TOT` refers to his total statistics.  For the purposes of the rest of the notebook, remove all the rows with `TOT` as team.

In [None]:
player_names = []
player_ids = []
for player in players['Player']:
    player_name, player_id = player.split('\\')
    player_names.append(player_name)
    player_ids.append(player_id)
    
players = players.with_columns(['Player', player_names, 'PlayerID', player_ids]).\
    move_to_start('PlayerID').\
    move_to_start('Rk')
players.take[:5]

In [None]:
players = players.where(players['Tm'] != 'TOT')

#### 3. Group
Select the column `Tm` and the counting stats `FG`, `FGA`, `3P`, `3PA`, `2P`, `2PA`, `FT`, `FTA`, `ORB`, `DRB`, `TRB`, `AST`, `STL`, `BLK`, `TOV`, `PF`, `PTS`.  Then group by `Tm` and compute a sum and store the result as `teams`.

For each of the counting stats, the new column in `teams` will have `' sum'` appended.  Relabel each of the stat columns to remove that appended string.

In [None]:
stats = ['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 
         'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
teams = players.select('Tm', *stats).group(['Tm'], sum)

for stat in stats:
    teams.relabel(stat + ' sum', stat)

#### 4. Sort
Sort teams by most 3-point attempts (`3PA`) from most to least and take the first 10 rows after sorting.  Display the result.

In [None]:
n = 10
most_3PAs = teams.sort('3PA', descending=True).take[:n]
most_3PAs

Reorder this top ten in `3PA` according to 2-point attempts from most to least.

In [None]:
most_3PAs.sort('2PA', descending=True)

Add a new column to `teams` called effective field goal percentage, `eFG%`:
$$
    \mathit{eFG\%} = \frac{\mathit{FG} + .5\cdot\mathit{3P}}{\mathit{FGA}}
$$
This is a nice way to balance between the different values of shots since $3 > 2$.

After adding the new column to `team`, regenerate the table of top ten teams in most `3PA` and then sort on `eFG%` from highest to lowest.


In [None]:
teams['eFG%'] = np.round((teams['FG'] + .5 * teams['3P']) / teams['FGA'], 3)

most_3PAs = teams.sort('3PA', descending=True).take[:n]
most_3PAs.sort('eFG%', descending=True)

#### 5. Plots
Generate a scatter plot from `teams` with `AST` on the $x$-axis and `eFG%` on the $y$-axis.  Include the best-fit line.

In [None]:
teams.scatter('AST', select='eFG%', fit_line=True)

Add a column points per game, `ppg`, that gives the points per game (82) for each team.  Create a histogram plot on `ppg` and display the top 5 teams in points per game.

In [None]:
teams['ppg'] = teams['PTS'] / 82
teams.hist('ppg')
teams.sort('ppg', descending=True).take[:5]

#### 6. Linear Fit
Below is a simple scatter plot and linear fit on field goal attempts and points scored for players.

The second cell contains a scatter plot of the errors of the linear fit, i.e. the player's points minus the predicted points according to the fitted line.

What is wrong with this fitted model?  A basic assumption regarding simple linear modeling like this is ignored and can be observed in the errors?  What is the cause of this? (Think about this: What is the range of points possible if I take 1 shot versus if I take 2000 shots?)

In [None]:
players.scatter('FGA', select='PTS', fit_line=True)

In [None]:
def linear_fit(x, y, constant=True):
    import statsmodels.api as _sm
    if constant:
        x = _sm.add_constant(x)
    fit = _sm.OLS(y, x).fit()
    out = (fit.params, fit.fittedvalues, fit.resid)
    return out

# Ignore the first two returned quantities and just get the error values
_, _, err = linear_fit(players['FGA'] , players['PTS'])
Table().with_columns(['FGA', players['FGA'], 'err', err]).scatter('FGA')

Here is an alternatively constructed fit taken by dividing the above quantities by minutes played.

The linear fits appear qualitatively improved.  It's not perfect and I'd recommend not making too much of this.  

The improvement can be seen from the errors, which show a more even variation across values for `FGApm`.  Why does it make sense to divide by minutes played?

In [None]:
FGApm = players['FGA'] / players['MP']
PPM = players['PTS'] / players['MP']

Table().with_columns(['FGApm', FGApm, 'PPM', PPM]).\
    scatter('FGApm', select='PPM', fit_line=True)

# Ignore the first two returned quantities and just get the error values
_, _, err = linear_fit(FGApm, PPM)
Table().with_columns(['FGA', players['FGA'] / players['MP'], 'err', err]).scatter('FGA')

#### 7. Expected Value

Generate a list of 10 types of events labeled `'A'` through `'J'`.  Each type has a value associated it given, in order, as 
1.42, -1.07, -0.21 , -2.64, -1.57, 1.45,  0.70,  1.17 ,  3.09,  0.26.  The number of observations for each type of event, again in order, is 44, 759, 584, 174, 114,  28,  91, 148, 302, 148.

Build a table with this data and that compute a column `p` with the frequency of the occurrence each type of event.  The frequency of a type is the number of observations over the total number of observations.

Display the result.

In [None]:
names = list('ABCDEFGHIJ')

n = make_array(44, 759, 584, 174, 114,  28,  91, 148, 302, 148)
vals = make_array(1.42, -1.07, -0.21 , -2.64, -1.57, 1.45,  0.70,  1.17 ,  3.09,  0.26)
t = Table().with_columns(['name', names, 'n', n, 'val', vals])
t['p'] = t['n'] / t['n'].sum()
t

We make the next observation and it ends up as type `'J'`.  Our example here is actually a game and part of the game is we can change the type of the event we just observed.  We can change the type from `'J'` to either `'A'` or `'C'`, but it is a 50-50 toss up whether we change from to `'A'` or `'C'`.  The value of `'A`' is higher than `'J'` but the value of `'C'` is lower.  Ideally we would like to change to `'A'`.  What is expected value of the outcome of changing the event type.

In [None]:
ev = np.sum(t['p'] * t['val'])
print(ev)

Assume the probabilities of the type of our next observation is governed by the frequency we computed.
Using the frequencies and the values, compute the expected value of the next event.  

In [None]:
val_J = t.where('name', 'J')['val'][0]
val_A = t.where('name', 'A')['val'][0]
val_C = t.where('name', 'C')['val'][0]

p = .5
val_new = p * val_A + (1 - p) * val_C

#### 8. Questions

1. What interests you about sports and data science?  
2. Why did you opt for this course?
3. What do you hope to learn from this course?
4. Is there a particular sport/stat/concept that intrigued you and got you interested in sports stats/analytics?