# Name: Eric Ramsey
### Date: 08/30/2023

<style>
.jp-Notebook {
    padding: var(--jp-notebook-padding);
    margin-left: 160px;
    outline: none;
    overflow: auto;
    background: var(--jp-layout-color0);
}
</style>

# Introduction  

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this jupyter notebook, answering questions as you go along. Please begin by adding your name to the top markdown chunk in this document. When you're finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you'd like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use packages like pandas/numpy and matplotlib/seaborn as much as possible (instead of base python data manipulations and explicit loops.)  

**WARNING:** Your project will **ONLY** be graded if it's knit to an HTML document where we can see your code. Be careful to make sure that any long lines of code appropriately visibly wrap around visibly to the next line, as code that's cut off from the side of the document cannot be graded.  

**Note:**    

**Throughout this document, any `season` column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. For most of the rest of the project, we will refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).** 

# Answers  

## Part 1      

**Question 1:**   

- 1st Team: XX.X points per game  
- 2nd Team: XX.X points per game  
- 3rd Team: XX.X points per game  
- All-Star: XX.X points per game   

**Question 2:** XX.X Years  

**Question 3:** 

- Elite: X players.  
- All-Star: X players.  
- Starter: X players.  
- Rotation: X players.  
- Roster: X players.  
- Out of League: X players.  

**Open Ended Modeling Question:** Please show your work and leave all responses below in the document.


## Part 2  

**Question 1:** XX.X%   
**Question 2:** Written question, put answer below in the document.    
**Question 3:** Written question, put answer below in the document.    
  


# Setup and Data    

In [1]:
import pandas as pd
import numpy as np
# Note you will likely have to change these paths. 
# If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/awards_project/ from each string.
awards = pd.read_csv("./awards_data.csv")
player_data = pd.read_csv("./player_stats.csv")
team_data = pd.read_csv("./team_stats.csv")
rebounding_data = pd.read_csv("./team_rebounding_data_22.csv")

## Part 1 -- Awards  

In this section, you're going to work with data relating to player awards and statistics. You'll start with some data manipulation questions and work towards building a model to predict broad levels of career success.  


### Question 1  

**QUESTION:** What is the average number of points per game for players in the 2007-2021 seasons who won All NBA First, Second, and Third teams (**not** the All Defensive Teams), as well as for players who were in the All-Star Game (**not** the rookie all-star game)?


 

In [2]:
awards # display the original awards data to determine any manipulation needed

Unnamed: 0,season,nbapersonid,All NBA Defensive First Team,All NBA Defensive Second Team,All NBA First Team,All NBA Second Team,All NBA Third Team,All Rookie First Team,All Rookie Second Team,Bill Russell NBA Finals MVP,...,all_star_game,rookie_all_star_game,allstar_rk,Defensive Player Of The Year_rk,Most Improved Player_rk,Most Valuable Player_rk,Rookie Of The Year_rk,Sixth Man Of The Year_rk,all_nba_points_rk,all_rookie_points_rk
0,2007,708.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,True,False,1.0,1.0,,3.0,,,,
1,2007,947.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,True,False,2.0,,,,,,,
2,2007,948.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,3.0,2.0,,,,,,
3,2007,959.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,True,False,4.0,,,9.0,,,,
4,2007,977.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,True,False,1.0,5.0,,1.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4324,2015,1626170.0,,,,,,,,,...,,,,,,,,,,24.0
4325,2015,1626202.0,,,,,,,,,...,,,,,,,,,,24.0
4326,2015,1626273.0,,,,,,,,,...,,,,,,,,,,24.0
4327,2018,1628971.0,,,,,,,,,...,,,,,,,,,,18.0


In [3]:
awards_q1_df = awards.copy() # initialize new DataFrame for question 1 using a copy of awards data

# funtion used to remove columns and rows not needed for question 1 solution (ex: remove "All Defensive Team" | "rookie_all_star_game")
def clean_awards_data(df):
    # remove columns not needed for question 1 solution
    df.drop(columns=['All NBA Defensive First Team', 'All NBA Defensive Second Team', 'rookie_all_star_game', 'Rookie Of The Month', 
    'All Rookie First Team', 'All Rookie Second Team', 'allstar_rk', 'all_rookie_points_rk', 'Player Of The Month', 'Player Of The Week',
    'Bill Russell NBA Finals MVP', 'Rookie Of The Year_rk', 'Sixth Man Of The Year_rk', 'all_nba_points_rk', 
    'Defensive Player Of The Year_rk', 'Most Improved Player_rk', 'Most Valuable Player_rk'], inplace=True)
    # filter df by data values in All NBA First - Third teams to remove players who do not meet this condition
    df = df[(df['All NBA First Team'] > 0.0) | (df['All NBA Second Team'] > 0.0) | (df['All NBA Third Team'] > 0.0)]
    df = df.fillna(False) # set NaN or null values to "False" - possibly change to 0 for later steps of project for ML model section

    return df

awards_q1_df = clean_awards_data(awards_q1_df)
awards_q1_df # display awards_q1_df to verify that data have been filtered to contain only players who were awarded All NBA First - Third teams

Unnamed: 0,season,nbapersonid,All NBA First Team,All NBA Second Team,All NBA Third Team,all_star_game
0,2007,708.0,1.0,0.0,0.0,True
3,2007,959.0,0.0,1.0,0.0,True
4,2007,977.0,1.0,0.0,0.0,True
6,2007,1495.0,0.0,1.0,0.0,True
7,2007,1503.0,0.0,0.0,1.0,False
...,...,...,...,...,...,...
666,2021,1627783.0,0.0,0.0,1.0,False
668,2021,1628369.0,1.0,0.0,0.0,True
676,2021,1629027.0,0.0,0.0,1.0,True
677,2021,1629029.0,1.0,0.0,0.0,True


In [4]:
# create copy of player_data for question 1 so original df is not altered - mainly for visual purposes and for troubleshooting next step
player_data_q1_df = player_data.copy()

# funtion used to remove columns and rows not needed for question 1 solution (ex: remove "All Defensive Team" | "rookie_all_star_game")
def clean_player_data(df):
    df.sort_values('season')
    df.drop(columns=['draftyear', 'draftpick', 'games_start', 'mins', 'fgm', 'fga', 'fgp', 'fgm3', 'fga3', 'fgp3', 'fgm2', 'fga2', 
                     'fgp2', 'efg', 'ftm', 'fta', 'ftp', 'off_reb', 'def_reb', 'tot_reb', 'off_reb_pct', 'def_reb_pct', 
                     'tot_reb_pct', 'ast', 'steals', 'blocks', 'tov', 'stl_pct', 'blk_pct', 'tov_pct', 'tot_fouls', 'PER', 'FTr', 
                     'ast_pct', 'usg', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP'], inplace=True)
    return df
# call clean_player_data function to display player_data_q1_df to verify changes were implimented correctly
clean_player_data(player_data_q1_df)

Unnamed: 0,nbapersonid,player,season,nbateamid,team,games,points
0,2585,Zaza Pachulia,2007,1610612737,ATL,62,322
1,200780,Solomon Jones,2007,1610612737,ATL,35,35
2,2746,Josh Smith,2007,1610612737,ATL,81,1394
3,201151,Acie Law,2007,1610612737,ATL,56,235
4,101136,Salim Stoudamire,2007,1610612737,ATL,35,200
...,...,...,...,...,...,...,...
8487,1630648,Jordan Schakel,2021,1610612764,WAS,4,5
8488,1630557,Corey Kispert,2021,1610612764,WAS,77,634
8489,1628398,Kyle Kuzma,2021,1610612764,WAS,66,1130
8490,203526,Raul Neto,2021,1610612764,WAS,70,526


In [5]:
# Initialize df subsets to store players based on which All NBA (First, Second, Third) team they were placed to then use playerid and retrieve ppg
all_team_first = awards_q1_df[(awards_q1_df['All NBA First Team'] == 1.0)]
all_team_second = awards_q1_df[(awards_q1_df['All NBA Second Team'] == 1.0)]
all_team_third = awards_q1_df[(awards_q1_df['All NBA Third Team'] == 1.0)]
all_star_game = awards_q1_df[(awards_q1_df['all_star_game'] == True)]
# all_team_first
# all_team_second
# all_team_third
# all_star_game

# function used to determine team averages using nbapersonid to retrieve points for a given season and total games played by the target
def get_all_team_ppg(all_team_df, player_data_q1_df):
    # located the games played and total points scored in a season for players in all_team_df and merge needed data
    all_team_df = all_team_df.merge(player_data_q1_df[['nbapersonid', 'points', 'games']], on='nbapersonid', how='left')
    # determine the player ppg for matched nbapersonid (points / games = ppg) - adding ppg column for result
    all_team_df['ppg'] = all_team_df['points'] / all_team_df['games']
    # determine the overall averages of the entire all_team_df
    all_team_ppg = round(all_team_df['points'].sum() / all_team_df['games'].sum(), 1)

    return all_team_ppg

# function used to determine the averages of players located in all_star_game subset where all_star_game (column) == True for player
def get_all_star_ppg(all_star_game, player_data_q1_df):
    # located the games played and total points scored in a season for players in all_star_game df and merge needed data
    all_star_game = all_star_game.merge(player_data_q1_df[['nbapersonid', 'points', 'games']], on='nbapersonid', how='left')
    # determine the all_star_game player ppg for matched nbapersonid (points / games = ppg) - adding ppg column for result
    all_star_game['ppg'] = all_star_game['points'] / all_star_game['games']
    # determine the overall averages of the entire all_star_game df
    all_star_ppg = round(all_star_game['points'].sum() / all_star_game['games'].sum(), 1)

    return all_star_ppg

# final output displaying ppg averages for the required team classifications
print("1st Team: ", get_all_team_ppg(all_team_first, player_data_q1_df), 'ppg')
print("2nd Team: ", get_all_team_ppg(all_team_second, player_data_q1_df), 'ppg')
print("3rd Team: ", get_all_team_ppg(all_team_third, player_data_q1_df), 'ppg')
print("All-Star: ", get_all_star_ppg(all_star_game, player_data_q1_df), 'ppg')

1st Team:  22.7 ppg
2nd Team:  20.2 ppg
3rd Team:  18.1 ppg
All-Star:  20.9 ppg


<strong><span style="color:red">ANSWER 1:</span></strong>   

1st Team: 22.7 points per game  
2nd Team: 20.2 points per game  
3rd Team: 18.1 points per game  
All-Star: 20.9 points per game  

### Question 2  

**QUESTION:** What was the average number of years of experience in the league it takes for players to make their first All NBA Selection (1st, 2nd, or 3rd team)? Please limit your sample to players drafted in 2007 or later who did eventually go on to win at least one All NBA selection. For example:

- Luka Doncic is in the dataset as 2 years. He was drafted in 2018 and won his first All NBA award in 2019 (which was his second season).  
- LeBron James is not in this dataset, as he was drafted prior to 2007.  
- Lu Dort is not in this dataset, as he has not received any All NBA honors.  



In [6]:
# Initialize copy of modified awards data specified for question 2
awards_q2_df = awards_q1_df.copy()
# create copy of player_data for question 3 so original df is not altered 
player_data_q2_df = player_data.copy()
# funtion used to remove columns and rows not needed for question 2 solution - keeping columns needed that were not used in question 1
def clean_player_data_q2(df):
    # sort player_data_q2_df by 'season'
    df.sort_values('season')
    # remove columns not needed for question 2 results - mainly for visual and troubleshooting purposes
    df.drop(columns=['draftpick', 'games_start', 'mins', 'fgm', 'fga', 'fgp', 'fgm3', 'fga3', 'fgp3', 'fgm2', 'fga2', 
                     'fgp2', 'efg', 'ftm', 'fta', 'ftp', 'off_reb', 'def_reb', 'tot_reb', 'off_reb_pct', 'def_reb_pct', 
                     'tot_reb_pct', 'ast', 'steals', 'blocks', 'tov', 'stl_pct', 'blk_pct', 'tov_pct', 'tot_fouls', 'PER', 'FTr', 
                     'ast_pct', 'usg', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP'], inplace=True)
    return df

# function used to determine the average years it takes for a player to make their first All NBA Team Selection (First, Second, Third)
def get_avg_allteam_selection(awards_q2_df, player_data_q2_df):
    # located the games played and total points scored in a season for players in all_star_game df and merge needed data
    awards_q2_df = awards_q2_df.merge(player_data_q2_df[['nbapersonid', 'draftyear']], on='nbapersonid', how='left')
    # determine duration of years until first ALL NBA Team selection (All NBA Team selection season - draftyear) adding '1st_all_team_yrs' for result
    awards_q2_df['1st_all_team_sel_yrs'] = awards_q2_df.apply(
        lambda row: row['season'] - row['draftyear'] if (
            row['All NBA First Team'] == 1.0 or
            row['All NBA Second Team'] == 1.0 or
            row['All NBA Third Team'] == 1.0) else None,
            axis=1
    )
    # noticed on output test there were duplicate rows - causing inaccurate calculation
    awards_q2_df = awards_q2_df.drop_duplicates()
    # determine the average years until a player is first selected for All NBA Teams
    avg_allteam_sel_yrs = round(awards_q2_df['1st_all_team_sel_yrs'].mean(), 1)
    
    return avg_allteam_sel_yrs

# call the required data cleaning and avg_allteam_selection functions for question 2 - display final output
clean_player_data_q2(player_data_q2_df)
print(get_avg_allteam_selection(awards_q2_df, player_data_q2_df), 'Years')

7.2 Years


<strong><span style="color:red">ANSWER 2:</span></strong>  

7.2 Years  

## Data Cleaning Interlude  

You're going to work to create a dataset with a "career outcome" for each player, representing the highest level of success that the player achieved for **at least two** seasons *after his first four seasons in the league* (examples to follow below!). To do this, you'll start with single season level outcomes. On a single season level, the outcomes are:  

- Elite: A player is "Elite" in a season if he won any All NBA award (1st, 2nd, or 3rd team), MVP, or DPOY in that season.    
- All-Star: A player is "All-Star" in a season if he was selected to be an All-Star that season.   
- Starter:  A player is a "Starter" in a season if he started in at least 41 games in the season OR if he played at least 2000 minutes in the season.    
- Rotation:  A player is a "Rotation" player in a season if he played at least 1000 minutes in the season.   
- Roster:  A player is a "Roster" player in a season if he played at least 1 minute for an NBA team but did not meet any of the above criteria.     
- Out of the League: A player is "Out of the League" if he is not in the NBA in that season.   

We need to make an adjustment for determining Starter/Rotation qualifications for a few seasons that didn't have 82 games per team. Assume that there were 66 possible games in the 2011 lockout season and 72 possible games in each of the 2019 and 2020 seasons that were shortened due to covid. Specifically, if a player played 900 minutes in 2011, he **would** meet the rotation criteria because his final minutes would be considered to be 900 * (82/66) = 1118. Please use this math for both minutes and games started, so a player who started 38 games in 2019 or 2020 would be considered to have started 38 * (82/72) = 43 games, and thus would qualify for starting 41. Any answers should be calculated assuming you round the multiplied values to the nearest whole number.

Note that on a season level, a player's outcome is the highest level of success he qualifies for in that season. Thus, since Shai Gilgeous-Alexander was both All-NBA 1st team and an All-Star last year, he would be considered to be "Elite" for the 2022 season, but would still qualify for a career outcome of All-Star if in the rest of his career he made one more All-Star game but no more All-NBA teams. Note this is a hypothetical, and Shai has not yet played enough to have a career outcome.    

Examples:  

- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Rotation (3), Roster (4), Roster (5), Out of the League (6+) would be considered "Out of the League," because after his first four seasons, he only has a single Roster year, which does not qualify him for any success outcome.  
- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), All-Star (7), Elite (8), Starter (9) would be considered "All-Star," because he had at least two seasons after his first four at all-star level of production or higher.  
- A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), Rotation (7), Rotation (8), Roster (9) would be considered a "Starter" because he has two seasons after his first four at a starter level of production. 


### Question 3  

**QUESTION:** There are 73 players in the `player_data` dataset who have 2010 listed as their draft year. How many of those players have a **career** outcome in each of the 6 buckets?  

In [7]:
awards_q3_df = awards.copy()
player_data_q3_df = player_data.copy()

# funtion used to merge the two data frames and remove columns and rows not needed for question 3 solution 
def clean_awards_player_data(awards_q3_df, player_data_q3_df):
    # remove columns not needed for question 3 solution
    awards_q3_df.drop(columns=['All NBA Defensive First Team', 'All NBA Defensive Second Team', 'rookie_all_star_game', 'Rookie Of The Month', 
    'All Rookie First Team', 'All Rookie Second Team', 'allstar_rk', 'all_rookie_points_rk', 'Player Of The Month', 'Player Of The Week',
    'Bill Russell NBA Finals MVP', 'Rookie Of The Year_rk', 'Sixth Man Of The Year_rk', 'all_nba_points_rk', 'Most Improved Player_rk'], inplace=True)
    # filter df by data values in All NBA First - Third teams to remove players who do not meet this condition
    awards_q3_df = awards_q3_df[(awards_q3_df['All NBA First Team'] > 0.0) | (awards_q3_df['All NBA Second Team'] > 0.0) | (awards_q3_df['All NBA Third Team'] > 0.0)]
    awards_q3_df = awards_q3_df.fillna(False) # set NaN or null values to "False" - possibly change to 0 for later steps of project for ML model section
    # remove columns not needed for question 3 results - mainly for visual and troubleshooting purposes
    player_data_q3_df.drop(columns=['draftpick', 'fgm', 'fga', 'fgp', 'fgm3', 'fga3', 'fgp3', 'fgm2', 'fga2', 
                     'fgp2', 'efg', 'ftm', 'fta', 'ftp', 'off_reb', 'def_reb', 'tot_reb', 'off_reb_pct', 'def_reb_pct', 
                     'tot_reb_pct', 'ast', 'steals', 'blocks', 'tov', 'stl_pct', 'blk_pct', 'tov_pct', 'tot_fouls', 'PER', 'FTr', 
                     'ast_pct', 'usg', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP'], inplace=True)
    # sort player_data_q2_df by 'season'
    player_data_q3_df = player_data_q3_df.sort_values('season')
    player_data_q3_df = player_data_q3_df
    # merge the data frames modify 
    awards_player_data = pd.concat([awards_q3_df, player_data_q3_df], axis=1)
    # adjust the former NaN/null values to 0.0 to resolve InvalidIndexError
    adjust_col_false_values = ['season', 'nbapersonid', 'All NBA First Team', 'All NBA Second Team', 'All NBA Third Team', 'Defensive Player Of The Year_rk', 'Most Valuable Player_rk', 
                               'draftyear', 'season', 'nbateamid', 'team', 'games', 'games_start', 'mins', 'points'] 
    # modify columns listed in adjust_col_false_values to 0.0
    for column in adjust_col_false_values:
        awards_player_data[column] = awards_player_data[column].fillna(0.0)
    # modify NaN/null values in all_star_game column to False
    awards_player_data['all_star_game'] = awards_player_data['all_star_game'].fillna(False)

    return awards_player_data

awards_player_data_merged = clean_awards_player_data(awards_q3_df, player_data_q3_df)
awards_player_data_merged =  awards_player_data_merged[(awards_player_data_merged['draftyear'] == 2010)].copy()
awards_player_data_merged

Unnamed: 0,season,nbapersonid,All NBA First Team,All NBA Second Team,All NBA Third Team,all_star_game,Defensive Player Of The Year_rk,Most Valuable Player_rk,nbapersonid.1,player,draftyear,season.1,nbateamid,team,games,games_start,mins,points
1909,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202332,Cole Aldrich,2010,2010,1610612760,OKC,18,0,142,18
1896,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202360,Andy Rautins,2010,2010,1610612752,NYK,5,0,24,8
1900,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202361,Landry Fields,2010,2010,1610612752,NYK,82,81,2541,797
1953,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202342,Craig Brackins,2010,2010,1610612755,PHI,3,0,33,8
1955,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202323,Evan Turner,2010,2010,1610612755,PHI,78,14,1797,565
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8052,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202362,Lance Stephenson,2010,2021,1610612754,IND,40,1,744,373
8063,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202331,Paul George,2010,2021,1610612746,LAC,31,31,1077,754
8102,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202340,Avery Bradley,2010,2021,1610612747,LAL,62,45,1406,394
7961,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,202326,DeMarcus Cousins,2010,2021,1610612743,DEN,31,2,431,276


In [8]:
# function used to determine the total buckets a player with the draft year of 2010 has met to classify their standing (Elite, All-Star, Starter, Rotation, Roster, or Out of the League)
def get_career_outcome(awards_player_data_merged):
    # series used to store career_outcome results for the players being evaluated 
    career_outcome = {
        "elite": 0,
        "all-star": 0,
        "starter": 0,
        "rotation": 0,
        "roster": 0,
        "otl": 0
    }
    # determine the career outcomes based on the columns containing the required information for career evaluation
    for index, row in awards_player_data_merged.iterrows():
        all_team_first = row['All NBA First Team']
        all_team_second = row['All NBA Second Team']
        all_team_third = row['All NBA Third Team']
        all_star = row['all_star_game']
        dpoy = row['Defensive Player Of The Year_rk']
        mvp = row['Most Valuable Player_rk']
        games_started = row['games_start']
        mins_played = row['mins']
        # increment career_outcome elements based on the data in the columns that determine the career_outcome classification (displayed above)
        if ((all_team_first == 1.0) | (all_team_second == 1.0) | (all_team_third == 1.0) | (dpoy == 1.0) | (mvp == 1.0)):
            career_outcome["elite"] += 1
        if all_star:
            career_outcome["all-star"] += 1
        if games_started >= 41.0 or mins_played >= 2000.0:
            career_outcome["starter"] += 1
        if mins_played >= 1000.0:
            career_outcome["rotation"] += 1
        if row['games'] > 0.0:
            career_outcome["roster"] += 1
        else:
            career_outcome["otl"] += 1    
    # create a visual representation of the final results of career_outcomes to determine question 3 solution
    career_outcome_results = pd.DataFrame(career_outcome.items(), columns=['career_outcome', 'total'])

    return career_outcome_results

career_outcome_results = get_career_outcome(awards_player_data_merged)
career_outcome_results

Unnamed: 0,career_outcome,total
0,elite,0
1,all-star,0
2,starter,98
3,rotation,181
4,roster,437
5,otl,0


<strong><span style="color:red">ANSWER 3:</span></strong>  

Elite: 0 players.  
All-Star: 0 players.  
Starter: 98 players.  
Rotation: 181 players.  
Roster: 437 players.  
Out of League: 0 players.  

### Open Ended Modeling Question   

In this question, you will work to build a model to predict a player's career outcome based on information up through the first four years of his career. 

This question is intentionally left fairly open ended, but here are some notes and specifications.  

1. We know modeling questions can take a long time, and that qualified candidates will have different levels of experience with "formal" modeling. Don't be discouraged. It's not our intention to make you spend excessive time here. If you get your model to a good spot but think you could do better by spending a lot more time, you can just write a bit about your ideas for future improvement and leave it there. Further, we're more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery, and a successful candidate could use a simple regression approach. 

2. You may use any data provided in this project, but please do not bring in any external sources of data. Note that while most of the data provided goes back to 2007, All NBA and All Rookie team voting is only included back to 2011.  

3. A player needs to complete three additional seasons after their first four to be considered as having a distinct career outcome for our dataset. Because the dataset in this project ends in 2021, this means that a player would need to have had the chance to play in the '21, '20, and '19 seasons after his first four years, and thus his first four years would have been '18, '17, '16, and '15. **For this reason, limit your training data to players who were drafted in or before the 2015 season.** Karl-Anthony Towns was the #1 pick in that season.  

4. Once you build your model, predict on all players who were drafted in 2018-2021 (They have between 1 and 4 seasons of data available and have not yet started accumulating seasons that inform their career outcome).  

5. You can predict a single career outcome for each player, but it's better if you can predict the probability that each player falls into each outcome bucket.    

6. Include, as part of your answer:  
  - A brief written overview of how your model works, targeted towards a decision maker in the front office without a strong statistical background. 
  - What you view as the strengths and weaknesses of your model.  
  - How you'd address the weaknesses if you had more time and or more data.  
  - A matplotlib or plotly visualization highlighting some part of your modeling process, the model itself, or your results.  
  - Your predictions for Shai Gilgeous-Alexander, Zion Williamson, James Wiseman, and Josh Giddey.  
  - (Bonus!) An html table (for example, see the package `reactable`) containing all predictions for the players drafted in 2019-2021.  



In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder 

# create model training df using the function created in earlier steps and adjust as needed for ML process
awards_df = awards.copy()
player_data_df = player_data.copy()
model_df = clean_awards_player_data(awards_df, player_data_df)

#function used to modify model_df to prepare the data for ML process
def clean_model_data(model_df):

    model_df = model_df[(model_df['draftyear'] <= 2015)].copy()
    model_df['Defensive Player Of The Year_rk'] = model_df['Defensive Player Of The Year_rk'].replace(False, 0.0) # while this was part of the function's process, for this new df the column was still having False for NaN values
    model_df['all_star_game'] = model_df['all_star_game'].replace(True, 1.0)
    model_df['all_star_game'] = model_df['all_star_game'].replace(False, 0.0)
    model_df = model_df.drop(columns=['player', 'team', 'season', 'season'])

    return model_df

model_df = clean_model_data(model_df)

In [10]:
# set the career_outcome requirements and target for machine learning model requirements - default value of career_outcome is set to 'otl' prior to analyzing of the data
model_df['career_outcome'] = 'otl' 
# set target requirments for 'elite' career_outcome
elite = (
        (model_df['All NBA First Team'] == 1.0) | 
        (model_df['All NBA Second Team'] == 1.0) | 
        (model_df['All NBA Third Team'] == 1.0) | 
        (model_df['Defensive Player Of The Year_rk'] == 1.0) | 
        (model_df['Most Valuable Player_rk'] == 1.0)
) 
model_df.loc[elite, 'career_outcome'] = 'elite'
# set requirments for 'starter' career_outcome
starter = ((model_df['games_start'] >= 41.0) & (model_df['mins'] >= 2000.0))
model_df.loc[starter, 'career_outcome'] = 'starter'
# set requirments for 'rotation' career_outcome
rotation = (model_df['mins'] >= 1000.0) 
model_df.loc[rotation, 'career_outcome'] = 'rotation'
# set requirments for 'roster' career_outcome
roster = (model_df['games'] > 0.0) 
model_df.loc[roster, 'career_outcome'] = 'roster'

# initialize machine learning model requirments
random_forest_classifier = RandomForestClassifier(random_state=42)
scaler = MinMaxScaler()
encoder = OneHotEncoder(sparse_output=False) # sparse_output=False to remove the FutureWarning related to sparse parameter

# 
model_df['career_outcome'] = model_df['career_outcome'].astype(str).replace('nan', 'unknown')

# Initalize target variable (y - career_outcome) and feature set (x - contains all columns except career outcome)
X = model_df.drop(columns=['career_outcome'])
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(model_df['career_outcome'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

random_forest_classifier.fit(X_train, y_train)

y_predict = random_forest_classifier.predict(X_test)
report = classification_report(y_test, y_predict, target_names=label_encoder.classes_)
print(report)

              precision    recall  f1-score   support

      roster       1.00      1.00      1.00      1381

    accuracy                           1.00      1381
   macro avg       1.00      1.00      1.00      1381
weighted avg       1.00      1.00      1.00      1381



### Model Overview

For this model I unfortunately did not find a working solution. The output above indicates that the model is 100% accurate, which is not true. I believe I'm having errors with the data my model has access to which is giving a false representation of the model's accuracy. I also have been unable to implement a filter for the model to only allow access to the first 4 years of data. Another issue I need to fix is the merge of the data that is being evaluate. Although I've implemented the necessary data frames. I noticed during troubleshooting the errors I was getting that I unfortunately have two 'nbapersonid' columns indicating a logical error in my merge function.

My original plan was to use the concept of the functions I created for the question three solution to determine a career outcome solution for this section. Although I was unable to find a solution and create an efficient model for this problem, I'm going to keep working on this after the deadline and find a way to implement a functioning model. Looking forward to hearing any feedback on the solutions I presented so I can correct my mistakes and improve.


## Part 2 -- Predicting Team Stats  

In this section, we're going to introduce a simple way to predict team offensive rebound percent in the next game and then discuss ways to improve those predictions.  
 
### Question 1   

Using the `rebounding_data` dataset, we'll predict a team's next game's offensive rebounding percent to be their average offensive rebounding percent in all prior games. On a single game level, offensive rebounding percent is the number of offensive rebounds divided by their number offensive rebound "chances" (essentially the team's missed shots). On a multi-game sample, it should be the total number of offensive rebounds divided by the total number of offensive rebound chances.    

Please calculate what OKC's predicted offensive rebound percent is for game 81 in the data. That is, use games 1-80 to predict game 81.  

In [11]:
rebounding_data_df = rebounding_data.copy()
rebounding_data_df


Unnamed: 0,team,opp_team,gamedate,game_number,offensive_rebounds,off_rebound_chances,oreb_pct
0,BOS,PHI,2022-10-18,1,10,39,0.256410
1,PHI,BOS,2022-10-18,1,8,42,0.190476
2,GSW,LAL,2022-10-18,1,16,57,0.280702
3,LAL,GSW,2022-10-18,1,14,57,0.245614
4,ORL,DET,2022-10-19,1,13,47,0.276596
...,...,...,...,...,...,...,...
2455,LAC,PHX,2023-04-09,82,18,56,0.321429
2456,MEM,OKC,2023-04-09,82,12,55,0.218182
2457,POR,GSW,2023-04-09,82,11,61,0.180328
2458,SAC,DEN,2023-04-09,82,12,50,0.240000


In [12]:
from sklearn.linear_model import LinearRegression

# filter rebounding_data_df to only contain OKC games 1-80
rebounding_data_df = rebounding_data_df[(rebounding_data_df['game_number'] >= 1) & (rebounding_data_df['game_number'] <= 80) & (rebounding_data_df['team'] == 'OKC')]
rebounding_data_df

# determine the total offensive rebound averages/percentages for OKC in games 1-80
def get_okc_off_reb_avg(rebounding_data_df):
    total_off_reb = rebounding_data_df['offensive_rebounds'].sum()
    total_off_reb_chances = rebounding_data_df['off_rebound_chances'].sum()
    off_reb_avg = round((total_off_reb / total_off_reb_chances) * 100, 1)

    return off_reb_avg

okc_off_reb_avg = get_okc_off_reb_avg(rebounding_data_df)

X = rebounding_data_df[['game_number']]
y = rebounding_data_df['oreb_pct']
okc_off_reb_avg

28.9

<strong><span style="color:red">ANSWER 1:</span></strong>  

28.9% 

### Question 2  

There are a few limitations to the method we used above. For example, if a team has a great offensive rebounder who has played in most games this season but will be out due to an injury for the next game, we might reasonably predict a lower team offensive rebound percent for the next game.  

Please discuss how you would think about changing our original model to better account for missing players. You do not have to write any code or implement any changes, and you can assume you have access to any reasonable data that isn't provided in this project. Try to be clear and concise with your answer.  

<strong><span style="color:red">ANSWER 2:</span></strong>
One characteristic of our current model that I would consider implementing is to retrieve injury report data or any indication that a player whose minutes per game is greater than 10 minutes. This would allow our model's predictions to consider players that potentially have a high impact on OKC's overall offensive rebounding and rebounding percentages. Another implementation that could benefit the accuracy of our model's predictions would be to allow our model to have access to smaller sets of data, as well as the data already obtained. For example, if we allow our model to have access to the current rebounding data as well as smaller data sets containing the team’s rebounding averages in the past 10 games, this could increase the accuracy of the predictions as well as take into consideration of the more recent rebounding performance of the team.

### Question 3  

In question 2, you saw and discussed how to deal with one weakness of the model. For this question, please write about 1-3 other potential weaknesses of the simple average model you made in question 1 and discuss how you would deal with each of them. You may either explain a weakness and discuss how you'd fix that weakness, then move onto the next issue, or you can start by explaining multiple weaknesses with the original approach and discuss one overall modeling methodology you'd use that gets around most or all of them. Again, you do not need to write any code or implement any changes, and you can assume you have access to any reasonable data that isn't provided in this project. Try to be clear and concise with your answer.  


<strong><span style="color:red">ANSWER 3:</span></strong>  I feel that the first potential weakness of the model would be that the model doesn't necessarily compare OKC's rebounding performance and averages against the oppenent the team is facing in the upcoming game. It would benefit the model's predictions to implement a feature that would allow the model to take in consideration how the team has performed against the upcoming opponent in the past. Another potential weakness of the current model is not because the upcoming opponent may have key players that highly contribute to rebounding who might not be participating in the next game against OKC. I did discuss this in the previous question, however implementing a feature that also determines if players in the upcoming game that is being predicted have key players that are unable to contribute to rebound changes would change the odds in the game. Therefore, considering this would be an important factor in predicting OKC's overall rebounding (and overall game) performance. The final potential weakness that I feel could be considered for model improvement is the type of defense the opposing team plays, which should be considered when determining offensive rebounds for OKC. If a team tends to switch up a defensive formation or favors a zone defensive formation, this could impact the overall offensive rebounding performance for OKC. Finding a method to determine the defensive tactics of the opposing team and taking that into consideration could benefit the performance and predictions of the current model.