# Mid-Semester Project:
## Predict Potential NBA Players Who Can Become Future Members of The Hall of Fame

Student: Zheng Fang

DA 210-02 / CS 181-02: Data Systems

Spring 2023

Instructor: Dr. Tanya Amert 


### The Central Question:

From the proposal, we want to answer: "Which player not inducted into the NBA Hall of Fame between 1999 and 2016 is potential to become a member of Hall of Fame in the future?"

Pengfei and I are both fans of the NBA, and both like sports. Therefore, we are curious about what is a Hall-of-Fame level player and which players have the opportunity to enter the Hall of Fame in the future. We plan to find some players who have been selected into the Hall of Fame from the draft list and find the common characteristics of Hall of Fame players from them. Then, according to the NBA regular season and playoff data, we can find the players who are most likely to enter the Hall of Fame in the future.

### Part A: Importing of Packages

In [1]:
import os
import os.path
import pandas as pd
import csv

datadir = os.getcwd()

Here we import necessary packags, such as os for reading documents, pandas for processing dataframe, and csv for importing and exporting. We set the current directory as the location of the notebook, and then we will use the relative path to find other databases based on this location in each subsequent file search process.

### Part B: Making the Dataframe of the Information from `NBA_Rookies_by_Year_Hall_of_Fame_Class.csv` by Using DoL

We found `NBA_Rookies_by_Year_Hall_of_Fame_Class.csv` in many NBA data. This database records the rookie year data of all selected rookies from 1980 to 2016. In addition, the database also records the names of NBA players who were elected to the Hall of Fame between 1999 and 2016. 

Therefore, we want to utilize this file to form a list of players who are elected to the Hall of Fame between 1999 and 2016 and another list of players who are not elected to the Hall of Fame between 1999 and 2016 yet but drafted from 1980 to 2016.

We can find the dataset from this URL: https://www.kaggle.com/datasets/thedevastator/nba-rookies-performance-statistics-and-minutes-p

From our proposal, we know some basic information about this file:

a)	In this dataset, we have two CSV files and we will use one of them named `NBA Rookies by Year_Hall of Fame Class.csv`.

b)	In this CSV file, there are 24 columns and 1537 rows. 

We gonna use following columns. 

The `Name` will be index of the future lists we build. 

The `Hall_of_Fame_Class` can help us to find players who were elected to the Hall of Fame and who are not elected to the Hall of Fame yet. 

<center>

| Variable Name   |      Variable Introduction      |  Variable Type |
|:----------:|:-------------:|:------:|
| Name |  The name of the rookie | String |
|Hall_of_Fame_Class	|The year the player elected to the Hall of Fame|	Integer|

</center>

In [2]:
def csvtoDoL(filepath, int_columns,str_columns):
    '''
    Reads a CSV file and convert it to a Dictionary of List.

    Parameters:
    filepath (str): The path to the CSV file
    int_columns (list): The list of column names that should be converted to integers.
    str_columns (list): The list of column names that should be kept as strings.

    Returns:
    Dictonary of List: A Dictonary of List with the column names as keys and the corresponding values as lists.

    '''
    with open(filepath, 'r') as f: 
        reader = csv.reader(f)
        headers = next(reader)
        DoL = {}
        # Convert the column name of the first row to the Key of the dictionary,each key include a list
        for col in headers:
            DoL[col] = []
        # Determining data types and special cases
        for row in reader:
            for i in range(len(row)):
                # Convert all empty spaces '' and '-' to 0.0
                if row[i] == '' or row[i] == '-':
                    row[i] = 0.0
                # Converts the data type of specific columns(str_columns) to string
                elif headers[i] in str_columns:
                    row[i] = str(row[i])
                # All other data are converted to Float
                else:
                    row[i] = float(row[i])
            # Add each value to the corresponding Key of the dictionary
            for i in range(len(row)):
                DoL[headers[i]].append(row[i])
        # Converts the data type of specific columns(int_colums) to integer
        for col in int_columns:
            DoL[col] = list(map(int, DoL[col]))
    return DoL

We write this fucntion for converting the read data to DoL. Since there are many special symbols are used to represent white space, such as `''` and `'-'`, we turn them to `0.0` as a placeholder. We also check the data types of all columns to make sure we are using correct form of the data.

In [3]:
# store the dictionary of list for "NBA_Rookies_by_Year_Hall_of_Fame_Class.csv" to 'DoL_hall_of_fame'
filepath1 = os.path.join(datadir, "data", "NBA_Rookies_by_Year_Hall_of_Fame_Class.csv")

DoL_hall_of_fame = csvtoDoL(filepath1,['index','GP','Hall_of_Fame_Class','Year_Drafted'],['Name'])

#print(DoL_hall_of_fame)

We use `join` to make a relative pathway to find `NBA_Rookies_by_Year_Hall_of_Fame_Class.csv`, and convert the data into DoL by the above function, `csvtoDoL`. We store this DoL in a variable as `regular`.

In [4]:
#convert the dictonary of list in 'DoL_hall_of_fame' to a data frame named 'dataframe_hall_of_fame'

dataframe_hall_of_fame = pd.DataFrame(DoL_hall_of_fame)
dataframe_hall_of_fame

Unnamed: 0,index,Name,Hall_of_Fame_Class,Year_Drafted,GP,MIN,PTS,FGM,FGA,FG%,...,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,EFF
0,0,Jeff Taylor,0,1982,44,17.6,3.6,1.5,3.6,40.0,...,1.0,65.2,0.6,1.2,1.8,2.5,0.9,0.3,1.4,5.2
1,1,Charles Smith,0,1988,71,30.4,16.3,6.1,12.4,49.5,...,5.5,72.5,2.4,4.1,6.5,1.5,1.0,1.3,2.1,16.7
2,2,Mark Davis,0,1988,33,7.8,3.8,1.5,3.1,48.0,...,1.0,82.4,0.5,0.6,1.1,0.4,0.4,0.1,0.4,3.8
3,3,Charles Smith,0,1989,60,8.7,2.9,1.0,2.2,44.4,...,1.3,69.7,0.2,0.9,1.2,1.7,0.6,0.1,0.6,4.1
4,4,Michael Smith,0,1989,65,9.5,5.0,2.1,4.4,47.6,...,1.0,82.8,0.6,0.9,1.5,1.2,0.1,0.0,0.8,4.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1532,1532,Taurean Prince,0,2016,20,10.0,3.4,1.2,3.1,39.3,...,0.7,85.7,0.2,1.8,2.0,0.7,0.3,0.3,0.7,4.0
1533,1533,Tomas Satoransky,0,2016,23,14.7,3.1,1.3,3.2,39.7,...,0.7,64.7,0.4,1.1,1.5,2.2,0.5,0.0,0.9,4.2
1534,1534,Troy Williams,0,2016,24,17.4,5.3,2.1,5.1,41.8,...,1.0,60.0,0.3,1.6,1.8,0.8,1.0,0.4,1.1,4.8
1535,1535,Wade Baldwin IV,0,2016,22,13.5,3.5,1.2,3.8,31.3,...,1.3,82.1,0.3,1.1,1.5,2.1,0.6,0.3,1.3,3.8


We convert the DoL, `regular`, into dataframe by `pandas` as another variable, `dataframe_regular`. We can see this dataframe above to check the data.

For the next step we can set the index of the dataframe by `Name`, and we want to split all players in this file into the list of players were elected to the Hall of Fame or another list of players who are not elected to the Hall of Fame yet.

### Part C: Making the Dataframe of the Information from `playoffStats.csv` by Using loL

In `playoffStats.csv`, we can find NBA playoffs player statistics from 1980 to 2022. We will use this document to find the playoff data of each player who was selected in the Hall of Fame between 1999 and 2016. Then, according to the common characteristics of the data of these Hall of Fame players, we can find out which players are likely to be included in the Hall of Fame.

Therefore, we need to use the data of this file to find a subset to record the playoff data of players who have entered the Hall of Fame. Then we will sort playoff performance of other players who are not hall of famers according to the historical Hall of Fame player data provided in the previous file. The higher the ranking, the greater the chance of being selected into the Hall of Fame.

We can find the dataset from this URL: https://www.kaggle.com/datasets/robertsunderhaft/nba-playoffs

From our proposal, we know some basic information about this file:

a)	In this dataset, there is one CSV file and we will use it named `playoffStats.csv`

b)	In this CSV file, there are 51 columns and 8167 rows.

We gonna use following columns to evaluate a player's performance in the playoffs of each year:

<center>

| Variable Name   |      Variable Introduction      |  Variable Type |
|:----------:|:-------------:|:------:|
|season	|NBA Season. 2022 would represent the 2021-2022 season.	|Integer|
|player	|Player name	|String|
|pos	|Player position	|String|
|age	|Player Age	|Integer|
|team_id	|Player team	|String|
|g	|Number of playoff games in season played	|Integer|
|gs	|Number of playoff games started in season	|Integer|
|mp_per_g	|Average minutes played	|Float|
|fg_per_g	|Average field goals made	|Float|
|fga_per_g	|Average field goals attempted	|Float|
|fg_pct	|Average field goal percentage	|Float|
|fg3_per_g	|Average three point shots made	|Float|
|fg3a_per_g	|Average three point shots attempted	|Float|
|fg3_pct	|Average three point percentage	|Float|
|fg2_per_g	|Average two point shots made	|Float|
|fg2a_per_g	|Average two point shots attempted	|Float|
|fg2_pct	|Average two point showing percentage	|Float|
|efg_pct	|Effective shooting percentage	|Float|
|ft_per_g	|Free throws made per game	|Float|
|fta_per_g	|Free throws attempted per game	|Float|
|ft_pct	|Free throw percentage per game	|Float|
|orb_per_g	|Offensive rebounds per game	|Float|
|drb_per_g	|Defensive rebounds per game	|Float|
|trb_per_g	|Total rebounds per game	|Float|
|ast_per_g	|Assists per game	|Float|
|stl_per_g	|Steals per game	|Float|
|blk_per_g	|Blocks per game	|Float|
|tov_per_g	|Turnovers per game	|Float|
|pf_per_g	|Personal fouls per game	|Float|
|pts_per_g	|Points per game	|Float|
|ast_pct	|Assist percentage per game	|Float|
|blk_pct	|Block percentage per game	|Float|
|bpm	|Box plus minus	|Float|
|dbpm	|Defensive box plus minus	|Float|
|drb_pct	|Defensive rebounding percentage	|Float|
|dws	|Defensive win share	|Float|
|fg3a_per_fga_pct	|Three point shot attempts per field goal attempted|	Float|
|fta_per_fga_pct	|Free throw attempted per field goal attempted percentage|	Float|
|mp	|Total minutes played	|Float|
|obpm	|Offensive box plus minus	|Float|
|orb_pct	|Offensive rebounding percentage|	Float|
|ows	|Offensive win share|	Float|
|per	|Player Efficiency Rating	|Float|
|stl_pct	|Steal percentage	|Float|
|tov_pct	|Turnover percentage|	Float|
|trb_pct	|Total rebound percentage|	Float|
|ts_pct	|True shooting percentage|	Float|
|usg_pct	|Usage percentage	|Float|
|vorp	|Value Over Replacement Player	|Float|
|ws	|Win Share	|Float|


</center>


In [5]:
def readTopNamesLoL(path):
    ''' 
    reads the file and creates a LoL representation, returning both the list of column names and the list of lists structure from the function
    '''
    # read the file
    with open(path, 'r') as f:
        # make the header of LoL
        headers = f.readline().strip().split(",")

        # make the LoL
        LoL = []
        for line in f:
            LoL.append(line.strip().split(","))

        # change every blank cell into 0.0
        for x in range(len(LoL)):
            for y in range(0,51):
                if LoL[x][y] == "":
                    LoL[x][y] = 0.0

        # change data type for each column
        for i in range(len(LoL)):
            LoL[i][0] = int(LoL[i][0])
            LoL[i][3] = int(LoL[i][3])
            LoL[i][5] = int(LoL[i][5])
            LoL[i][6] = int(LoL[i][6])
            for j in range(7,51):
                LoL[i][j] = float(LoL[i][j])
            
    return headers, LoL

We write this fucntion, `readTopNamesLoL`, for converting the read data to LoL. We change every blank cell into `0.0` and change the data type of them into string, int, or float correspondingly.

In [6]:
# Make the LoL of NBA playoff data
filepath2 = os.path.join(datadir, "data", "playoffStats.csv")
playoff = readTopNamesLoL(filepath2)

We use `join` to make a relative pathway to find `playoffStats.csv`, and convert the data into LoL by the above function, `readTopNamesLoL`. We store this DoL in a variable as `playoff`.

In [7]:
#Make the dataframe based on LoL
dataframe_playoff = pd.DataFrame(playoff[1], columns=playoff[0])
dataframe_playoff

Unnamed: 0,season,player,pos,age,team_id,g,gs,mp_per_g,fg_per_g,fga_per_g,...,ows,per,stl_pct,tov_pct,trb_pct,ts_pct,usg_pct,vorp,ws,ws_per_48
0,2022,Omer Yurtseven,C,23,MIA,9,0,4.2,1.3,2.0,...,0.1,25.8,0.0,0.0,10.8,0.647,22.6,0.1,0.2,0.228
1,2022,Kessler Edwards,SF,21,BRK,2,0,3.5,0.0,0.0,...,0.0,-2.2,7.3,100.0,0.0,0.000,6.6,0.0,0.0,-0.104
2,2022,Draymond Green,PF,31,GSW,22,22,32.0,3.1,6.5,...,0.4,12.3,1.8,26.4,12.7,0.534,14.0,0.5,1.4,0.094
3,2022,Danny Green,SF,34,PHI,12,12,26.6,3.0,7.4,...,0.0,9.9,2.0,12.7,7.1,0.576,15.1,0.3,0.3,0.049
4,2022,Devonte' Graham,PG,26,NOP,6,0,10.0,1.0,3.0,...,0.1,11.5,0.9,15.7,8.9,0.558,18.3,0.0,0.1,0.049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8162,1980,Steve Mix,SF,32,PHI,17,0,11.8,2.6,5.6,...,0.1,16.1,1.9,12.2,8.5,0.522,26.0,0.1,0.4,0.098
8163,1980,Dave Meyers,PF,26,MIL,7,0,27.9,3.7,8.9,...,-0.2,10.1,2.3,16.6,10.1,0.439,19.4,0.0,0.1,0.026
8164,1980,Billy McKinney,PG,24,KCK,3,0,11.3,0.7,1.7,...,0.0,6.6,0.0,37.5,4.7,0.400,9.7,0.0,0.0,-0.028
8165,1980,Jim McElroy,PG,26,ATL,5,0,6.4,0.8,1.8,...,0.0,12.2,0.0,15.2,3.5,0.536,17.7,0.0,0.0,0.053


We convert the LoL, `regular`, into dataframe by `pandas` as another variable, `dataframe_regular`. We can see this dataframe above to check the data.

For the next step, we need to find a way of evaluating a player's performance in the playoffs by the data we have and how to find potential candidacies by the standard we make.

### Part D: Making the Dataframe of the Information from `Seasons_Stats.csv` Directly with a .csv File

In `Seasons_Stats.csv`, we can find NBA player statistics of regular season from 1950 to 2017. We will use this document to find the regular season data of each player who was selected in the Hall of Fame between 1999 and 2016. Then, according to the common characteristics of the data of these Hall of Fame players, we can find out which players are likely to be included in the Hall of Fame.

Therefore, we need to use the data of this file to find a subset to record the regular season data of players who have entered the Hall of Fame. Then we will sort regular season performance of other players who are not hall of famers according to the historical Hall of Fame player data provided in the previous file. The higher the ranking, the greater the chance of being selected into the Hall of Fame.

We can find the dataset from this URL: https://www.kaggle.com/datasets/drgilermo/nba-players-stats?select=Seasons_Stats.csv

From our proposal, we know some basic information about this file:

a)	In this dataset, there are three CSV files and we will use one of them named `Seasons_Stats.csv`.

b)	In this CSV file, there are 53 columns and 24691 rows.

We gonna use following columns to evaluate a player's performance in each regular season:

<center>

| Variable Name   |      Variable Introduction      |  Variable Type |
|:----------:|:-------------:|:------:|
|Year	|Season	|Integer|
|Player|	Name	|String|
|Pos	|Position	|String|
|Age	|Age	|Float|
|Tm	|Team Name	|String|
|G	|The number of games played	|Integer|
|GS	|The number of games Started	|Integer|
|MP	|Minutes Played|	Float|
|PER	|Player Efficiency Rating|	Float|
|TS%	|True Shooting %	|Float|
|3PAr	|3-Point Attempt Rate|	Float|
|FTr	|Free Throw Rate	|Float|
|ORB%	|Offensive Rebound Percentage	|Float|
|DRB%	|Defensive Rebound Percentage	|Float|
|TRB%	|Total Rebound Percentage|	Float|
|AST%	|Assist Percentage|	Float|
|STL%	|Steal Percentage	|Float|
|BLK%	|Block Percentage	|Float|
|TOV%	|Turnover Percentage	|Float|
|USG%	|Usage Percentage|	Float|
|blanl	|empty	|Float|
|OWS	|Offensive Win Shares	|Float|
|DWS	|Defensive Win Shares	|Float|
|WS	|Win Shares	|Float|
|WS/48	|Win Shares Per 48 Minutes	|Float|
|blank2	|empty	|Float|
|OBPM	|Offensive Box Plus/Minus	|Float|
|DBPM	|Defensive Box Plus/Minus	|Float|
|BPM	|Box Plus/Minus|	Float|
|VORP	|Value Over Replacement|	Float|
|FG|	Field Goals	|Float|
|FGA	|Field Goal Attempts	|Float|
|FG%	|Field Goal Percentage	|Float|
|3P|	3-Point Field Goals	|Float|
|3PA	|3-Point Field Goal Attempts	|Float|
|3P%	|3-Point Field Goal Percentage	|Float|
|2P	|2-Point Field Goals	|Float|
|2PA	|2-Point Field Goal Attempts	|Float|
|2P%	|2-Point Field Goal Percentage	|Float|
|eFG%	|Effective Field Goal Percentage	|Float|
|FT|	Free Throws	|Float|
|FTA	|Free Throw Attempts	|Float|
|FT%	|Free Throw Percentage	|Float|
|ORB	|Offensive Rebounds|	Float|
|DRB	|Defensive Rebounds	|Float|
|TRB	|Total Rebounds	|Float|
|AST	|Assists	|Float|
|STL	|Steals	|Float|
|BLK	|Blocks	|Float|
|TOV	|Turnovers	|Float|
|PF	|Personal Fouls|	Float|
|PTS	|Points	|Float|


In [8]:
#Make the dataframe directly
filepath3 = os.path.join(datadir, "data", "regular_season_since_1950", "Seasons_Stats.csv")
df_season_stats = pd.read_csv(filepath3)
df_season_stats

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24686,24686,2017.0,Cody Zeller,PF,24.0,CHO,62.0,58.0,1725.0,16.7,...,0.679,135.0,270.0,405.0,99.0,62.0,58.0,65.0,189.0,639.0
24687,24687,2017.0,Tyler Zeller,C,27.0,BOS,51.0,5.0,525.0,13.0,...,0.564,43.0,81.0,124.0,42.0,7.0,21.0,20.0,61.0,178.0
24688,24688,2017.0,Stephen Zimmerman,C,20.0,ORL,19.0,0.0,108.0,7.3,...,0.600,11.0,24.0,35.0,4.0,2.0,5.0,3.0,17.0,23.0
24689,24689,2017.0,Paul Zipser,SF,22.0,CHI,44.0,18.0,843.0,6.9,...,0.775,15.0,110.0,125.0,36.0,15.0,16.0,40.0,78.0,240.0


We use `join` to make a relative pathway to find `Seasons_Stats.csv`, and read the data into dataframe,`df` , by `pandas`. We can see this dataframe above to check the data.

For the next step, we need to find a way of evaluating a player's performance in the regular season by the data we have and how to find potential candidacies by the standard we make. We also need to consider how to combian the performence of playoffs and regular season to show the candidacy of a player to hall of fame.

### Part E: Make a Tidy Dataframe of Hall of Fame members

For the `dataframe_hall_of_fame`, we want want to know the names of plyers who entered the Hall of Frame and the drafted year of them.

Therefore, we only left the columns of `Name` and `Year_Drafted` with players whoes `Hall_of_Fame_Class` value is not `0`.

In this data frame, the player's `Name` is the independent variable, and the player's `Year_Drafted` is dependent variable.

In [9]:
df_hall_of_fame_member = dataframe_hall_of_fame.loc[dataframe_hall_of_fame['Hall_of_Fame_Class']!=0,["Name","Year_Drafted"]]
df_hall_of_fame_member

Unnamed: 0,Name,Year_Drafted
34,Kevin McHale,1980
74,Isiah Thomas,1981
111,Dominique Wilkins,1982
115,James Worthy,1982
147,Clyde Drexler,1983
168,Ralph Sampson,1983
185,Charles Barkley,1984
194,Hakeem Olajuwon,1984
199,John Stockton,1984
208,Michael Jordan,1984


### Part F: Make a Tidy Dataframe of Ratings of Players Based on DataPlayoffs Data

For the first step, we used `merge` to combine the Hall of Frame information from pervious part. 

We labeled Hall of Framers with `1` in a new column called `hall_of_fame` and others with `0` in the `hall_of_fame` column as well.

We drop the `Name` column since it only contained the names of Hall of Framers.

In this way, we have two dataframes: 

The `playoff_hall_of_fame` contians the playoff data of all players who enter the Hall of Frame. 

The `playoff_Not_hall_of_fame` contians the playoff data of all players who not enter the Hall of Frame yet.

In [10]:
#label the playoff data of Hall of frame members
merged = pd.merge(dataframe_playoff, df_hall_of_fame_member, left_on='player', right_on='Name', how='left')
merged['hall_of_fame'] = merged['Name'].apply(lambda x: 1 if pd.notnull(x) else 0)
merged.drop('Name', axis=1, inplace=True)
playoff_hall_of_fame = merged.loc[merged['hall_of_fame']==1].drop(['Year_Drafted','hall_of_fame'],axis=1)
playoff_Not_hall_of_fame = merged.loc[merged['hall_of_fame']==0].drop(['Year_Drafted','hall_of_fame'],axis=1)

For the second step, we need a standrad to help us understand what is the level of a Hall of Framer. 

Thus, we use the `playoff_hall_of_fame` to find the average value of each columns in this data frame. This gives us the `mean` value of each playoff technical statistics of Hall of framer.

After this calclulation, we find the `mean`, `max`, and `min` of these average values to build a standard of entering the Hall of Frame of each column. 

This gives us a range to select potential Hall of Framer in the next step.

In [11]:
#make a standard of Hall of Framer
columns_to_calculate_mean = playoff_hall_of_fame.iloc[:, 5:]
head = list(columns_to_calculate_mean.columns)
means = list(columns_to_calculate_mean.mean().round(2))
max_point = list(columns_to_calculate_mean.max().round(2))
min_point = list(columns_to_calculate_mean.min().round(2))

For the third step, we use the standard we made based on Hall of Framer to rate each player who do not enter yet.

For each row, we cheak those column represent different technical statistics whether or not fit the range we made:

The total rate of this seaon of the player will add one, if the the data of that player is greater than the average value of the Hall of Frame level.

The total rate of this seaon of the player will add two, if the the data of that player is greater than the maximum of the Hall of Frame level.

The total rate of this seaon of the player will minus one, if the the data of that player is smaller than the minimum of the Hall of Frame level.

By this process, we have the total rate of every season of players who played in playoffs, and store this total rate in a column called `rating`.

We used `playoff_rating` as a dataframe to store the data with columns: `season`, `player`, `pos`, `age`, `team_id`, and `rating`.

In [12]:
#rate every player who do not enter Hall of Frame yet by the standard we made
columns_to_calculate_mean_1 = playoff_Not_hall_of_fame.iloc[:, 2:]
columns_to_calculate_mean_1['rating'] = 0
for i in range(5, len(head)+1):
    columns_to_calculate_mean_1.loc[columns_to_calculate_mean_1.iloc[:, i] > means[i-5], 'rating'] += 1
    columns_to_calculate_mean_1.loc[columns_to_calculate_mean_1.iloc[:, i] > max_point[i-5], 'rating'] += 2
    columns_to_calculate_mean_1.loc[columns_to_calculate_mean_1.iloc[:, i] < min_point[i-5], 'rating'] -= 1

playoff_rating = pd.concat([playoff_Not_hall_of_fame.iloc[:, :5], columns_to_calculate_mean_1.iloc[:, -1]], axis=1)
playoff_rating

Unnamed: 0,season,player,pos,age,team_id,rating
0,2022,Omer Yurtseven,C,23,MIA,22
1,2022,Kessler Edwards,SF,21,BRK,9
2,2022,Draymond Green,PF,31,GSW,40
3,2022,Danny Green,SF,34,PHI,28
4,2022,Devonte' Graham,PG,26,NOP,22
...,...,...,...,...,...,...
8162,1980,Steve Mix,SF,32,PHI,31
8163,1980,Dave Meyers,PF,26,MIL,33
8164,1980,Billy McKinney,PG,24,KCK,9
8165,1980,Jim McElroy,PG,26,ATL,14


For the fourth step, we used `groupby` to sum the rating of all seaons of one player to make a final rate of this player's playoff rating and only left columns of `player` and `rating`.

We sorted the entire dataframe by `rating` and stored the players with top 100 rating of playoff into `playoff_avrage_rate_top100`.

In this data frame, the `player` is the independent variable, and the player's `rating` is dependent variable.

In [13]:
#get the top 100 players in playoffs
playoff_avrage_rate = playoff_rating.groupby('player')['rating'].mean().reset_index().round(2)
playoff_avrage_rate_sort=playoff_avrage_rate.sort_values(by=["rating"],ascending=False)
playoff_avrage_rate_top100=playoff_avrage_rate_sort.head(100)

playoff_avrage_rate_top100

Unnamed: 0,player,rating
1185,LeBron James,54.87
0,Luka Doncic,52.33
1436,Nikola Jokic,52.25
685,Giannis Antetokounmpo,49.86
92,Anthony Davis,49.00
...,...,...
1143,Kyle Lowry,38.44
1285,Mark Price,38.43
1180,Latrell Sprewell,38.40
37,Albert King,38.40


### Part G: Make a Tidy Dataframe of Ratings of Players Based on Regular Season Data 

For the first step, we only left rows after year of 1980, since all player we have in these data sould be drafted after 1980.

we used `merge` to combine the Hall of Frame information from part E. 

We labeled Hall of Framers with `1` in a new column called `hall_of_fame` and others with `0` in the `hall_of_fame` column as well.

We drop the `Name` column since it only contained the names of Hall of Framers.

In this way, we have two dataframes: 

The `regular_hall_of_fame` contians the regular season data of all players who enter the Hall of Frame. 

The `regular_Not_hall_of_fame` contians the regular season  data of all players who not enter the Hall of Frame yet.

In [14]:
df_season_stats=df_season_stats.loc[df_season_stats['Year']>=1980]
#label the playoff data of Hall of frame members
merged = pd.merge(df_season_stats, df_hall_of_fame_member, left_on='Player', right_on='Name', how='left')
merged['hall_of_fame'] = merged['Name'].apply(lambda x: 1 if pd.notnull(x) else 0)
merged.drop('Name', axis=1, inplace=True)
regular_hall_of_fame=merged.loc[merged['hall_of_fame']==1].drop(['Year_Drafted','hall_of_fame'],axis=1)
regular_Not_hall_of_fame=merged.loc[merged['hall_of_fame']==0].drop(['Year_Drafted','hall_of_fame'],axis=1)

For the second step, we need a standrad to help us understand what is the level of a Hall of Framer. 

Thus, we use the `regular_hall_of_fame` to find the average value of each columns in this data frame. This gives us the `mean` value of each regular season technical statistics of Hall of framer.

After this calclulation, we find the `mean`, `max`, and `min` of these average values to build a standard of entering the Hall of Frame of each column. 

This gives us a range to select potential Hall of Framer in the next step.

In [15]:
#make a standard of Hall of Framer
columns_to_calculate = regular_hall_of_fame.iloc[:, 6:]
head = list(columns_to_calculate.columns)
regular_mean = list(columns_to_calculate.mean().round(2))
regular_max = list(columns_to_calculate.max().round(2))
regular_min = list(columns_to_calculate.min().round(2))

For the third step, we use the standard we made based on Hall of Framer to rate each player who do not enter yet.

For each row, we cheak those column represent different technical statistics whether or not fit the range we made:

The total rate of this seaon of the player will add one, if the the data of that player is greater than the average value of the Hall of Frame level.

The total rate of this seaon of the player will add two, if the the data of that player is greater than the maximum of the Hall of Frame level.

By this process, we have the total rate of every season of players who played in playoffs, and store this total rate in a column called `rating`.

We used `regular_rating` as a dataframe to store the data with columns: `Year`, `Player`, and `rating`.

In [16]:
#rate every player who do not enter Hall of Frame yet by the standard we made
columns_to_calculate_1 = regular_Not_hall_of_fame.iloc[:, 6:]
columns_to_calculate_1['rating'] = 0

for i in range(6, len(head)+1):
    columns_to_calculate_1.loc[columns_to_calculate_1.iloc[:, i] > regular_mean[i-6], 'rating'] += 1
    columns_to_calculate_1.loc[columns_to_calculate_1.iloc[:, i] > regular_max[i-6], 'rating'] += 2

regular_rating = pd.concat([regular_Not_hall_of_fame.iloc[:, 1:3], columns_to_calculate_1.iloc[:, -1]], axis=1)
regular_rating

Unnamed: 0,Year,Player,rating
0,1980.0,Kareem Abdul-Jabbar,42
1,1980.0,Tom Abernethy,23
2,1980.0,Alvan Adams,34
3,1980.0,Tiny Archibald,33
4,1980.0,Dennis Awtrey,27
...,...,...,...
18922,2017.0,Cody Zeller,30
18923,2017.0,Tyler Zeller,26
18924,2017.0,Stephen Zimmerman,25
18925,2017.0,Paul Zipser,31


For the fourth step, we used `groupby` to sum the rating of all seaons of one player to make a final rate of this player's regular season rating and only left columns of `Player` and `rating`.

We sorted the entire dataframe by `rating` and stored the players with top 100 rating of regular season into `regular_avrage_rate_top100`.

In this data frame, the `Player` is the independent variable, and the player's `rating` is dependent variable.

In [17]:
#get the top 100 players in regular season
regular_avrage_rate = regular_rating.groupby('Player')['rating'].mean().reset_index().round(2)
regular_avrage_rate_sort=regular_avrage_rate.sort_values(by=["rating"],ascending=False)
regular_avrage_rate_top100=regular_avrage_rate_sort.head(100)
regular_avrage_rate_top100

Unnamed: 0,Player,rating
1654,LeBron James,48.86
1627,Larry Bird,44.85
1543,Kevin Durant,44.60
1724,Magic Johnson,42.69
2327,Russell Westbrook,42.56
...,...,...
1066,Hersey Hawkins,35.46
2492,Steve Nash,35.44
2596,Tom Chambers,35.44
1744,Marc Gasol,35.44


### Part H: Conclusion

We used `merge` to combine the `playoff_avrage_rate_top100` and `regular_avrage_rate_top100` togeter.

Since `Player` and `player` columns represent the same name of the player, we only left `player`.

We used `rename` to edit the name of columns into `playoff_rate` and `regular_rate`.

We sumed up `playoff_rate` and `regular_rate` to `sum_rate`.

We sorted the data frame by `sum_rate` and droped columns of `playoff_rate` and `regular_rate`.

Finally, we only left top 50 players of all time and stored the information in `top50_overall`.

In this data frame, the `player` is the independent variable, and the player's `sum_rate` is dependent variable.

In [18]:
#sum up the ratings of playoff and regular season togeter to get the overall rating value of each player
d3 = pd.merge(playoff_avrage_rate_top100, regular_avrage_rate_top100, left_on='player', right_on='Player').drop('Player',axis=1)
d3 = d3.rename(columns={'rating_x': 'playoff_rate', 'rating_y': 'regular_rate'})
d3['sum_rate']=d3['playoff_rate']+d3['regular_rate']
d3 = d3.sort_values(by=['sum_rate'],ascending=False).set_index('player').drop(['playoff_rate','regular_rate'],axis=1)
top50_overall = d3.head(50)
top50_overall

Unnamed: 0_level_0,sum_rate
player,Unnamed: 1_level_1
LeBron James,103.73
Kevin Durant,91.87
Nikola Jokic,91.25
Larry Bird,91.18
Magic Johnson,90.31
Russell Westbrook,89.74
Giannis Antetokounmpo,89.61
James Harden,88.11
Anthony Davis,88.0
Dwyane Wade,87.62


Since our dataset only contain the information of players who are elected to the Hall of Fame between 1999 and 2016, some of players in the data frame above are alredy entered Hall of Frame between 2017 to 2022, such as Kobe Bryant and Tim Duncan. Also since some player drafted before 1980, like Magic Johnson and Larry Bird, they are also not included in the `NBA_Rookies_by_Year_Hall_of_Fame_Class.csv` (this database only records the rookie year data of all selected rookies from 1980 to 2016). Thus, we showed up in our data of playoffs and regular seasons, but not consider as one of members of Hall of frame. However, this reflects that our rating system is relatively reasonable based on the facts.

Now, we find top 50 players of all time who have highest overall rating in playoffs and regular season. The larger `sum_rate` is, the greater probobility they have to become a member of Hall of Frame. Therefore, we answer the central question we asked by the above table that shows the order of these potential Hall of Framers.