In [1]:
import pandas as pd


df = pd.read_csv("nba.csv")
df.head(20)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 32 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 529 non-null    float64
 1   Player             530 non-null    object 
 2   Age                529 non-null    float64
 3   Team               529 non-null    object 
 4   Pos                529 non-null    object 
 5   G                  529 non-null    float64
 6   GS                 529 non-null    float64
 7   MP                 529 non-null    float64
 8   FG                 529 non-null    float64
 9   FGA                529 non-null    float64
 10  FG%                527 non-null    float64
 11  3P                 529 non-null    float64
 12  3PA                529 non-null    float64
 13  3P%                493 non-null    float64
 14  2P                 529 non-null    float64
 15  2PA                529 non-null    float64
 16  2P%                519 non

Loaded the csv into a dataframe. I wanted to do df.head(20) to really get a feel of the data. I did df.info() next to get some column information on data types and see if there are any null values. I see that awards is complety empty. There are some missing valuess for 3P%, 2P%, and FT% which are all columns that are dealing with shooting. 

In [11]:
df.columns.tolist()

['Player',
 'Age',
 'Team',
 'Pos',
 'G',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 '2P',
 '2PA',
 '2P%',
 'eFG%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS']

When I tried to drop columns I thought were unnecessary (`Rk`, `Awards`, and `Player-additional`), I got an error. To figure out why, I printed out all the columns using `df.columns.tolist()` to see what pandas actually loaded.  

It turns out the columns I wanted to drop aren’t in the DataFrame, so there’s no need to drop anything. The dataset already has the main player stats ready for analysis.


In [22]:
df.columns = (
    df.columns
    .str.lower()
    .str.replace('%', '_percent')
    .str.replace('3p', 'three_p')
    .str.replace('2p', 'two_p')
    .str.replace('-', '_')
)

df.columns

Index(['player', 'age', 'team', 'pos', 'g', 'gs', 'mp', 'fg', 'fga',
       'fg_percent', 'three_points', 'three_pointsa', 'three_points_percent',
       'two_points', 'two_pointsa', 'two_points_percent', 'efg_percent', 'ft',
       'fta', 'ft_percent', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov',
       'pf', 'pts'],
      dtype='object')

I wanted to make the columns names more consistent and more cleaner looking. I wanted to chagne where ever the column has a '%' to be change to percent. Also '3p' and '2p' to be words instead of the number to keep things consistent. It also lowered the words andd strip out any white spaces to also keep things consistent. 

In [28]:
percent_cols = [col for col in df.columns if col.endswith('percent')]
df[percent_cols] = df[percent_cols].fillna(0)

df[p_cols].isna().sum() 

NameError: name 'pct_cols' is not defined

I wanted to take care of the missing values. When looking at the data, the columns that contain percentages (`fg_percent`, `three_point_percent`, etc.) had some missing values.  

Usually, in NBA stats, this happens when a player didn’t take enough shots, so the percentage is not recorded.  

To handle this, I filled any missing or `NaN` values in the percentage columns with 0. This keeps the data clean and ready for analysis.
