In [1]:
import pandas as pd


df = pd.read_csv("nba.csv")
df.head(20)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 32 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 529 non-null    float64
 1   Player             530 non-null    object 
 2   Age                529 non-null    float64
 3   Team               529 non-null    object 
 4   Pos                529 non-null    object 
 5   G                  529 non-null    float64
 6   GS                 529 non-null    float64
 7   MP                 529 non-null    float64
 8   FG                 529 non-null    float64
 9   FGA                529 non-null    float64
 10  FG%                527 non-null    float64
 11  3P                 529 non-null    float64
 12  3PA                529 non-null    float64
 13  3P%                493 non-null    float64
 14  2P                 529 non-null    float64
 15  2PA                529 non-null    float64
 16  2P%                519 non

Loaded the csv into a dataframe. I wanted to do df.head(20) to really get a feel of the data. I did df.info() next to get some column information on data types and see if there are any null values. I see that awards is complety empty. There are some missing valuess for 3P%, 2P%, and FT% which are all columns that are dealing with shooting. 

In [11]:
df.columns.tolist()

['Player',
 'Age',
 'Team',
 'Pos',
 'G',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 '2P',
 '2PA',
 '2P%',
 'eFG%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS']

When I tried to drop columns I thought were unnecessary (`Rk`, `Awards`, and `Player-additional`), I got an error. To figure out why, I printed out all the columns using `df.columns.tolist()` to see what pandas actually loaded.  

It turns out the columns I wanted to drop aren’t in the DataFrame, so there’s no need to drop anything. The dataset already has the main player stats ready for analysis.


In [22]:
df.columns = (
    df.columns
    .str.lower()
    .str.replace('%', '_percent')
    .str.replace('3p', 'three_p')
    .str.replace('2p', 'two_p')
    .str.replace('-', '_')
)

df.columns

Index(['player', 'age', 'team', 'pos', 'g', 'gs', 'mp', 'fg', 'fga',
       'fg_percent', 'three_points', 'three_pointsa', 'three_points_percent',
       'two_points', 'two_pointsa', 'two_points_percent', 'efg_percent', 'ft',
       'fta', 'ft_percent', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov',
       'pf', 'pts'],
      dtype='object')

I wanted to make the columns names more consistent and more cleaner looking. I wanted to chagne where ever the column has a '%' to be change to percent. Also '3p' and '2p' to be words instead of the number to keep things consistent. It also lowered the words andd strip out any white spaces to also keep things consistent. 

In [29]:
percent_cols = [col for col in df.columns if col.endswith('percent')]
df[percent_cols] = df[percent_cols].fillna(0)

df[percent_cols].isna().sum() 

fg_percent              0
three_points_percent    0
two_points_percent      0
efg_percent             0
ft_percent              0
dtype: int64

I wanted to take care of the missing values. When looking at the data, the columns that contain percentages (`fg_percent`, `three_point_percent`, etc.) had some missing values.  

Usually, in NBA stats, this happens when a player didn’t take enough shots, so the percentage is not recorded.  

To handle this, I filled any missing or `NaN` values in the percentage columns with 0. This keeps the data clean and ready for analysis.


In [33]:
df.describe()


Unnamed: 0,age,g,gs,mp,fg,fga,fg_percent,three_points,three_pointsa,three_points_percent,...,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
count,529.0,529.0,529.0,529.0,529.0,529.0,530.0,529.0,529.0,530.0,...,530.0,529.0,529.0,529.0,529.0,529.0,529.0,529.0,529.0,529.0
mean,25.994329,28.045369,12.871456,19.238941,3.252174,7.001134,0.450166,1.021928,2.910397,0.304262,...,0.69703,0.941399,2.582042,3.522495,2.07845,0.674858,0.396786,1.106049,1.676371,8.989414
std,4.290357,13.730283,15.368532,9.527224,2.428836,4.946054,0.123046,0.892801,2.29482,0.159902,...,0.247726,0.836845,1.789838,2.441591,1.870107,0.432697,0.410987,0.854379,0.810367,6.889466
min,19.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,23.0,18.0,0.0,11.4,1.3,3.0,0.4,0.3,1.1,0.25,...,0.65575,0.4,1.2,1.8,0.8,0.4,0.1,0.5,1.1,3.5
50%,25.0,31.0,4.0,19.0,2.6,5.6,0.451,0.8,2.5,0.336,...,0.767,0.7,2.2,3.0,1.5,0.6,0.3,0.9,1.7,7.1
75%,28.0,40.0,25.0,27.5,4.7,10.0,0.504,1.6,4.5,0.38675,...,0.838,1.3,3.6,4.9,2.9,0.9,0.5,1.6,2.2,13.1
max,41.0,47.0,46.0,39.5,11.1,22.6,1.0,4.6,11.6,1.0,...,1.0,4.9,9.1,12.2,11.0,2.5,2.6,4.3,4.4,33.4


In [34]:
df['team'].value_counts()

team
IND    24
LAC    20
LAL    19
MEM    19
CHO    19
ATL    19
DET    19
WAS    18
GSW    18
CHI    18
SAC    18
SAS    18
TOR    18
NOP    17
NYK    17
PHI    17
PHO    17
DAL    17
MIL    17
CLE    16
ORL    16
MIA    16
DEN    16
OKC    16
BRK    16
POR    16
UTA    16
MIN    15
BOS    15
HOU    15
2TM     5
3TM     2
Name: count, dtype: int64

In [None]:
I used `df.describe()` to get a quick overview of the numeric columns. This helps me see the minimum, maximum, mean, and standard deviation for each statistic.  

Looking at these values gives insight into:

The range of each stat (min and max)  
Average performance (mean)  
-How spread out the stats are (std)  

This is useful to spot outliers and get a sense of how different players compare across categories.
                                                                                                                                        
I also wanted to see how many players were from each teams to rough idea how spread out the players were. 
                                                                                                                                           


In [35]:
df['points_per_game'] = df['pts']/df['g']
df['points_per_minute'] = df['pts']/df['mp']



In [36]:
high = df[df['g'] >= 30]

In [38]:
high.sort_values(
    ['points_per_game', 'efg_percent'],
    ascending = False
)[['player', 'team', 'points_per_game', 'efg_percent']].head(10)

Unnamed: 0,player,team,points_per_game,efg_percent
8,Giannis Antetokounmpo,MIL,0.933333,0.66
0,Luka Dončić,LAL,0.927778,0.543
5,Nikola Jokić,DEN,0.925,0.665
7,Kawhi Leonard,LAC,0.906452,0.571
10,Lauri Markkanen,UTA,0.845455,0.557
4,Anthony Edwards,MIN,0.822222,0.579
21,Victor Wembanyama,SAS,0.790323,0.566
11,Stephen Curry,GSW,0.761111,0.588
1,Shai Gilgeous-Alexander,OKC,0.734091,0.605
18,Michael Porter Jr.,BRK,0.725714,0.582


I wanted to identify players who score a lot while being efficient. Simply looking at total points doesn’t tell the whole story — some players play fewer minutes or fewer games.  

Calculate points per game and points per minute** to standardize scoring:
`points_per_game` = total points ÷ games played  
`points_per_min` = total points ÷ minutes played  

2. Filter for serious contributors by keeping only players who played at least 30 games. This removes players with very few games that could skew results.  

3. Sort by scoring and efficiency using:
`points_per_game` to get high-volume scorers  
`efg_percent` (effective field goal percentage) to ensure they are also efficient  

4. Show the top 10 players, giving a clear view of the most efficient high-volume scorers this season.


In [None]:
df.to_csv('nba_cleaned.csv')