# NBA Analysis Project Part 1: Data Cleaning

Goal: **This project is aiming to predict which nba team will most likely win the 2020-2021 nba championship**

Project Planning: When starting a project, I like to outline the steps that I plan to take. This will allow me to break down the project into pieces and help me to minimize the time and effort when I'm debugging in an organized manner. The process is shown below:

1. Understand the nature of the data
2. Clean the data to obtain the most relevant information
3. Create as many plots as possible to visualize the data
4. Process missing data
5. Explore data
and more (to be continued...)

In [94]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


games_details = pd.read_csv("games_details.csv")
games = pd.read_csv("games.csv")
players = pd.read_csv("players.csv")
ranking = pd.read_csv("ranking.csv")
teams = pd.read_csv("teams.csv")

### Data Table 1: NBA Game Details

To start off, I want to get rid of some of the columns as they are just redundant information and aren't helpful for our data analysis. I also want to clean the data by removing the null data to ensure every data block has a meaning.

In [95]:
print("Title: Games_Details")

#get rid of start_position null value
games_details = games_details[games_details['START_POSITION'].notna()]
col_list = []
for col in games_details.columns:
    col_list.append(col.lower())
print(col_list)

games_details.pop("PLAYER_ID")
#games_details.pop("GAME_ID")
games_details.pop("TEAM_ID")
games_details.pop("COMMENT")

games_details.sort_values(by=['GAME_ID','TEAM_ABBREVIATION'],inplace=True)
games_details

Title: Games_Details
['game_id', 'team_id', 'team_abbreviation', 'team_city', 'player_id', 'player_name', 'start_position', 'comment', 'min', 'fgm', 'fga', 'fg_pct', 'fg3m', 'fg3a', 'fg3_pct', 'ftm', 'fta', 'ft_pct', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'pts', 'plus_minus']


Unnamed: 0,GAME_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_NAME,START_POSITION,MIN,FGM,FGA,FG_PCT,FG3M,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
543022,11400001,MIA,Miami,Luol Deng,F,24:50,2.0,7.0,0.286,0.0,...,2.0,1.0,3.0,1.0,2.0,0.0,0.0,0.0,4.0,-16.0
543023,11400001,MIA,Miami,Udonis Haslem,F,10:32,1.0,2.0,0.500,0.0,...,0.0,2.0,2.0,1.0,1.0,0.0,0.0,2.0,2.0,-8.0
543024,11400001,MIA,Miami,Chris Bosh,C,25:20,3.0,13.0,0.231,1.0,...,1.0,5.0,6.0,2.0,1.0,0.0,4.0,2.0,9.0,-10.0
543025,11400001,MIA,Miami,Dwyane Wade,G,20:31,2.0,7.0,0.286,0.0,...,0.0,2.0,2.0,3.0,0.0,0.0,1.0,1.0,6.0,-11.0
543026,11400001,MIA,Miami,Mario Chalmers,G,21:16,0.0,2.0,0.000,0.0,...,0.0,3.0,3.0,2.0,1.0,0.0,2.0,1.0,2.0,-6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
476,52000211,MEM,Memphis,Kyle Anderson,F,39:01,2.0,7.0,0.286,0.0,...,2.0,8.0,10.0,6.0,1.0,2.0,0.0,1.0,9.0,11.0
477,52000211,MEM,Memphis,Jaren Jackson Jr.,F,14:56,1.0,6.0,0.167,1.0,...,2.0,0.0,2.0,0.0,0.0,2.0,1.0,4.0,10.0,1.0
478,52000211,MEM,Memphis,Jonas Valanciunas,C,25:33,3.0,6.0,0.500,1.0,...,6.0,6.0,12.0,3.0,0.0,0.0,3.0,6.0,9.0,7.0
479,52000211,MEM,Memphis,Dillon Brooks,G,45:04,7.0,22.0,0.318,0.0,...,1.0,1.0,2.0,3.0,2.0,0.0,1.0,5.0,14.0,-1.0


As we can see, the column names can be very confusing, so just want to clarify full names of the abbreviations
* FGM: Field Goal Made
* FGA: Field Goal Attempt
* FG_PCT: Field Goal Percentage %
* FG3M: Field Goal 3 pointers Made
* FG3A: Field Goal 3 pointer Attempt
* FG_PCT: Field Goal 3 pointers Percentage %
* FTM: Free Throw Made
* FTA: Free Throw Attempt
* FT_PCT: Free Throw Percentage %
* OREB: Offensive rebound
* DREB: Defensive rebound
* REB: rebound
* AST: assists
* STL: Steal
* BLK: Block
* TO: Turnover
* PF: Personal Fouls
* PTS: Points
* PLUS_MINUS: plus minus

I know. These are a lot of information that is shown and it's quite intimdating for a nba sports newbie to understand everything. So I highly recommend those who don't understand these terminologies by checking the wikipedia page: https://en.wikipedia.org/wiki/Rules_of_basketball. Enough said, let's keep navigating through our data.

In [100]:
print("games")
games.sort_values(by="GAME_DATE_EST")

games


Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
17314,2003-10-05,10300001,Final,1610612762,1610612742,2003,1610612762,90.0,0.457,0.735,...,23.0,41.0,1610612742,85.0,0.447,0.500,0.250,20.0,38.0,1
17313,2003-10-06,10300002,Final,1610612763,1610612749,2003,1610612763,105.0,0.494,0.618,...,25.0,48.0,1610612749,94.0,0.427,0.700,0.154,20.0,43.0,1
17312,2003-10-07,10300009,Final,1610612758,1610612746,2003,1610612758,101.0,0.467,0.871,...,19.0,39.0,1610612746,82.0,0.368,0.609,0.364,13.0,50.0,1
17311,2003-10-07,10300005,Final,1610612757,1610612745,2003,1610612757,104.0,0.527,0.657,...,22.0,33.0,1610612745,80.0,0.470,0.667,0.333,10.0,37.0,1
17310,2003-10-07,10300007,Final,1610612748,1610612755,2003,1610612748,86.0,0.352,0.647,...,15.0,55.0,1610612755,79.0,0.329,0.897,0.143,7.0,44.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,2021-05-25,42000152,Final,1610612756,1610612747,2020,1610612756,102.0,0.465,0.933,...,21.0,31.0,1610612747,109.0,0.450,0.871,0.303,24.0,39.0,0
3,2021-05-25,42000112,Final,1610612751,1610612738,2020,1610612751,130.0,0.523,0.955,...,31.0,46.0,1610612738,108.0,0.424,0.783,0.353,23.0,43.0,1
2,2021-05-26,42000142,Final,1610612762,1610612763,2020,1610612762,141.0,0.544,0.774,...,28.0,42.0,1610612763,129.0,0.541,0.763,0.348,20.0,33.0,1
1,2021-05-26,42000132,Final,1610612752,1610612737,2020,1610612752,101.0,0.383,0.739,...,15.0,54.0,1610612737,92.0,0.369,0.818,0.273,17.0,41.0,1


In [97]:
print("players")
players.sample(5)

players


Unnamed: 0,PLAYER_NAME,TEAM_ID,PLAYER_ID,SEASON
704,Amir Johnson,1610612755,101161,2018
281,Terence Davis,1610612761,1629056,2019
1902,Willy Hernangomez,1610612752,1626195,2017
7000,Carl Landry,1610612758,201171,2009
3643,Quincy Pondexter,1610612740,202347,2014


In [98]:
print("rankings")
df

rankings


Unnamed: 0,TEAM_ID,LEAGUE_ID,SEASON_ID,STANDINGSDATE,CONFERENCE,TEAM,G,W,L,W_PCT,HOME_RECORD,ROAD_RECORD,RETURNTOPLAY
0,1610612762,0,22020,2021-05-26,West,Utah,72,52,20,0.722,31-5,21-15,
1,1610612756,0,22020,2021-05-26,West,Phoenix,72,51,21,0.708,27-9,24-12,
2,1610612743,0,22020,2021-05-26,West,Denver,72,47,25,0.653,25-11,22-14,
3,1610612746,0,22020,2021-05-26,West,LA Clippers,72,47,25,0.653,26-10,21-15,
4,1610612742,0,22020,2021-05-26,West,Dallas,72,42,30,0.583,21-15,21-15,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
193087,1610612765,0,22013,2014-09-01,East,Detroit,82,29,53,0.354,17-24,12-29,
193088,1610612738,0,22013,2014-09-01,East,Boston,82,25,57,0.305,16-25,9-32,
193089,1610612753,0,22013,2014-09-01,East,Orlando,82,23,59,0.280,19-22,4-37,
193090,1610612755,0,22013,2014-09-01,East,Philadelphia,82,19,63,0.232,10-31,9-32,
