_In this week, we will use basketball data downloaded from NBA.com to demonstrate how to import data into Python, how to clean up data before conducting any data analyses, as well how to describe and summarize data._ 


## Importing data into Python
### Before we import the dataset into Jupyter notebook, we need to first import the Python libraries that we will use to analyze the data
- pandas
- numpy



In [1]:
import pandas as pd
import numpy as np

#### In our data repository, we have a dataset that contains NBA team information. Let's import this dataset into the Jupyter notebook.

In [2]:
NBA_Teams=pd.read_csv("../../Data/Week 2/nba_teams.csv")

We can take a quick look at the data we imported by displaying the dataset. 

In [3]:
display(NBA_Teams)

Unnamed: 0.1,Unnamed: 0,ABBREVIATION,CITY,FULL_NAME,ID,NICKNAME,STATE,YEAR_FOUNDED
0,0,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949
1,1,BOS,Boston,Boston Celtics,1610612738,Celtics,Massachusetts,1946
2,2,CLE,Cleveland,Cleveland Cavaliers,1610612739,Cavaliers,Ohio,1970
3,3,NOP,New Orleans,New Orleans Pelicans,1610612740,Pelicans,Louisiana,2002
4,4,CHI,Chicago,Chicago Bulls,1610612741,Bulls,Illinois,1966
5,5,DAL,Dallas,Dallas Mavericks,1610612742,Mavericks,Texas,1980
6,6,DEN,Denver,Denver Nuggets,1610612743,Nuggets,Colorado,1976
7,7,GSW,Golden State,Golden State Warriors,1610612744,Warriors,California,1946
8,8,HOU,Houston,Houston Rockets,1610612745,Rockets,Texas,1967
9,9,LAC,Los Angeles,Los Angeles Clippers,1610612746,Clippers,California,1970


_This dataset provides some basic information of the NBA teams._ 

For a dataset, each row represents an observation, i.e., a team in this dataset and each column represents a variable which contains information of a characteristics of the observation. A variable can take different values in different situations. The number of observation in a dataset represents the size of our sample and the number of variables represents the richness of information in our dataset. 

#### We can use the “shape” function in Python to see how many variables and observations in our dataset.




In [4]:
NBA_Teams.shape

(30, 8)

_We can see that there are 30 observations (rows) and 8 variables (columns)._

### Renaming Variables
####  We can rename a variable using the “rename” function in Python.
_Inplace parameter_
- True: replace the old variable with a new name; 
- False: create a new variable with the new name.

The first variable is unnamed, let's rename it to be "TEAM_NUMBER"; let's also rename "ID" to "TEAM_ID." 

In [7]:
NBA_Teams.rename(columns = {'Unnamed: 0':'TEAM_NUMBER', 'ID':'TEAM_ID'}, inplace = True)
display(NBA_Teams)

Unnamed: 0,TEAM_NUMBER,ABBREVIATION,CITY,FULL_NAME,TEAM_ID,NICKNAME,STATE,YEAR_FOUNDED
0,0,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949
1,1,BOS,Boston,Boston Celtics,1610612738,Celtics,Massachusetts,1946
2,2,CLE,Cleveland,Cleveland Cavaliers,1610612739,Cavaliers,Ohio,1970
3,3,NOP,New Orleans,New Orleans Pelicans,1610612740,Pelicans,Louisiana,2002
4,4,CHI,Chicago,Chicago Bulls,1610612741,Bulls,Illinois,1966
5,5,DAL,Dallas,Dallas Mavericks,1610612742,Mavericks,Texas,1980
6,6,DEN,Denver,Denver Nuggets,1610612743,Nuggets,Colorado,1976
7,7,GSW,Golden State,Golden State Warriors,1610612744,Warriors,California,1946
8,8,HOU,Houston,Houston Rockets,1610612745,Rockets,Texas,1967
9,9,LAC,Los Angeles,Los Angeles Clippers,1610612746,Clippers,California,1970


## Self Test - 1
- Rename "FULL_NAME" to "TEAM_NAME" 

In [8]:
#Your Code Here
NBA_Teams.rename(columns = {'FULL_NAME':'TEAM_NAME'}, inplace = True)
NBA_Teams

Unnamed: 0,TEAM_NUMBER,ABBREVIATION,CITY,TEAM_NAME,TEAM_ID,NICKNAME,STATE,YEAR_FOUNDED
0,0,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949
1,1,BOS,Boston,Boston Celtics,1610612738,Celtics,Massachusetts,1946
2,2,CLE,Cleveland,Cleveland Cavaliers,1610612739,Cavaliers,Ohio,1970
3,3,NOP,New Orleans,New Orleans Pelicans,1610612740,Pelicans,Louisiana,2002
4,4,CHI,Chicago,Chicago Bulls,1610612741,Bulls,Illinois,1966
5,5,DAL,Dallas,Dallas Mavericks,1610612742,Mavericks,Texas,1980
6,6,DEN,Denver,Denver Nuggets,1610612743,Nuggets,Colorado,1976
7,7,GSW,Golden State,Golden State Warriors,1610612744,Warriors,California,1946
8,8,HOU,Houston,Houston Rockets,1610612745,Rockets,Texas,1967
9,9,LAC,Los Angeles,Los Angeles Clippers,1610612746,Clippers,California,1970


### Dropping Variables (columns)

#### To drop a variable, i.e., to delete a column, we can use the “drop” command. 
- We need to provide the name of the variable;
- We also need to use the argument “axis=1” which tells Python that we are dropping a column, not a row. 

The variable "TEAM_NUMBER" has little meaning, let's drop it.

In [10]:
NBA_Teams.drop(['TEAM_NUMBER'], axis = 1, inplace = True)
NBA_Teams

Unnamed: 0,ABBREVIATION,CITY,TEAM_NAME,TEAM_ID,NICKNAME,STATE,YEAR_FOUNDED
0,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949
1,BOS,Boston,Boston Celtics,1610612738,Celtics,Massachusetts,1946
2,CLE,Cleveland,Cleveland Cavaliers,1610612739,Cavaliers,Ohio,1970
3,NOP,New Orleans,New Orleans Pelicans,1610612740,Pelicans,Louisiana,2002
4,CHI,Chicago,Chicago Bulls,1610612741,Bulls,Illinois,1966
5,DAL,Dallas,Dallas Mavericks,1610612742,Mavericks,Texas,1980
6,DEN,Denver,Denver Nuggets,1610612743,Nuggets,Colorado,1976
7,GSW,Golden State,Golden State Warriors,1610612744,Warriors,California,1946
8,HOU,Houston,Houston Rockets,1610612745,Rockets,Texas,1967
9,LAC,Los Angeles,Los Angeles Clippers,1610612746,Clippers,California,1970


#### Next we will work on game level data.

Import the game level dataset from our data repository. 
- We can display just first five rows of the dataset using the "head" command.


In [11]:
Games=pd.read_csv("../../Data/Week 2/basketball_games.csv")
Games.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22019,1611661322,WAS,Washington Mystics,1021900079,8/5/2019,WAS @ LVA,,101,51,...,0.75,3,10,13,14,2,4,5,7,12.2
1,22019,1611661319,LVA,Las Vegas Aces,1021900079,8/5/2019,LVA vs. WAS,,100,36,...,0.8,5,10,15,12,2,0,7,8,-11.8
2,22019,1611661320,LAS,Los Angeles Sparks,1021900124,8/1/2019,LAS vs. LVA,W,200,76,...,1.0,5,28,33,19,7,7,12,18,8.0
3,22019,1611661323,CON,Connecticut Sun,1021900122,8/1/2019,CON vs. PHO,W,200,68,...,0.786,12,28,40,18,13,0,16,15,6.0
4,22019,1611661317,PHO,Phoenix Mercury,1021900122,8/1/2019,PHO @ CON,L,199,62,...,0.778,6,23,29,15,8,9,18,11,-6.0


_Upon importing the game data, we notice that the first five games are not NBA games, instead, they are WNBA games. Indeed, this dataset contains NBA games, WNBA games, NBA 2K (simulation video) games._ 

### Dropping observations (rows)

#### To drop an observation, we can use the index number on the left to specify the row we want to drop.
- The argument axis=0 specifies that we want to drop a row instead of a column.


In [12]:
Games.drop([0], axis=0, inplace=True)
Games.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
1,22019,1611661319,LVA,Las Vegas Aces,1021900079,8/5/2019,LVA vs. WAS,,100,36,...,0.8,5,10,15,12,2,0,7,8,-11.8
2,22019,1611661320,LAS,Los Angeles Sparks,1021900124,8/1/2019,LAS vs. LVA,W,200,76,...,1.0,5,28,33,19,7,7,12,18,8.0
3,22019,1611661323,CON,Connecticut Sun,1021900122,8/1/2019,CON vs. PHO,W,200,68,...,0.786,12,28,40,18,13,0,16,15,6.0
4,22019,1611661317,PHO,Phoenix Mercury,1021900122,8/1/2019,PHO @ CON,L,199,62,...,0.778,6,23,29,15,8,9,18,11,-6.0
5,22019,1611661319,LVA,Las Vegas Aces,1021900124,8/1/2019,LVA @ LAS,L,200,68,...,0.765,10,29,39,17,7,4,14,13,-8.0


#### More often, we will drop observations based on certain conditions. 

For example, Las Vegas Aces is a women’s basketball team. If we are only going to focus on men’s basketball games, we will drop all the games played by Las Vegas Aces. In this case, we don’t have to use the “drop” function. We can specify our TEAM_NAME variables to be not equal to “Las Vegas Aces.”


In [13]:
Games = Games[Games.TEAM_NAME != 'Las Vegas Aces']
Games.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
2,22019,1611661320,LAS,Los Angeles Sparks,1021900124,8/1/2019,LAS vs. LVA,W,200,76,...,1.0,5,28,33,19,7,7,12,18,8.0
3,22019,1611661323,CON,Connecticut Sun,1021900122,8/1/2019,CON vs. PHO,W,200,68,...,0.786,12,28,40,18,13,0,16,15,6.0
4,22019,1611661317,PHO,Phoenix Mercury,1021900122,8/1/2019,PHO @ CON,L,199,62,...,0.778,6,23,29,15,8,9,18,11,-6.0
6,22019,1611661313,NYL,New York Liberty,1021900123,8/1/2019,NYL @ DAL,L,200,64,...,0.8,6,21,27,15,10,2,21,26,-23.0
7,22019,1611661321,DAL,Dallas Wings,1021900123,8/1/2019,DAL vs. NYL,W,200,87,...,0.833,9,24,33,24,13,6,18,21,23.0


## Self Test - 2
- Drop all the Phoenix Mercury games

In [14]:
#Your Code Here
Games = Games[Games.TEAM_NAME != 'Phoenix Mercury']
Games

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
2,22019,1611661320,LAS,Los Angeles Sparks,1021900124,8/1/2019,LAS vs. LVA,W,200,76,...,1.000,5,28,33,19,7,7,12,18,8.0
3,22019,1611661323,CON,Connecticut Sun,1021900122,8/1/2019,CON vs. PHO,W,200,68,...,0.786,12,28,40,18,13,0,16,15,6.0
6,22019,1611661313,NYL,New York Liberty,1021900123,8/1/2019,NYL @ DAL,L,200,64,...,0.800,6,21,27,15,10,2,21,26,-23.0
7,22019,1611661321,DAL,Dallas Wings,1021900123,8/1/2019,DAL vs. NYL,W,200,87,...,0.833,9,24,33,24,13,6,18,21,23.0
8,22019,1611661325,IND,Indiana Fever,1021900121,7/31/2019,IND vs. ATL,W,200,61,...,0.731,7,36,43,14,11,4,15,17,2.0
9,22019,1611661330,ATL,Atlanta Dream,1021900121,7/31/2019,ATL @ IND,L,201,59,...,0.667,12,32,44,17,7,5,15,21,-2.0
11,22019,1611661322,WAS,Washington Mystics,1021900119,7/30/2019,WAS vs. PHO,W,200,99,...,0.800,10,23,33,21,6,4,8,13,6.0
12,22019,1611661323,CON,Connecticut Sun,1021900118,7/30/2019,CON vs. CHI,W,201,100,...,0.762,10,27,37,29,6,4,16,22,6.0
13,22019,1611661321,DAL,Dallas Wings,1021900120,7/30/2019,DAL @ LVA,L,201,54,...,0.600,10,29,39,11,4,4,18,22,-32.0
14,22019,1611661329,CHI,Chicago Sky,1021900118,7/30/2019,CHI @ CON,L,200,94,...,0.783,8,23,31,26,6,3,14,20,-6.0


## Merging Dataframes

#### We will only focus on NBA games. We could merge the NBA_Teams and Games datasets to filter out NBA games.

#### Teams are identified by the TEAM_ID. So, let’s merge the datasets by TEAM_ID. Since the variable “TEAM_NAME” is also present in both datasets, we could also include this variable as a criteria to merge the datasets so that in our new dataset, there is no duplicate variables.

In [15]:
NBA_Games = pd.merge(NBA_Teams, Games, on = ['TEAM_ID', 'TEAM_NAME'])
NBA_Games

Unnamed: 0,ABBREVIATION,CITY,TEAM_NAME,TEAM_ID,NICKNAME,STATE,YEAR_FOUNDED,SEASON_ID,TEAM_ABBREVIATION,GAME_ID,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22019,ATL,1521900072,...,0.850,13,23,36,14,15,3,12,24,8.0
1,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22019,ATL,1521900060,...,0.700,9,28,37,19,10,8,22,25,-5.0
2,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22019,ATL,1521900042,...,0.708,7,27,34,17,5,5,18,21,18.2
3,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22019,ATL,1521900023,...,0.500,1,2,3,1,0,1,4,2,0.0
4,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22019,ATL,1521900023,...,0.625,9,27,36,7,7,10,18,28,-24.0
5,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22019,ATL,1521900013,...,0.885,9,30,39,13,11,6,13,21,2.0
6,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22018,ATL,21801220,...,0.816,22,39,61,29,5,7,17,25,-1.0
7,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22018,ATL,21801202,...,0.526,9,39,48,25,2,3,11,28,-8.0
8,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22018,ATL,21801181,...,0.677,10,28,38,21,16,4,14,21,-36.0
9,ATL,Atlanta,Atlanta Hawks,1610612737,Hawks,Atlanta,1949,22018,ATL,21801168,...,0.786,11,33,44,29,7,7,11,26,8.0


### Understanding and cleaning the merged dataset

_As you can tell, the merged dataset has a lot more variables and Python cannot fit all of them in the screen._ 

#### We can obtain the list of variables using the “columns” command. This provides us a full list of variables in our dataset.


In [16]:
NBA_Games.columns

Index(['ABBREVIATION', 'CITY', 'TEAM_NAME', 'TEAM_ID', 'NICKNAME', 'STATE',
       'YEAR_FOUNDED', 'SEASON_ID', 'TEAM_ABBREVIATION', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')

#### Data Cleaning
The variable “ABBREVIATION” AND “TEAM_ABBREVIATION” carry the same information and it is not necessary to keep both of them. 
- Delete "ABBREVIATION" 


In [17]:
NBA_Games.drop(['ABBREVIATION'], axis = 1, inplace = True, errors = 'ignore')

## Self Test - 3
- Find the number of observations and the number of variables in the dataset

In [18]:
#Your Code Here
NBA_Games.shape

(18421, 32)

The merged dataset is sorted by the criteria we use to merge the datasets. Thus, the NBA_Games dataset is currently sorted by "TEAM_ID." We may be interested to sort the data by other criteria, for example, the date of the game. 

#### We can do so by using the “sort_values” option. 

In our dataset, "GAME_ID" is created based on the date of the game. We can sort the games by “GAME_ID” and display the 20 most recent games.


In [19]:
NBA_Games.sort_values(by=['GAME_ID'], ascending=[False]).head(20)

Unnamed: 0,CITY,TEAM_NAME,TEAM_ID,NICKNAME,STATE,YEAR_FOUNDED,SEASON_ID,TEAM_ABBREVIATION,GAME_ID,GAME_DATE,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
13449,San Antonio,San Antonio Spurs,1610612759,Spurs,Texas,1976,22019,SAS,1621900006,7/3/2019,...,0.615,9,29,38,12,5,3,11,14,-7.2
15445,Utah,Utah Jazz,1610612762,Jazz,Utah,1974,22019,UTA,1621900006,7/3/2019,...,0.684,13,31,44,15,6,5,15,25,5.8
1304,Cleveland,Cleveland Cavaliers,1610612739,Cavaliers,Ohio,1970,22019,CLE,1621900005,7/3/2019,...,0.737,3,26,29,15,9,3,17,12,-13.0
16076,Memphis,Memphis Grizzlies,1610612763,Grizzlies,Tennessee,1995,22019,MEM,1621900005,7/3/2019,...,0.692,11,36,47,19,8,5,14,19,13.0
15446,Utah,Utah Jazz,1610612762,Jazz,Utah,1974,22019,UTA,1621900004,7/2/2019,...,0.75,11,34,45,18,3,4,14,18,14.6
1305,Cleveland,Cleveland Cavaliers,1610612739,Cavaliers,Ohio,1970,22019,CLE,1621900004,7/2/2019,...,0.682,6,27,33,13,6,0,6,14,-15.0
13450,San Antonio,San Antonio Spurs,1610612759,Spurs,Texas,1976,22019,SAS,1621900003,7/2/2019,...,0.87,10,35,45,24,10,7,18,24,16.6
16077,Memphis,Memphis Grizzlies,1610612763,Grizzlies,Tennessee,1995,22019,MEM,1621900003,7/2/2019,...,0.867,10,29,39,17,6,5,17,18,-15.8
15447,Utah,Utah Jazz,1610612762,Jazz,Utah,1974,22019,UTA,1621900002,7/1/2019,...,0.462,14,27,41,16,10,7,19,26,-6.6
16078,Memphis,Memphis Grizzlies,1610612763,Grizzlies,Tennessee,1995,22019,MEM,1621900002,7/1/2019,...,0.704,8,32,40,14,14,4,20,19,7.4


## Missing Values

### Before we move on to doing any data analyses, we usually need to check if there is any missing value, that is, the source may have failed to collect some information. 

#### We can use the info() command which will return the total number of observations that have real values. By looking at these total numbers, we can see if there is any variable with missing value.


In [20]:
NBA_Games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18421 entries, 0 to 18420
Data columns (total 32 columns):
CITY                 18421 non-null object
TEAM_NAME            18421 non-null object
TEAM_ID              18421 non-null int64
NICKNAME             18421 non-null object
STATE                18421 non-null object
YEAR_FOUNDED         18421 non-null int64
SEASON_ID            18421 non-null int64
TEAM_ABBREVIATION    18421 non-null object
GAME_ID              18421 non-null int64
GAME_DATE            18421 non-null object
MATCHUP              18421 non-null object
WL                   18414 non-null object
MIN                  18421 non-null int64
PTS                  18421 non-null int64
FGM                  18421 non-null int64
FGA                  18421 non-null int64
FG_PCT               18419 non-null float64
FG3M                 18421 non-null int64
FG3A                 18421 non-null int64
FG3_PCT              18418 non-null float64
FTM                  18421 non-null int

_The total number of rows is 18421, so there are missing values in variable WL, FG_PCT, FG3_PCT, and FT_PCT._

#### Detecting missing values
We can use the isnull() function and the notnull() function to detect where the missing values are.


In [21]:
NBA_Games.notnull()

Unnamed: 0,CITY,TEAM_NAME,TEAM_ID,NICKNAME,STATE,YEAR_FOUNDED,SEASON_ID,TEAM_ABBREVIATION,GAME_ID,GAME_DATE,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
5,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
6,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
7,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
8,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
9,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


### Handling Missing Values

There are two main approaches to handle missing values. 
- First, we can simply drop the observations with missing value.

#### Drop observations with missing value in the variable "FG_PCT"

In [22]:
NBA_Games=NBA_Games[pd.notnull(NBA_Games["FG_PCT"])]
NBA_Games.shape

(18419, 32)

- Second, we can replace the missing values with valid values (Imputation), such as mean and median.

#### We can use the fillna() command to replace missing values with the mean or the median of the variable.


In [23]:
NBA_Games=NBA_Games.fillna(NBA_Games.mean())
NBA_Games.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18419 entries, 0 to 18420
Data columns (total 32 columns):
CITY                 18419 non-null object
TEAM_NAME            18419 non-null object
TEAM_ID              18419 non-null int64
NICKNAME             18419 non-null object
STATE                18419 non-null object
YEAR_FOUNDED         18419 non-null int64
SEASON_ID            18419 non-null int64
TEAM_ABBREVIATION    18419 non-null object
GAME_ID              18419 non-null int64
GAME_DATE            18419 non-null object
MATCHUP              18419 non-null object
WL                   18414 non-null object
MIN                  18419 non-null int64
PTS                  18419 non-null int64
FGM                  18419 non-null int64
FGA                  18419 non-null int64
FG_PCT               18419 non-null float64
FG3M                 18419 non-null int64
FG3A                 18419 non-null int64
FG3_PCT              18419 non-null float64
FTM                  18419 non-null int

## Creating variables
We can create a variable equals to the total number of goals made.


In [24]:
NBA_Games['GM']=NBA_Games['FGM']+NBA_Games['FG3M']+NBA_Games['FTM']

## Self Test - 4
- Create a variable called "GA" equals to the total number of goals attempted.

In [27]:
Games.columns

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')

In [28]:
#Your Code Here
NBA_Games['GA'] = NBA_Games['FTA'] + NBA_Games['FG3A'] + NBA_Games['FGA']

### Create variables based on conditions
- We can create a variable conditional on the value of another variable.

For example, we can create a variable "RESULT" that equals to 1 if the team won the game and 0 otherwise. The result of the game can be captured in the points of the team receive, whether it was positive or negative.


In [29]:
NBA_Games['RESULT'] = np.where(NBA_Games['PLUS_MINUS']>0, 'W', 'L')

We will now drop this newly created "RESULT" variable.


In [30]:
NBA_Games.drop(['RESULT'], axis=1, inplace=True)

### Create a variable within group
In the dataset, each game has two observations, one represents the statistics of the home team, one represents those of the away team. Both observations have the same GAME_ID. We can create a variable "POINT_DIFF" that equals the difference between the points earned by the two teams.

We will first sort the data not only by the "GAME_ID" but also by the result "WL".


In [31]:
NBA_Games.sort_values(['GAME_ID','WL'], inplace=True)
NBA_Games["POINT_DIFF"]=NBA_Games.groupby(["GAME_ID"])["PTS"].diff()

The "POINT_DIFF" variable only has the point difference for the winning team, we need to impute the point difference for the losing team as well.



In [32]:
NBA_Games['POINT_DIFF'] = NBA_Games['POINT_DIFF'].fillna(NBA_Games.groupby('GAME_ID')['POINT_DIFF'].transform('mean'))

- We can also drop all observations with missing value in at least one variable using the "dropna()" command.


In [33]:
NBA_Games=NBA_Games.dropna()
NBA_Games.shape

(17779, 35)

### Creating new dataframe

#### Create a new dataframe that aggregates information by group

Sometimes we may want to work with season level data rather than team level data. We can create a new dataset that includes aggregate information of team statistics in each season.




In [34]:
NBA_Team_Stats=NBA_Games.groupby(['TEAM_ID', 'SEASON_ID'])['PTS','FGM','FGA','FG_PCT','FG3M','FG3A','FG3_PCT','FTM','FTA','FT_PCT','OREB','DREB','REB','AST','STL','BLK','TOV','PF','PLUS_MINUS'].sum()
display(NBA_Team_Stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
TEAM_ID,SEASON_ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1610612737,12013,523,193,459,2.520,34,131,1.576,103,140,4.477,44,184,228,130,61,22,118,112,-75.0
1610612737,12014,708,247,554,3.122,66,182,2.549,148,197,5.262,63,237,300,178,48,35,116,170,15.0
1610612737,12015,652,226,548,2.896,59,176,2.340,141,175,5.630,53,271,324,141,57,41,122,143,-5.0
1610612737,12016,686,261,593,3.083,50,159,2.140,114,153,5.253,71,263,334,184,52,40,112,154,41.0
1610612737,12017,480,167,410,2.045,51,156,1.611,95,125,3.798,40,169,209,113,40,15,95,91,-17.6
1610612737,12018,563,206,445,2.312,63,189,1.659,88,124,3.641,49,175,224,120,49,31,104,157,-18.0
1610612737,22012,2074,795,1701,9.824,155,454,7.206,329,444,15.640,191,681,872,506,178,68,293,367,-15.0
1610612737,22013,8272,3061,6719,37.846,747,2085,29.419,1403,1799,64.601,731,2532,3263,2005,686,327,1220,1610,-81.0
1610612737,22014,8786,3253,7017,40.399,848,2256,32.719,1432,1840,67.699,754,2728,3482,2171,786,396,1183,1577,433.0
1610612737,22015,8701,3265,7163,39.250,833,2403,29.818,1338,1716,66.851,715,2868,3583,2143,789,517,1265,1678,294.8


Notice that the newly created dataset has two levels of index, the "TEAM_ID" and "SEASON_ID"

#### If we want to convert these two indexes back as variables, we can use the "reset_index" command.


In [35]:
NBA_Team_Stats=NBA_Team_Stats.reset_index()
display(NBA_Team_Stats)

Unnamed: 0,TEAM_ID,SEASON_ID,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,1610612737,12013,523,193,459,2.520,34,131,1.576,103,...,4.477,44,184,228,130,61,22,118,112,-75.0
1,1610612737,12014,708,247,554,3.122,66,182,2.549,148,...,5.262,63,237,300,178,48,35,116,170,15.0
2,1610612737,12015,652,226,548,2.896,59,176,2.340,141,...,5.630,53,271,324,141,57,41,122,143,-5.0
3,1610612737,12016,686,261,593,3.083,50,159,2.140,114,...,5.253,71,263,334,184,52,40,112,154,41.0
4,1610612737,12017,480,167,410,2.045,51,156,1.611,95,...,3.798,40,169,209,113,40,15,95,91,-17.6
5,1610612737,12018,563,206,445,2.312,63,189,1.659,88,...,3.641,49,175,224,120,49,31,104,157,-18.0
6,1610612737,22012,2074,795,1701,9.824,155,454,7.206,329,...,15.640,191,681,872,506,178,68,293,367,-15.0
7,1610612737,22013,8272,3061,6719,37.846,747,2085,29.419,1403,...,64.601,731,2532,3263,2005,686,327,1220,1610,-81.0
8,1610612737,22014,8786,3253,7017,40.399,848,2256,32.719,1432,...,67.699,754,2728,3482,2171,786,396,1183,1577,433.0
9,1610612737,22015,8701,3265,7163,39.250,833,2403,29.818,1338,...,66.851,715,2868,3583,2143,789,517,1265,1678,294.8


### We can create a variable that equals to the total number of observations within a specified group using the size() command.
- Create a variable that equals to the total number of games played by a team in each season, name this variable "GAME_COUNT".


In [36]:
NBA_Game_Count=NBA_Games.groupby(['TEAM_ID','SEASON_ID']).size().reset_index(name='GAME_COUNT')
display(NBA_Game_Count)

Unnamed: 0,TEAM_ID,SEASON_ID,GAME_COUNT
0,1610612737,12013,6
1,1610612737,12014,7
2,1610612737,12015,7
3,1610612737,12016,7
4,1610612737,12017,5
5,1610612737,12018,5
6,1610612737,22012,21
7,1610612737,22013,83
8,1610612737,22014,87
9,1610612737,22015,86


## Saving data

### We can save a dataframe by exporting the edited dataframe to csv file using the “to_csv” command.
- Save merged data as a csv file 
_We can use the "index=False" command to save the data without adding the index as a column in the csv file_

In [None]:
NBA_Games.to_csv("../../Data/Week 2/NBA_Games.csv", index=False)