# NBA Data Lab

### Introduction

In this lesson, we'll use our knowledge of pandas to coerce data originally drawn from the [sports reference package](https://sportsreference.readthedocs.io/en/stable/).

### Loading the data

In [71]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/2-coercing-data/nba_players.csv?token=ANKFJMFY7KDGORDKHUCCMUK6QVGFA"
df = pd.read_csv(url, index_col = 0)

In [72]:
df[:2]

Unnamed: 0,player_id,name,weight,birth_date,height,nationality,team_abbreviation,most_recent_season,box_plus_minus,games_played,games_started,player_efficiency_rating,three_point_percentage,true_shooting_percentage,two_point_attempts,two_point_percentage,two_pointers
0,klebima01,Maxi Kleber,240lb,1992-01-29,6-10,Germany,DAL,2018-19,0.3,209.0,75.0,13.4,0.354,0.588,529.0,0.597,316.0
1,wrighde01,Delon Wright,183lb,1992-04-26,6-5,United States of America,DAL,2018-19,2.2,263.0,23.0,16.0,0.345,0.549,1086.0,0.498,541.0


### Exploring the data

Let's start off by looking at the different datatypes of the various columns.

In [73]:
df.dtypes

player_id                    object
name                         object
weight                       object
birth_date                   object
height                       object
nationality                  object
team_abbreviation            object
most_recent_season           object
box_plus_minus              float64
games_played                float64
games_started               float64
player_efficiency_rating    float64
three_point_percentage      float64
true_shooting_percentage    float64
two_point_attempts          float64
two_point_percentage        float64
two_pointers                float64
dtype: object

As we can see the initial datatypes are all of type object, while the later ones are of type string.  Let's just select the columns of type object, and then we can get to work coercing some of the columns.

In [74]:
players_object_df = df.select_dtypes('object')

In [75]:
players_object_df.columns
# Index(['player_id', 'name', 'weight', 'birth_date', 'height', 'nationality',
#        'team_abbreviation', 'most_recent_season'],
#       dtype='object')

Index(['player_id', 'name', 'weight', 'birth_date', 'height', 'nationality',
       'team_abbreviation', 'most_recent_season'],
      dtype='object')

Ok, now columns like `weight`, `birth_date`, `height`, and `most_recent_season` are candidates to coerce into different datatypes.

Let's just select those columns.

In [76]:
df_candidates = players_object_df[['weight', 'birth_date', 'height', 'most_recent_season']]
df_candidates[:2]

Unnamed: 0,weight,birth_date,height,most_recent_season
0,240lb,1992-01-29,6-10,2018-19
1,183lb,1992-04-26,6-5,2018-19


### Changing Birthdate

In [77]:
birth_date_as_dt = pd.to_datetime(df_candidates['birth_date'])



In [78]:
birth_date_as_dt.dtype
# dtype('<M8[ns]')

dtype('<M8[ns]')

In [79]:
df_candidates = df_candidates.assign(birth_date = birth_date_as_dt)

In [80]:
df_candidates.dtypes

# weight                        object
# birth_date            datetime64[ns]
# height                        object
# most_recent_season            object
# dtype: object

weight                        object
birth_date            datetime64[ns]
height                        object
most_recent_season            object
dtype: object

### Weight

Let's begin with weight.  As we saw, there are a number of ways that we can change weight, so that we only have a digits in each entry.  Try slicing the string, and then try using the replace method to coerce the string so that it only includes digits. 

* Using slice

In [81]:
sliced_weights = df_candidates['weight'].str[:-2]
sliced_weights[:3]

0    240
1    183
2    220
Name: weight, dtype: object

* Using replace

In [82]:
replaced_weights = df_candidates['weight'].str.replace('lb', '')
replaced_weights[:2]

0    240
1    183
Name: weight, dtype: object

Now let's change the series to type integer, and then add it to our `players_object_df`.

In [83]:
import numpy as np
weight_int = replaced_weights.astype(np.int_)

In [84]:
weight_int.dtype
# dtype('int64')

dtype('int64')

> We'll assign the column for you.

In [85]:
df_candidates = df_candidates.assign(weight = weight_int)

In [86]:
df_candidates.dtypes
# player_id             object
# name                  object
# weight                 int64
# birth_date            object
# height                object
# nationality           object
# team_abbreviation     object
# most_recent_season    object

weight                         int64
birth_date            datetime64[ns]
height                        object
most_recent_season            object
dtype: object

### Coercing Season

Now let's change the `most_recent_season` column.  Change it so that it only lists the latter year.  (Eg. 2018-19 should be changed to 2019, and 2017-2018 to 2018).

In [87]:
recent_season = df_candidates['most_recent_season'].str[-2:]
numeric_season = pd.to_numeric(recent_season)
recent_season = numeric_season + 2000

In [88]:
recent_season[:3]

# 0    2019.0
# 1    2019.0
# 2    2019.0

0    2019.0
1    2019.0
2    2019.0
Name: most_recent_season, dtype: float64

Now update `df_candidates` to use the `recent_season` column, using the `assign` method.

In [89]:
df_candidates = df_candidates.assign(most_recent_season = recent_season)

In [90]:
df_candidates.dtypes

weight                         int64
birth_date            datetime64[ns]
height                        object
most_recent_season           float64
dtype: object

### Changing Height

In [95]:
feet = df_candidates['height'].str.split('-').str[0]
inches = df_candidates['height'].str.split('-').str[-1]

In [101]:
feet_inches = pd.to_numeric(feet) * 12

In [103]:
inches = pd.to_numeric(inches)

In [105]:
total_inches = feet_inches + inches

In [108]:
total_inches[:5]
# 0    82
# 1    77
# 2    79
# 3    77
# 4    74

0    82
1    77
2    79
3    77
4    74
Name: height, dtype: int64

Next, update `df_candidates` to use the new column.

In [111]:
df_candidates = df_candidates.assign(height = total_inches)

In [114]:
df_candidates.dtypes
# weight                         int64
# birth_date            datetime64[ns]
# height                         int64
# most_recent_season           float64
# dtype: object

weight                         int64
birth_date            datetime64[ns]
height                         int64
most_recent_season           float64
dtype: object

### Updating the original dataframe

Now that we have coerced all of our data, it's time to combine this with the data from our original dataframe.

In [115]:
df[:2]

Unnamed: 0,player_id,name,weight,birth_date,height,nationality,team_abbreviation,most_recent_season,box_plus_minus,games_played,games_started,player_efficiency_rating,three_point_percentage,true_shooting_percentage,two_point_attempts,two_point_percentage,two_pointers
0,klebima01,Maxi Kleber,240lb,1992-01-29,6-10,Germany,DAL,2018-19,0.3,209.0,75.0,13.4,0.354,0.588,529.0,0.597,316.0
1,wrighde01,Delon Wright,183lb,1992-04-26,6-5,United States of America,DAL,2018-19,2.2,263.0,23.0,16.0,0.345,0.549,1086.0,0.498,541.0


In [116]:
df_candidates.columns

Index(['weight', 'birth_date', 'height', 'most_recent_season'], dtype='object')

In [123]:
original_cols = df.loc[:, ~df.columns.isin(df_candidates.columns)]

In [125]:
original_cols[:2]

Unnamed: 0,player_id,name,nationality,team_abbreviation,box_plus_minus,games_played,games_started,player_efficiency_rating,three_point_percentage,true_shooting_percentage,two_point_attempts,two_point_percentage,two_pointers
0,klebima01,Maxi Kleber,Germany,DAL,0.3,209.0,75.0,13.4,0.354,0.588,529.0,0.597,316.0
1,wrighde01,Delon Wright,United States of America,DAL,2.2,263.0,23.0,16.0,0.345,0.549,1086.0,0.498,541.0


In [130]:
updated_players = df_candidates.merge(original_cols, left_index=True, right_index=True)

In [131]:
updated_players[:2]

Unnamed: 0,weight,birth_date,height,most_recent_season,player_id,name,nationality,team_abbreviation,box_plus_minus,games_played,games_started,player_efficiency_rating,three_point_percentage,true_shooting_percentage,two_point_attempts,two_point_percentage,two_pointers
0,240,1992-01-29,82,2019.0,klebima01,Maxi Kleber,Germany,DAL,0.3,209.0,75.0,13.4,0.354,0.588,529.0,0.597,316.0
1,183,1992-04-26,77,2019.0,wrighde01,Delon Wright,United States of America,DAL,2.2,263.0,23.0,16.0,0.345,0.549,1086.0,0.498,541.0


Let's make sure that the data lines up with our original.

In [132]:
df[:2]

Unnamed: 0,player_id,name,weight,birth_date,height,nationality,team_abbreviation,most_recent_season,box_plus_minus,games_played,games_started,player_efficiency_rating,three_point_percentage,true_shooting_percentage,two_point_attempts,two_point_percentage,two_pointers
0,klebima01,Maxi Kleber,240lb,1992-01-29,6-10,Germany,DAL,2018-19,0.3,209.0,75.0,13.4,0.354,0.588,529.0,0.597,316.0
1,wrighde01,Delon Wright,183lb,1992-04-26,6-5,United States of America,DAL,2018-19,2.2,263.0,23.0,16.0,0.345,0.549,1086.0,0.498,541.0


And then, let's reorder the columns of our `updated_players` dataframe.

In [133]:
updated_players.columns

Index(['weight', 'birth_date', 'height', 'most_recent_season', 'player_id',
       'name', 'nationality', 'team_abbreviation', 'box_plus_minus',
       'games_played', 'games_started', 'player_efficiency_rating',
       'three_point_percentage', 'true_shooting_percentage',
       'two_point_attempts', 'two_point_percentage', 'two_pointers'],
      dtype='object')

In [137]:
df.columns

Index(['player_id', 'name', 'weight', 'birth_date', 'height', 'nationality',
       'team_abbreviation', 'most_recent_season', 'box_plus_minus',
       'games_played', 'games_started', 'player_efficiency_rating',
       'three_point_percentage', 'true_shooting_percentage',
       'two_point_attempts', 'two_point_percentage', 'two_pointers'],
      dtype='object')

We can do so, by using the original dataframe's columns with the `loc` method.

In [141]:
players_df = updated_players.loc[:, df.columns]
players_df[:2]

Unnamed: 0,player_id,name,weight,birth_date,height,nationality,team_abbreviation,most_recent_season,box_plus_minus,games_played,games_started,player_efficiency_rating,three_point_percentage,true_shooting_percentage,two_point_attempts,two_point_percentage,two_pointers
0,klebima01,Maxi Kleber,240,1992-01-29,82,Germany,DAL,2019.0,0.3,209.0,75.0,13.4,0.354,0.588,529.0,0.597,316.0
1,wrighde01,Delon Wright,183,1992-04-26,77,United States of America,DAL,2019.0,2.2,263.0,23.0,16.0,0.345,0.549,1086.0,0.498,541.0


### Summary

Nice job.  You are earning your pandas stripes.

### Resources

[Sports Reference API](https://sportsreference.readthedocs.io/en/stable/)

[Set with Copy Warning](https://www.dataquest.io/blog/settingwithcopywarning/)