In [30]:
import numpy as np
import pandas as pd

Changing Pandas option to show all columns in dataframes.

In [31]:
pd.set_option('Display.max_columns', None)

Build dataframe containing two columns PLAYER_ID and NAME.

In [32]:
player_id_name = pd.read_csv('../season_counting_stats.csv')

In [33]:
# We drop duplicates because a given player could have played for
# more than one season or could have played only one season but
# for multiple teams, resulting in multiple identical rows.
player_id_name = player_id_name[['PLAYER_ID', 'NAME']].drop_duplicates()

Identify NBA players in the player_id_name dataframe with identical names. For example, there are two entries of 'Patrick Ewing' (father and son). Technically, the younger Patrick Ewing is Patrick Ewing Jr., but the data entry omitted the 'Jr.' for some reason. 

In [34]:
nonunique_player_names = player_id_name[player_id_name.duplicated(subset= "NAME", keep=False)]

In [35]:
nonunique_player_names.sort_values(by='NAME')

Unnamed: 0,PLAYER_ID,NAME
19037,78209,Bill Smith
19031,78207,Bill Smith
13004,76610,Bob Duffy
13001,76609,Bob Duffy
15305,77193,Bobby Jones
...,...,...
25697,203502,Tony Mitchell
21920,201041,Walker Russell
18371,78048,Walker Russell
15357,77203,Willie Jones


As we can see there are some players that share identical names. Therefore, one thing that might help when matching players to their ID is to also include the season information. It is presumably less likely that two nba players of the same name are in the league during the same season. Let's try this and see if it will eliminate any nonuniqueness. 

In [36]:
player_id_season = pd.read_csv('../season_counting_stats.csv')

In [37]:
# We drop duplicates because a given player can have played
# for multiple teams in a given season, resulting in multiple identical rows.
player_id_season = player_id_season[['PLAYER_ID', 'SEASON_START', 'NAME']].drop_duplicates()

In [38]:
player_id_season

Unnamed: 0,PLAYER_ID,SEASON_START,NAME
0,2,1983,Byron Scott
1,2,1984,Byron Scott
2,2,1985,Byron Scott
3,2,1986,Byron Scott
4,2,1987,Byron Scott
...,...,...,...
29922,1641926,2023,Dexter Dennis
29923,1641931,2023,Onuralp Bitim
29924,1641970,2023,Maozinha Pereira
29925,1641998,2023,Trey Jemison


In [39]:
nonunique_player_id_season_pairs = player_id_season[player_id_season.duplicated(subset= ['SEASON_START', 'NAME'], keep=False)]

In [40]:
nonunique_player_id_season_pairs.sort_values(by=['NAME', 'SEASON_START'])

Unnamed: 0,PLAYER_ID,SEASON_START,NAME
2115,279,1984,Charles Jones
15249,77178,1984,Charles Jones
2118,279,1985,Charles Jones
15250,77178,1985,Charles Jones
2120,279,1987,Charles Jones
15251,77178,1987,Charles Jones
2121,279,1988,Charles Jones
15252,77178,1988,Charles Jones
2256,293,1989,Charles Smith
18900,78179,1989,Charles Smith


Next, we output these dataframes to csv files. To remove clutter, the index column will not be written. 

In [41]:
player_id_name.to_csv('player_id_name.csv', index=False)

In [42]:
nonunique_player_names.to_csv('nonunique_player_names.csv', index=False)

In [43]:
player_id_season.to_csv('player_id_season.csv', index=False)

In [44]:
nonunique_player_id_season_pairs.to_csv('nonunique_player_id_season_pairs.csv', index=False)

Note, different data sources format names differently. For example, some write 'AJ Green', while others write 'A.J. Green'. Some write 'Bogdan Bogdanovic', while others write 'Bogdan Bogdanović'  These formatting issues also add some difficulty in associating a player ID to a given player name coming from a different data source. 
