Ce notebook a pour objectif de récupérer les informations des différents dataframe Kaggle disponibles afin de créer une dataframe **best_player.csv**

Les données utilisées proviennent du site kaggle : 
- dataset des tirs NBA entre 1997 et 2019 : [kaggle](https://www.kaggle.com/jonathangmwl/nba-shot-locations)
- dataset des bilans d’équipe entre 2014 et 2018 : [kaggle](https://www.kaggle.com/nathanlauga/nba-games?select=ranking.csv)
- dataset des joueurs de NBA depuis 1950 : [kaggle](https://www.kaggle.com/drgilermo/nba-players-stats?select=Players.csv)


# Package & data

In [None]:
import pandas as pd
import numpy as np


In [None]:
# Load data

shot_location = pd.read_csv("../data/raw/NBA_Shot_Locations_1997-2020/NBA_Shot_Locations_1997-2020.zip")
players = pd.read_csv("../data/raw/NBA_Players_stats_since_1950/Players.zip")
players_data = pd.read_csv("../data/raw/NBA_Players_stats_since_1950/player_data.zip")
seasons_stats = pd.read_csv("../data/raw/NBA_Players_stats_since_1950/Seasons_Stats.zip")


# ETL

In [None]:
# Category Mapping des ActionTypes

action_mapping = pd.read_csv("../data/config/ActionMapping_4.csv")
action_mapping.replace(np.NaN, 0, inplace=True)

cols = action_mapping.columns[1:]
action_mapping[cols] = action_mapping[cols].astype(int)

shot_location = shot_location.merge(action_mapping, on='Action Type', how='inner')
shot_location.drop(['Action Type'], axis=1, inplace=True)


In [None]:
display(shot_location.info())
display(shot_location.head(5))

## df shot_location

- Transformation des variables en datetime
- Création d'une colonne seconde restante

In [None]:
###### clean shot_location info from shot_location.csv ######

# Type columns
shot_location['Game Date'] = pd.to_datetime(shot_location['Game Date'], format='%Y%m%d')

# Create columns
shot_location['GAME_YEAR'] = shot_location['Game Date'].dt.year
shot_location['GAME_YEAR'] = shot_location['GAME_YEAR'].astype(int)
shot_location['GAME_PERIODE_SECOND_REMAINGING'] = shot_location['Minutes Remaining'] * 60 + shot_location['Seconds Remaining']
shot_location = shot_location.drop(['Minutes Remaining', 'Seconds Remaining'], axis=1)


## df players

- Récupération des informations sur les joueurs
- Création d'une colonne secondes restantes

In [None]:
###### Clean player info from players.csv ######
players = players[['Player', 'height', 'weight','born']]

# Merge shot location & players (add players characteristics from players.csv)
nba_data = pd.merge(shot_location, players, left_on='Player Name', right_on='Player', how='inner')

# Type columns
nba_data['height'] = nba_data['height'].astype(int)
nba_data['weight'] = nba_data['weight'].astype(int)
nba_data['born'] = pd.to_datetime(nba_data['born'], format='%Y').dt.year
nba_data['Game Date'] = pd.to_datetime(nba_data['Game Date'], format='%Y%m%d')

nba_data.head()

## df seasons_stats

- Récupération des informations sur les joueurs
- Harmonisation du naming


In [None]:
###### Load & clean player info from seasons_stats.csv ######
player_info = seasons_stats[['Player', 'Pos', 'Year']]
player_info = player_info.dropna()
player_info['Year'] = player_info['Year'].astype(int)
player_info = player_info.rename(columns={'Year': 'GAME_YEAR', 'Player': 'Player Name'})

# Merge nba_data & player_info(add players info)
nba_data = pd.merge(nba_data, player_info, on=['Player Name', 'GAME_YEAR'], how='inner')
nba_data.drop("Player", axis=1, inplace=True)

nba_data.head()


## df players_data

- Récupération des informations sur les joueurs
- Création d'une colonne PLAYER_EXP et PLAYER_GAME_AGE
- Merge des dataframes
- Réorganisation de l'ordre des colonnes
- Harmonisation du naming des colonnes


In [None]:
###### Load & clean player data from players_data.csv ######

players_data = players_data[['name', 'year_start']]
players_data = players_data.rename(columns={'name': 'Player Name'})

# Merge nba_data & players_data(add players year_start)
nba_data = pd.merge(nba_data, players_data, on='Player Name', how='inner')

# Create columns PLAYER_EXP & drop PLAYER_EXP < 0
nba_data["PLAYER_EXP"] = nba_data["GAME_YEAR"] - nba_data["year_start"]
nba_data = nba_data[nba_data["PLAYER_EXP"] >= 0]

## Create columns PLAYER_AGE
nba_data["PLAYER_GAME_AGE"] = nba_data["GAME_YEAR"] - nba_data["born"]

nba_data.head()


In [None]:
###### columns transform ######
# # Order columns
new_order = [
    'Player ID',
    'Player Name',
    'Pos',
    'year_start',
    'PLAYER_GAME_AGE',
    'height',
    'weight',
    'born',
    'PLAYER_EXP',
    'Game ID',
    'Game Event ID',
    'GAME_YEAR',
    'Game Date',
    'Season Type',
    'Team ID',
    'Team Name',
    'Home Team',
    'Away Team',
    'Period',
    'GAME_PERIODE_SECOND_REMAINGING',
    'SHOT_ACTION_CATEGORY_SHOOT',
    'SHOT_ACTION_CATEGORY_DUNK',
    'SHOT_ACTION_CATEGORY_LAYUP',
    'SHOT_ACTION_CATEGORY_OTHER',
    'Shot Type',
    'Shot Zone Basic',
    'Shot Zone Area',
    'Shot Zone Range',
    'Shot Distance',
    'X Location',
    'Y Location',
    'Shot Made Flag'
]
nba_data = nba_data[new_order]

# Rename columns
column_mapping = {
    'Player ID': 'PLAYER_ID',
    'Player Name': 'PLAYER_NAME',
    'Pos': 'PLAYER_POS',
    'year_start': 'PLAYER_YEAR_START',
    'PLAYER_GAME_AGE': 'PLAYER_GAME_AGE',
    'height': 'PLAYER_HEIGHT',
    'weight': 'PLAYER_WEIGHT',
    'born': 'PLAYER_BORN_YEAR',
    'PLAYER_EXP': 'PLAYER_EXP',
    'Game ID': 'GAME_ID',
    'Game Event ID': 'GAME_EVENT_ID',
    'GAME_YEAR': 'GAME_YEAR',
    'Game Date': 'GAME_DATE',
    'Season Type': 'GAME_SEASON_TYPE',
    'Team ID': 'GAME_TEAM_ID',
    'Team Name': 'GAME_TEAM_NAME',
    'Home Team': 'GAME_HOME_TEAM',
    'Away Team': 'GAME_AWAY_TEAM',
    'Period': 'GAME_PERIOD',
    'GAME_PERIODE_SECOND_REMAINGING': 'GAME_PERIODE_SECOND_REMAINGING',
    'SHOT_ACTION_CATEGORY_SHOOT': 'SHOT_ACTION_CATEGORY_SHOOT',
    'SHOT_ACTION_CATEGORY_DUNK': 'SHOT_ACTION_CATEGORY_DUNK',
    'SHOT_ACTION_CATEGORY_LAYUP': 'SHOT_ACTION_CATEGORY_LAYUP',
    'SHOT_ACTION_CATEGORY_OTHER': 'SHOT_ACTION_CATEGORY_OTHER',
    'Shot Type': 'SHOT_TYPE',
    'Shot Zone Basic': 'SHOT_ZONE_BASIC',
    'Shot Zone Area': 'SHOT_ZONE_AREA',
    'Shot Zone Range': 'SHOT_ZONE_RANGE',
    'Shot Distance': 'SHOT_DISTANCE',
    'X Location': 'SHOT_X_LOCATION',
    'Y Location': 'SHOT_Y_LOCATION',
    'Shot Made Flag': 'SHOT_MADE_FLAG'
}

nba_data.rename(columns=column_mapping, inplace=True)

# Drop na & duplicate
nba_data.dropna(inplace=True)
nba_data.drop_duplicates(inplace=True)

nba_data.head()

## Create best_player dataframe

- Récupération de la liste des 20 meilleurs joueurs dans le fichier TopPlayers.csv
- Cleanning sur les post des joueurs

In [None]:
top_players = pd.read_csv("../data/config/TopPlayers.csv")

# verify player is in nba_data
players_in_data = nba_data["PLAYER_NAME"].unique()
players_not_in_data = [player for player in top_players['PlayerName'] if player not in players_in_data]
if not players_not_in_data:
    print("Tous les joueurs sont dans le dataframe.")
else:
    print("Les joueurs suivants ne sont pas dans le dataframe :")
    for player in players_not_in_data:
        print(player)

# Create dataframe
best_players = nba_data[nba_data["PLAYER_NAME"].isin(top_players['PlayerName'])]

# Keep single position
single_pos = ['SG', 'SF', 'PG', 'PF', 'C']
best_players = best_players[best_players['PLAYER_POS'].isin(single_pos)]

# Export best_player.csv

best_players.to_csv("../data/preprocessed/best_player.csv", index=False)
display(best_players.head())

print(best_players.shape[0])


In [None]:
best_players.info()