# Dataset Exploration

Datasets are categorized as: Informational, Player, or Team. 
Each exploration will include the following:
- Shape
- Data Types
- Missing Data
- Duplicate Rows
- Descriptive Statistics
- Datetimes

*Column notes will include insights found for simple explorations that may be ommited from the final code*

In [4]:
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path
sys_path = os.path.abspath("../scripts")
if sys_path not in sys.path:
    sys.path.append(sys_path)

from exploration_functions import InitialExploration

## Inforational Tables
- Player Season Info
- Player Career Info
- Team Abbrev

### Player Season Info

In [27]:
psi_file_path = Path(r"..\data\raw\Player Season Info.csv")
psi = InitialExploration(psi_file_path)
psi.show_report()




Shape

There are 10 columns and 32504 rows.
------------------------------------------------------

First Five Rows:
------------------------------------------------------
   season  seas_id  player_id         player  birth_year  pos   age   lg   tm  \
0    1947        1          1   Al Brightman         NaN    F  23.0  BAA  BOS   
1    1947        2          2      Al Lujack         NaN    F  26.0  BAA  WSC   
2    1947        3          3    Al Negratti         NaN  F-C  25.0  BAA  WSC   
3    1947        4          4    Angelo Musi         NaN    G  28.0  BAA  PHW   
4    1947        5          5  Ariel Maughan      1923.0    F  23.0  BAA  DTF   

   experience  
0           1  
1           1  
2           1  
3           1  
4           1  
------------------------------------------------------

Data Types

season:     int64
seas_id:    int64
player_id:  int64
player:     object
birth_year: float64
pos:        object
age:        float64
lg:         object
tm:         object
exper

*Columns and Notes*

- season
    - year of the season
- seas_id
    - primary key
    - the unique id of a specific player in a specific year on a specific team etc
    - seas_id 17000 is specifically Kobe Bryant's 2001 season
- player_id
    - id unique to the player
- player
    - players name
- birth_year
    - should be a birth year
    - Mostly null values
    - will most likely drop this column in the cleaning phase
- pos
    - What position the player played in a given season
    - Can change over time
    - Most values are single positions but may need to do some cleaning on the other values.
- age
    - age of player at the end of a given season
    - 22 Missing Values
    - Occurs in years 1947, 1968, 1969, 1970, 1971, 1973
    - Won't likely be a big issue but just something to remember
- lg
    - League played in
- tm
    - team played on
    - can have multiple for a season
    - different teams in the same season will have different seas_ids
- experience
    - number of years in the league at end of the season
    - rookie year is 1

### Player Career Info

In [33]:
pci_file_path = Path(r"..\data\raw\Player Career Info.csv")
pci = InitialExploration(pci_file_path)
pci.show_report()



Shape

There are 7 columns and 5328 rows.
------------------------------------------------------

First Five Rows:
------------------------------------------------------
   player_id         player  birth_year    hof  num_seasons  first_seas  \
0          1   Al Brightman         NaN  False          1.0      1947.0   
1          2      Al Lujack         NaN  False          1.0      1947.0   
2          3    Al Negratti         NaN  False          1.0      1947.0   
3          4    Angelo Musi         NaN  False          3.0      1947.0   
4          5  Ariel Maughan      1923.0  False          5.0      1947.0   

   last_seas  
0     1947.0  
1     1947.0  
2     1947.0  
3     1949.0  
4     1951.0  
------------------------------------------------------

Data Types

player_id:   int64
player:      object
birth_year:  float64
hof:         bool
num_seasons: float64
first_seas:  float64
last_seas:   float64
------------------------------------------------------

Missing Data

Rows wit

*Columns and Notes*
- player_id
    - primary key
- player
    - Player name
- birth_year
    - Mostly null 
    - Most likely going to be dropped.
    - Doesnt add much here beyond stating what the age at retirement was. 
        - This can be found in player season info if really needed
- hof
    - if the player has been inducted into the hall of fame. 
    - accurate through 10/01/2024
- num_seasons
    - Number of seasons played
    - Needs missing values handeled and converted to int
- first_seas
    - Year of rookie season
    - Needs missing values handeled and converted to int
- last_seas
    - Year of retirement
    - Needs missing values handeled and converted to int

### Team Abbrev

In [40]:
team_abv_file_path = Path(r"..\data\raw\Team Abbrev.csv")
team_abv = InitialExploration(team_abv_file_path)
team_abv.show_report()



Shape

There are 5 columns and 1871 rows.
------------------------------------------------------

First Five Rows:
------------------------------------------------------
   season   lg               team  playoffs abbreviation
0    2025  NBA      Atlanta Hawks     False          ATL
1    2025  NBA     Boston Celtics     False          BOS
2    2025  NBA      Brooklyn Nets     False          BRK
3    2025  NBA  Charlotte Hornets     False          CHO
4    2025  NBA      Chicago Bulls     False          CHI
------------------------------------------------------

Data Types

season:       int64
lg:           object
team:         object
playoffs:     bool
abbreviation: object
------------------------------------------------------

Missing Data

Rows without missing data:
------------------------------------------------------
season:        100.00%
lg:            100.00%
team:          100.00%
playoffs:      100.00%


Rows with some missing data:
-----------------------------------------

*Columns and Notes*

<u>There is no primary key in this table</u>

- season 
    - Season Year
- lg 
    - League
- team 
    - Full Team Name
- playoffs
    - If the team made the playoffs
- abbreviation
    - Team Abbrevation
    - Forign Key in other informational Tables