# Interview Day Approach
Your Process:

Clarify requirements (2 min) - "Should I filter for minimum games played?"
Explore data briefly (2 min) - df.shape, df.columns, df.head()
Code solution (10 min) - Think out loud, explain basketball context
Test & explain (5 min) - Run code, interpret results

Differentiating Factors:

Basketball intuition - "I'll filter for players with meaningful minutes"
Domain knowledge - "True shooting is better than FG% because it includes free throws"
Data quality awareness - "Let me check for division by zero in percentages"

You're extremely well-prepared. The combination of your comprehensive basketball datasets, statistical knowledge, and programming skills puts you ahead of most candidates. Focus on staying calm, thinking out loud, and demonstrating your basketball understanding - that's what will set you apart from generic data science candidates.
The fact that they said "easy" suggests they're testing fundamentals and cultural fit rather than trying to stump you. Trust your preparation and let your basketball passion show through your technical solutions.




----------------------------

Most Likely Question Types (Given 15-20 min + "Easy")
High Probability (80%+):

Basic ranking/filtering - "Find top N players by X metric"
Simple aggregation - "Which team has the highest average Y?"
Basketball calculation - "Calculate shooting percentage" or "Find players with double-doubles"

Medium Probability (50%+):
4. Data cleaning - Handle missing values or obvious errors
5. Correlation/relationship - "What's the relationship between minutes and points?"
Lower Probability (20%+):
6. Advanced metrics - Calculate TS% or other efficiency measures
7. Statistical testing - Compare groups or find outliers



## Interview Best Practices

### 1. **Start with Data Exploration**
```python
# Always begin with these
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.head())
print(df.info())
```

### 2. **Handle Edge Cases**
```python
# Check for missing values
print(df.isnull().sum())

# Handle division by zero
df['efficiency'] = df['PTS'] / df['FGA'].replace(0, np.nan)

# Filter for meaningful sample sizes
qualified_players = df[df['games'] >= 10]  # or minutes >= 500
```

### 3. **Use Clear Variable Names**
```python
# Good
top_scorers = df.nlargest(5, 'points_per_game')

# Not ideal
ts = df.nlargest(5, 'ppg')
```

### 4. **Show Your Basketball Knowledge**
```python
# Demonstrate understanding of basketball concepts
# True Shooting %, PER, usage rate, pace-adjusted stats
```

### 5. **Think Out Loud**
- Explain your approach before coding
- Mention assumptions you're making
- Discuss potential improvements or extensions

### 6. **Common Functions to Know**
```python
# Data manipulation
df.groupby().agg()
df.merge()
df.pivot_table()
df.query()
df.nlargest() / df.nsmallest()

# Statistical
df.corr()
df.describe()
pd.cut() / pd.qcut()
```

---

## Likely Question Formats

1. **"Find the top N players by [metric]"** - Basic ranking
2. **"What's the relationship between X and Y?"** - Correlation analysis  
3. **"Which team/player is the most/least [characteristic]?"** - Aggregation
4. **"Calculate [basketball metric] for these players"** - Domain knowledge
5. **"Clean this data issue"** - Data manipulation
6. **"Are there any interesting patterns in this data?"** - Exploratory analysis

---

## Final Tips

- **Stay calm** - 15-20 minutes is generous for most questions
- **Ask clarifying questions** if requirements are unclear
- **Test your code** with small examples
- **Handle errors gracefully** (try/except if needed)
- **Show basketball intuition** - they hired you for basketball analytics
- **Be ready to explain** your approach and any trade-offs
- **Practice with your existing datasets** - you have everything you need!




----------------------------


Test Yourself Now:

Load your player_season_master dataset
Practice these 5 questions in under 4 minutes each:

Top 10 scorers with 40+ games
Teams ranked by average rebounds per game
Players shooting >45% FG with >15 PPG
Correlation between minutes and assists
Handle any missing values in shooting percentages


# start up the problems:

In [3]:
%%writefile notebooks/5080_gpu/interview_prep/data/schema.yaml
# schema.yaml
# for data preprocessing and joining:
strings: 
# player data=======
- lastName
- personId
- gameId
- gameDate
- playerteamCity
- playerteamName
- opponentteamCity
- opponentteamName
- gameType
- gameLabel
- gameSubLabel
- seriesGameNumber

# team data=======
- gameId
- gameDate
- teamCity
- teamName
- teamId
- opponentTeamCity
- opponentTeamName
- opponentTeamId
- coachId

numerics: 
# player data=======
- lastName
- personId
- gameId
- gameDate
- playerteamCity
- playerteamName
- opponentteamCity
- opponentteamName
- gameType
- gameLabel
- gameSubLabel
- seriesGameNumber
- win
- home
- numMinutes
- points
- assists
- blocks
- steals
- fieldGoalsAttempted
- fieldGoalsMade
- fieldGoalsPercentage
- threePointersAttempted
- threePointersMade
- threePointersPercentage
- freeThrowsAttempted
- freeThrowsMade
- freeThrowsPercentage
- reboundsDefensive
- reboundsOffensive
- reboundsTotal
- foulsPersonal
- turnovers
- plusMinusPoints
- season
- player_name
- minutes_total

# team data=======
- home
- win
- teamScore
- opponentScore
- assists
- blocks
- steals
- fieldGoalsAttempted
- fieldGoalsMade
- fieldGoalsPercentage
- threePointersAttempted
- threePointersMade
- threePointersPercentage
- freeThrowsAttempted
- freeThrowsMade
- freeThrowsPercentage
- reboundsDefensive
- reboundsOffensive
- reboundsTotal
- foulsPersonal
- turnovers
- plusMinusPoints
- numMinutes
- q1Points
- q2Points
- q3Points
- q4Points
- benchPoints
- biggestLead
- biggestScoringRun
- leadChanges
- pointsFastBreak
- pointsFromTurnovers
- pointsInThePaint
- pointsSecondChance
- timesTied
- timeoutsRemaining
- seasonWins
- seasonLosses
- season


#---------------------------
# NBA Player Season Data Schema


# Target variable for analysis
y_variable: 

# Ordinal variables (ordered categories)
ordinal: 


# Nominal variables (unordered categories) 
nominal: 


# Numerical variables (continuous/discrete numbers)
numerical: 


# ID columns (identifiers, not used in modeling)
id_cols: 



Overwriting notebooks/5080_gpu/interview_prep/data/schema.yaml


In [None]:
#%%writefile notebooks/5080_gpu/interview_prep/main.py
"""
NBA Analysis Pipeline - Optimized for Live Coding Assessments
============================================================


# Detailed Questions for set up 

Start by:
working through the datasets to create VORP, EWA, PER, PIE, and PER in basic steps so that we can easily do this ourselves in easy code. Then let's answer the questions below in order while creating easy to use and write functions that can do these. 

Questions:
“Given a CSV dataset, how would you explore and summarize it?”

“Given a DataFrame, how would you handle missing values?”

“How would you detect and address outliers in a dataset?”

“Perform univariate, bivariate, and multivariate analysis on given columns.”

“Given a dataset, how would you normalize or standardize its features?”

“Write a function to compute summary statistics (mean, median, std, etc.) of a column.”

“Given a dataset, how would you identify the type of each variable and choose feature encoding?”

“Given a dataset and a target variable, how would you check for relationships or correlations?”

“Write code to detect missingness patterns and decide how to impute.”

“Describe the full data-analysis pipeline: from loading to insight delivery—then code accordingly.”





interview questions:
1) "Who are the top 5 players by points per game this season?"
2) "Find all players who average more than 25 PPG and 10 RPG"
3) "What's the correlation between minutes played and points scored?"

4) "Calculate shooting efficiency - players with best True Shooting %"
5) "Find the most 'complete' players - top 10 in points + assists + rebounds"
6) "Which team has the most balanced scoring attack?"

7) "Clean this dataset - handle missing values in FG%"
8) "Group players by minutes played tiers and show average stats"

9) "Is there a significant difference in scoring between guards and forwards?"
10) "Find players who are outliers in efficiency"

11) "Who improved the most from last season to this season?"
12) "Calculate Player Impact Estimate (PIE) for top 10 players"


--
13) Who are the top 3 scorers **per team** by **points per-36** (filter to minutes ≥ 15)?

14) Build a simple **composite impact score** per player using standardized per-36 stats:
   `impact = z(points/36) + 0.7*z(assists/36) + 0.7*z(rebounds/36)`. Who are the top 10?

15) Do players **outperform their expected points** for their minutes/assists/rebounds?
   Fit a quick linear model `points ~ minutes + assists + rebounds` and list top 10 **positive residuals**.

16) Which team is **most balanced vs. star-heavy** by scoring?
   Compute **coefficient of variation** (std/mean) of `points` per team and rank.

17) Bucket players into **minutes tiers**: `[10–19, 20–29, 30–40]`.
   What are the mean/median of `points/assists/rebounds` per tier?

18) “Three-above-median” players: per team, who is **above the team median** in **points, assists, and rebounds** simultaneously?

19) Write a reusable helper `top_k(df, by, k, group=None)` and use it to return the **top 2 rebounders per team**.

20) What’s the **team effect** on scoring after controlling for other stats?
   One-hot encode `team`, fit a Ridge `points ~ minutes + assists + rebounds + team_*`, and show team coefficients.

21) Give a quick **bootstrap 95% CI** for **mean points per team** (1,000 resamples). Which teams have clearly higher means?

22) Detect potential **duplicate identity issues**: do we have any duplicate `(player_id, name)` rows? If so, keep the one with the **max minutes**.



"""

import pandas as pd
import numpy as np
from pathlib import Path
from omegaconf import OmegaConf
from typing import List
from pydantic import BaseModel


class ColumnSchema(BaseModel):
    y_variable: str
    ordinal: List[str]
    nominal: List[str]
    numerical: List[str]
    id_cols: List[str]
    strings: List[str]
    numerics: List[str]
    
    
def load_schema(yaml_path, debug=False):
    cfg = OmegaConf.load(yaml_path)
    cfg_dict = OmegaConf.to_container(cfg, resolve=True)
    if debug:
        print(f"cfg_dict============={cfg_dict}")
    return ColumnSchema(**cfg_dict)


def load_data(csv_path: str, schema: ColumnSchema,  sample_size: int = 100, sample: bool = False, debug: bool = False):
    dtype_data = pd.read_csv(csv_path, nrows=sample_size)
    if debug:
        print(f"data types from the dataset======={dtype_data.dtypes}")
    
    dtype_map = {}
    
    for cols in schema.strings:
        if cols in dtype_data.columns:
            dtype_map = "string"

    if sample:
        data = pd.read_csv(csv_path, nrows=sample_size)
    else:
        data = pd.read_csv(csv_path, dtype=dtype_map)

    if debug:
        print(f"updated data types from the dataset======={dtype_data.dtypes}")
        
    for col in schema.numerics:
        if col in data.columns:
            data[col] = pd.to_numeric(data[col])

    return data

if __name__ == "__main__":
    base = Path("notebooks/5080_gpu/interview_prep/data/heat_base_data")
    player_df_csv_path = Path(base / "player_statistics_used_Regular Season_from_2009.csv")
    team_df_csv_path = Path(base / "team_statistics_used_Regular Season_from_2009.csv")
    schema_path = Path("notebooks/5080_gpu/interview_prep/data/schema.yaml")  # FIXED: Convert to Path object
    
    schema = load_schema(schema_path)
    
    player_data = load_data(player_df_csv_path, schema, debug=True)
    team_data = load_data(team_df_csv_path, schema, debug=True)
    
    
    
    



ValidationError: 5 validation errors for ColumnSchema
y_variable
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
ordinal
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/list_type
nominal
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/list_type
numerical
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/list_type
id_cols
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/list_type