# NBA Game Prediction - Data Preparation

This notebook focuses on preparing our data for machine learning model development. We'll:
1. Load the transformed data
2. Define our target variable
3. Select relevant features
4. Split the data into training and testing sets

## 1. Import Libraries
Import all necessary libraries for data preparation and machine learning.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plot style
plt.style.use('seaborn')
sns.set_palette('husl')

## 2. Load Transformed Data
Load the final processed data from the transformation step.

In [None]:
# Load the transformed data
df = pd.read_csv('../data/transformed/transformed_games.csv')
print(f'Loaded {len(df)} games')
df.head()

## 3. Define Target Variable
Create our target variable (game outcome: win/loss).

In [None]:
# Create target variable (1 for win, 0 for loss)
df['TARGET'] = (df['WL'] == 'W').astype(int)

# Display target distribution
print('Target variable distribution:')
print(df['TARGET'].value_counts(normalize=True))

## 4. Feature Selection
Select relevant features for prediction. We'll focus on:
- Team efficiency metrics
- Momentum indicators
- Strength of schedule metrics
- Rolling averages

In [None]:
# Define feature columns (adjust based on your actual column names)
feature_columns = [
    'NET_RATING',
    'OFF_RATING',
    'DEF_RATING',
    'PACE',
    'WIN_STREAK',
    'ROLL_WIN_PCT_5',
    'ROLL_WIN_PCT_10',
    'REST_DAYS',
    'IS_BACK_TO_BACK'
]

# Create feature matrix X and target vector y
X = df[feature_columns]
y = df['TARGET']

print('Feature matrix shape:', X.shape)
print('\nFeature columns:')
print(X.columns.tolist())

## 5. Split Data into Training and Testing Sets
Split the data into training (80%) and testing (20%) sets.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print('Training set shape:', X_train.shape)
print('Testing set shape:', X_test.shape)

# Save the prepared data
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

print('\nPrepared data saved to ../data/processed/')

## 6. Feature Analysis
Analyze the relationship between features and the target variable.

In [None]:
# Calculate correlation with target
correlations = X_train.corrwith(y_train).sort_values(ascending=False)

# Plot feature correlations
plt.figure(figsize=(10, 6))
correlations.plot(kind='bar')
plt.title('Feature Correlations with Target Variable')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()