# üèì Table Tennis Match Prediction - V6 (Competition Submission)

**Version**: V6 - Rally Context Features (Overfitted)

**Performance**: 
- Private Score: 0.3419 ‚ùå
- Public Score: 0.3698
- Delta: -0.0279 (overfitting to public test)

**Problems**:
1. ‚ùå `rally_serve_action/point`: Information leakage from first stroke
2. ‚ùå `is_deuce`, `is_server`: Training-specific patterns
3. ‚ùå 80-20 ensemble: Unbalanced blend
4. ‚ùå Sample weighting: Reduced generalization

**Note**: This version is included for educational purposes to show what NOT to do.

## 1. Configuration & Setup

In [None]:
# =========================================================
# Global Configuration
# =========================================================
USE_GPU = True  # Set to True if GPU available
N_FOLDS = 5
RANDOM_SEED = 42

# =========================================================
# 1. Setup & Import
# =========================================================
# !pip -q install lightgbm catboost pandas numpy scikit-learn

import pandas as pd
import numpy as np
import lightgbm as lgb
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.utils.class_weight import compute_sample_weight
import warnings
import gc

warnings.filterwarnings('ignore')

print("‚úÖ Imports complete")
print(f"LightGBM version: {lgb.__version__}")

## 2. Load Data

In [None]:
# =========================================================
# 2. Load Data
# =========================================================
try:
    print("[2/8] Loading data...")
    train_df = pd.read_csv("../data/train.csv")
    test_df = pd.read_csv("../data/test.csv")
    submission_df = pd.read_csv("../data/sample_submission.csv")
    print("‚úì Loaded from ../data/")
except:
    try:
        train_df = pd.read_csv("data/train.csv")
        test_df = pd.read_csv("data/test.csv")
        submission_df = pd.read_csv("data/sample_submission.csv")
        print("‚úì Loaded from data/")
    except:
        print("‚ùå Error: Data files not found")
        raise

print(f"Train: {train_df.shape}, Test: {test_df.shape}")
train_df.head()

## 3. Feature Engineering (V6 - Rally Context Features)

In [None]:
# =========================================================
# 3. Feature Engineering (V6: Rally Context Features)
# =========================================================
print("[3/8] Feature Engineering (V6: Rally Context)...")

def get_rally_phase(n):
    if n == 1: return 0      # Serve
    elif n == 2: return 1    # Return
    elif n <= 4: return 2    # Early rally
    else: return 3           # Extended rally

def create_features(df):
    df_feats = df.copy()
    
    # --- Basic Features ---
    df_feats['rally_phase'] = df_feats['strickNumber'].apply(get_rally_phase)
    action_map = {
        1: 'Attack', 2: 'Attack', 3: 'Attack', 4: 'Attack', 5: 'Attack', 6: 'Attack', 7: 'Attack',
        8: 'Control', 9: 'Control', 10: 'Control', 11: 'Control',
        12: 'Defensive', 13: 'Defensive', 14: 'Defensive',
        15: 'Serve', 16: 'Serve', 17: 'Serve', 18: 'Serve',
        0: 'Zero', -1: 'Zero'
    }
    df_feats['action_type'] = df_feats['actionId'].map(action_map).fillna('Zero')
    
    if 'scoreSelf' in df_feats.columns and 'scoreOther' in df_feats.columns:
        df_feats['score_diff'] = df_feats['scoreSelf'] - df_feats['scoreOther']
        df_feats['is_deuce'] = ((df_feats['scoreSelf'] >= 9) & (df_feats['scoreOther'] >= 9) & (df_feats['score_diff'].abs() <= 2)).astype(int)
    if 'serveId' in df_feats.columns and 'gamePlayerId' in df_feats.columns:
        df_feats['is_server'] = (df_feats['serveId'] == df_feats['gamePlayerId']).astype(int)

    # --- Rally Context Features (INFORMATION LEAKAGE!) ---
    # ‚ùå PROBLEM: Propagates first stroke info to entire rally
    serve_info = df_feats[df_feats['strickNumber'] == 1][['rally_uid', 'actionId', 'pointId']].copy()
    serve_info.columns = ['rally_uid', 'rally_serve_action', 'rally_serve_point']
    df_feats = pd.merge(df_feats, serve_info, on='rally_uid', how='left')
    df_feats['rally_serve_action'] = df_feats['rally_serve_action'].fillna(-999).astype(int)
    df_feats['rally_serve_point'] = df_feats['rally_serve_point'].fillna(-999).astype(int)

    # --- Lag Features ---
    lag1_cols = ['strickId', 'handId', 'strengthId', 'spinId', 'pointId', 'actionId', 'positionId', 'action_type']
    for col in lag1_cols:
        df_feats[f'prev_{col}'] = df_feats.groupby('rally_uid')[col].shift(1)
    lag2_cols = ['actionId', 'pointId', 'action_type']
    for col in lag2_cols:
        df_feats[f'prev2_{col}'] = df_feats.groupby('rally_uid')[col].shift(2)

    # --- Interaction Features ---
    df_feats['prev_hand_spin'] = df_feats['prev_handId'].astype(str) + '_' + df_feats['prev_spinId'].astype(str)
    df_feats['prev_action_point'] = df_feats['prev_actionId'].astype(str) + '_' + df_feats['prev_pointId'].astype(str)

    # --- Fill Missing Values ---
    for col in df_feats.columns:
        if 'prev' in col: 
            df_feats[col] = df_feats[col].replace(['nan_nan', 'nan', '<NA>', '<NA>_<NA>'], np.nan)
        if col.startswith('prev'):
            df_feats[col] = df_feats[col].fillna('None' if df_feats[col].dtype == 'object' else -999)

    return df_feats

train_feats_df = create_features(train_df)
test_feats_df = create_features(test_df)

print(f"‚úì Train: {train_feats_df.shape}, Test: {test_feats_df.shape}")
print(f"‚ö†Ô∏è Warning: Includes rally_serve_action/point (INFORMATION LEAKAGE!)")

## 4. Prepare Training Data & Target Variables

In [None]:
# =========================================================
# 4. Prepare Training Data
# =========================================================
print("[4/8] Preparing Datasets...")

# Create target variables
train_feats_df['next_actionId'] = train_feats_df.groupby('rally_uid')['actionId'].shift(-1)
train_feats_df['next_pointId'] = train_feats_df.groupby('rally_uid')['pointId'].shift(-1)
train_feats_df['rally_outcome'] = train_feats_df['serverGetPoint']

train_next_df = train_feats_df.dropna(subset=['next_actionId', 'next_pointId']).copy()

# Define features
drop_cols = ['rally_uid', 'serverGetPoint', 'gamePlayerId', 'gamePlayerOtherId', 'match_id', 
             'next_actionId', 'next_pointId', 'rally_outcome', 'match', 'rally_id']
features = [col for col in train_feats_df.columns if col not in drop_cols]

# Identify categorical features
categorical_features = []
for col in features:
    if train_feats_df[col].dtype == 'object' or 'Id' in col or 'phase' in col or 'is_' in col or 'serve_' in col:
        categorical_features.append(col)

# Encode categorical features
print(f"Encoding {len(categorical_features)} categorical features...")
for col in categorical_features:
    le = LabelEncoder()
    train_feats_df[col] = train_feats_df[col].astype(str)
    test_feats_df[col] = test_feats_df[col].astype(str)
    le.fit(pd.concat([train_feats_df[col], test_feats_df[col]]))
    train_feats_df[col] = le.transform(train_feats_df[col])
    test_feats_df[col] = le.transform(test_feats_df[col])

# Prepare datasets
X_next = train_feats_df.loc[train_next_df.index, features]
groups_next = train_next_df['rally_uid']

le_action = LabelEncoder()
y_action = le_action.fit_transform(train_next_df['next_actionId'].astype(int))

le_point = LabelEncoder()
y_point = le_point.fit_transform(train_next_df['next_pointId'].astype(int))

X_outcome = train_feats_df[features]
y_outcome = train_feats_df['rally_outcome']
groups_outcome = train_feats_df['rally_uid']

# Test data (last row of each rally)
test_final_rows = test_feats_df.groupby('rally_uid').tail(1)
X_test = test_final_rows[features]
test_rally_uids = test_final_rows['rally_uid']

print(f"‚úì X_next: {X_next.shape}, X_outcome: {X_outcome.shape}, X_test: {X_test.shape}")

## 5. Training Functions

In [None]:
# =========================================================
# 5. Training Functions
# =========================================================
def train_lgb(X, y, groups, X_test, params, cat_feats, n_splits=5):
    gkf = GroupKFold(n_splits=n_splits)
    num_class = params.get('num_class', 1)
    is_multiclass = params['objective'] == 'multiclass'
    oof_preds = np.zeros((len(X), num_class)) if is_multiclass else np.zeros(len(X))
    test_preds_list = []
    
    # ‚ùå PROBLEM: Sample weighting reduces generalization
    if is_multiclass:
        sample_weights = compute_sample_weight(class_weight='balanced', y=y)
    else:
        sample_weights = np.ones(len(y))

    for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
        X_tr, y_tr = X.iloc[train_idx], y[train_idx]
        X_val, y_val = X.iloc[val_idx], y[val_idx]
        w_tr = sample_weights[train_idx]
        
        if is_multiclass:
             missing = set(np.unique(y_val)) - set(np.unique(y_tr))
             if missing:
                 add_idx = [val_idx[np.where(y_val == label)[0][0]] for label in missing]
                 X_tr = pd.concat([X_tr, X.iloc[add_idx]])
                 y_tr = np.concatenate([y_tr, y[add_idx]])
                 w_tr = np.concatenate([w_tr, sample_weights[add_idx]])
        
        dtrain = lgb.Dataset(X_tr, label=y_tr, weight=w_tr, categorical_feature=cat_feats)
        dval = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_feats, reference=dtrain)
        model = lgb.train(params, dtrain, valid_sets=[dval], 
                         callbacks=[lgb.early_stopping(50, verbose=False), lgb.log_evaluation(0)])
        
        oof_preds[val_idx] = model.predict(X.iloc[val_idx])
        test_preds_list.append(model.predict(X_test))
            
    return oof_preds, np.mean(test_preds_list, axis=0)

def train_cat(X, y, groups, X_test, params, cat_indices, n_splits=5):
    gkf = GroupKFold(n_splits=n_splits)
    is_multiclass = 'MultiClass' in params.get('loss_function', '')
    num_class = int(np.max(y) + 1) if is_multiclass else 1
    oof_preds = np.zeros((len(X), num_class)) if is_multiclass else np.zeros(len(X))
    test_preds_list = []
    
    # ‚ùå PROBLEM: Auto class weights
    if is_multiclass: params['auto_class_weights'] = 'Balanced'
    
    for train_idx, val_idx in gkf.split(X, y, groups):
        X_tr, y_tr = X.iloc[train_idx], y[train_idx]
        X_val, y_val = X.iloc[val_idx], y[val_idx]
        
        if is_multiclass:
             missing = set(np.unique(y_val)) - set(np.unique(y_tr))
             if missing:
                 add_idx = [val_idx[np.where(y_val == label)[0][0]] for label in missing]
                 X_tr = pd.concat([X_tr, X.iloc[add_idx]])
                 y_tr = np.concatenate([y_tr, y[add_idx]])
                 
        train_pool = Pool(X_tr, y_tr, cat_features=cat_indices)
        val_pool = Pool(X_val, y_val, cat_features=cat_indices)
        model = CatBoostClassifier(**params)
        model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=50, verbose=0)
        
        if is_multiclass:
            oof_preds[val_idx] = model.predict_proba(val_pool)
            test_preds_list.append(model.predict_proba(X_test))
        else:
            oof_preds[val_idx] = model.predict_proba(val_pool)[:, 1]
            test_preds_list.append(model.predict_proba(X_test)[:, 1])
            
    return oof_preds, np.mean(test_preds_list, axis=0)

print("‚úÖ Training functions ready")

## 6. Train Models (Dual-Engine: 80% LGBM + 20% CatBoost)

In [None]:
# =========================================================
# 6. Train Models (80% LGBM + 20% CatBoost)
# =========================================================
print("\n[5/8] Training Dual-Engine Models...")

# Model parameters
lgb_common = {
    'boosting_type': 'gbdt', 'n_estimators': 3000, 'learning_rate': 0.03, 
    'num_leaves': 31, 'subsample': 0.8, 'colsample_bytree': 0.8, 
    'random_state': RANDOM_SEED, 'n_jobs': -1, 'verbose': -1
}
cat_common = {
    'iterations': 2000, 'learning_rate': 0.05, 'depth': 7, 
    'random_seed': RANDOM_SEED, 'verbose': 0
}

if USE_GPU:
    lgb_common['device'] = 'gpu'
    cat_common.update({'task_type': 'GPU', 'devices': '0'})

cat_indices = [X_next.columns.get_loc(col) for col in categorical_features]

# ‚ö†Ô∏è PROBLEM: Unbalanced ensemble (80-20)
print("‚ö†Ô∏è Using 80% LGBM + 20% CatBoost (unbalanced)")

# --- Action ID ---
print("\n>> Action ID...")
lgb_p = {**lgb_common, 'objective': 'multiclass', 'num_class': len(le_action.classes_), 'metric': 'multi_logloss'}
cat_p = {**cat_common, 'loss_function': 'MultiClass', 'eval_metric': 'MultiClass'}

oof_lgb1, pred_lgb1 = train_lgb(X_next, y_action, groups_next, X_test, lgb_p, categorical_features)
oof_cat1, pred_cat1 = train_cat(X_next, y_action, groups_next, X_test, cat_p, cat_indices)

final_proba_action = 0.8 * pred_lgb1 + 0.2 * pred_cat1
oof_blended_action = 0.8 * oof_lgb1 + 0.2 * oof_cat1
print(f"   OOF F1: {f1_score(y_action, np.argmax(oof_blended_action, axis=1), average='macro'):.4f}")

# --- Point ID ---
print("\n>> Point ID...")
lgb_p['num_class'] = len(le_point.classes_)

oof_lgb2, pred_lgb2 = train_lgb(X_next, y_point, groups_next, X_test, lgb_p, categorical_features)
oof_cat2, pred_cat2 = train_cat(X_next, y_point, groups_next, X_test, cat_p, cat_indices)

final_proba_point = 0.8 * pred_lgb2 + 0.2 * pred_cat2
oof_blended_point = 0.8 * oof_lgb2 + 0.2 * oof_cat2
print(f"   OOF F1: {f1_score(y_point, np.argmax(oof_blended_point, axis=1), average='macro'):.4f}")

# --- Outcome ---
print("\n>> Rally Outcome...")
lgb_p_bin = {**lgb_common, 'objective': 'binary', 'metric': 'auc'}
cat_p_bin = {**cat_common, 'loss_function': 'Logloss', 'eval_metric': 'AUC'}

oof_lgb3, pred_lgb3 = train_lgb(X_outcome, y_outcome, groups_outcome, X_test, lgb_p_bin, categorical_features)
oof_cat3, pred_cat3 = train_cat(X_outcome, y_outcome, groups_outcome, X_test, cat_p_bin, cat_indices)

final_proba_outcome = 0.7 * pred_lgb3 + 0.3 * pred_cat3
oof_blended_outcome = 0.7 * oof_lgb3 + 0.3 * oof_cat3
print(f"   OOF AUC: {roc_auc_score(y_outcome, oof_blended_outcome):.4f}")

print("\n‚úÖ Training complete!")

## 7. Synchronize Predictions

In [None]:
# =========================================================
# 7. Synchronize Predictions
# =========================================================
print("\n[6/8] Synchronizing Predictions...")

def synchronize_endings_strict(prob_act, prob_pt, le_act, le_pt, threshold=0.5):
    """Synchronize action/point predictions for rally endings"""
    try:
        act_neg1_idx = list(le_act.classes_).index(-1)
        pt_neg1_idx = list(le_pt.classes_).index(-1)
        p_end = (prob_act[:, act_neg1_idx] + prob_pt[:, pt_neg1_idx]) / 2
        
        prob_act_mod, prob_pt_mod = prob_act.copy(), prob_pt.copy()
        prob_act_mod[p_end >= threshold, act_neg1_idx] = 2.0
        prob_pt_mod[p_end >= threshold, pt_neg1_idx] = 2.0
        prob_act_mod[p_end < threshold, act_neg1_idx] = 0.0
        prob_pt_mod[p_end < threshold, pt_neg1_idx] = 0.0
        
        print(f"‚úì Synced {(p_end >= threshold).sum()} rows to END")
        return le_act.inverse_transform(np.argmax(prob_act_mod, axis=1)), \
               le_pt.inverse_transform(np.argmax(prob_pt_mod, axis=1))
    except:
        print("‚úó Could not find -1 class, skipping sync")
        return le_act.inverse_transform(np.argmax(prob_act, axis=1)), \
               le_pt.inverse_transform(np.argmax(prob_pt, axis=1))

final_action, final_point = synchronize_endings_strict(
    final_proba_action, final_proba_point, le_action, le_point
)

## 8. Generate Submission

In [None]:
# =========================================================
# 8. Generate Submission
# =========================================================
print("\n[7/8] Generating Submission...")

submission = pd.DataFrame({
    'rally_uid': test_rally_uids, 
    'serverGetPoint': final_proba_outcome, 
    'pointId': final_point, 
    'actionId': final_action
})

final_submission = pd.merge(submission_df[['rally_uid']], submission, on='rally_uid', how='left')
final_submission.fillna({'serverGetPoint': 0.5}, inplace=True)

valid_action_mode = train_df['actionId'].mode()[0]
valid_point_mode = train_df['pointId'].mode()[0]
final_submission['actionId'] = final_submission['actionId'].fillna(valid_action_mode).astype(int)
final_submission['pointId'] = final_submission['pointId'].fillna(valid_point_mode).astype(int)

final_submission.to_csv('../submissions/submission_v6.csv', index=False)

print("‚úÖ Submission saved to: ../submissions/submission_v6.csv")
print(f"\nShape: {final_submission.shape}")
print("\nFirst 5 rows:")
print(final_submission.head())
print("\n‚ö†Ô∏è Warning: This version overfits to public test!")
print("   Use main_gold.ipynb or main_gold_eda.ipynb for better generalization.")

## 10. Summary: What Went Wrong in V6?

### ‚ùå Problem 1: Rally Context Information Leakage
```python
df['rally_serve_action'] = df.groupby('rally_uid')['actionId'].transform('first')
df['rally_serve_point'] = df.groupby('rally_uid')['pointId'].transform('first')
```
**Issue**: Propagates first stroke information to entire rally
**Impact**: Model learns to "peek" at rally start, doesn't generalize

### ‚ùå Problem 2: Training-Specific Features
```python
df['is_deuce'] = ((df['server_score'] >= 10) & (df['receiver_score'] >= 10)).astype(int)
df['is_server'] = (df['stroke_number'] % 2 == 0).astype(int)
```
**Issue**: Patterns specific to training set distribution
**Impact**: Overfits to training patterns that differ in test

### ‚ùå Problem 3: Unbalanced Ensemble
```python
BLEND_RATIO = 0.8  # 80% LGBM, 20% CatBoost
```
**Issue**: Over-relies on one model, reduces diversity
**Impact**: Less stable predictions, worse generalization

### ‚ùå Problem 4: Sample Weighting
```python
sample_weights = compute_sample_weight('balanced', y_tr)
```
**Issue**: Forces class balance when test distribution may differ
**Impact**: Model optimizes wrong objective

### üìä Final Results
- **Public Score**: 0.3698 (looked good!)
- **Private Score**: 0.3419 (reality check)
- **Delta**: -0.0279 (-7.5% drop)

### ‚úÖ Lessons Learned
1. **Avoid Information Leakage**: Features should only use past information
2. **Trust Cross-Validation**: OOF scores are better indicators than public LB
3. **Simple Features Win**: Complexity often hurts generalization
4. **Balanced Ensembles**: Equal weighting usually works best
5. **Distribution Awareness**: Understand train/test differences

### üîó Better Alternatives
- **main_gold.ipynb**: Clean baseline (0.3574 private, +15.5 points)
- **main_gold_eda.ipynb**: EDA-optimized (0.3596 private, +17.7 points)

---

**Note**: This notebook is preserved for educational purposes to demonstrate common pitfalls in machine learning competitions.