# Student Test Scores - Fixed Pipeline + Optuna Optimization

This notebook:
1. âœ… Fixes the feature encoding issue
2. âœ… Implements proper feature engineering
3. âœ… Uses Optuna for hyperparameter optimization
4. âœ… Generates a valid submission file

---

## 1. Install and Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, make_scorer
import lightgbm as lgb

# Optuna
import optuna
from optuna.visualization import plot_optimization_history, plot_param_importances, plot_parallel_coordinate

# Settings
import warnings
warnings.filterwarnings('ignore')
optuna.logging.set_verbosity(optuna.logging.WARNING)

%matplotlib inline
sns.set_style('darkgrid')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("âœ“ Libraries imported successfully")
print(f"Optuna version: {optuna.__version__}")
print(f"LightGBM version: {lgb.__version__}")

âœ“ Libraries imported successfully
Optuna version: 4.7.0
LightGBM version: 4.6.0


## 2. Load Data

In [None]:
# Load datasets
print("Loading data...")
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

print(f"âœ“ Data loaded")
print(f"  Train shape: {train.shape}")
print(f"  Test shape: {test.shape}")

print(f"\nFeatures: {train.columns.tolist()}")
print(f"\nTarget stats:")
print(train['exam_score'].describe())

Loading data...
âœ“ Data loaded
  Train shape: (630000, 13)
  Test shape: (270000, 12)

Features: ['id', 'age', 'gender', 'course', 'study_hours', 'class_attendance', 'internet_access', 'sleep_hours', 'sleep_quality', 'study_method', 'facility_rating', 'exam_difficulty', 'exam_score']

Target stats:
count    630000.000000
mean         62.506672
std          18.916884
min          19.599000
25%          48.800000
50%          62.600000
75%          76.300000
max         100.000000
Name: exam_score, dtype: float64


In [None]:
# Preview data
print("Training data preview:")
display(train.head())

print("\nTest data preview:")
display(test.head())

Training data preview:


Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3
1,1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7
2,2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0
3,3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9
4,4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0



Test data preview:


Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty
0,630000,24,other,ba,6.85,65.2,yes,5.2,poor,group study,high,easy
1,630001,18,male,diploma,6.61,45.0,no,9.3,poor,coaching,low,easy
2,630002,24,female,b.tech,6.6,98.5,yes,6.2,good,group study,medium,moderate
3,630003,24,male,diploma,3.03,66.3,yes,5.7,average,mixed,medium,moderate
4,630004,20,female,b.tech,2.03,42.4,yes,9.2,average,coaching,low,moderate


## 3. Feature Engineering

**CRITICAL:** We apply the same transformations to both train and test

In [None]:
def engineer_features(df):
    """
    Apply feature engineering transformations.
    This function is applied to both train and test to ensure consistency.
    """
    df = df.copy()
    
    # Polynomial features
    df['study_hours_sq'] = df['study_hours'] ** 2
    df['sleep_hours_sq'] = df['sleep_hours'] ** 2
    df['class_attendance_sq'] = df['class_attendance'] ** 2
    
    # Interaction features
    df['study_attendance'] = df['study_hours'] * df['class_attendance']
    df['study_sleep_ratio'] = df['study_hours'] / (df['sleep_hours'] + 1e-6)
    df['age_study_interaction'] = df['age'] * df['study_hours']
    df['attendance_sleep'] = df['class_attendance'] * df['sleep_hours']
    
    # Categorical binning
    df['age_group'] = pd.cut(df['age'], 
                              bins=[0, 22, 28, 100], 
                              labels=['young', 'middle', 'senior'])
    
    df['study_intensity'] = pd.cut(df['study_hours'],
                                     bins=[0, 3, 6, 100],
                                     labels=['low', 'medium', 'high'])
    
    df['sleep_category'] = pd.cut(df['sleep_hours'],
                                    bins=[0, 6, 8, 100],
                                    labels=['insufficient', 'optimal', 'excessive'])
    
    return df

print("âœ“ Feature engineering function defined")

âœ“ Feature engineering function defined


In [None]:
# Separate features and target
X_train_full = train.drop(['id', 'exam_score'], axis=1)
y_train = train['exam_score']
X_test = test.drop(['id'], axis=1)
test_ids = test['id']

print(f"Original train features: {X_train_full.shape}")
print(f"Original test features: {X_test.shape}")

# Apply feature engineering
print("\nApplying feature engineering...")
X_train_full = engineer_features(X_train_full)
X_test = engineer_features(X_test)

print(f"After feature engineering:")
print(f"  Train: {X_train_full.shape}")
print(f"  Test: {X_test.shape}")

Original train features: (630000, 11)
Original test features: (270000, 11)

Applying feature engineering...
After feature engineering:
  Train: (630000, 21)
  Test: (270000, 21)


## 4. Encode Categorical Variables

**THE FIX:** Combine train and test before encoding to ensure identical columns

In [None]:
print("Encoding categorical variables...")

# Combine train and test with keys to track them
combined = pd.concat([X_train_full, X_test], keys=['train', 'test'], sort=False)

print(f"Combined shape: {combined.shape}")

# One-hot encode
combined_encoded = pd.get_dummies(combined, drop_first=True)

print(f"After encoding: {combined_encoded.shape}")

# Split back into train and test
X_train_encoded = combined_encoded.loc['train'].reset_index(drop=True)
X_test_encoded = combined_encoded.loc['test'].reset_index(drop=True)

print(f"\nâœ“ Encoding complete")
print(f"  Train shape: {X_train_encoded.shape}")
print(f"  Test shape: {X_test_encoded.shape}")
print(f"  Columns match: {X_train_encoded.shape[1] == X_test_encoded.shape[1]}")
print(f"  Number of features: {X_train_encoded.shape[1]}")

# Verify no missing values
train_nulls = X_train_encoded.isnull().sum().sum()
test_nulls = X_test_encoded.isnull().sum().sum()

print(f"\nMissing values:")
print(f"  Train: {train_nulls}")
print(f"  Test: {test_nulls}")

assert train_nulls == 0, "Train has missing values!"
assert test_nulls == 0, "Test has missing values!"

print("\nâœ“ All validation checks passed!")

Encoding categorical variables...
Combined shape: (900000, 21)
After encoding: (900000, 36)

âœ“ Encoding complete
  Train shape: (630000, 36)
  Test shape: (270000, 36)
  Columns match: True
  Number of features: 36

Missing values:
  Train: 0
  Test: 0

âœ“ All validation checks passed!


In [None]:
# Display feature names
print(f"Feature columns ({len(X_train_encoded.columns)}):")
print(X_train_encoded.columns.tolist())

Feature columns (36):
['age', 'study_hours', 'class_attendance', 'sleep_hours', 'study_hours_sq', 'sleep_hours_sq', 'class_attendance_sq', 'study_attendance', 'study_sleep_ratio', 'age_study_interaction', 'attendance_sleep', 'gender_male', 'gender_other', 'course_b.sc', 'course_b.tech', 'course_ba', 'course_bba', 'course_bca', 'course_diploma', 'internet_access_yes', 'sleep_quality_good', 'sleep_quality_poor', 'study_method_group study', 'study_method_mixed', 'study_method_online videos', 'study_method_self-study', 'facility_rating_low', 'facility_rating_medium', 'exam_difficulty_hard', 'exam_difficulty_moderate', 'age_group_middle', 'age_group_senior', 'study_intensity_medium', 'study_intensity_high', 'sleep_category_optimal', 'sleep_category_excessive']


## 5. Train-Validation Split

In [None]:
# Split for validation
X_train, X_val, y_train_split, y_val = train_test_split(
    X_train_encoded, y_train, 
    test_size=0.2, 
    random_state=RANDOM_STATE
)

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test_encoded.shape}")

Training set: (504000, 36)
Validation set: (126000, 36)
Test set: (270000, 36)


## 6. Baseline Model (Before Optuna)

Let's establish a baseline with default parameters

In [None]:
print("Training baseline model...")

# Baseline LightGBM with default parameters
baseline_model = lgb.LGBMRegressor(
    n_estimators=100,
    random_state=RANDOM_STATE,
    verbose=-1
)

baseline_model.fit(X_train, y_train_split)

# Evaluate
train_preds = baseline_model.predict(X_train)
val_preds = baseline_model.predict(X_val)

train_rmse = np.sqrt(mean_squared_error(y_train_split, train_preds))
val_rmse = np.sqrt(mean_squared_error(y_val, val_preds))
val_mae = mean_absolute_error(y_val, val_preds)
val_r2 = r2_score(y_val, val_preds)

print("\n" + "="*60)
print("BASELINE MODEL RESULTS")
print("="*60)
print(f"Training RMSE:   {train_rmse:.4f}")
print(f"Validation RMSE: {val_rmse:.4f}")
print(f"Validation MAE:  {val_mae:.4f}")
print(f"Validation RÂ²:   {val_r2:.4f}")
print("="*60)

baseline_score = val_rmse

Training baseline model...

BASELINE MODEL RESULTS
Training RMSE:   8.8028
Validation RMSE: 8.8216
Validation MAE:  7.0442
Validation RÂ²:   0.7812


## 7. Optuna Hyperparameter Optimization

Now let's use Optuna to find better hyperparameters!

In [None]:
def objective(trial):
    """
    Objective function for Optuna to optimize.
    Optuna will try to minimize the returned value (RMSE).
    """
    
    # Suggest hyperparameters
    params = {
        'objective': 'regression',
        'metric': 'rmse',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'random_state': RANDOM_STATE,
        
        # Hyperparameters to optimize
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
    }
    
    # Train model
    model = lgb.LGBMRegressor(**params)
    model.fit(X_train, y_train_split)
    
    # Predict and evaluate
    preds = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, preds))
    
    return rmse

print("âœ“ Objective function defined")

âœ“ Objective function defined


In [None]:
# Create Optuna study
print("Creating Optuna study...\n")

study = optuna.create_study(
    direction='minimize',  # Minimize RMSE
    study_name='student_scores_optimization',
    sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE),
    pruner=optuna.pruners.MedianPruner(n_warmup_steps=10)
)

# Run optimization
print("Starting hyperparameter optimization...")
print("This may take a few minutes depending on n_trials.\n")

# Change n_trials based on your time budget:
# - 20-30 trials: Quick test (~5-10 minutes)
# - 50-100 trials: Good results (~15-30 minutes)
# - 200+ trials: Best results (longer)

N_TRIALS = 50  # Adjust this as needed

study.optimize(objective, n_trials=N_TRIALS, show_progress_bar=True)

print("\n" + "="*60)
print("OPTIMIZATION COMPLETE!")
print("="*60)

Creating Optuna study...

Starting hyperparameter optimization...
This may take a few minutes depending on n_trials.



  0%|          | 0/50 [00:00<?, ?it/s]


OPTIMIZATION COMPLETE!


## 8. Optuna Results & Visualization

In [None]:
# Best trial results
print("\n" + "="*60)
print("BEST TRIAL RESULTS")
print("="*60)
print(f"Best RMSE: {study.best_value:.4f}")
print(f"Baseline RMSE: {baseline_score:.4f}")
print(f"Improvement: {baseline_score - study.best_value:.4f} ({(baseline_score - study.best_value)/baseline_score*100:.2f}%)")

print(f"\nBest hyperparameters:")
for key, value in study.best_params.items():
    print(f"  {key:20s}: {value}")

print(f"\nNumber of trials: {len(study.trials)}")
print(f"Best trial number: {study.best_trial.number}")

In [None]:
# Visualization 1: Optimization History
fig = plot_optimization_history(study)
fig.update_layout(
    title="Optimization History: How RMSE improved over trials",
    width=900,
    height=500
)
fig.show()

In [None]:
# Visualization 2: Parameter Importance
fig = plot_param_importances(study)
fig.update_layout(
    title="Hyperparameter Importance: Which parameters matter most?",
    width=900,
    height=600
)
fig.show()

In [None]:
# Visualization 3: Parallel Coordinate Plot
fig = plot_parallel_coordinate(study)
fig.update_layout(
    title="Parallel Coordinate Plot: Relationship between parameters and RMSE",
    width=1000,
    height=600
)
fig.show()

In [None]:
# Show top 10 trials
trials_df = study.trials_dataframe().sort_values('value').head(10)
print("\nTop 10 Trials:")
display(trials_df[['number', 'value', 'params_learning_rate', 'params_max_depth', 
                    'params_n_estimators', 'params_num_leaves']].style.background_gradient(cmap='RdYlGn_r', subset=['value']))

## 9. Train Final Model with Best Parameters

In [None]:
print("Training final model with best parameters on full training data...\n")

# Create final model with best parameters
final_model = lgb.LGBMRegressor(**study.best_params, random_state=RANDOM_STATE, verbose=-1)

# Train on full training data
final_model.fit(X_train_encoded, y_train)

# Evaluate on validation set one more time
final_val_preds = final_model.predict(X_val)
final_rmse = np.sqrt(mean_squared_error(y_val, final_val_preds))
final_mae = mean_absolute_error(y_val, final_val_preds)
final_r2 = r2_score(y_val, final_val_preds)

print("="*60)
print("FINAL MODEL PERFORMANCE")
print("="*60)
print(f"Validation RMSE: {final_rmse:.4f}")
print(f"Validation MAE:  {final_mae:.4f}")
print(f"Validation RÂ²:   {final_r2:.4f}")
print("="*60)

print("\nâœ“ Final model trained successfully")

## 10. Feature Importance Analysis

In [None]:
# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_train_encoded.columns,
    'importance': final_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 20 Most Important Features:")
display(feature_importance.head(20))

# Visualize top features
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(20)
plt.barh(range(len(top_features)), top_features['importance'], color='steelblue')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance', fontsize=12)
plt.title('Top 20 Feature Importances', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 11. Generate Predictions & Create Submission

In [None]:
print("Generating predictions on test set...\n")

# Make predictions
test_predictions = final_model.predict(X_test_encoded)

print(f"âœ“ Predictions generated: {len(test_predictions)}")
print(f"\nPrediction statistics:")
print(f"  Min:    {test_predictions.min():.4f}")
print(f"  Max:    {test_predictions.max():.4f}")
print(f"  Mean:   {test_predictions.mean():.4f}")
print(f"  Median: {np.median(test_predictions):.4f}")
print(f"  Std:    {test_predictions.std():.4f}")

# Compare with training distribution
print(f"\nTraining target statistics:")
print(f"  Min:    {y_train.min():.4f}")
print(f"  Max:    {y_train.max():.4f}")
print(f"  Mean:   {y_train.mean():.4f}")
print(f"  Median: {y_train.median():.4f}")
print(f"  Std:    {y_train.std():.4f}")

In [None]:
# Visualize prediction distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Histogram comparison
axes[0].hist(y_train, bins=50, alpha=0.7, label='Training Target', edgecolor='black', color='blue')
axes[0].hist(test_predictions, bins=50, alpha=0.7, label='Test Predictions', edgecolor='black', color='orange')
axes[0].set_xlabel('Exam Score', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Training vs Test Predictions Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot comparison
data_to_plot = [y_train, test_predictions]
axes[1].boxplot(data_to_plot, labels=['Training', 'Test Predictions'])
axes[1].set_ylabel('Exam Score', fontsize=12)
axes[1].set_title('Distribution Comparison', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Create submission dataframe
submission = pd.DataFrame({
    'id': test_ids,
    'exam_score': test_predictions
})

print("Validating submission...")

# Validation checks
assert len(submission) == len(test), f"Expected {len(test)} rows, got {len(submission)}"
assert list(submission.columns) == ['id', 'exam_score'], "Wrong column names"
assert submission['exam_score'].notna().all(), "Has missing predictions"
assert submission['id'].nunique() == len(submission), "Has duplicate IDs"
assert (submission['exam_score'] >= 0).all(), "Has negative predictions"
assert (submission['exam_score'] <= 120).all(), "Has predictions > 120"

print("\nâœ“ All validation checks passed!")

# Save submission
submission.to_csv('submission_optuna.csv', index=False)

print(f"\n{'='*60}")
print("SUBMISSION FILE CREATED!")
print(f"{'='*60}")
print(f"Filename: submission_optuna.csv")
print(f"Shape: {submission.shape}")
print(f"File size: ~{submission.memory_usage(deep=True).sum() / 1024:.1f} KB")
print(f"\nFirst 10 rows:")
display(submission.head(10))

print(f"\nðŸŽ¯ Ready to submit to Kaggle!")
print(f"Expected leaderboard RMSE: ~{final_rmse:.4f} (may vary)")

## 12. Summary & Next Steps

In [None]:
print("="*60)
print("NOTEBOOK SUMMARY")
print("="*60)

print(f"\nâœ“ Data loaded: {len(train):,} training samples, {len(test):,} test samples")
print(f"âœ“ Features engineered: {X_train_encoded.shape[1]} total features")
print(f"âœ“ Categorical encoding: Fixed and aligned between train/test")
print(f"âœ“ Baseline RMSE: {baseline_score:.4f}")
print(f"âœ“ Optuna trials: {len(study.trials)}")
print(f"âœ“ Best RMSE: {study.best_value:.4f}")
print(f"âœ“ Improvement: {(baseline_score - study.best_value)/baseline_score*100:.2f}%")
print(f"âœ“ Submission file: submission_optuna.csv")

print(f"\n{'='*60}")
print("NEXT STEPS TO IMPROVE YOUR SCORE")
print("="*60)

print("""
1. Run more Optuna trials (100-200) for better optimization
2. Try different feature engineering:
   - More interaction features
   - Different binning strategies
   - Log/sqrt transformations

3. Ensemble methods:
   - Train multiple models and average predictions
   - Try XGBoost or CatBoost alongside LightGBM
   - Stacking different model types

4. Cross-validation:
   - Use 5-fold or 10-fold CV in Optuna objective
   - More robust evaluation

5. Advanced techniques:
   - Target encoding for categorical features
   - Pseudo-labeling (use test predictions as extra training)
   - Outlier handling
""")

print("\nGood luck on the leaderboard! ðŸš€")

---

## Optional: Save Best Model & Study

In [None]:
# Save the trained model
import joblib

joblib.dump(final_model, 'best_model_optuna.pkl')
print("âœ“ Model saved to: best_model_optuna.pkl")

# Save Optuna study for later analysis
joblib.dump(study, 'optuna_study.pkl')
print("âœ“ Optuna study saved to: optuna_study.pkl")

# Save best parameters as JSON
import json

with open('best_params.json', 'w') as f:
    json.dump(study.best_params, f, indent=2)
print("âœ“ Best parameters saved to: best_params.json")