# Brain Tumor Classification - Cleaned Version

**Current Best:** v7 (Pseudo-label) = **0.89293**

**Top 3 Strategies:**
1. v7: Pseudo-labeling (98% confidence) - 0.89293 ‚úÖ
2. v16: Neural Network (deep learning) - TBD
3. v14: Extreme Ensemble - 0.890000

---

## 1. Setup & Data Loading

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

# Load data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
test_ids = test_df['id']

print(f"Training: {train_df.shape}")
print(f"Test: {test_df.shape}")
print(f"\nTarget distribution:")
print(train_df['cancer_stage'].value_counts(normalize=True).sort_index())

Training: (7000, 20)
Test: (3000, 19)

Target distribution:
cancer_stage
I      0.035714
II     0.068714
III    0.219143
IV     0.676429
Name: proportion, dtype: float64


## 2. Feature Engineering (Proven Features Only)

In [2]:
def engineer_features(df):
    """Add only proven medical domain features"""
    df = df.copy()
    
    # Core aggressiveness score
    df['aggressiveness_score'] = df['ki67_index'] * 0.5 + df['mitotic_count'] * 2.5
    
    # Risk score
    df['risk_score'] = (
        df['necrosis'] * 3 + 
        df['hemorrhage'] * 2 + 
        df['edema'] * 1
    )
    
    # Age group
    df['age_group'] = pd.cut(df['age'], bins=[0, 40, 60, 100], labels=[0, 1, 2]).astype(int)
    
    # Ki67 category
    df['ki67_category'] = pd.cut(df['ki67_index'], bins=[0, 10, 20, 100], labels=[0, 1, 2]).astype(int)
    
    # Mitotic category
    df['mitotic_category'] = pd.cut(df['mitotic_count'], bins=[0, 5, 15, 100], labels=[0, 1, 2]).astype(int)
    
    # Symptoms severity
    df['symptoms_severity'] = df['neurological_deficit'] + df['seizures'] + df['headache']
    
    # KPS category
    df['kps_category'] = pd.cut(df['kps_score'], bins=[0, 60, 80, 100], labels=[0, 1, 2]).astype(int)
    
    # Tumor complexity
    df['tumor_complexity'] = (
        df['calcification'] + 
        df['cystic_components'] + 
        df['necrosis']
    )
    
    # Interactions
    df['ki67_mitotic_interaction'] = df['ki67_index'] * df['mitotic_count'] / 100
    df['age_ki67_interaction'] = df['age'] * df['ki67_index'] / 100
    
    return df

# Apply feature engineering
train_df_engineered = engineer_features(train_df)
test_df_engineered = engineer_features(test_df)

print(f"‚úÖ Features engineered: {train_df_engineered.shape[1]} total columns")

ValueError: Cannot convert float NaN to integer

## 3. Data Preparation

In [None]:
# Prepare features and target
X = train_df_engineered.drop(['cancer_stage'], axis=1)
y = train_df_engineered['cancer_stage']

X_test = test_df_engineered.drop(['id'], axis=1)

# Label encode target
target_encoder = LabelEncoder()
y_encoded = target_encoder.fit_transform(y)

# Encode categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    X_test[col] = le.transform(X_test[col])
    label_encoders[col] = le

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print(f"Train: {X_train.shape}, Val: {X_val.shape}")
print(f"Test: {X_test.shape}")

## 4. Model Training - Base CatBoost

In [None]:
# Best CatBoost parameters (from previous tuning)
best_catboost_params = {
    'random_strength': 1,
    'learning_rate': 0.07,
    'l2_leaf_reg': 5,
    'iterations': 1000,
    'depth': 4,
    'border_count': 32,
    'bagging_temperature': 0.5,
    'random_state': 42,
    'verbose': 0,
    'task_type': 'CPU'
}

# Train base model
catboost_model = CatBoostClassifier(**best_catboost_params)
catboost_model.fit(X_train, y_train)

# Evaluate
val_pred = catboost_model.predict(X_val)
val_f1 = f1_score(y_val, val_pred, average='weighted')
print(f"CatBoost Validation F1: {val_f1:.5f}")

---

## ‚≠ê STRATEGY #1: Pseudo-Labeling (v7) - BEST: 0.89293

In [None]:
print("=" * 70)
print("STRATEGY #1: PSEUDO-LABELING (98% CONFIDENCE)")
print("=" * 70)

# Train on full training data
pseudo_base_model = CatBoostClassifier(**best_catboost_params)
pseudo_base_model.fit(X, y_encoded)

# Get predictions on test set with confidence
test_proba = pseudo_base_model.predict_proba(X_test)
test_pred = np.argmax(test_proba, axis=1)
test_confidence = np.max(test_proba, axis=1)

# Select high-confidence predictions (‚â•98%)
confidence_threshold = 0.98
high_conf_mask = test_confidence >= confidence_threshold
high_conf_indices = np.where(high_conf_mask)[0]

print(f"\nüìä High-confidence samples at 98%: {len(high_conf_indices)} ({len(high_conf_indices)/len(test_pred)*100:.1f}%)")

if len(high_conf_indices) > 0:
    # Get pseudo-labeled samples
    X_pseudo = X_test.iloc[high_conf_indices].copy()
    y_pseudo = test_pred[high_conf_indices]
    
    # Combine original + pseudo-labeled data
    X_combined = pd.concat([X, X_pseudo], axis=0, ignore_index=True)
    y_combined = np.concatenate([y_encoded, y_pseudo])
    
    print(f"   Combined training: {len(X_combined)} samples (+{len(X_pseudo)/len(X)*100:.1f}%)")
    
    # Retrain model
    pseudo_model = CatBoostClassifier(**best_catboost_params)
    pseudo_model.fit(X_combined, y_combined)
    
    # Make final predictions
    pseudo_final_pred = pseudo_model.predict(X_test)
    pseudo_final_predictions = target_encoder.inverse_transform(pseudo_final_pred)
    
    # Save submission
    submission_v7 = pd.DataFrame({
        'id': test_ids,
        'cancer_stage': pseudo_final_predictions
    })
    submission_v7.to_csv('subChromium_v7_pseudo_label.csv', index=False)
    
    print(f"\n‚úÖ v7 Submission created: subChromium_v7_pseudo_label.csv")
    print(f"üèÜ Kaggle Score: 0.89293 (CURRENT BEST)")
else:
    print("‚ö†Ô∏è  No high-confidence predictions found")

---

## ‚≠ê STRATEGY #2: Deep Neural Network (v16) - Most Different!

In [None]:
print("=" * 70)
print("STRATEGY #2: ADVANCED NEURAL NETWORK")
print("=" * 70)

# Scale features for neural network
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

# Train deep neural network
nn_model = MLPClassifier(
    hidden_layer_sizes=(256, 128, 64, 32),
    activation='relu',
    solver='adam',
    alpha=0.0001,
    batch_size=64,
    learning_rate='adaptive',
    learning_rate_init=0.001,
    max_iter=500,
    early_stopping=True,
    validation_fraction=0.15,
    n_iter_no_change=30,
    random_state=42,
    verbose=True
)

print("\nüîÑ Training neural network (3-5 minutes)...\n")
nn_model.fit(X_scaled, y_encoded)

# Make predictions
nn_pred = nn_model.predict(X_test_scaled)
nn_predictions = target_encoder.inverse_transform(nn_pred)

# Save submission
submission_v16 = pd.DataFrame({
    'id': test_ids,
    'cancer_stage': nn_predictions
})
submission_v16.to_csv('subChromium_v16_neural_network.csv', index=False)

# Compare with v7
v7_sub = pd.read_csv('subChromium_v7_pseudo_label.csv')
differences = (v7_sub['cancer_stage'] != nn_predictions).sum()

print(f"\n‚úÖ v16 Submission created: subChromium_v16_neural_network.csv")
print(f"üìä Changes from v7: {differences} predictions ({differences/len(v7_sub)*100:.1f}%)")
print(f"üéØ Expected: 0.890-0.905 (completely different learning!)")

---

## ‚≠ê STRATEGY #3: Extreme Weighted Ensemble (v14)

In [None]:
print("=" * 70)
print("STRATEGY #3: EXTREME WEIGHTED ENSEMBLE")
print("=" * 70)

# Train additional models for ensemble
print("\nüîÑ Training ensemble models...")

# XGBoost
xgb_model = XGBClassifier(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='mlogloss',
    tree_method='hist'
)
xgb_model.fit(X, y_encoded)

# LightGBM
lgb_model = LGBMClassifier(
    n_estimators=1000,
    max_depth=4,
    learning_rate=0.05,
    random_state=42,
    verbose=-1
)
lgb_model.fit(X, y_encoded)

# Blending model (from v8)
X_train1, X_train2, y_train1, y_train2 = train_test_split(
    X, y_encoded, test_size=0.30, random_state=42, stratify=y_encoded
)

blend_cat = CatBoostClassifier(**best_catboost_params)
blend_cat.fit(X_train1, y_train1)

blend_xgb = XGBClassifier(n_estimators=1000, max_depth=4, learning_rate=0.05, random_state=42, eval_metric='mlogloss', tree_method='hist')
blend_xgb.fit(X_train1, y_train1)

blend_lgb = LGBMClassifier(n_estimators=1000, max_depth=4, learning_rate=0.05, random_state=42, verbose=-1)
blend_lgb.fit(X_train1, y_train1)

# Train meta-learner
train2_cat_proba = blend_cat.predict_proba(X_train2)
train2_xgb_proba = blend_xgb.predict_proba(X_train2)
train2_lgb_proba = blend_lgb.predict_proba(X_train2)
train2_meta = np.hstack([train2_cat_proba, train2_xgb_proba, train2_lgb_proba])

blend_meta = LogisticRegression(max_iter=1000, random_state=42, C=0.1)
blend_meta.fit(train2_meta, y_train2)

# Retrain on full data
blend_cat_full = CatBoostClassifier(**best_catboost_params)
blend_cat_full.fit(X, y_encoded)

blend_xgb_full = XGBClassifier(n_estimators=1000, max_depth=4, learning_rate=0.05, random_state=42, eval_metric='mlogloss', tree_method='hist')
blend_xgb_full.fit(X, y_encoded)

blend_lgb_full = LGBMClassifier(n_estimators=1000, max_depth=4, learning_rate=0.05, random_state=42, verbose=-1)
blend_lgb_full.fit(X, y_encoded)

# Get test probabilities
test_cat_proba = blend_cat_full.predict_proba(X_test)
test_xgb_proba = blend_xgb_full.predict_proba(X_test)
test_lgb_proba = blend_lgb_full.predict_proba(X_test)
test_meta_features = np.hstack([test_cat_proba, test_xgb_proba, test_lgb_proba])
blend_proba = blend_meta.predict_proba(test_meta_features)

# Get v7 probabilities
v7_proba = pseudo_model.predict_proba(X_test)

# EXTREME weighted ensemble (equal weights)
ensemble_proba = (
    0.25 * v7_proba +
    0.25 * test_cat_proba +
    0.25 * test_xgb_proba +
    0.20 * blend_proba
)

ensemble_pred = np.argmax(ensemble_proba, axis=1)
ensemble_predictions = target_encoder.inverse_transform(ensemble_pred)

# Save submission
submission_v14 = pd.DataFrame({
    'id': test_ids,
    'cancer_stage': ensemble_predictions
})
submission_v14.to_csv('subChromium_v14_extreme_ensemble.csv', index=False)

# Compare with v7
differences = (v7_sub['cancer_stage'] != ensemble_predictions).sum()

print(f"\n‚úÖ v14 Submission created: subChromium_v14_extreme_ensemble.csv")
print(f"üìä Changes from v7: {differences} predictions ({differences/len(v7_sub)*100:.1f}%)")
print(f"üèÜ Kaggle Score: 0.890000")

---

## üìä Summary & Next Steps

### **Current Rankings:**

| Version | Strategy | Kaggle Score | Status |
|---------|----------|--------------|--------|
| **v7** | **Pseudo-label (98%)** | **0.89293** | **‚úÖ BEST** |
| v16 | Neural Network | TBD | ‚è≥ Submit next! |
| v14 | Extreme Ensemble | 0.890000 | ‚úÖ Tested |

### **Action Plan:**

1. **Submit v16** (Neural Network) - Most different from v7 (10.4% change)
2. **If v16 works:** Ensemble v7 + v16 for potential 0.895-0.905
3. **If stuck:** Check competition discussions for winning techniques

### **Key Learnings:**

- ‚úÖ Pseudo-labeling with high confidence (98%) works well
- ‚úÖ Neural networks provide different predictions than tree models
- ‚úÖ Simple approaches often beat complex ones
- ‚ùå More features/data doesn't always help (noise vs signal)
- ‚ùå Complex ensembles can overfit

---