# 🛠️ Chapter 14 Exercise: Feature Optimization

In 3D semantic segmentation, feature engineering is key. Using too many noisy features can degrade performance and slow down training.

**Your Task:**
1.  Train a Random Forest classifier using **all** available features.
2.  Compute **Feature Importance**.
3.  Select the **Top 5** most important features.
4.  Retrain the classifier using *only* these 5 features.
5.  Compare the accuracy and training speed (time) of the full model vs. the optimized model.

**Question:** Did performance drop significantly? Did speed improve?

In [None]:
import numpy as np
import pandas as pd
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# 1. Load Data & Prepare
data_path = "../DATA/3DML_urban_point_cloud.xyz"
try:
    df = pd.read_csv(data_path, delimiter=' ')
    df.dropna(inplace=True)
    
    # Define all possible feature columns (excluding XYZ and Labels)
    all_features = [c for c in df.columns if c not in ['X', 'Y', 'Z', 'Classification', 'Label']]
    X = df[all_features]
    y = df['Classification']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    print(f"Training on {len(X_train)} samples with {len(all_features)} features.")
    
    # 2. Train Full Model
    start = time.time()
    clf_full = RandomForestClassifier(n_estimators=30, random_state=42, n_jobs=-1)
    clf_full.fit(X_train, y_train)
    full_time = time.time() - start
    full_acc = accuracy_score(y_test, clf_full.predict(X_test))
    
    print(f"Full Model: Accuracy = {full_acc:.4f}, Time = {full_time:.2f}s")
    
    # 3. Feature Importance
    # TODO: Get feature importances
    # importances = ...
    # indices = np.argsort(importances)[::-1]
    # top_5_indices = indices[:5]
    # top_5_features = [all_features[i] for i in top_5_indices]
    # print("Top 5 Features:", top_5_features)
    
    # 4. Train Optimized Model
    # TODO: Subset X_train and X_test to top_5_features
    # TODO: Train new CLF
    # TODO: Calculate new Accuracy & Time
    
except Exception as e:
    print(f"Data issue: {e}")