# üîß Feature Engineering

Learn how to create powerful features automatically using the FeatureFactory.

## Topics Covered
1. Automatic feature generation
2. Numeric transformations
3. Categorical encoding
4. DateTime features
5. Feature selection

**Time Required**: ~20 minutes

In [None]:
import sys
sys.path.insert(0, '../../')

from data_science_master_system import (
    DataLoader, FeatureFactory, FeatureSelector,
    StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Ready!")

In [None]:
# Load sample data
loader = DataLoader()
df = loader.read('../data/csv/house_prices.csv')
print(f"Dataset: {df.shape}")
df.head()

## 1. Automatic Feature Generation

FeatureFactory can automatically generate features based on data types.

In [None]:
# Initialize FeatureFactory
factory = FeatureFactory()

# Prepare data (drop ID column)
df_features = df.drop(columns=['house_id'])

# Auto-generate features
df_generated = factory.auto_generate(
    df_features,
    target='price',
    max_features=50,
    include_interactions=True,
    include_polynomials=True
)

print(f"\nüìä Feature Generation Results:")
print(f"  ‚Ä¢ Original features: {len(df_features.columns)}")
print(f"  ‚Ä¢ Generated features: {len(df_generated.columns)}")
print(f"\nNew feature names:")
new_cols = [c for c in df_generated.columns if c not in df_features.columns]
print(new_cols[:20])

## 2. Numeric Feature Transformations

In [None]:
# Get numeric columns
numeric_df = df_features.select_dtypes(include=[np.number])

# Generate numeric features
numeric_features = factory.generate_numeric_features(
    numeric_df[['sqft_living', 'sqft_lot', 'bedrooms']],
    include_log=True,
    include_sqrt=True,
    include_reciprocal=True
)

print("Generated numeric features:")
display(numeric_features.head())

In [None]:
# Visualize transformations
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Original
df['sqft_living'].hist(bins=30, ax=axes[0], color='steelblue')
axes[0].set_title('Original sqft_living')

# Log transform
np.log(df['sqft_living']).hist(bins=30, ax=axes[1], color='green')
axes[1].set_title('Log Transform')

# Sqrt transform
np.sqrt(df['sqft_living']).hist(bins=30, ax=axes[2], color='coral')
axes[2].set_title('Square Root Transform')

plt.tight_layout()
plt.show()

## 3. Feature Scaling

In [None]:
# Select numeric columns
X = df[['sqft_living', 'bedrooms', 'bathrooms', 'grade']].copy()

# StandardScaler (zero mean, unit variance)
standard_scaler = StandardScaler()
X_standard = standard_scaler.fit_transform(X)

# MinMaxScaler (0-1 range)
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)

print("Scaling Comparison:")
comparison = pd.DataFrame({
    'Original_mean': X.mean().round(2),
    'Original_std': X.std().round(2),
    'Standard_mean': X_standard.mean().round(4),
    'Standard_std': X_standard.std().round(4),
    'MinMax_min': X_minmax.min().round(4),
    'MinMax_max': X_minmax.max().round(4),
})
display(comparison)

## 4. Categorical Encoding

In [None]:
# Load data with categorical columns
df_churn = loader.read('../data/csv/customer_churn.csv')

# Get categorical columns
cat_cols = ['gender', 'contract_type', 'payment_method']
df_cat = df_churn[cat_cols]

print("Categorical columns:")
display(df_cat.head())

In [None]:
# Label Encoding (single column)
label_encoder = LabelEncoder()
gender_encoded = label_encoder.fit_transform(df_cat['gender'])
print("Label Encoded gender:")
print(pd.DataFrame({'Original': df_cat['gender'].head(), 'Encoded': gender_encoded[:5]}))

In [None]:
# One-Hot Encoding
onehot_encoder = OneHotEncoder()
contract_encoded = onehot_encoder.fit_transform(df_cat[['contract_type']])

print("\nOne-Hot Encoded contract_type:")
display(contract_encoded.head())

## 5. Feature Selection

In [None]:
# Prepare features and target
X = df.drop(columns=['house_id', 'price']).select_dtypes(include=[np.number])
y = df['price']

print(f"Features before selection: {X.shape[1]}")

In [None]:
# Filter-based selection
filter_selector = FeatureSelector(method='filter', n_features=5)
X_filtered = filter_selector.select(X, y)

print(f"\nüîç Filter Selection (top 5):")
print(f"  Selected: {filter_selector.selected_features}")
print(f"\n  Feature scores:")
for feat, score in sorted(filter_selector.feature_scores.items(), key=lambda x: -x[1])[:5]:
    print(f"    {feat}: {score:.4f}")

In [None]:
# Embedded selection (tree-based importance)
embedded_selector = FeatureSelector(method='embedded', n_features=5)
X_embedded = embedded_selector.select(X, y)

print(f"\nüå≥ Tree-based Selection (top 5):")
print(f"  Selected: {embedded_selector.selected_features}")

In [None]:
# Visualize feature importance
importance_df = pd.DataFrame([
    {'feature': k, 'importance': v} 
    for k, v in embedded_selector.feature_scores.items()
]).nlargest(10, 'importance')

plt.figure(figsize=(10, 5))
plt.barh(importance_df['feature'], importance_df['importance'], color='steelblue')
plt.xlabel('Importance')
plt.title('Top 10 Features by Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## üéØ Key Takeaways

1. **FeatureFactory** - Auto-generates features from any data type
2. **Scalers** - StandardScaler, MinMaxScaler, RobustScaler
3. **Encoders** - LabelEncoder, OneHotEncoder, TargetEncoder
4. **FeatureSelector** - Filter, wrapper, and embedded methods

### Next: Model Comparison ‚Üí