# üîß Feature Engineering

**Author**: Data Science Master System  
**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Time**: 60 minutes  
**Prerequisites**: 07_clustering_models

## Learning Objectives
- Create numeric, categorical, datetime features
- Handle missing values
- Feature scaling and encoding
- Automated feature generation

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import PolynomialFeatures

np.random.seed(42)

## 1. Sample Data

In [None]:
df = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 35],
    'income': [50000, 60000, 55000, np.nan, 80000],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-03-20', '2023-06-01', '2023-09-10', '2024-01-05']),
    'description': ['loves sports', 'enjoys music', 'sports fan', 'music lover', 'sports and music']
})
df

## 2. Handle Missing Values

In [None]:
# Imputation
imputer = SimpleImputer(strategy='median')
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])
print("‚úÖ Missing values imputed")
df

## 3. Datetime Features

In [None]:
df['year'] = df['signup_date'].dt.year
df['month'] = df['signup_date'].dt.month
df['day_of_week'] = df['signup_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days

print("üìÖ Datetime features:")
df[['signup_date', 'year', 'month', 'day_of_week', 'is_weekend', 'days_since_signup']]

## 4. Categorical Encoding

In [None]:
# One-hot encoding
city_dummies = pd.get_dummies(df['city'], prefix='city')
print("üèôÔ∏è One-hot encoded cities:")
display(city_dummies)

## 5. Text Features

In [None]:
# TF-IDF
tfidf = TfidfVectorizer(max_features=5)
text_features = tfidf.fit_transform(df['description'])
text_df = pd.DataFrame(text_features.toarray(), columns=tfidf.get_feature_names_out())

print("üìù Text features:")
display(text_df)

## 6. Feature Interactions

In [None]:
# Polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
numeric = df[['age', 'income']].values
poly_features = poly.fit_transform(numeric)

print(f"Original features: 2")
print(f"Polynomial features: {poly_features.shape[1]}")
print(f"Names: {poly.get_feature_names_out(['age', 'income'])}")

## üéØ Key Takeaways
- Impute before encoding
- Extract datetime components
- One-hot for categories, TF-IDF for text
- Polynomial for interactions

**Next**: 09_hyperparameter_tuning.ipynb