# Part 1.6: Feature Creation & Transformation

Feature engineering is the art of creating new, informative features from existing data. This can often lead to significant improvements in model performance.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, FunctionTransformer

data = {
    'age': [25, 30, 35, 40],
    'income': [50000, 80000, 60000, 100000],
    'date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05'])
}
df = pd.DataFrame(data)
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,age,income,date
0,25,50000,2023-01-15
1,30,80000,2023-02-20
2,35,60000,2023-03-10
3,40,100000,2023-04-05


### Polynomial Features
This creates interaction features (e.g., `age * income`) and higher-order features (e.g., `age^2`).

In [2]:
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
poly_features = poly.fit_transform(df[['age', 'income']])

poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['age', 'income']))
print("DataFrame with Polynomial Features:")
poly_df

DataFrame with Polynomial Features:


Unnamed: 0,age,income,age^2,age income,income^2
0,25.0,50000.0,625.0,1250000.0,2500000000.0
1,30.0,80000.0,900.0,2400000.0,6400000000.0
2,35.0,60000.0,1225.0,2100000.0,3600000000.0
3,40.0,100000.0,1600.0,4000000.0,10000000000.0


### Function Transformers for Skewed Data
Applying a mathematical function like a logarithm can help handle skewed data.

In [3]:
# The log1p function computes log(1+x), which is useful for data with zeros
log_transformer = FunctionTransformer(np.log1p)
df['income_log'] = log_transformer.fit_transform(df[['income']])

print("DataFrame with Log-Transformed Income:")
df[['income', 'income_log']]

DataFrame with Log-Transformed Income:


Unnamed: 0,income,income_log
0,50000,10.819798
1,80000,11.289794
2,60000,11.002117
3,100000,11.512935


### Date/Time Features
You can extract valuable information from datetime columns.

In [4]:
df['day_of_week'] = df['date'].dt.dayofweek # Monday=0, Sunday=6
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter

print("DataFrame with new Date/Time Features:")
df[['date', 'day_of_week', 'month', 'quarter']]

DataFrame with new Date/Time Features:


Unnamed: 0,date,day_of_week,month,quarter
0,2023-01-15,6,1,1
1,2023-02-20,0,2,1
2,2023-03-10,4,3,1
3,2023-04-05,2,4,2
