# Feature Engineering

**Project:** Urban Mobility Optimization

**Objectives:**
1. Create efficiency ratios and composite indices
2. Engineer temporal features (growth rates, lags)
3. Generate sustainability scores
4. Create interaction features
5. Prepare features for machine learning models

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

pd.set_option('display.max_columns', None)

print("Libraries imported successfully")

## 1. Load Cleaned Data

In [None]:
# Load cleaned dataset
df = pd.read_csv('../data/processed/transport_data_cleaned.csv')

print(f"Dataset loaded: {df.shape}")
print(f"Economies: {df['economy'].nunique()}")
print(f"Years: {df['year'].min()} - {df['year'].max()}")
df.head()

## 2. Efficiency Ratios

Create performance metrics that normalize outputs by inputs.

In [None]:
# Economic efficiency: GDP output per unit of emissions
df['gdp_per_co2'] = df['gdp_per_capita_ppp'] / (df['co2_emissions_per_capita'] + 0.1)  # Add small constant to avoid division by zero

# Energy efficiency: GDP per energy consumed
df['gdp_per_energy'] = df['gdp_per_capita_ppp'] / (df['energy_use_per_gdp'] + 0.1)

# Infrastructure efficiency: LPI per road density
df['lpi_per_road_density'] = df['lpi_overall_score'] / (df['road_density_km_per_100sqkm'] + 1)

# Transport emissions intensity
df['transport_co2_intensity'] = (df['co2_transport_pct_total'] / 100) * df['co2_emissions_per_capita']

# Passenger transport efficiency
df['passengers_per_gdp'] = df['air_transport_passengers'] / (df['gdp_per_capita_ppp'] / 1000)

print("✓ Efficiency ratios created:")
print("  - GDP per CO2")
print("  - GDP per Energy")
print("  - LPI per Road Density")
print("  - Transport CO2 Intensity")
print("  - Passengers per GDP")

## 3. Composite Sustainability Index

Create a sustainability score combining multiple dimensions.

In [None]:
# Normalize components to 0-100 scale
def normalize_to_100(series, reverse=False):
    """
    Normalize series to 0-100 scale
    reverse=True for indicators where lower is better (e.g., emissions)
    """
    min_val = series.min()
    max_val = series.max()
    
    if reverse:
        return 100 * (max_val - series) / (max_val - min_val)
    else:
        return 100 * (series - min_val) / (max_val - min_val)

# Component 1: Environmental Performance (lower emissions = better)
df['env_score'] = normalize_to_100(df['co2_emissions_per_capita'], reverse=True)

# Component 2: Energy Efficiency (lower energy use = better)
df['energy_score'] = normalize_to_100(df['energy_use_per_gdp'], reverse=True)

# Component 3: Economic Performance (higher GDP = better)
df['economic_score'] = normalize_to_100(df['gdp_per_capita_ppp'])

# Component 4: Infrastructure Quality (higher LPI = better)
df['infra_score'] = normalize_to_100(df['lpi_overall_score'])

# Composite Sustainability Index (equal weights)
df['sustainability_index'] = (
    0.25 * df['env_score'] +
    0.25 * df['energy_score'] +
    0.25 * df['economic_score'] +
    0.25 * df['infra_score']
)

print("✓ Sustainability Index created")
print(f"  Range: {df['sustainability_index'].min():.2f} - {df['sustainability_index'].max():.2f}")
print(f"  Mean: {df['sustainability_index'].mean():.2f}")
print(f"  Std: {df['sustainability_index'].std():.2f}")

In [None]:
# Visualize sustainability index by income group
plt.figure(figsize=(12, 6))
df.boxplot(column='sustainability_index', by='income_group', figsize=(12, 6))
plt.title('Sustainability Index by Income Group', fontsize=14, fontweight='bold')
plt.suptitle('')
plt.xlabel('Income Group')
plt.ylabel('Sustainability Index (0-100)')
plt.xticks(rotation=15)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('../visualizations/sustainability_index.png', dpi=300, bbox_inches='tight')
plt.show()

print("Visualization saved: visualizations/sustainability_index.png")

## 4. Temporal Features

Create growth rates, year-over-year changes, and lagged features.

In [None]:
# Sort by economy and year
df = df.sort_values(['economy', 'year']).reset_index(drop=True)

# Year-over-year growth rates
growth_cols = ['gdp_per_capita_ppp', 'co2_emissions_per_capita', 'urban_population_pct', 'lpi_overall_score']

for col in growth_cols:
    # Calculate percentage change within each economy
    df[f'{col}_growth'] = df.groupby('economy')[col].pct_change() * 100
    
print("✓ Growth rate features created:")
for col in growth_cols:
    print(f"  - {col}_growth")

In [None]:
# Lagged features (1-year lag for investment impact analysis)
lag_cols = ['gross_capital_formation_pct_gdp', 'road_density_km_per_100sqkm']

for col in lag_cols:
    df[f'{col}_lag1'] = df.groupby('economy')[col].shift(1)

print("✓ Lagged features created (1-year lag):")
for col in lag_cols:
    print(f"  - {col}_lag1")

In [None]:
# Rolling averages (3-year moving average for trend smoothing)
rolling_cols = ['gdp_per_capita_ppp', 'co2_emissions_per_capita', 'lpi_overall_score']

for col in rolling_cols:
    df[f'{col}_ma3'] = df.groupby('economy')[col].transform(
        lambda x: x.rolling(window=3, min_periods=1).mean()
    )

print("✓ Moving average features created (3-year):")
for col in rolling_cols:
    print(f"  - {col}_ma3")

## 5. Infrastructure Development Index

In [None]:
# Normalize infrastructure components
df['road_score'] = normalize_to_100(df['road_density_km_per_100sqkm'])
df['rail_score'] = normalize_to_100(df['rail_lines_total_km'])
df['air_score'] = normalize_to_100(df['air_transport_passengers'])
df['port_score'] = normalize_to_100(df['container_port_traffic_teu'])

# Infrastructure Development Index
df['infrastructure_index'] = (
    0.3 * df['road_score'] +
    0.2 * df['rail_score'] +
    0.3 * df['air_score'] +
    0.2 * df['port_score']
)

print("✓ Infrastructure Development Index created")
print(f"  Range: {df['infrastructure_index'].min():.2f} - {df['infrastructure_index'].max():.2f}")

## 6. Categorical Encoding

In [None]:
# Create income group ordinal encoding
income_order = {'Low income': 1, 'Lower middle income': 2, 'Upper middle income': 3, 'High income': 4}
df['income_group_encoded'] = df['income_group'].map(income_order)

# One-hot encode regions
region_dummies = pd.get_dummies(df['region'], prefix='region')
df = pd.concat([df, region_dummies], axis=1)

print("✓ Categorical features encoded")
print(f"  Income group ordinal: {income_order}")
print(f"  Region dummies: {len(region_dummies.columns)} columns")

## 7. Interaction Features

Create features that capture relationships between variables.

In [None]:
# GDP × Urbanization (economic concentration in cities)
df['gdp_urban_interaction'] = df['gdp_per_capita_ppp'] * (df['urban_population_pct'] / 100)

# Infrastructure × Trade (connectivity benefit)
df['infra_trade_interaction'] = df['infrastructure_index'] * (df['trade_pct_gdp'] / 100)

# Energy efficiency × GDP (development stage effect)
df['energy_gdp_interaction'] = df['energy_use_per_gdp'] * np.log1p(df['gdp_per_capita_ppp'])

print("✓ Interaction features created:")
print("  - GDP × Urbanization")
print("  - Infrastructure × Trade")
print("  - Energy × GDP (log)")

## 8. Log Transformations

Apply log transformations to skewed variables for modeling.

In [None]:
# Variables with high skewness
log_transform_cols = [
    'gdp_per_capita_ppp',
    'air_transport_passengers',
    'container_port_traffic_teu',
    'rail_lines_total_km'
]

for col in log_transform_cols:
    df[f'{col}_log'] = np.log1p(df[col])  # log(1 + x) to handle zeros

print("✓ Log transformations applied:")
for col in log_transform_cols:
    print(f"  - {col}_log")

## 9. Performance Categories

Create categorical performance tiers for classification tasks.

In [None]:
# LPI Performance Tiers
def categorize_lpi(score):
    if score >= 4.0:
        return 'High'
    elif score >= 3.0:
        return 'Medium'
    else:
        return 'Low'

df['lpi_category'] = df['lpi_overall_score'].apply(categorize_lpi)

# Sustainability Performance Tiers
df['sustainability_category'] = pd.qcut(df['sustainability_index'], 
                                        q=3, 
                                        labels=['Low', 'Medium', 'High'])

print("✓ Performance categories created:")
print("\nLPI Category Distribution:")
print(df['lpi_category'].value_counts())
print("\nSustainability Category Distribution:")
print(df['sustainability_category'].value_counts())

## 10. Feature Summary & Selection

In [None]:
# List all engineered features
original_features = [
    'gdp_per_capita_ppp', 'road_density_km_per_100sqkm', 'rail_lines_total_km',
    'air_transport_passengers', 'container_port_traffic_teu', 'energy_use_per_gdp',
    'co2_emissions_per_capita', 'co2_transport_pct_total', 'trade_pct_gdp',
    'urban_population_pct', 'lpi_overall_score', 'lpi_infrastructure_score',
    'gross_capital_formation_pct_gdp'
]

efficiency_features = [
    'gdp_per_co2', 'gdp_per_energy', 'lpi_per_road_density',
    'transport_co2_intensity', 'passengers_per_gdp'
]

index_features = [
    'sustainability_index', 'infrastructure_index',
    'env_score', 'energy_score', 'economic_score', 'infra_score'
]

temporal_features = [col for col in df.columns if '_growth' in col or '_lag' in col or '_ma3' in col]

interaction_features = [col for col in df.columns if 'interaction' in col]

log_features = [col for col in df.columns if col.endswith('_log')]

print("FEATURE ENGINEERING SUMMARY")
print("="*60)
print(f"Original features: {len(original_features)}")
print(f"Efficiency ratios: {len(efficiency_features)}")
print(f"Composite indices: {len(index_features)}")
print(f"Temporal features: {len(temporal_features)}")
print(f"Interaction features: {len(interaction_features)}")
print(f"Log transformations: {len(log_features)}")
print(f"\nTotal features created: {len(efficiency_features) + len(index_features) + len(temporal_features) + len(interaction_features) + len(log_features)}")
print(f"Final dataset shape: {df.shape}")

In [None]:
# Check for any remaining NaN values in engineered features
print("\nMissing values check in key features:")
key_features = efficiency_features + index_features + temporal_features[:5]
missing_check = df[key_features].isnull().sum()
missing_check = missing_check[missing_check > 0]

if len(missing_check) > 0:
    print(missing_check)
else:
    print("✓ No missing values in key engineered features")

## 11. Correlation Analysis of New Features

In [None]:
# Correlation of efficiency features with target variable (LPI)
feature_cols = efficiency_features + index_features + ['lpi_overall_score']
corr_with_lpi = df[feature_cols].corr()['lpi_overall_score'].sort_values(ascending=False)

print("Correlation with Logistics Performance (LPI):")
print("="*60)
print(corr_with_lpi[corr_with_lpi.index != 'lpi_overall_score'])

In [None]:
# Visualize top engineered features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sustainability Index vs LPI
axes[0,0].scatter(df['sustainability_index'], df['lpi_overall_score'], alpha=0.5, s=30)
axes[0,0].set_xlabel('Sustainability Index')
axes[0,0].set_ylabel('LPI Score')
axes[0,0].set_title('Sustainability vs Logistics Performance')
axes[0,0].grid(alpha=0.3)

# GDP per CO2 vs LPI
axes[0,1].scatter(df['gdp_per_co2'], df['lpi_overall_score'], alpha=0.5, s=30)
axes[0,1].set_xlabel('GDP per CO2')
axes[0,1].set_ylabel('LPI Score')
axes[0,1].set_title('Carbon Efficiency vs Logistics Performance')
axes[0,1].grid(alpha=0.3)

# Infrastructure Index vs LPI
axes[1,0].scatter(df['infrastructure_index'], df['lpi_overall_score'], alpha=0.5, s=30)
axes[1,0].set_xlabel('Infrastructure Index')
axes[1,0].set_ylabel('LPI Score')
axes[1,0].set_title('Infrastructure Development vs Logistics Performance')
axes[1,0].grid(alpha=0.3)

# GDP per Energy vs LPI
axes[1,1].scatter(df['gdp_per_energy'], df['lpi_overall_score'], alpha=0.5, s=30)
axes[1,1].set_xlabel('GDP per Energy')
axes[1,1].set_ylabel('LPI Score')
axes[1,1].set_title('Energy Efficiency vs Logistics Performance')
axes[1,1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../visualizations/engineered_features_vs_lpi.png', dpi=300, bbox_inches='tight')
plt.show()

print("Visualization saved: visualizations/engineered_features_vs_lpi.png")

## 12. Save Feature-Engineered Dataset

In [None]:
# Save full feature set
output_path = '../data/processed/transport_data_features.csv'
df.to_csv(output_path, index=False)

print("="*60)
print("FEATURE-ENGINEERED DATASET SAVED")
print("="*60)
print(f"File: {output_path}")
print(f"Shape: {df.shape}")
print(f"Total features: {len(df.columns)}")
print(f"\nDataset ready for modeling!")

In [None]:
# Save feature metadata
import json

feature_metadata = {
    'original_features': original_features,
    'efficiency_ratios': efficiency_features,
    'composite_indices': index_features,
    'temporal_features': temporal_features,
    'interaction_features': interaction_features,
    'log_transformations': log_features,
    'categorical_features': ['lpi_category', 'sustainability_category'],
    'target_variables': ['lpi_overall_score', 'co2_emissions_per_capita', 'sustainability_index']
}

with open('../data/processed/feature_metadata.json', 'w') as f:
    json.dump(feature_metadata, f, indent=4)

print("✓ Feature metadata saved")

## Summary: Key Features Created

### 1. Efficiency Metrics
- **GDP per CO2:** Economic output per unit of emissions (carbon efficiency)
- **GDP per Energy:** Economic productivity per energy consumed
- **LPI per Road Density:** Logistics performance relative to infrastructure

### 2. Composite Indices
- **Sustainability Index:** Balanced score across environment, energy, economy, infrastructure
- **Infrastructure Index:** Multi-modal transport infrastructure development

### 3. Temporal Features
- Growth rates (YoY % change)
- Lagged variables (investment impact analysis)
- Moving averages (trend smoothing)

### 4. Business Value
These features enable:
- Efficiency benchmarking across economies
- Investment impact assessment (lagged features)
- Sustainability-development trade-off analysis
- Multi-dimensional performance evaluation

---

**Next:** Machine learning modeling for prediction and segmentation