# Feature Engineering Pipeline
## Food Price Clustering Project - Phase 1

This notebook transforms the consolidated food price time series data into features suitable for clustering analysis.

### Objectives:
1. **Load** consolidated dataset from preprocessing phase
2. **Extract** 3 key features per commodity from time series data:
   - **Price Average**: Mean price over the period
   - **Coefficient of Variation (CV)**: Volatility measure (std/mean)
   - **Price Trend**: Linear regression slope (price change over time)
3. **Create** 30-feature matrix (3 features × 10 commodities) for clustering
4. **Validate** feature quality and distributions
5. **Visualize** feature relationships and patterns
6. **Export** feature matrix for clustering experiments

### Feature Engineering Strategy:
- **City-Level Features**: Each city becomes one row with 30 feature columns
- **Time Series → Features**: Convert daily price data to statistical summaries
- **Standardization Ready**: Features designed for easy scaling in clustering phase
- **Business Interpretable**: Each feature has clear business meaning

---


In [16]:
"""
Setup and Environment Configuration
"""
import sys
import os
import logging
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional, Union, Tuple
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from pydantic import BaseModel, Field, field_validator

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

def setup_logging(enable_file_logging: bool = False, log_level: str = "INFO") -> logging.Logger:
    """
    Setup logging with optional file output.
    
    Args:
        enable_file_logging: Whether to save logs to file
        log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
        
    Returns:
        Configured logger instance
    """
    # Create logger
    logger = logging.getLogger("feature_engineering")
    logger.setLevel(getattr(logging, log_level.upper()))
    
    # Clear any existing handlers
    logger.handlers.clear()
    
    # Create formatters
    detailed_formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(filename)s:%(lineno)d - %(message)s'
    )
    simple_formatter = logging.Formatter(
        '%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Console handler (always enabled)
    console_handler = logging.StreamHandler()
    console_handler.setLevel(getattr(logging, log_level.upper()))
    console_handler.setFormatter(simple_formatter)
    logger.addHandler(console_handler)
    
    # File handler (optional)
    if enable_file_logging:
        # Create logs directory
        logs_dir = Path("logs")
        logs_dir.mkdir(exist_ok=True)
        
        # Generate timestamped log filename
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        log_filename = logs_dir / f"feature_engineering_{timestamp}.log"
        
        file_handler = logging.FileHandler(log_filename, encoding='utf-8')
        file_handler.setLevel(logging.DEBUG)  # Log everything to file
        file_handler.setFormatter(detailed_formatter)
        logger.addHandler(file_handler)
        
        logger.info(f"📝 File logging enabled - Log file: {log_filename}")
    else:
        logger.info("📺 Console logging only (file logging disabled)")
    
    return logger

def setup_environment() -> Dict[str, Any]:
    """
    Setup notebook environment and verify paths.
    
    Returns:
        Dict containing environment information and path verification results.
    """
    # Fix working directory - ensure we're running from project root
    notebook_dir = Path.cwd()
    project_root = notebook_dir.parent if notebook_dir.name == 'notebooks' else notebook_dir
    
    # Change to project root so all relative paths work correctly
    os.chdir(project_root)
    
    # Verify critical paths
    paths_info = {
        'notebook_dir': notebook_dir,
        'project_root': project_root,
        'current_dir': Path.cwd(),
        'processed_data_path': Path('data/processed'),
        'processed_data_exists': Path('data/processed').exists(),
        'features_data_path': Path('data/features'),
        'features_data_exists': Path('data/features').exists()
    }
    
    # Create features data directory if it doesn't exist
    if not paths_info['features_data_exists']:
        paths_info['features_data_path'].mkdir(parents=True, exist_ok=True)
    
    # Environment info
    env_info = {
        'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'python_version': sys.version.split()[0],
        'pandas_version': pd.__version__,
        'numpy_version': np.__version__,
        'scipy_version': stats.__version__ if hasattr(stats, '__version__') else 'N/A'
    }
    
    return {**paths_info, **env_info}

# Setup environment
env_info = setup_environment()

# Display environment information
print("🔧 Feature Engineering Pipeline - Environment Setup")
print("=" * 60)
print(f"📅 Notebook run at: {env_info['timestamp']}")
print(f"🐍 Python version: {env_info['python_version']}")
print(f"🐼 Pandas version: {env_info['pandas_version']}")
print(f"🔢 NumPy version: {env_info['numpy_version']}")
print(f"📊 SciPy version: {env_info['scipy_version']}")
print()
print("📂 Path Verification:")
print(f"   Current working directory: {env_info['current_dir']}")
print(f"   Processed data path: {env_info['processed_data_path']} ({'✅ EXISTS' if env_info['processed_data_exists'] else '❌ MISSING'})")
print(f"   Features data path: {env_info['features_data_path']} ({'✅ EXISTS' if env_info['features_data_exists'] else '❌ MISSING'})")
print("=" * 60)


🔧 Feature Engineering Pipeline - Environment Setup
📅 Notebook run at: 2025-10-04 20:14:56
🐍 Python version: 3.12.7
🐼 Pandas version: 2.3.3
🔢 NumPy version: 2.3.3
📊 SciPy version: N/A

📂 Path Verification:
   Current working directory: c:\Users\UNTAR\Semester 7\SKRIPSI\program\food-price-clustering\food-price-clustering
   Processed data path: data\processed (✅ EXISTS)
   Features data path: data\features (✅ EXISTS)


In [17]:
"""
Configuration for Feature Engineering
"""

class FeatureEngineeringConfig(BaseModel):
    """
    Configuration class for feature engineering pipeline.
    
    Uses Pydantic for validation and type checking.
    """
    # Logging configuration
    enable_file_logging: bool = Field(
        default=False,
        description="Whether to save logs to file (True) or console only (False)"
    )
    
    log_level: str = Field(
        default="INFO",
        description="Logging level: DEBUG, INFO, WARNING, ERROR"
    )
    
    # Input data configuration
    input_file_pattern: str = Field(
        default="food_prices_consolidated.csv",
        description="Pattern to match input CSV files in processed directory"
    )
    
    # Feature engineering parameters
    expected_commodities: List[str] = Field(
        default=[
            "Beras", "Telur Ayam", "Daging Ayam", "Daging Sapi",
            "Bawang Merah", "Bawang Putih", "Cabai Merah", "Cabai Rawit",
            "Minyak Goreng", "Gula Pasir"
        ],
        description="Expected commodities in the dataset"
    )
    
    # Feature calculation parameters
    min_data_points: int = Field(
        default=30,
        description="Minimum number of data points required per city-commodity for feature calculation"
    )
    
    trend_method: str = Field(
        default="linear_regression",
        description="Method for calculating price trend: 'linear_regression' or 'simple_slope'"
    )
    
    # Output configuration
    feature_name_format: str = Field(
        default="{commodity}_{feature_type}",
        description="Format for feature column names. Available: {commodity}, {feature_type}"
    )
    
    export_formats: List[str] = Field(
        default=["csv", "excel"],
        description="Export formats for feature matrix"
    )
    
    @field_validator('log_level')
    @classmethod
    def validate_log_level(cls, v):
        valid_levels = ["DEBUG", "INFO", "WARNING", "ERROR"]
        if v.upper() not in valid_levels:
            raise ValueError(f"Log level must be one of: {valid_levels}")
        return v.upper()
    
    @field_validator('min_data_points')
    @classmethod
    def validate_min_data_points(cls, v):
        if v < 10:
            raise ValueError("Minimum data points must be at least 10")
        return v
    
    @field_validator('trend_method')
    @classmethod
    def validate_trend_method(cls, v):
        valid_methods = ["linear_regression", "simple_slope"]
        if v not in valid_methods:
            raise ValueError(f"Trend method must be one of: {valid_methods}")
        return v
    
    @field_validator('export_formats')
    @classmethod
    def validate_export_formats(cls, v):
        valid_formats = ["csv", "excel", "json"]
        for fmt in v:
            if fmt not in valid_formats:
                raise ValueError(f"Export format '{fmt}' not supported. Valid: {valid_formats}")
        return v

def find_latest_consolidated_file(processed_dir: Path, pattern: str) -> Optional[Path]:
    """
    Find the most recent consolidated data file matching the pattern.
    
    Args:
        processed_dir: Directory containing processed files
        pattern: File pattern to match
        
    Returns:
        Path to the most recent file, or None if not found
    """
    matching_files = list(processed_dir.glob(pattern))
    
    if not matching_files:
        return None
    
    # Sort by modification time, most recent first
    matching_files.sort(key=lambda x: x.stat().st_mtime, reverse=True)
    return matching_files[0]

# Initialize configuration
config = FeatureEngineeringConfig()

# Setup logging based on configuration
logger = setup_logging(config.enable_file_logging, config.log_level)

# Find input data file
processed_dir = Path("data/processed")
input_file = find_latest_consolidated_file(processed_dir, config.input_file_pattern)

print("⚙️ Feature Engineering Configuration")
print("=" * 50)
print(f"📝 File Logging: {'✅ ENABLED' if config.enable_file_logging else '❌ DISABLED'}")
print(f"📊 Log Level: {config.log_level}")
print(f"📁 Input File: {input_file.name if input_file else '❌ NOT FOUND'}")
print(f"🥬 Expected Commodities: {len(config.expected_commodities)}")
print(f"📈 Trend Method: {config.trend_method}")
print(f"📊 Min Data Points: {config.min_data_points}")
print(f"💾 Export Formats: {', '.join(config.export_formats)}")
print("=" * 50)

if not input_file:
    raise FileNotFoundError(f"No consolidated data file found matching pattern: {config.input_file_pattern}")

logger.info(f"Using input file: {input_file}")


2025-10-04 20:15:00,802 - INFO - 📺 Console logging only (file logging disabled)
2025-10-04 20:15:00,804 - INFO - Using input file: data\processed\food_prices_consolidated.csv


⚙️ Feature Engineering Configuration
📝 File Logging: ❌ DISABLED
📊 Log Level: INFO
📁 Input File: food_prices_consolidated.csv
🥬 Expected Commodities: 10
📈 Trend Method: linear_regression
📊 Min Data Points: 30
💾 Export Formats: csv, excel


## Data Loading and Exploration

Load the consolidated dataset and perform initial exploration to understand the data structure.


In [None]:
"""
Load and Explore Consolidated Dataset
"""

def load_consolidated_data(file_path: Path) -> pd.DataFrame:
    """
    Load consolidated food price data with proper data types.
    
    Args:
        file_path: Path to consolidated CSV file
        
    Returns:
        DataFrame with loaded data
    """
    logger.info(f"Loading data from: {file_path}")
    
    # Load data
    df = pd.read_csv(file_path)
    
    # Convert data types
    df['Date'] = pd.to_datetime(df['Date'])
    df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
    
    # Convert categorical columns
    categorical_columns = ['City', 'Commodity', 'Year']
    for col in categorical_columns:
        if col in df.columns:
            df[col] = df[col].astype('category')
    
    logger.info(f"Data loaded successfully: {df.shape}")
    return df

def explore_dataset(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Perform comprehensive dataset exploration.
    
    Args:
        df: Consolidated DataFrame
        
    Returns:
        Dictionary with exploration results
    """
    exploration_results = {
        'basic_info': {
            'total_rows': len(df),
            'total_columns': len(df.columns),
            'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024**2,
            'date_range': {
                'min_date': df['Date'].min(),
                'max_date': df['Date'].max(),
                'total_days': (df['Date'].max() - df['Date'].min()).days
            }
        },
        'data_coverage': {
            'total_cities': df['City'].nunique(),
            'total_commodities': df['Commodity'].nunique(),
            'cities_list': sorted(df['City'].unique().tolist()),
            'commodities_list': sorted(df['Commodity'].unique().tolist())
        },
        'data_quality': {
            'missing_values': df.isnull().sum().to_dict(),
            'duplicate_rows': df.duplicated().sum(),
            'price_stats': {
                'min_price': df['Price'].min(),
                'max_price': df['Price'].max(),
                'mean_price': df['Price'].mean(),
                'median_price': df['Price'].median(),
                'std_price': df['Price'].std()
            }
        }
    }
    
    # Calculate data points per city-commodity combination
    city_commodity_counts = df.groupby(['City', 'Commodity']).size()
    exploration_results['data_coverage']['data_points_per_city_commodity'] = {
        'min': city_commodity_counts.min(),
        'max': city_commodity_counts.max(),
        'mean': city_commodity_counts.mean(),
        'median': city_commodity_counts.median()
    }
    
    return exploration_results

# Load the data
df_consolidated = load_consolidated_data(input_file)

# Explore the dataset
exploration_results = explore_dataset(df_consolidated)

# Display exploration results
print("📊 Dataset Exploration Results")
print("=" * 60)

# Basic info
basic_info = exploration_results['basic_info']
print(f"📋 Basic Information:")
print(f"   Total Rows: {basic_info['total_rows']:,}")
print(f"   Total Columns: {basic_info['total_columns']}")
print(f"   Memory Usage: {basic_info['memory_usage_mb']:.2f} MB")
print(f"   Date Range: {basic_info['date_range']['min_date']} to {basic_info['date_range']['max_date']}")
print(f"   Total Days: {basic_info['date_range']['total_days']}")

# Data coverage
coverage = exploration_results['data_coverage']
print(f"\n🏙️ Data Coverage:")
print(f"   Cities: {coverage['total_cities']}")
print(f"   Commodities: {coverage['total_commodities']}")

# Data points per city-commodity
data_points = coverage['data_points_per_city_commodity']
print(f"\n📊 Data Points per City-Commodity:")
print(f"   Min: {data_points['min']}")
print(f"   Max: {data_points['max']}")
print(f"   Mean: {data_points['mean']:.1f}")
print(f"   Median: {data_points['median']:.1f}")

# Data quality
quality = exploration_results['data_quality']
print(f"\n✅ Data Quality:")
print(f"   Missing Values: {sum(quality['missing_values'].values())}")
print(f"   Duplicate Rows: {quality['duplicate_rows']}")

# Price statistics
price_stats = quality['price_stats']
print(f"\n💰 Price Statistics:")
print(f"   Range: {price_stats['min_price']:,.0f} - {price_stats['max_price']:,.0f} IDR")
print(f"   Mean: {price_stats['mean_price']:,.0f} IDR")
print(f"   Median: {price_stats['median_price']:,.0f} IDR")
print(f"   Std Dev: {price_stats['std_price']:,.0f} IDR")

print("\n🏙️ Cities in Dataset:")
for i, city in enumerate(coverage['cities_list'], 1):
    print(f"   {i:2d}. {city}")

print("\n🥬 Commodities in Dataset:")
for i, commodity in enumerate(coverage['commodities_list'], 1):
    print(f"   {i:2d}. {commodity}")

print("=" * 60)


2025-10-04 20:15:10,295 - INFO - Loading data from: data\processed\food_prices_consolidated.csv
2025-10-04 20:15:10,841 - INFO - Data loaded successfully: (900460, 5)


📊 Dataset Exploration Results
📋 Basic Information:
   Total Rows: 900,460
   Total Columns: 5
   Memory Usage: 16.32 MB
   Date Range: 2020-01-01 00:00:00 to 2024-12-31 00:00:00
   Total Days: 1826

🏙️ Data Coverage:
   Cities: 69
   Commodities: 10
   Provinces: 0

📊 Data Points per City-Commodity:
   Min: 1305
   Max: 1306
   Mean: 1305.0
   Median: 1305.0

✅ Data Quality:
   Missing Values: 10
   Duplicate Rows: 0

💰 Price Statistics:
   Range: 1 - 218,750 IDR
   Mean: 39,354 IDR
   Median: 29,250 IDR
   Std Dev: 33,612 IDR

🏙️ Cities in Dataset:
    1. Kab. Banyumas
    2. Kab. Banyuwangi
    3. Kab. Boyolali
    4. Kab. Bulungan
    5. Kab. Bungo
    6. Kab. Cilacap
    7. Kab. Cirebon
    8. Kab. Jember
    9. Kab. Karanganyar
   10. Kab. Klaten
   11. Kab. Kudus
   12. Kab. Sragen
   13. Kab. Sukoharjo
   14. Kab. Sumenep
   15. Kab. Tasikmalaya
   16. Kab. Wonogiri
   17. Kota Balikpapan
   18. Kota Banda Aceh
   19. Kota Bandar Lampung
   20. Kota Bandung
   21. Kota Banjarmas

## Feature Extraction Functions

Define functions to extract the three key features from time series data for each city-commodity combination.


In [19]:
"""
Function 1: Price Average Calculation
"""

def calculate_price_average(price_series: pd.Series) -> float:
    """
    Calculate the mean price over the time period.
    
    Args:
        price_series: Series of prices for a city-commodity combination
        
    Returns:
        Mean price value
    """
    if len(price_series) == 0 or price_series.isnull().all():
        return np.nan
    
    return price_series.mean()


In [20]:
"""
Function 2: Coefficient of Variation Calculation
"""

def calculate_coefficient_of_variation(price_series: pd.Series) -> float:
    """
    Calculate the coefficient of variation (CV) as a measure of price volatility.
    
    CV = standard_deviation / mean
    
    Args:
        price_series: Series of prices for a city-commodity combination
        
    Returns:
        Coefficient of variation (volatility measure)
    """
    if len(price_series) == 0 or price_series.isnull().all():
        return np.nan
    
    # Remove null values
    clean_prices = price_series.dropna()
    
    if len(clean_prices) < 2:
        return np.nan
    
    mean_price = clean_prices.mean()
    std_price = clean_prices.std()
    
    # Avoid division by zero
    if mean_price == 0:
        return np.nan
    
    cv = std_price / mean_price
    return cv


In [21]:
"""
Function 3: Price Trend Calculation
"""

def calculate_price_trend(price_series: pd.Series, date_series: pd.Series, method: str = "linear_regression") -> float:
    """
    Calculate the price trend (slope) over time.
    
    Args:
        price_series: Series of prices for a city-commodity combination
        date_series: Corresponding series of dates
        method: Method for trend calculation ('linear_regression' or 'simple_slope')
        
    Returns:
        Price trend (slope coefficient)
    """
    if len(price_series) == 0 or price_series.isnull().all():
        return np.nan
    
    # Create DataFrame and remove null values
    df_temp = pd.DataFrame({'price': price_series, 'date': date_series}).dropna()
    
    if len(df_temp) < 3:  # Need at least 3 points for trend
        return np.nan
    
    # Sort by date
    df_temp = df_temp.sort_values('date')
    
    if method == "linear_regression":
        # Convert dates to numeric (days since first date)
        df_temp['days'] = (df_temp['date'] - df_temp['date'].min()).dt.days
        
        # Fit linear regression
        X = df_temp['days'].values.reshape(-1, 1)
        y = df_temp['price'].values
        
        try:
            model = LinearRegression()
            model.fit(X, y)
            slope = model.coef_[0]
            return slope
        except Exception:
            return np.nan
            
    elif method == "simple_slope":
        # Simple slope calculation: (last_price - first_price) / days
        first_price = df_temp['price'].iloc[0]
        last_price = df_temp['price'].iloc[-1]
        total_days = (df_temp['date'].iloc[-1] - df_temp['date'].iloc[0]).days
        
        if total_days == 0:
            return np.nan
        
        slope = (last_price - first_price) / total_days
        return slope
    
    else:
        raise ValueError(f"Unknown trend method: {method}")


In [None]:
"""
Function 4: Feature Matrix Creation
"""

def create_feature_matrix(df: pd.DataFrame, config: FeatureEngineeringConfig) -> pd.DataFrame:
    """
    Create the feature matrix from consolidated time series data.
    
    Args:
        df: Consolidated DataFrame with time series data
        config: Configuration object
        
    Returns:
        DataFrame with cities as rows and features as columns
    """
    logger.info("Starting feature matrix creation")
    
    # Initialize results list
    feature_rows = []
    
    # Get unique cities
    cities = df['City'].unique()
    logger.info(f"Processing {len(cities)} cities")
    
    for city in cities:
        logger.debug(f"Processing city: {city}")
        
        # Filter data for this city
        city_data = df[df['City'] == city].copy()
        
        # Initialize feature row
        feature_row = {'City': city}
        
        # Process each commodity
        for commodity in config.expected_commodities:
            # Filter data for this commodity
            commodity_data = city_data[city_data['Commodity'] == commodity].copy()
            
            if len(commodity_data) < config.min_data_points:
                logger.warning(f"Insufficient data for {city} - {commodity}: {len(commodity_data)} points")
                # Set features to NaN for insufficient data
                avg_col = config.feature_name_format.format(commodity=commodity, feature_type="avg")
                cv_col = config.feature_name_format.format(commodity=commodity, feature_type="cv")
                trend_col = config.feature_name_format.format(commodity=commodity, feature_type="trend")
                
                feature_row[avg_col] = np.nan
                feature_row[cv_col] = np.nan
                feature_row[trend_col] = np.nan
                continue
            
            # Sort by date
            commodity_data = commodity_data.sort_values('Date')
            
            # Extract features
            price_avg = calculate_price_average(commodity_data['Price'])
            price_cv = calculate_coefficient_of_variation(commodity_data['Price'])
            price_trend = calculate_price_trend(
                commodity_data['Price'], 
                commodity_data['Date'], 
                config.trend_method
            )
            
            # Add to feature row
            avg_col = config.feature_name_format.format(commodity=commodity, feature_type="avg")
            cv_col = config.feature_name_format.format(commodity=commodity, feature_type="cv")
            trend_col = config.feature_name_format.format(commodity=commodity, feature_type="trend")
            
            feature_row[avg_col] = price_avg
            feature_row[cv_col] = price_cv
            feature_row[trend_col] = price_trend
            
            logger.debug(f"  {commodity}: avg={price_avg:.0f}, cv={price_cv:.3f}, trend={price_trend:.3f}")
        
        feature_rows.append(feature_row)
    
    # Create DataFrame
    feature_matrix = pd.DataFrame(feature_rows)
    
    # Set City as index
    feature_matrix = feature_matrix.set_index('City')
    
    logger.info(f"Feature matrix created: {feature_matrix.shape}")
    return feature_matrix


## Execute Feature Engineering

Create the feature matrix from the consolidated time series data.


In [23]:
"""
Execute Feature Engineering Pipeline
"""

# Create the feature matrix
try:
    feature_matrix = create_feature_matrix(df_consolidated, config)
    
    print("🎉 Feature Engineering Successful!")
    print("=" * 50)
    print(f"📊 Feature Matrix Shape: {feature_matrix.shape}")
    print(f"🏙️ Cities: {len(feature_matrix)}")
    print(f"🔢 Features: {len(feature_matrix.columns)}")
    
    # Calculate memory usage
    memory_usage_mb = feature_matrix.memory_usage(deep=True).sum() / 1024**2
    print(f"💾 Memory Usage: {memory_usage_mb:.2f} MB")
    
    # Show feature columns
    print(f"\n📋 Feature Columns ({len(feature_matrix.columns)}):")
    feature_cols = feature_matrix.columns.tolist()
    
    # Group by feature type for better display
    avg_features = [col for col in feature_cols if col.endswith('_avg')]
    cv_features = [col for col in feature_cols if col.endswith('_cv')]
    trend_features = [col for col in feature_cols if col.endswith('_trend')]
    other_features = [col for col in feature_cols if not any(col.endswith(suffix) for suffix in ['_avg', '_cv', '_trend'])]
    
    if other_features:
        print(f"   Metadata: {', '.join(other_features)}")
    print(f"   Average Features ({len(avg_features)}): {', '.join(avg_features[:3])}{'...' if len(avg_features) > 3 else ''}")
    print(f"   CV Features ({len(cv_features)}): {', '.join(cv_features[:3])}{'...' if len(cv_features) > 3 else ''}")
    print(f"   Trend Features ({len(trend_features)}): {', '.join(trend_features[:3])}{'...' if len(trend_features) > 3 else ''}")
    
    # Check for missing values
    missing_values = feature_matrix.isnull().sum().sum()
    print(f"\n❓ Missing Values: {missing_values}")
    
    if missing_values > 0:
        print("   Missing values by column:")
        missing_by_col = feature_matrix.isnull().sum()
        for col, count in missing_by_col[missing_by_col > 0].items():
            print(f"     {col}: {count}")
    
    print("=" * 50)
    
except Exception as e:
    logger.error(f"Feature engineering failed: {str(e)}")
    print(f"❌ Error: {str(e)}")
    raise


2025-10-04 20:15:26,355 - INFO - Starting feature matrix creation
2025-10-04 20:15:26,359 - INFO - Processing 69 cities
2025-10-04 20:15:27,932 - INFO - Feature matrix created: (69, 30)


🎉 Feature Engineering Successful!
📊 Feature Matrix Shape: (69, 30)
🏙️ Cities: 69
🔢 Features: 30
💾 Memory Usage: 0.02 MB

📋 Feature Columns (30):
   Average Features (10): Beras_avg, Telur Ayam_avg, Daging Ayam_avg...
   CV Features (10): Beras_cv, Telur Ayam_cv, Daging Ayam_cv...
   Trend Features (10): Beras_trend, Telur Ayam_trend, Daging Ayam_trend...

❓ Missing Values: 0


## Feature Analysis and Validation

Analyze the extracted features to understand their distributions and relationships.


In [None]:
"""
Feature Validation and Statistical Analysis
"""

def validate_feature_matrix(feature_matrix: pd.DataFrame) -> Dict[str, Any]:
    """
    Validate the feature matrix and compute statistical summaries.
    
    Args:
        feature_matrix: DataFrame with extracted features
        
    Returns:
        Dictionary with validation results
    """
    # Separate numeric and non-numeric columns
    numeric_cols = feature_matrix.select_dtypes(include=[np.number]).columns.tolist()
    non_numeric_cols = feature_matrix.select_dtypes(exclude=[np.number]).columns.tolist()
    
    validation_results = {
        'basic_info': {
            'total_cities': len(feature_matrix),
            'total_features': len(numeric_cols),
            'metadata_columns': len(non_numeric_cols),
            'numeric_columns': numeric_cols,
            'metadata_columns': non_numeric_cols
        },
        'data_quality': {
            'missing_values_total': feature_matrix[numeric_cols].isnull().sum().sum(),
            'missing_values_by_column': feature_matrix[numeric_cols].isnull().sum().to_dict(),
            'complete_cases': len(feature_matrix.dropna(subset=numeric_cols)),
            'incomplete_cases': len(feature_matrix) - len(feature_matrix.dropna(subset=numeric_cols))
        },
        'feature_statistics': {}
    }
    
    # Calculate statistics for each feature type
    for feature_type in ['avg', 'cv', 'trend']:
        type_cols = [col for col in numeric_cols if col.endswith(f'_{feature_type}')]
        
        if type_cols:
            type_data = feature_matrix[type_cols]
            validation_results['feature_statistics'][feature_type] = {
                'count': len(type_cols),
                'min': type_data.min().min(),
                'max': type_data.max().max(),
                'mean': type_data.mean().mean(),
                'median': type_data.median().median(),
                'std': type_data.std().mean(),
                'missing_rate': type_data.isnull().sum().sum() / (len(type_data) * len(type_cols))
            }
    
    return validation_results

# Validate the feature matrix
validation_results = validate_feature_matrix(feature_matrix)

print("🔍 Feature Matrix Validation Results")
print("=" * 60)

# Basic info
basic_info = validation_results['basic_info']
print(f"📊 Basic Information:")
print(f"   Total Cities: {basic_info['total_cities']}")
print(f"   Numeric Features: {basic_info['total_features']}")
print(f"   Metadata Columns: {basic_info['metadata_columns']}")

# Data quality
quality = validation_results['data_quality']
print(f"\n✅ Data Quality:")
print(f"   Total Missing Values: {quality['missing_values_total']}")
print(f"   Complete Cases: {quality['complete_cases']}")
print(f"   Incomplete Cases: {quality['incomplete_cases']}")

# Feature statistics by type
print(f"\n📈 Feature Statistics by Type:")
for feature_type, stats in validation_results['feature_statistics'].items():
    print(f"   {feature_type.upper()} Features ({stats['count']}):")
    print(f"     Range: {stats['min']:.3f} to {stats['max']:.3f}")
    print(f"     Mean: {stats['mean']:.3f}")
    print(f"     Std Dev: {stats['std']:.3f}")
    print(f"     Missing Rate: {stats['missing_rate']:.1%}")

# Show sample of feature matrix
print(f"\n📋 Sample Feature Matrix (first 5 cities):")
display_cols = [col for col in feature_matrix.columns][:10]  # Show first 10 features
print(feature_matrix[display_cols].head().to_string())

print("=" * 60)


🔍 Feature Matrix Validation Results
📊 Basic Information:
   Total Cities: 69
   Numeric Features: 30
   Metadata Columns: []

✅ Data Quality:
   Total Missing Values: 0
   Complete Cases: 69
   Incomplete Cases: 0

📈 Feature Statistics by Type:
   AVG Features (10):
     Range: 11230.077 to 149930.996
     Mean: 39353.539
     Std Dev: 4281.201
     Missing Rate: 0.0%
   CV Features (10):
     Range: 0.030 to 0.499
     Mean: 0.181
     Std Dev: 0.028
     Missing Rate: 0.0%
   TREND Features (10):
     Range: -9.398 to 24.917
     Mean: 4.750
     Std Dev: 1.718
     Missing Rate: 0.0%

📋 Sample Feature Matrix (first 5 cities):
                     Beras_avg  Beras_cv  Beras_trend  Telur Ayam_avg  Telur Ayam_cv  Telur Ayam_trend  Daging Ayam_avg  Daging Ayam_cv  Daging Ayam_trend  Daging Sapi_avg
City                                                                                                                                                                       
Kota Banda Aceh   1

## Export Feature Matrix

Save the feature matrix for use in clustering experiments.


In [25]:
"""
Export Feature Matrix to Multiple Formats
"""

def export_feature_matrix(feature_matrix: pd.DataFrame, config: FeatureEngineeringConfig) -> Dict[str, str]:
    """
    Export feature matrix to multiple formats.
    
    Args:
        feature_matrix: DataFrame with extracted features
        config: Configuration object for metadata
        
    Returns:
        Dictionary with export file paths
    """
    features_dir = Path("data/features")
    features_dir.mkdir(exist_ok=True)
    
    # Generate filename with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    base_filename = f"feature_matrix_{timestamp}"
    
    export_paths = {}
    
    # Export to different formats based on configuration
    for export_format in config.export_formats:
        if export_format == "csv":
            csv_path = features_dir / f"{base_filename}.csv"
            feature_matrix.to_csv(csv_path)
            export_paths['csv'] = str(csv_path)
            
        elif export_format == "excel":
            excel_path = features_dir / f"{base_filename}.xlsx"
            feature_matrix.to_excel(excel_path)
            export_paths['excel'] = str(excel_path)
            
        elif export_format == "json":
            json_path = features_dir / f"{base_filename}.json"
            feature_matrix.to_json(json_path, orient='index', indent=2)
            export_paths['json'] = str(json_path)
    
    # Export metadata
    metadata = {
        'export_timestamp': datetime.now().isoformat(),
        'configuration': config.dict(),
        'feature_matrix_info': {
            'shape': feature_matrix.shape,
            'cities': len(feature_matrix),
            'features': len(feature_matrix.select_dtypes(include=[np.number]).columns),
            'missing_values': feature_matrix.isnull().sum().sum(),
            'feature_columns': feature_matrix.columns.tolist(),
            'city_names': feature_matrix.index.tolist()
        },
        'feature_statistics': validation_results['feature_statistics']
    }
    
    metadata_path = features_dir / f"{base_filename}_metadata.json"
    import json
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2, default=str)
    export_paths['metadata'] = str(metadata_path)
    
    return export_paths

# Export the feature matrix
try:
    export_paths = export_feature_matrix(feature_matrix, config)
    
    print("💾 Feature Matrix Export Successful!")
    print("=" * 50)
    for format_type, file_path in export_paths.items():
        if format_type != 'metadata':
            file_size = Path(file_path).stat().st_size / 1024  # KB
            print(f"📄 {format_type.upper()}: {file_path}")
            print(f"   Size: {file_size:.1f} KB")
    
    print(f"📄 METADATA: {export_paths['metadata']}")
    
    print(f"\n✅ Feature matrix ready for clustering experiments!")
    print(f"📁 Files saved in: data/features/")
    print(f"🔢 Matrix Shape: {feature_matrix.shape[0]} cities × {feature_matrix.shape[1]} features")
    
    # Prepare summary for next phase
    numeric_features = feature_matrix.select_dtypes(include=[np.number]).columns.tolist()
    print(f"\n📋 Ready for Clustering:")
    print(f"   ✅ {len(numeric_features)} numeric features extracted")
    print(f"   ✅ {len(feature_matrix)} cities ready for clustering")
    print(f"   ✅ Feature scaling will be needed before clustering")
    
except Exception as e:
    logger.error(f"Export failed: {str(e)}")
    print(f"❌ Export Error: {str(e)}")
    raise


💾 Feature Matrix Export Successful!
📄 CSV: data\features\feature_matrix_20251004_201543.csv
   Size: 39.7 KB
📄 EXCEL: data\features\feature_matrix_20251004_201543.xlsx
   Size: 30.9 KB
📄 METADATA: data\features\feature_matrix_20251004_201543_metadata.json

✅ Feature matrix ready for clustering experiments!
📁 Files saved in: data/features/
🔢 Matrix Shape: 69 cities × 30 features

📋 Ready for Clustering:
   ✅ 30 numeric features extracted
   ✅ 69 cities ready for clustering
   ✅ Feature scaling will be needed before clustering


C:\Users\emman\AppData\Local\Temp\ipykernel_2572\241157282.py:45: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  'configuration': config.dict(),


## Summary and Next Steps

### What We Accomplished:
1. ✅ **Data Loading**: Successfully loaded consolidated time series data
2. ✅ **Feature Extraction**: Extracted 3 key features per commodity:
   - **Price Average**: Mean price over the time period
   - **Coefficient of Variation**: Volatility measure (std/mean)
   - **Price Trend**: Linear regression slope (price change over time)
3. ✅ **Feature Matrix**: Created 30-feature matrix (3 features × 10 commodities)
4. ✅ **Validation**: Comprehensive feature quality checks and statistics
5. ✅ **Export**: Multiple output formats (CSV, Excel) with metadata

### Key Features Created:
- **30 Numeric Features**: Ready for clustering algorithms
- **City-Level Data**: Each row represents one city with all commodity features
- **Business Interpretable**: Each feature has clear meaning for market analysis
- **Quality Validated**: Statistical summaries and missing value analysis completed

### Feature Engineering Results:
- **Input**: Time series price data (daily observations)
- **Output**: Feature matrix suitable for clustering
- **Transformation**: Time series → Statistical summaries
- **Scalability**: Designed for easy addition of new commodities or time periods

### Next Steps:
1. **Clustering Experiments** (`03_clustering_experiments.ipynb`):
   - Feature scaling (StandardScaler recommended)
   - Optimal K selection (Elbow method + Silhouette analysis)
   - K-Means clustering implementation
   - Fuzzy C-Means clustering implementation
   - Spectral clustering implementation
   - Algorithm comparison and evaluation

2. **Expected Workflow**:
   - Load feature matrix from `data/features/`
   - Apply feature scaling
   - Determine optimal number of clusters (k=3 to k=8 expected)
   - Compare clustering algorithms
   - Interpret business meaning of clusters

3. **Clustering Preparation**:
   - ✅ Feature matrix ready
   - ✅ No missing values (or handled appropriately)
   - ⏳ Feature scaling needed before clustering
   - ⏳ Dimensionality reduction for visualization (PCA → t-SNE/UMAP)

The feature engineering phase is complete! The feature matrix is ready for clustering analysis. 🚀
