# Model Performance Comparison: Before and After LLM Feature Engineering

This notebook demonstrates the impact of automated feature engineering using `llm-feat` on model performance. We'll:

1. Train a baseline model using all original features from the dataset
2. Use `llm-feat` to automatically generate additional meaningful features with:
   - Comprehensive metadata descriptions
   - Problem-specific context via `problem_description`
   - Detailed feature report explaining the domain understanding
3. Train a model with the engineered features
4. Compare performance improvements (RMSE reduction)

## Setup and Data Loading


In [1]:
# Import required libraries
import os
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes  # Built-in dataset, no download needed
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import llm_feat

print(f"llm-feat version: {llm_feat.__version__}")
print("‚úì Package imported successfully!")


llm-feat version: 0.2.0
‚úì Package imported successfully!


In [None]:
# Set your OpenAI API key
# Option 1: Set from environment variable (RECOMMENDED for production)
api_key = "<OPENAI_API_KEY>"

# Option 2: Set directly (for testing only - remove before committing!)
# Uncomment and set your key here if environment variable is not set:
# api_key = "<OPENAI_API_KEY>"

if api_key:
    llm_feat.set_api_key(api_key)
    print("‚úì API key set")
else:
    print("‚ö†Ô∏è  OPENAI_API_KEY not set. Set it using:")
    print("   export OPENAI_API_KEY='your-key-here' (before starting Jupyter)")
    print("   Or uncomment the line above to set it directly in the notebook")


‚úì API key set


## Load and Prepare Dataset

We'll use the **Diabetes dataset** - a built-in sklearn dataset that doesn't require any downloads. This dataset contains 10 baseline variables for 442 diabetes patients, predicting disease progression.


In [3]:
# Load Diabetes dataset (built-in, no download required)
diabetes = load_diabetes()

# Create DataFrame
df = pd.DataFrame(
    diabetes.data,
    columns=diabetes.feature_names
)
df['target'] = diabetes.target

print("="*70)
print("Diabetes Dataset Loaded Successfully")
print("="*70)
print(f"Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nDataset statistics:")
print(df.describe())
print(f"\nTarget variable (disease progression):")
print(f"  Mean: {df['target'].mean():.2f}")
print(f"  Std: {df['target'].std():.2f}")
print(f"  Range: [{df['target'].min():.2f}, {df['target'].max():.2f}]")
print("="*70)


Diabetes Dataset Loaded Successfully
Shape: (442, 11)

Columns: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']

First few rows:
        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  target  
0 -0.002592  0.019907 -0.017646   151.0  
1 -0.039493 -0.068332 -0.092204    75.0  
2 -0.002592  0.002861 -0.025930   141.0  
3  0.034309  0.022688 -0.009362   206.0  
4 -0.002592 -0.031988 -0.046641   135.0  

Dataset statistics:
                age           sex           bmi            bp            s1  \
count  4.420000e+02  4.420000e+02  4.42000

## Create Metadata DataFrame

We'll create comprehensive metadata that describes each column in detail. This helps the LLM understand the medical/health domain context and generate more relevant features.


In [4]:
# Create comprehensive metadata DataFrame with detailed clinical context
metadata_df = pd.DataFrame({
    'column_name': [
        'age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target'
    ],
    'description': [
        'Age in years - older patients typically have higher diabetes progression risk',
        'Gender (1 = male, 2 = female) - sex differences affect diabetes risk and progression',
        'Body mass index (weight in kg / (height in m)^2) - higher BMI is a major diabetes risk factor, strongly correlated with disease progression',
        'Average blood pressure (systolic/diastolic) - hypertension is a key comorbidity that accelerates diabetes progression',
        'Blood serum measurement 1: Total cholesterol (TC) - lipid metabolism marker, higher levels associated with diabetes complications',
        'Blood serum measurement 2: Low-density lipoproteins (LDL) - "bad cholesterol", elevated LDL increases cardiovascular risk in diabetics',
        'Blood serum measurement 3: High-density lipoproteins (HDL) - "good cholesterol", higher HDL is protective against diabetes progression',
        'Blood serum measurement 4: Total cholesterol to HDL ratio (TC/HDL) - important cardiovascular risk indicator, higher ratio = higher risk',
        'Blood serum measurement 5: Log of serum triglycerides level - elevated triglycerides are common in diabetes and predict progression',
        'Blood serum measurement 6: Blood sugar level (glucose) - direct diabetes marker, higher levels indicate worse disease control and progression',
        'Disease progression measure (quantitative measure of disease progression one year after baseline)'
    ],
    'data_type': ['numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 
                  'numeric', 'numeric', 'numeric', 'numeric', 'numeric'],
    'label_definition': [None, None, None, None, None, None, None, None, None, None,
                          'Quantitative measure of diabetes disease progression one year after baseline - higher values indicate worse progression. This is what we want to predict.']
})

print("Metadata DataFrame (Enhanced with Clinical Context):")
print(metadata_df)


Metadata DataFrame (Enhanced with Clinical Context):
   column_name                                        description data_type  \
0          age  Age in years - older patients typically have h...   numeric   
1          sex  Gender (1 = male, 2 = female) - sex difference...   numeric   
2          bmi  Body mass index (weight in kg / (height in m)^...   numeric   
3           bp  Average blood pressure (systolic/diastolic) - ...   numeric   
4           s1  Blood serum measurement 1: Total cholesterol (...   numeric   
5           s2  Blood serum measurement 2: Low-density lipopro...   numeric   
6           s3  Blood serum measurement 3: High-density lipopr...   numeric   
7           s4  Blood serum measurement 4: Total cholesterol t...   numeric   
8           s5  Blood serum measurement 5: Log of serum trigly...   numeric   
9           s6  Blood serum measurement 6: Blood sugar level (...   numeric   
10      target  Disease progression measure (quantitative meas...   numeric   

## Baseline Model Performance (Using All Original Features)

First, let's establish a baseline by training a model on all original features from the dataset.


In [5]:
# Prepare data for baseline model - USING ALL ORIGINAL FEATURES
X_baseline = df.drop('target', axis=1)
y_baseline = df['target']

# Split into train and test sets
X_train_baseline, X_test_baseline, y_train_baseline, y_test_baseline = train_test_split(
    X_baseline, y_baseline, test_size=0.2, random_state=42
)

# Train baseline Random Forest model with same hyperparameters for fair comparison
print("Training baseline model with all original features...")
baseline_model = RandomForestRegressor(
    n_estimators=200,  # Same as engineered model for fair comparison
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
baseline_model.fit(X_train_baseline, y_train_baseline)

# Make predictions
baseline_preds = baseline_model.predict(X_test_baseline)

# Calculate metrics
baseline_rmse = np.sqrt(mean_squared_error(y_test_baseline, baseline_preds))
baseline_r2 = r2_score(y_test_baseline, baseline_preds)

print("\n" + "="*70)
print("BASELINE MODEL PERFORMANCE (All Original Features)")
print("="*70)
print(f"RMSE: {baseline_rmse:.4f}")
print(f"R¬≤ Score: {baseline_r2:.4f}")
print(f"Number of features: {X_baseline.shape[1]}")
print("="*70)


Training baseline model with all original features...

BASELINE MODEL PERFORMANCE (All Original Features)
RMSE: 54.3267
R¬≤ Score: 0.4429
Number of features: 10


## Generate Features Using LLM-Feat

Now we'll use `llm-feat` to automatically generate meaningful features. We'll provide:
- **Comprehensive metadata** describing each column
- **Problem description** with domain-specific context
- **Feature report** to understand the reasoning behind generated features


In [11]:
# Define problem description - detailed content written naturally
problem_description = """
We are predicting diabetes disease progression using baseline patient measurements.

The most important thing to understand is that lipid profile ratios are way more predictive than individual lipid values. Specifically, the LDL to HDL ratio, which is s2 divided by s3, tells us a lot about cardiovascular risk and diabetes progression. Higher ratios mean worse outcomes. The total cholesterol to HDL ratio, which is s1 divided by s3, is already captured in s4 but we can create variations of it. The triglycerides to HDL ratio is also really important as an indicator of metabolic syndrome. These ratios are clinically more meaningful than looking at the individual values separately.

BMI interactions are absolutely critical. When you multiply BMI with blood pressure, you are capturing the combined effect of obesity and hypertension, which together create very high risk. BMI times age captures how older obese patients have worse outcomes. BMI times blood sugar, which is s6, shows how obese patients with high glucose progress faster. And do not forget BMI squared because the relationship between obesity and diabetes risk is not linear. Very high BMI has an exponentially worse effect.

Blood pressure interactions matter too. Blood pressure times age shows that hypertension in older patients is more dangerous. Blood pressure times lipid ratios captures the combination of hypertension and dyslipidemia, which is classic metabolic syndrome. Blood pressure times blood sugar shows that high BP combined with high glucose indicates poor diabetes control.

Age and demographic patterns are important. Age times BMI shows how age amplifies obesity risk. Age times sex captures different risk patterns by gender and age. Age times blood sugar shows that older patients with high glucose have worse outcomes.

Blood sugar interactions are also key. s6 times BMI captures high glucose plus obesity, which suggests insulin resistance. s6 times BP shows poor glucose control combined with hypertension leads to complications. s6 times lipid ratios are indicators of metabolic syndrome.

Finally, composite risk scores would be really useful. A metabolic syndrome score that combines BMI, blood pressure, lipid ratios, and glucose. A cardiovascular risk score that combines LDL to HDL ratio, blood pressure, and age. A diabetes severity score that combines glucose, BMI, and age.

The key is to focus on creating features that capture multiplicative interactions between risk factors, not just additive ones. We want clinically meaningful ratios, especially lipid ratios. We need non-linear effects like squared terms for BMI and age. And we should create composite indicators that combine multiple risk factors. Avoid simple linear combinations or features that do not have clinical meaning.
"""

print("Problem Description:")
print(problem_description)


Problem Description:

We are predicting diabetes disease progression using baseline patient measurements.

The most important thing to understand is that lipid profile ratios are way more predictive than individual lipid values. Specifically, the LDL to HDL ratio, which is s2 divided by s3, tells us a lot about cardiovascular risk and diabetes progression. Higher ratios mean worse outcomes. The total cholesterol to HDL ratio, which is s1 divided by s3, is already captured in s4 but we can create variations of it. The triglycerides to HDL ratio is also really important as an indicator of metabolic syndrome. These ratios are clinically more meaningful than looking at the individual values separately.

BMI interactions are absolutely critical. When you multiply BMI with blood pressure, you are capturing the combined effect of obesity and hypertension, which together create very high risk. BMI times age captures how older obese patients have worse outcomes. BMI times blood sugar, which is 

In [12]:
print("="*70)
print("GENERATING FEATURES WITH LLM-FEAT")
print("="*70)
print(f"Starting with {len(df.columns) - 1} original features")
print("LLM will create additional powerful features based on clinical knowledge...")
print("Using gpt-4o model for better feature quality...")
print("This may take a minute as the LLM analyzes the data and generates features...\n")

df_feng, feature_report = llm_feat.generate_features(
    df=df,
    metadata_df=metadata_df,
    mode='direct',
    model='gpt-4o',  # Using more powerful model for better features
    problem_description=problem_description,
    return_report=True  # Get detailed feature report
)

print("\n" + "="*70)
print("FEATURE GENERATION COMPLETE")
print("="*70)
print(f"Original columns: {len(df.columns)}")
print(f"New columns: {len(df_feng.columns)}")
print(f"Features added: {len(df_feng.columns) - len(df.columns)}")
print(f"\nNew feature columns:")
new_features = [col for col in df_feng.columns if col not in df.columns]
for i, feat in enumerate(new_features, 1):
    print(f"  {i}. {feat}")
print("="*70)


GENERATING FEATURES WITH LLM-FEAT
Starting with 10 original features
LLM will create additional powerful features based on clinical knowledge...
Using gpt-4o model for better feature quality...
This may take a minute as the LLM analyzes the data and generates features...


FEATURE GENERATION COMPLETE
Original columns: 11
New columns: 21
Features added: 10

New feature columns:
  1. ldl_hdl_ratio
  2. triglycerides_hdl_ratio
  3. bmi_squared
  4. bmi_bp_interaction
  5. age_bmi_interaction
  6. glucose_bmi_interaction
  7. bp_age_interaction
  8. metabolic_syndrome_score
  9. cardiovascular_risk_score
  10. diabetes_severity_score


## Feature Report

The feature report provides detailed explanations of the domain understanding and rationale for each generated feature.


In [13]:
# Display the feature report
print("="*70)
print("FEATURE ENGINEERING REPORT")
print("="*70)
print(feature_report)
print("="*70)


FEATURE ENGINEERING REPORT

1. DOMAIN UNDERSTANDING:
   - The problem domain involves predicting diabetes disease progression using baseline patient measurements. The target variable is a quantitative measure of disease progression one year after baseline.
   - The business context is healthcare, specifically focusing on diabetes management and predicting how the disease will progress based on various health indicators.
   - Key relationships include the impact of obesity (BMI), blood pressure, and lipid profiles on diabetes progression. Lipid ratios and interactions between BMI, age, and blood sugar are particularly important.

2. GENERATED FEATURES EXPLANATION:
   - Feature Name: ldl_hdl_ratio
     - Description: Ratio of low-density lipoproteins to high-density lipoproteins.
     - Rationale: Higher LDL to HDL ratios indicate worse cardiovascular health, which is linked to diabetes progression.
     - Domain Relevance: This ratio is a well-known marker for cardiovascular risk, which

## Model Performance with Engineered Features

Now let's train a model using the engineered features and compare performance.


In [14]:
# Prepare data with engineered features
X_eng = df_feng.drop('target', axis=1)
y_eng = df_feng['target']

# Split into train and test sets (same random_state for fair comparison)
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(
    X_eng, y_eng, test_size=0.2, random_state=42
)

# Train model with engineered features
# Using more trees and better hyperparameters for better performance
print("="*70)
print("TRAINING MODEL WITH ENGINEERED FEATURES")
print("="*70)
print(f"Using {X_eng.shape[1]} features (original + LLM-generated)")
print("Training Random Forest with optimized hyperparameters...")
model_eng = RandomForestRegressor(
    n_estimators=200,  # More trees for better performance
    max_depth=15,       # Limit depth to prevent overfitting
    min_samples_split=5,  # Require more samples to split
    min_samples_leaf=2,   # Minimum samples in leaf
    random_state=42,
    n_jobs=-1
)
model_eng.fit(X_train_eng, y_train_eng)

# Make predictions
eng_preds = model_eng.predict(X_test_eng)

# Calculate metrics
eng_rmse = np.sqrt(mean_squared_error(y_test_eng, eng_preds))
eng_r2 = r2_score(y_test_eng, eng_preds)

print("\n" + "="*70)
print("MODEL PERFORMANCE WITH ENGINEERED FEATURES")
print("="*70)
print(f"RMSE: {eng_rmse:.4f}")
print(f"R¬≤ Score: {eng_r2:.4f}")
print(f"Number of features: {X_eng.shape[1]}")
print("="*70)


TRAINING MODEL WITH ENGINEERED FEATURES
Using 20 features (original + LLM-generated)
Training Random Forest with optimized hyperparameters...

MODEL PERFORMANCE WITH ENGINEERED FEATURES
RMSE: 53.5280
R¬≤ Score: 0.4592
Number of features: 20


## Performance Comparison

Let's compare the baseline and engineered feature models side by side.


In [15]:
# Calculate improvements
rmse_improvement = ((baseline_rmse - eng_rmse) / baseline_rmse) * 100
r2_improvement = eng_r2 - baseline_r2
rmse_reduction = baseline_rmse - eng_rmse

print("="*70)
print("PERFORMANCE COMPARISON")
print("="*70)
print(f"\n{'Metric':<25} {'Baseline':<20} {'With LLM Features':<20} {'Improvement':<15}")
print("-"*70)
print(f"{'RMSE':<25} {baseline_rmse:<20.4f} {eng_rmse:<20.4f} {rmse_improvement:>14.2f}%")
print(f"{'R¬≤ Score':<25} {baseline_r2:<20.4f} {eng_r2:<20.4f} {r2_improvement:>14.4f}")
print(f"{'Number of Features':<25} {X_baseline.shape[1]:<20} {X_eng.shape[1]:<20} {X_eng.shape[1] - X_baseline.shape[1]:>14}")
print("="*70)

# Summary
print("\n" + "="*70)
print("SUMMARY: LLM FEATURE ENGINEERING IMPACT")
print("="*70)
if rmse_improvement > 0:
    print(f"üéØ RMSE improved by {rmse_improvement:.2f}%")
    print(f"   - Baseline: {baseline_rmse:.4f}")
    print(f"   - With LLM features: {eng_rmse:.4f}")
    print(f"   - Absolute reduction: {rmse_reduction:.4f}")
    if rmse_improvement > 5:
        print(f"   ‚≠ê Significant improvement! LLM features made a major difference.")
    elif rmse_improvement > 1:
        print(f"   ‚úì Good improvement! LLM features enhanced model performance.")
    else:
        print(f"   ‚úì Modest improvement.")
else:
    print(f"‚ö† RMSE increased by {abs(rmse_improvement):.2f}%")
    print("   (This can happen if features add noise or model needs tuning)")

if r2_improvement > 0:
    print(f"\nüéØ R¬≤ Score improved by {r2_improvement:.4f}")
    print(f"   - Baseline: {baseline_r2:.4f}")
    print(f"   - With LLM features: {eng_r2:.4f}")
    if r2_improvement > 0.05:
        print(f"   ‚≠ê Major improvement in explained variance!")
    elif r2_improvement > 0.01:
        print(f"   ‚úì Good improvement in model fit.")
else:
    print(f"\n‚ö† R¬≤ Score decreased by {abs(r2_improvement):.4f}")

print(f"\nüìä Feature Engineering Results:")
print(f"   - Started with: {X_baseline.shape[1]} original features")
print(f"   - Generated: {len(new_features)} new features using llm-feat")
print(f"   - Total features: {X_eng.shape[1]} (original + engineered)")
print(f"\nüí° Key Insight: LLM added {len(new_features)} domain-relevant features")
print(f"   that capture important relationships in the data!")
print("="*70)


PERFORMANCE COMPARISON

Metric                    Baseline             With LLM Features    Improvement    
----------------------------------------------------------------------
RMSE                      54.3267              53.5280                        1.47%
R¬≤ Score                  0.4429               0.4592                       0.0163
Number of Features        10                   20                               10

SUMMARY: LLM FEATURE ENGINEERING IMPACT
üéØ RMSE improved by 1.47%
   - Baseline: 54.3267
   - With LLM features: 53.5280
   - Absolute reduction: 0.7987
   ‚úì Good improvement! LLM features enhanced model performance.

üéØ R¬≤ Score improved by 0.0163
   - Baseline: 0.4429
   - With LLM features: 0.4592
   ‚úì Good improvement in model fit.

üìä Feature Engineering Results:
   - Started with: 10 original features
   - Generated: 10 new features using llm-feat
   - Total features: 20 (original + engineered)

üí° Key Insight: LLM added 10 domain-relevant feat

## Visualize Generated Features

Let's take a look at some of the generated features to understand what was created.


In [16]:
# Display sample of engineered features
print("Sample of engineered features (first 5 rows):")
print(df_feng[['target'] + new_features[:5]].head())

print(f"\n\nStatistics for generated features:")
print(df_feng[new_features].describe())


Sample of engineered features (first 5 rows):
   target  ldl_hdl_ratio  triglycerides_hdl_ratio  bmi_squared  \
0   151.0       0.802306               -23.504311     0.003806   
1    75.0      -0.257532                12.551151     0.002650   
2   141.0       1.056822               -30.994793     0.001976   
3   206.0      -0.693459               -28.385573     0.000134   
4   135.0       1.915497               118.952175     0.001324   

   bmi_bp_interaction  age_bmi_interaction  
0            0.001349             0.002349  
1            0.001355             0.000097  
2           -0.000252             0.003792  
3            0.000425             0.001033  
4           -0.000796            -0.000196  


Statistics for generated features:
       ldl_hdl_ratio  triglycerides_hdl_ratio   bmi_squared  \
count     442.000000               442.000000  4.420000e+02   
mean        0.273854                12.758633  2.262443e-03   
std        13.217584               203.517476  3.267015e-03  