# llm-feat Package Test Notebook

This notebook tests the llm-feat package for automated feature engineering using LLMs.

## Setup


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import llm_feat

print(f"llm-feat version: {llm_feat.__version__}")
print("✓ Package imported successfully!")


llm-feat version: 0.1.0
✓ Package imported successfully!


In [None]:
# Set your OpenAI API key
# Option 1: Set from environment variable (RECOMMENDED for production)
import os
api_key = os.getenv("OPENAI_API_KEY")

# Option 2: Set directly in notebook (for testing only - remove before committing!)
# Uncomment and set your key here if environment variable is not set:
if not api_key:
    api_key = "<OPENAI_API_KEY>"

if api_key:
    llm_feat.set_api_key(api_key)
    print("✓ API key set")
else:
    print("⚠️  OPENAI_API_KEY not set. Set it using:")
    print("   export OPENAI_API_KEY='your-key-here' (before starting Jupyter)")
    print("   Or uncomment the line above to set it directly in the notebook")

✓ API key set


## Test 1: Simple Numerical Dataset


In [24]:
# Create a simple numerical dataset
df = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 50, 28, 32, 38, 42],
    'income': [50000, 60000, 70000, 80000, 90000, 100000, 55000, 65000, 75000, 85000],
    'savings': [10000, 15000, 20000, 25000, 30000, 35000, 12000, 18000, 22000, 28000],
    'expenses': [40000, 45000, 50000, 55000, 60000, 65000, 43000, 47000, 53000, 57000]
})

print("Original DataFrame:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"Columns: {list(df.columns)}")


Original DataFrame:
   age  income  savings  expenses
0   25   50000    10000     40000
1   30   60000    15000     45000
2   35   70000    20000     50000
3   40   80000    25000     55000
4   45   90000    30000     60000

Shape: (10, 4)
Columns: ['age', 'income', 'savings', 'expenses']


In [25]:
# Create metadata DataFrame
metadata_df = pd.DataFrame({
    'column_name': ['age', 'income', 'savings', 'expenses'],
    'description': [
        'Age of the person in years',
        'Annual income in dollars',
        'Total savings in dollars',
        'Annual expenses in dollars'
    ],
    'data_type': ['numeric', 'numeric', 'numeric', 'numeric'],
    'label_definition': [None, None, None, None]
})

print("Metadata DataFrame:")
print(metadata_df)


Metadata DataFrame:
  column_name                 description data_type label_definition
0         age  Age of the person in years   numeric             None
1      income    Annual income in dollars   numeric             None
2     savings    Total savings in dollars   numeric             None
3    expenses  Annual expenses in dollars   numeric             None


### Mode 1: Generate Code

**Note:** When you run the cell below, the generated code will be set for the next cell. 
- In some Jupyter environments, a new cell will be created automatically
- In others, create a new cell below and the code will appear automatically
- The code is also printed in the output for manual copying


In [26]:
# Generate feature engineering code
# In Jupyter, the code will be automatically injected into the next cell
# Using gpt-4o-mini model for cost-effective feature generation
code = llm_feat.generate_features(df, metadata_df, mode='code', model='gpt-4o-mini')
print("Generated code:")
print(code)


<IPython.core.display.Javascript object>

✓ Attempted to create new cell with code - check below
  (Also set as next input - will appear in next cell you create)
Generated code:

# Generated Feature Engineering Code
import numpy as np

df['income_to_expense_ratio'] = np.where(df['expenses'] != 0, df['income'] / df['expenses'], 0)
df['savings_to_income_ratio'] = np.where(df['income'] != 0, df['savings'] / df['income'], 0)
df['net_savings'] = df['savings'] - df['expenses']
df['age_squared'] = df['age'] ** 2
df['income_per_age'] = np.where(df['age'] != 0, df['income'] / df['age'], 0)



In [27]:
import numpy as np

df['income_to_expense_ratio'] = np.where(df['expenses'] != 0, df['income'] / df['expenses'], 0)
df['savings_to_income_ratio'] = np.where(df['income'] != 0, df['savings'] / df['income'], 0)
df['net_savings'] = df['savings'] - df['expenses']
df['age_squared'] = df['age'] ** 2
df['income_per_age'] = np.where(df['age'] != 0, df['income'] / df['age'], 0)

### Mode 2: Direct Feature Addition


In [7]:
# Directly add features to DataFrame
# Using gpt-4o-mini model for cost-effective feature generation
df_with_features = llm_feat.generate_features(df, metadata_df, mode='direct', model='gpt-4o-mini')

print("DataFrame with new features:")
print(df_with_features.head())
print(f"\nOriginal columns: {list(df.columns)}")
print(f"New columns: {[col for col in df_with_features.columns if col not in df.columns]}")
print(f"\nTotal columns: {len(df_with_features.columns)} (original: {len(df.columns)})")


DataFrame with new features:
   age  income  savings  expenses  income_to_expense_ratio  \
0   25   50000    10000     40000                 1.250000   
1   30   60000    15000     45000                 1.333333   
2   35   70000    20000     50000                 1.400000   
3   40   80000    25000     55000                 1.454545   
4   45   90000    30000     60000                 1.500000   

   savings_to_income_ratio  net_savings  age_squared  income_per_age  \
0                 0.200000       -30000          625          2000.0   
1                 0.250000       -30000          900          2000.0   
2                 0.285714       -30000         1225          2000.0   
3                 0.312500       -30000         1600          2000.0   
4                 0.333333       -30000         2025          2000.0   

   savings_to_expenses_ratio  net_savings_per_age  income_squared  \
0                   0.250000         -1200.000000      2500000000   
1                   0.33333

## Test 2: Dataset with Target Column


In [21]:
# Create dataset with target column
df = pd.DataFrame({
    'height': [170, 175, 180, 165, 185, 172, 178, 168, 182, 174],
    'weight': [70, 75, 80, 65, 85, 72, 78, 68, 83, 74],
    'bmi': [24.2, 24.5, 24.7, 23.9, 24.8, 24.3, 24.6, 24.1, 25.0, 24.4],
    'health_score': [1, 1, 0, 1, 0, 1, 1, 1, 0, 1]  # Target: 1=healthy, 0=unhealthy
})

metadata_df2 = pd.DataFrame({
    'column_name': ['height', 'weight', 'bmi', 'health_score'],
    'description': [
        'Height in centimeters',
        'Weight in kilograms',
        'Body Mass Index',
        'Health classification score'
    ],
    'data_type': ['numeric', 'numeric', 'numeric', 'numeric'],
    'label_definition': [None, None, None, '1 if healthy, 0 if unhealthy']
})

print("Dataset with target:")
print(df.head())
print("\nMetadata:")
print(metadata_df2)


Dataset with target:
   height  weight   bmi  health_score
0     170      70  24.2             1
1     175      75  24.5             1
2     180      80  24.7             0
3     165      65  23.9             1
4     185      85  24.8             0

Metadata:
    column_name                  description data_type  \
0        height        Height in centimeters   numeric   
1        weight          Weight in kilograms   numeric   
2           bmi              Body Mass Index   numeric   
3  health_score  Health classification score   numeric   

               label_definition  
0                          None  
1                          None  
2                          None  
3  1 if healthy, 0 if unhealthy  


In [22]:
# Generate features for dataset with target
# Using gpt-4o-mini model for cost-effective feature generation
code2 = llm_feat.generate_features(df, metadata_df2, mode='code', model='gpt-4o-mini')
print("Generated feature code:")
print(code2)


<IPython.core.display.Javascript object>

✓ Attempted to create new cell with code - check below
  (Also set as next input - will appear in next cell you create)
Generated feature code:

# Generated Feature Engineering Code
import numpy as np

df['height_weight_ratio'] = df['height'] / df['weight'].replace(0, np.nan)
df['bmi_squared'] = df['bmi'] ** 2
df['health_score_bmi_interaction'] = df['health_score'] * df['bmi']
df['weight_category'] = pd.cut(df['weight'], bins=[0, 70, 80, 90], labels=['Underweight', 'Normal', 'Overweight'], right=False)
df['height_bmi_difference'] = df['height'] - (df['bmi'] * 100 / (df['weight'].replace(0, np.nan)))



In [23]:
import numpy as np

df['height_weight_ratio'] = df['height'] / df['weight'].replace(0, np.nan)
df['bmi_squared'] = df['bmi'] ** 2
df['health_score_bmi_interaction'] = df['health_score'] * df['bmi']
df['weight_category'] = pd.cut(df['weight'], bins=[0, 70, 80, 90], labels=['Underweight', 'Normal', 'Overweight'], right=False)
df['height_bmi_difference'] = df['height'] - (df['bmi'] * 100 / (df['weight'].replace(0, np.nan)))

## Test 3: Partial Metadata (Many Columns, Few Descriptions)

This example demonstrates that you can provide metadata for only a subset of columns. The LLM will still see all columns in the DataFrame and can generate features using all of them, but will have richer context for columns with descriptions.


In [16]:
# Create a dataset with many columns (simulating a real-world scenario)
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 50],
    'income': [50000, 60000, 70000, 80000, 90000, 55000, 65000, 75000, 85000, 95000],
    'credit_score': [650, 720, 680, 750, 710, 660, 730, 690, 740, 700],
    'loan_amount': [10000, 15000, 20000, 25000, 30000, 12000, 18000, 22000, 28000, 35000],
    'employment_years': [2, 5, 8, 12, 15, 3, 6, 9, 11, 18],
    'debt_to_income': [0.25, 0.30, 0.28, 0.35, 0.32, 0.27, 0.29, 0.31, 0.33, 0.36],
    'savings': [5000, 8000, 12000, 15000, 20000, 6000, 9000, 13000, 16000, 22000],
    'num_accounts': [2, 3, 2, 4, 3, 2, 3, 2, 4, 3],
    'default_status': [0, 0, 1, 0, 0, 0, 1, 0, 0, 1]  # Target: 1=defaulted, 0=not defaulted
})

print("DataFrame with many columns:")
print(df.head())
print(f"\nTotal columns: {len(df.columns)}")
print(f"Columns: {list(df.columns)}")


DataFrame with many columns:
   customer_id  age  income  credit_score  loan_amount  employment_years  \
0            1   25   50000           650        10000                 2   
1            2   30   60000           720        15000                 5   
2            3   35   70000           680        20000                 8   
3            4   40   80000           750        25000                12   
4            5   45   90000           710        30000                15   

   debt_to_income  savings  num_accounts  default_status  
0            0.25     5000             2               0  
1            0.30     8000             3               0  
2            0.28    12000             2               1  
3            0.35    15000             4               0  
4            0.32    20000             3               0  

Total columns: 10
Columns: ['customer_id', 'age', 'income', 'credit_score', 'loan_amount', 'employment_years', 'debt_to_income', 'savings', 'num_accounts', 'de

In [17]:
# Provide metadata for only a FEW important columns (not all columns)
# This simulates a real scenario where you might have 100+ columns but only
# describe the most important ones
metadata_df3 = pd.DataFrame({
    'column_name': ['income', 'credit_score', 'loan_amount', 'default_status'],
    'description': [
        'Annual income in dollars',
        'Credit score (300-850 scale)',
        'Loan amount requested in dollars',
        'Loan default status'
    ],
    'data_type': ['numeric', 'numeric', 'numeric', 'numeric'],
    'label_definition': [None, None, None, '1 if customer defaulted on loan, 0 if not']
})

print("Metadata provided for only 4 out of 10 columns:")
print(metadata_df3)
print(f"\nColumns with metadata: {list(metadata_df3['column_name'])}")
print(f"Columns without metadata: {[col for col in df.columns if col not in metadata_df3['column_name'].values]}")


Metadata provided for only 4 out of 10 columns:
      column_name                       description data_type  \
0          income          Annual income in dollars   numeric   
1    credit_score      Credit score (300-850 scale)   numeric   
2     loan_amount  Loan amount requested in dollars   numeric   
3  default_status               Loan default status   numeric   

                            label_definition  
0                                       None  
1                                       None  
2                                       None  
3  1 if customer defaulted on loan, 0 if not  

Columns with metadata: ['income', 'credit_score', 'loan_amount', 'default_status']
Columns without metadata: ['customer_id', 'age', 'employment_years', 'debt_to_income', 'savings', 'num_accounts']


Generate Features with Partial Metadata



Even though we only provided descriptions for 4 columns, the LLM can see all 10 columns in the DataFrame and will generate features using all available columns. The LLM will have richer context for the 4 columns with descriptions, but can still create meaningful features using the other columns.


In [18]:
# Generate features - LLM will use ALL columns, not just the ones with metadata
# Using gpt-4o-mini model for cost-effective feature generation
code3 = llm_feat.generate_features(df, metadata_df3, mode='code', model='gpt-4o-mini')
print("Generated feature code (using all columns, with context from described columns):")
print(code3)


<IPython.core.display.Javascript object>

✓ Attempted to create new cell with code - check below
  (Also set as next input - will appear in next cell you create)
Generated feature code (using all columns, with context from described columns):

# Generated Feature Engineering Code
import numpy as np

df['income_to_loan_ratio'] = np.where(df['loan_amount'] != 0, df['income'] / df['loan_amount'], 0)
df['credit_score_to_income_ratio'] = np.where(df['income'] != 0, df['credit_score'] / df['income'], 0)
df['savings_to_debt_ratio'] = np.where(df['debt_to_income'] != 0, df['savings'] / df['debt_to_income'], 0)
df['employment_years_squared'] = df['employment_years'] ** 2
df['age_bins'] = pd.cut(df['age'], bins=[20, 30, 40, 50], labels=['20-30', '30-40', '40-50'], right=False)
df['high_income'] = np.where(df['income'] > df['income'].mean(), 1, 0)



In [19]:
import numpy as np

df['income_to_loan_ratio'] = np.where(df['loan_amount'] != 0, df['income'] / df['loan_amount'], 0)
df['credit_score_to_income_ratio'] = np.where(df['income'] != 0, df['credit_score'] / df['income'], 0)
df['savings_to_debt_ratio'] = np.where(df['debt_to_income'] != 0, df['savings'] / df['debt_to_income'], 0)
df['employment_years_squared'] = df['employment_years'] ** 2
df['age_bins'] = pd.cut(df['age'], bins=[20, 30, 40, 50], labels=['20-30', '30-40', '40-50'], right=False)
df['high_income'] = np.where(df['income'] > df['income'].mean(), 1, 0)

### Direct Feature Addition with Partial Metadata

You can also use direct mode - the LLM will still generate features using all columns in the DataFrame.


In [20]:
# Direct feature addition with partial metadata
df3_with_features = llm_feat.generate_features(df, metadata_df3, mode='direct', model='gpt-4o-mini')

print("DataFrame with new features:")
print(df3_with_features.head())
print(f"\nOriginal columns: {len(df.columns)}")
print(f"New columns added: {len(df3_with_features.columns) - len(df.columns)}")
print(f"Total columns: {len(df3_with_features.columns)}")


DataFrame with new features:
   customer_id  age  income  credit_score  loan_amount  employment_years  \
0            1   25   50000           650        10000                 2   
1            2   30   60000           720        15000                 5   
2            3   35   70000           680        20000                 8   
3            4   40   80000           750        25000                12   
4            5   45   90000           710        30000                15   

   debt_to_income  savings  num_accounts  default_status  ...  \
0            0.25     5000             2               0  ...   
1            0.30     8000             3               0  ...   
2            0.28    12000             2               1  ...   
3            0.35    15000             4               0  ...   
4            0.32    20000             3               0  ...   

   credit_score_to_income_ratio  savings_to_debt_ratio  \
0                      0.013000           20000.000000   
1      

**Key Takeaway:** You don't need to provide metadata for every column! Provide descriptions for:
- The target column (if applicable)
- Important domain-specific columns
- Columns that need special handling

The LLM will still see all columns in your DataFrame and can generate features using all of them, but will have richer context for columns with descriptions.


## Test 4: Using Problem Description for Additional Context

This example demonstrates how to use the `problem_description` parameter to provide additional business context to the LLM, which helps generate more relevant and domain-specific features.


In [7]:
# Create a dataset for e-commerce customer churn prediction
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'total_purchases': [15, 8, 25, 3, 30, 12, 20, 5, 18, 22],
    'avg_order_value': [45.50, 32.00, 78.90, 15.00, 95.20, 55.30, 67.80, 20.50, 72.40, 88.60],
    'days_since_last_purchase': [5, 45, 2, 90, 1, 30, 7, 120, 10, 3],
    'support_tickets': [0, 2, 0, 5, 1, 1, 0, 3, 0, 0],
    'account_age_days': [365, 180, 730, 90, 1095, 240, 540, 60, 450, 600],
    'churned': [0, 1, 0, 1, 0, 0, 0, 1, 0, 0]  # Target: 1=churned, 0=active
})

metadata_df4 = pd.DataFrame({
    'column_name': ['total_purchases', 'avg_order_value', 'days_since_last_purchase', 'support_tickets', 'account_age_days', 'churned'],
    'description': [
        'Total number of purchases made by customer',
        'Average value of customer orders in dollars',
        'Number of days since customer last made a purchase',
        'Number of customer support tickets opened',
        'Age of customer account in days',
        'Customer churn status'
    ],
    'data_type': ['numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric'],
    'label_definition': [None, None, None, None, None, '1 if customer churned, 0 if active']
})

print("E-commerce Customer Dataset:")
print(df.head())
print(f"\nShape: {df.shape}")


E-commerce Customer Dataset:
   customer_id  total_purchases  avg_order_value  days_since_last_purchase  \
0            1               15             45.5                         5   
1            2                8             32.0                        45   
2            3               25             78.9                         2   
3            4                3             15.0                        90   
4            5               30             95.2                         1   

   support_tickets  account_age_days  churned  
0                0               365        0  
1                2               180        1  
2                0               730        0  
3                5                90        1  
4                1              1095        0  

Shape: (10, 7)


### Without Problem Description

First, let's see features generated without additional context:


In [8]:
# Generate features without problem description
code4_no_desc = llm_feat.generate_features(df, metadata_df4, mode='code', model='gpt-4o-mini')
print("Generated code WITHOUT problem description:")
print(code4_no_desc)


<IPython.core.display.Javascript object>

✓ Attempted to create new cell with code - check below
  (Also set as next input - will appear in next cell you create)
Generated code WITHOUT problem description:

# Generated Feature Engineering Code
import numpy as np

df['total_purchases_to_account_age_ratio'] = df['total_purchases'] / (df['account_age_days'] + 1e-5)
df['avg_order_value_to_support_tickets_ratio'] = df['avg_order_value'] / (df['support_tickets'] + 1e-5)
df['days_since_last_purchase_squared'] = df['days_since_last_purchase'] ** 2
df['support_tickets_per_purchase'] = df['support_tickets'] / (df['total_purchases'] + 1e-5)
df['avg_order_value_log'] = np.log1p(df['avg_order_value'])



In [9]:
import numpy as np

df['total_purchases_to_account_age_ratio'] = df['total_purchases'] / (df['account_age_days'] + 1e-5)
df['avg_order_value_to_support_tickets_ratio'] = df['avg_order_value'] / (df['support_tickets'] + 1e-5)
df['days_since_last_purchase_squared'] = df['days_since_last_purchase'] ** 2
df['support_tickets_per_purchase'] = df['support_tickets'] / (df['total_purchases'] + 1e-5)
df['avg_order_value_log'] = np.log1p(df['avg_order_value'])

### With Problem Description

Now, let's provide additional business context using the `problem_description` parameter:


In [10]:
# Define the problem description with business context
problem_desc = """
We are an e-commerce company trying to predict customer churn. 
Key business insights:
- Customers who haven't purchased in 30+ days are at high risk of churning
- High support ticket volume often indicates dissatisfaction and leads to churn
- Customers with high lifetime value (many purchases, high order values) are less likely to churn
- New customers (account age < 90 days) with low engagement are particularly vulnerable
- We want to identify at-risk customers early to intervene with retention campaigns

Generate features that help identify these at-risk customer segments.
"""

# Generate features WITH problem description
code4_with_desc = llm_feat.generate_features(
    df, 
    metadata_df4, 
    mode='code', 
    model='gpt-4o-mini',
    problem_description=problem_desc
)
print("Generated code WITH problem description:")
print(code4_with_desc)


<IPython.core.display.Javascript object>

✓ Attempted to create new cell with code - check below
  (Also set as next input - will appear in next cell you create)
Generated code WITH problem description:

# Generated Feature Engineering Code
import numpy as np

df['days_since_last_purchase_over_30'] = (df['days_since_last_purchase'] > 30).astype(int)
df['high_support_ticket_risk'] = (df['support_tickets'] > 2).astype(int)
df['low_engagement_new_customer'] = ((df['account_age_days'] < 90) & (df['total_purchases'] < 5)).astype(int)
df['total_purchases_squared'] = df['total_purchases'] ** 2
df['avg_order_value_to_days_since_last_purchase_ratio'] = np.where(df['days_since_last_purchase'] > 0, 
                                                                    df['avg_order_value'] / df['days_since_last_purchase'], 
                                                                    0)



In [11]:
import numpy as np

df['days_since_last_purchase_over_30'] = (df['days_since_last_purchase'] > 30).astype(int)
df['high_support_ticket_risk'] = (df['support_tickets'] > 2).astype(int)
df['low_engagement_new_customer'] = ((df['account_age_days'] < 90) & (df['total_purchases'] < 5)).astype(int)
df['total_purchases_squared'] = df['total_purchases'] ** 2
df['avg_order_value_to_days_since_last_purchase_ratio'] = np.where(df['days_since_last_purchase'] > 0, 
                                                                    df['avg_order_value'] / df['days_since_last_purchase'], 
                                                                    0)

### Direct Mode with Problem Description

You can also use problem description with direct mode:


In [15]:
# Direct feature addition with problem description
df4_with_features = llm_feat.generate_features(
    df, 
    metadata_df4, 
    mode='direct', 
    model='gpt-4o-mini',
    problem_description=problem_desc
)

print("DataFrame with features generated using problem description:")
print(df4_with_features.head())
print(f"\nOriginal columns: {len(df.columns)}")
print(f"New columns added: {len(df4_with_features.columns) - len(df.columns)}")
print(f"Total columns: {len(df4_with_features.columns)}")
print(f"\nNew feature columns: {[col for col in df4_with_features.columns if col not in df.columns]}")


DataFrame with features generated using problem description:
   customer_id  total_purchases  avg_order_value  days_since_last_purchase  \
0            1               15             45.5                         5   
1            2                8             32.0                        45   
2            3               25             78.9                         2   
3            4                3             15.0                        90   
4            5               30             95.2                         1   

   support_tickets  account_age_days  churned  \
0                0               365        0   
1                2               180        1   
2                0               730        0   
3                5                90        1   
4                1              1095        0   

   total_purchases_to_account_age_ratio  \
0                              0.041096   
1                              0.044444   
2                              0.034247   
3  

## Test 5: Generating Feature Report

This example demonstrates how to get a detailed report explaining the domain understanding and rationale for each generated feature.


In [3]:
# Create a simple dataset for report demonstration
df = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 28, 32, 38, 42, 50],
    'income': [50000, 60000, 70000, 80000, 90000, 55000, 65000, 75000, 85000, 95000],
    'credit_score': [650, 720, 680, 750, 710, 660, 730, 690, 740, 700],
    'loan_default': [0, 0, 1, 0, 1, 0, 0, 1, 0, 1]  # Target: 1=defaulted, 0=not defaulted
})

metadata_df5 = pd.DataFrame({
    'column_name': ['age', 'income', 'credit_score', 'loan_default'],
    'description': [
        'Age of the borrower',
        'Annual income in dollars',
        'Credit score (300-850)',
        'Loan default status'
    ],
    'data_type': ['numeric', 'numeric', 'numeric', 'numeric'],
    'label_definition': [None, None, None, '1 if defaulted, 0 if not']
})

print("Dataset:")
print(df.head())


Dataset:
   age  income  credit_score  loan_default
0   25   50000           650             0
1   30   60000           720             0
2   35   70000           680             1
3   40   80000           750             0
4   45   90000           710             1


### Code Mode with Report

Generate features and get a detailed report explaining the domain understanding and each feature:


In [4]:
# Generate features with report
code5, report5 = llm_feat.generate_features(
    df, 
    metadata_df5, 
    mode='code', 
    model='gpt-4o-mini',
    return_report=True
)

print("Generated Code:")
print(code5)
print("\n" + "=" * 70)
print("FEATURE REPORT")
print("=" * 70)
print(report5)

<IPython.core.display.Javascript object>

✓ Attempted to create new cell with code - check below
  (Also set as next input - will appear in next cell you create)
Generated Code:

# Generated Feature Engineering Code
import pandas as pd
import numpy as np

# Creating new features based on the existing columns
df['income_to_age_ratio'] = df['income'] / (df['age'] + 1e-5)  # Avoid division by zero
df['credit_score_to_income_ratio'] = df['credit_score'] / (df['income'] + 1e-5)  # Avoid division by zero
df['age_squared'] = df['age'] ** 2  # Capturing non-linear relationship
df['income_binned'] = pd.cut(df['income'], bins=[0, 60000, 80000, 100000], labels=['low', 'medium', 'high'])  # Binning income
df['credit_score_binned'] = pd.cut(df['credit_score'], bins=[300, 600, 700, 850], labels=['poor', 'fair', 'good'])  # Binning credit score


FEATURE REPORT

1. DOMAIN UNDERSTANDING:
   - The problem domain revolves around predicting loan default status based on borrower characteristics such as age, income, and credit score.
   - The targ

In [5]:
import pandas as pd
import numpy as np

# Creating new features based on the existing columns
df['income_to_age_ratio'] = df['income'] / (df['age'] + 1e-5)  # Avoid division by zero
df['credit_score_to_income_ratio'] = df['credit_score'] / (df['income'] + 1e-5)  # Avoid division by zero
df['age_squared'] = df['age'] ** 2  # Capturing non-linear relationship
df['income_binned'] = pd.cut(df['income'], bins=[0, 60000, 80000, 100000], labels=['low', 'medium', 'high'])  # Binning income
df['credit_score_binned'] = pd.cut(df['credit_score'], bins=[300, 600, 700, 850], labels=['poor', 'fair', 'good'])  # Binning credit score

### Direct Mode with Report

You can also get a report when using direct mode:


In [6]:
# Direct mode with report
df5_with_features, report5_direct = llm_feat.generate_features(
    df, 
    metadata_df5, 
    mode='direct', 
    model='gpt-4o-mini',
    return_report=True
)

print("DataFrame with features:")
print(df5_with_features.head())
print("\n" + "=" * 70)
print("FEATURE REPORT")
print("=" * 70)
print(report5_direct)

DataFrame with features:
   age  income  credit_score  loan_default  income_to_age_ratio  \
0   25   50000           650             0          1999.999200   
1   30   60000           720             0          1999.999333   
2   35   70000           680             1          1999.999429   
3   40   80000           750             0          1999.999500   
4   45   90000           710             1          1999.999556   

   credit_score_to_income_ratio  age_squared income_binned  \
0                      0.013000          625           low   
1                      0.012000          900           low   
2                      0.009714         1225        medium   
3                      0.009375         1600        medium   
4                      0.007889         2025          high   

  credit_score_binned  income_to_credit_score_ratio  age_income_interaction  \
0                fair                     76.923077                 1250000   
1                good                    

**Key Takeaway:** The `return_report=True` parameter provides valuable insights:
- **Domain Understanding**: Explains the problem context and business domain
- **Feature Explanations**: For each generated feature, explains:
  - What the feature represents
  - Why it's useful for prediction
  - How it relates to the business problem

This report helps you understand the reasoning behind the generated features and can be useful for documentation, presentations, or explaining your feature engineering approach to stakeholders.


## Summary

- ✓ Package imports successfully
- ✓ API key management works
- ✓ Metadata validation works
- ✓ Code generation mode works (injects into next cell in Jupyter)
- ✓ Direct feature addition mode works
- ✓ Works with datasets with and without target columns
- ✓ Works with partial metadata (you can provide descriptions for only a subset of columns)
