**Packages**

In [23]:
import pandas as pd
import re
import numpy as np
from openai import OpenAI
import os
from dotenv import load_dotenv

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chains import LLMChain


**Helpers imports**

In [24]:
from helpers.models import adjusted_r2, run_xgboost_cv
from helpers.data_extraction import extract_section_7, load_automobile_df
from helpers.llm_helpers import get_schema_text_for_llm, make_summary
from helpers.general_helpers import get_next_run_idx, get_interesting_idxs

Parameters api key and base url

In [25]:
load_dotenv()
api_key = os.getenv("key")
base_url = os.getenv("api")

Get the dataset

In [26]:
data_path = "automobile/imports-85.data"
names_path = "automobile/imports-85.names"
df, informations = load_automobile_df(data_path, names_path)

In [27]:
door_map = {
    'four': 4, 
    'two': 2
}

cyl_map = {
    'eight': 8,
    'five': 5,
    'four': 4,
    'six': 6,
    'three': 3,
    'twelve': 12,
    'two': 2
}

# Convert to integer columns. I put Int64 because one variables has missing values.
df['num-of-doors'] = df['num-of-doors'].map(door_map).astype('Int64')
df['num-of-cylinders'] = df['num-of-cylinders'].map(cyl_map).astype('Int64')

df['curb-weight'] = df['curb-weight'].astype('float64')
df['engine-size'] = df['engine-size'].astype('float64')

In [28]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,2,convertible,rwd,front,88.6,...,130.0,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,2,convertible,rwd,front,88.6,...,130.0,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,2,hatchback,rwd,front,94.5,...,152.0,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,4,sedan,fwd,front,99.8,...,109.0,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,4,sedan,4wd,front,99.4,...,136.0,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


***TWO agent chain***

**1 - Analyst Agent**

**2 - Feature Engineering Agent**

After that, we apply the feature engineering, and we use the same Explanation Agent

In [29]:
os.environ["OPENAI_API_KEY"] = api_key
os.environ["OPENAI_API_BASE"] = base_url

In [30]:
chat = ChatOpenAI(model_name="deepseek/deepseek-r1-0528:free", 
                  temperature=0.2)

eda_prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(
        f"""You are an automotive data analyst agent.

            You will receive a summary of an automobile dataset, including:
            - A list of all columns, with their data types, attribute ranges or value examples, and missing value counts
            - Sample records
            - Key statistics (e.g. mean, std, min, max, skew, etc.) for numerical and categorical variables
            - Optional EDA findings (distribution shape, outlier notes, trends, correlations)

            Your responsabilities are:
            1. For each variable/column, explain what it likely measures, its units or categories, and its practical meaning in the context of vehicle data. Be explicit and do not group variables—analyze each one individually.
            2. Detect possible semantic relationships, physical or business meanings, and likely associations with other columns (e.g. which pairs have strong correlations, which affect price/risk, etc).
            3. Call out notable trends, strong correlations, anomalies, and outliers. Reference variable names and statistics explicitly.
            4. Suggest ways the dataset could be enriched: for each, mention relevant external datasets, additional columns, or real-world data sources (e.g. safety ratings, CO2 emissions, vehicle class, market info) that could be joined or mapped by column name.

            Format your answer as:
            - A numbered list where each variable is explained with name, meaning, type, and any key distribution or outlier notes.
            - A section on key trends, correlations and anomalies
            - A section with enrichment recommendations, referencing specific variable names

            Please, focus on schema interpretation (variable meanings), EDA, and actionable suggestions for deeper analysis.
        """
    ),
    HumanMessagePromptTemplate.from_template("{schema_summary}")
])

eda_chain = LLMChain(llm=chat, prompt=eda_prompt)

feat_eng_prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(
        """
        You are an automotive feature engineering agent.

        Your input is the complete output of a Data Analyst Agent, including:
        - The dataset schema with column names, attribute ranges, data types, and missing value counts.
        - Sample records, statistics, and EDA findings.
        - Explicit interpretations of each variable and summary of trends/outliers from data analysis.

        Your tasks are:
        1. Purpose new features for predictive modeling and interpretability by combining, transforming, enriching the columns provided or by suggesting realistic external enrichment.
        2. For each suggestion, specify:
            - Feature Name
            - Formula – Show how it is calculated (including mathematical formula or logic or code-like description referencing column names and attribute ranges)
            - Justification – Business or automotive domain reason for its value (e.g., why it matters for price, efficiency, safety, etc.)
            - Reasoning Chain – Step-by-step logic or EDA findings that motivate this feature (why it could improve prediction, segmentation, or insight)

        Guidelines:
            - Use only the columns in the schema/EDA unless you explicitly recommend a realistic external enrichment (note the external source in justification).
            - Cite variable names and attribute ranges where helpful.
            - Suggest both classic (e.g., ratios, interactions, indicators) and creative features, but keep formulas practical—no fictitious data or overly complex recipes.

        Output all suggestions as a markdown table:
        | Feature Name | Formula        | Justification | Reasoning Chain |
        |--------------|----------------|---------------|-----------------|

        Then, in a short section below, summarize the main rationale for your key suggestions (e.g., which features link to performance, efficiency, market segmentation, or domain knowledge).
        {analyst_output}
        """
    ),
    HumanMessagePromptTemplate.from_template("{analyst_output}")
])

feat_eng_chain = LLMChain(llm=chat, prompt=feat_eng_prompt)


summary = make_summary(df, informations)

eda_out = eda_chain.run(schema_summary=summary) 
print("\n=== Analyst Agent Output ===\n", eda_out[:1200], "...")

features_out = feat_eng_chain.run(analyst_output=eda_out) 
print("\n=== Feature Engineering Agent Output ===\n", features_out[:1200], "...")



=== Analyst Agent Output ===
 ### 1. Variable Analysis (Individual Breakdown)
1. **symboling** (int64):  
   - Likely measures **insurance risk rating** (-3 = safest, 3 = riskiest).  
   - Distribution: Slightly right-skewed (skew=0.21), mode=0. Most vehicles (50th percentile=1) are low-risk.  

2. **normalized-losses** (float64):  
   - Represents **insurance loss severity** (normalized to 65–256 scale).  
   - Missing 20% data. Slight right skew (skew=0.77), mode=161.  

3. **make** (object):  
   - **Manufacturer** (e.g., Toyota, BMW). Toyota dominates (32/205), Nissan second (18).  

4. **fuel-type** (object):  
   - **Fuel type**: Gas (185), Diesel (20). Gas is predominant.  

5. **aspiration** (object):  
   - **Engine aspiration type**: Standard (168), Turbocharged (37).  

6. **num-of-doors** (Int64):  
   - **Door count**: Four (153), Two (50). Missing 2 entries.  

7. **body-style** (object):  
   - **Vehicle body type**: Sedan (96), Hatchback (70), Convertible (3).  

8. **

In [31]:
import os

def ensure_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)


def get_next_chain_run_idx(folder):
    existing = [fname for fname in os.listdir(folder) if fname.endswith(".txt")]
    idxs = []
    for fname in existing:
        try:
            idx = int(fname[:3])
            idxs.append(idx)
        except:
            pass
    next_idx = max(idxs) + 1 if idxs else 1
    return f"{next_idx:03d}"

def save_agent_output(agent_type, output, idx, folder="outputs/2_agents_chain/"):
    os.makedirs(folder, exist_ok=True)
    fname = os.path.join(folder, f"{idx}_{agent_type}.txt")
    with open(fname, "w", encoding="utf-8") as f:
        f.write(output)
    return fname

folder = "outputs/2_agents_chain/"
run_idx = get_next_chain_run_idx(folder)  # Call ONCE per workflow run

analyst_outfile = save_agent_output("analyst", eda_out, run_idx, folder)
feature_outfile = save_agent_output("feature", features_out, run_idx, folder)

print("Saved:", analyst_outfile, feature_outfile)
print("Saved:", analyst_outfile, feature_outfile)

Saved: outputs/2_agents_chain/005_analyst.txt outputs/2_agents_chain/005_feature.txt
Saved: outputs/2_agents_chain/005_analyst.txt outputs/2_agents_chain/005_feature.txt


In [32]:
print(features_out)

### Feature Engineering Suggestions

| Feature Name              | Formula                                                                 | Justification                                                                 | Reasoning Chain                                                                                                                                                             |
|---------------------------|-------------------------------------------------------------------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Power-to-Weight Ratio** | `horsepower / curb-weight`                                              | Critical performance metric: Higher ratio indicates better acceleration and sportier vehicles.                                               | 

**Training XGBOOST with all features and with the new features suggested by the the multi-agent system, in particular, by the feature engineering agent**

In [33]:
df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

In [34]:
features = [col for col in df.columns if col != 'price']
xgboost_results_original = run_xgboost_cv(df, features, False)

Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 400, 'subsample': 1}
Best CV RMSE: 2616.435608016184
Test RMSE: 2724.65
Test MAE: 1810.04
Test R²: 0.905
Adjusted R²: 0.748

Top Feature Importances:
make           0.280320
engine-size    0.167140
engine-type    0.133444
curb-weight    0.121030
highway-mpg    0.055936
fuel-system    0.036141
horsepower     0.031452
city-mpg       0.028791
width          0.024451
body-style     0.023890
dtype: float32


I use one of the runs from the chain. I only use feature engineering suggestions. I don't use external enrichements.

### Feature Engineering Suggestions

| Feature Name             | Formula                                                                 | Justification                                                                 | Reasoning Chain                                                                                                                                                             |
|--------------------------|-------------------------------------------------------------------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Power-to-Weight Ratio** | `horsepower / curb-weight`                                             | Critical performance metric: Higher ratios indicate better acceleration and handling, directly influencing price and desirability.           | EDA shows strong correlation between horsepower and price. Combining with curb-weight (key efficiency/performance factor) quantifies dynamic performance beyond raw power. |
| **Fuel Efficiency Gap**   | `highway-mpg - city-mpg`                                               | Highlights drivetrain optimization: Larger gaps indicate engines optimized for highway cruising (e.g., diesels), affecting buyer preferences. | EDA reveals diesel engines (high compression) behave differently. This feature captures efficiency adaptability across driving conditions.                               |
| **Cylinder Efficiency**   | `horsepower / num-of-cylinders`                                        | Measures engineering sophistication: High values indicate advanced tech (e.g., turbocharging), justifying price premiums.                    | Turbocharged engines (aspiration=turbo) show higher horsepower. Normalizing by cylinders isolates engineering efficiency from sheer size.                              |
| **Size Index**            | `(length * width * height) / 1000`                                     | Represents interior volume and stability: Larger vehicles command higher prices and influence safety perceptions.                            | Strong correlation between dimensions/weight in EDA. This aggregate metric better captures spatial utility than individual dimensions.                                  |
| **High-Risk Safety Flag** | `1 if normalized-losses > 200 else 0`                                  | Identifies safety outliers for insurance/risk models: Vehicles with losses >200 are actuarially high-risk.                                   | EDA flags normalized-losses >200 as extreme values. This binary feature isolates high-risk vehicles for targeted analysis.                                              |
| **Performance Engine Flag**| `1 if (aspiration = 'turbo') and (horsepower > 150) else 0`            | Tags high-performance configurations: Turbocharged engines with >150HP are premium segments affecting price and insurance.                   | EDA notes turbo linkage to higher horsepower. Combining both isolates true performance models (e.g., sports cars).                                                    |
| **Diesel Compression Flag**| `1 if (compression-ratio > 15) or (fuel-type = 'diesel') else 0`       | Flags diesel/high-efficiency engines: High compression ratios (≥15) define distinct efficiency/performance characteristics.                 | Anomaly at compression-ratio=23 (diesel trait). This feature groups diesel-specific engineering for better modeling.                                                  |
| **Footprint Ratio**       | `wheel-base / (length * width)`                                        | Measures stability vs. maneuverability: Lower ratios indicate agile handling; higher ratios imply highway stability.                         | Wheel-base correlates with length/width. This ratio captures design trade-offs affecting driver experience and safety.                                                  |

### External Enrichment Suggestions
| Feature Name          | Formula/Enrichment Source                                     | Justification                                                                 | Reasoning Chain                                                                                                                               |
|-----------------------|---------------------------------------------------------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| **Brand Tier**        | Map `make` to [luxury, mainstream, economy] via KBB/Edmunds   | Luxury brands (e.g., Porsche) command price premiums; economy brands prioritize efficiency. Critical for market segmentation. | EDA shows price strongly correlates with make. External tier data contextualizes brand positioning beyond raw names.                         |
| **Safety Rating**     | NHTSA/IIHS crash-test scores joined via `make` and model year | Explains insurance metrics (`symboling`, `normalized-losses`) and consumer safety preferences.                                              | High normalized-losses values lack context. Official safety ratings provide objective validation for risk modeling.                          |
| **Vehicle Class**     | NHTSA classification using `body-style` + `curb-weight`       | Standardizes segments (e.g., SUV, sports car) affecting pricing, regulations, and consumer behavior.                                        | Current body-style lacks granularity (e.g., "sedan" spans compacts to luxury). Weight-based classes improve market trend analysis.           |


In [35]:
df_feat = df.copy()

In [36]:
df_feat['power-to-weight-ratio'] = df_feat['horsepower'] / df_feat['curb-weight']

df_feat['fuel-efficiency-gap'] = df_feat['highway-mpg'] - df_feat['city-mpg']

df_feat['cylinder-efficiency'] = df_feat['horsepower'] / df_feat['num-of-cylinders']

df_feat['size-index'] = (df_feat['length'] * df_feat['width'] * df_feat['height']) / 1000

df_feat['high-risk-safety-flag'] = (df_feat['normalized-losses'] > 200).astype(int)

df_feat['performance-engine-flag'] = ((df_feat['aspiration'] == 'turbo') & (df_feat['horsepower'] > 150)).astype(int)

df_feat['diesel-compression-flag'] = ((df_feat['compression-ratio'] > 15) | (df_feat['fuel-type'] == 'diesel')).astype(int)

df_feat['footprint-ratio'] = df_feat['wheel-base'] / (df_feat['length'] * df_feat['width'])

In [37]:
features = [col for col in df_feat.columns if col != 'price']
xgboost_results_llm = run_xgboost_cv(df_feat, features, False)

Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best parameters: {'learning_rate': 0.03, 'max_depth': 3, 'n_estimators': 400, 'subsample': 1}
Best CV RMSE: 2617.1461070542273
Test RMSE: 2718.37
Test MAE: 1745.95
Test R²: 0.906
Adjusted R²: 0.462

Top Feature Importances:
engine-size            0.312064
make                   0.213388
engine-type            0.096205
highway-mpg            0.069226
curb-weight            0.065309
fuel-system            0.059634
num-of-cylinders       0.029277
horsepower             0.024970
body-style             0.024140
cylinder-efficiency    0.020054
dtype: float32


It seems that this time the variables suggested by the feature engineering agent do not appear to be as important. This may indicate that the summary provided by the Analyst Agent does not contain as much useful information as what was available in the case of separate agents, where the Feature Engineering Agent received a data summary (including sample records and statistics) as input. Nevertheless, the test RMSE and R² achieved slightly better results.

**Explanation agent**

In [38]:
def explain_prediction_llm(row, y_true, y_pred, importances, rmse, r2, cv_rmse, client,
                           ml_model="XGBoost", model="deepseek/deepseek-r1-0528:free",
                           txt_filename="llm_explanation.txt", temperature=0.2
):
    
    top_features = importances.head(5).index.tolist()
    feature_vals = "\n".join([f"- **{feat}**: {row[feat]}" for feat in top_features])
    prompt = f"""
    You are an AI assistant for explaining car price predictions to non-technical stakeholders

    A machine learning model ({ml_model}) was trained to predict vehicle prices based on technical and categorical features.
    Here are its key performance metrics:
    - Test RMSE: ${rmse:.2f} —   the typical error in predicted prices
    - Test R²: ${r2:.3f} — explanatory power on unseen data
    - CV RMSE: ${cv_rmse:.2f} — cross-validated training error

    Below is one car's prediction:
    - Actual price: ${y_true:.2f}
    - Predicted price: ${y_pred:.2f}

    Key features for this car:  
    {feature_vals}

    Top overall features driving model predictions:  
    {top_features}

    Your task (for non-technical stakeholders):

    - Write a clear, plain-language explanation of why the model predicted this car’s price as it did.  
        - Reference the most important features and their values for this specific car.
        - Highlight which features make it expensive or cheap.
        - Mention if the prediction is likely to be especially reliable or uncertain (for example, maybe the car has rare features or brand).
    - Briefly explain what the overall performance metrics (RMSE, R²) mean for buyers, sellers, or decision makers.

    Don’t use technical terms unless they are explained above. Always relate your explanation to real-world or business ideas.
    """
    messages = [
        {"role": "system", "content": "You are a helpful AI analytics explainer."},
        {"role": "user", "content": prompt}
    ]
    response = client.chat.completions.create(model=model, messages=messages, temperature=temperature)

    output = response.choices[0].message.content
    with open(txt_filename, "w", encoding="utf-8") as f:
        f.write(output)
    return output

In [39]:
y_test = xgboost_results_llm['y_test']
y_pred = xgboost_results_llm['y_pred']
importances = xgboost_results_llm['importances']
X_test = xgboost_results_llm['X_test']
cv_rmse = xgboost_results_llm['cv_rmse']
rmse = xgboost_results_llm['rmse']
r2 = xgboost_results_llm['r2']

In [40]:
interesting_idxs = get_interesting_idxs(y_test, y_pred)

In [41]:
def explain_interesting_cases(
    interesting_idxs, X_test, y_test, y_pred,
    importances, rmse, r2, cv_rmse, client, ml_model, next_idx,
    out_dir="outputs/explanation_agent",
    temperature=0.2
):
    """
    Function that runs all interesing indices and call explain_prediction_llm for each case, using a given next_idx.

    Inputs:

    interesting_idxs : list of int
        Indices to explain in the test set.
    X_test : pandas DataFrame
        Test set features.
    y_test : pandas Series or array
        True test values (must line up with X_test index).
    y_pred : array
        Corresponding model predictions.
    importances : any
        Feature importance data to pass to explainer.
    rmse, r2, cv_rmse : float
        Metrics for the explainer.
    client : object
        LLM client object.
    ml_model : str
        Model name string.
    next_idx : str
        The zero-padded index string for this batch of explanations.
    out_dir : str
        Directory to save explanations.
    temperature : float
        LLM temperature setting.

    Output:
    
    outputs : dict
        Dictionary mapping idx to explanation string.
    """
    outputs = {}
    for idx in interesting_idxs:
        row = X_test.iloc[idx]
        y_true = y_test.iloc[idx]
        y_hat = y_pred[idx]
        txt_filename = os.path.join(out_dir, f"{next_idx}_two_agent_chain_explanation_{idx}.txt")
        output = explain_prediction_llm(
            row=row,
            y_true=y_true,
            y_pred=y_hat,
            importances=importances,
            rmse=rmse,
            r2=r2,
            cv_rmse=cv_rmse,
            client=client,
            ml_model=ml_model,
            txt_filename=txt_filename,
            temperature=temperature
        )
        outputs[idx] = output
        print(f"\n=== Explanation for test row {idx} ===\n")
        print(output)
    return outputs

In [42]:
out_dir = "outputs/explanation_agent"
next_idx = get_next_run_idx(out_dir, agent_name="two_agent_chain_explanation")
ml_model = "XGBoost"

In [43]:
client = OpenAI(
    base_url=base_url,
    api_key=api_key
)

In [44]:
explain_interesting_cases(interesting_idxs, X_test, y_test, y_pred, importances, rmse, r2, cv_rmse, client, ml_model,next_idx)


=== Explanation for test row 15 ===

### Explanation for This Car's Prediction
The model predicted this Toyota's price at **$6,417.97**, very close to its actual price of **$6,488.00** (only **$70.03 under**). Here's why:

1. **Key features driving the prediction**:  
   - **Engine size (92.0)**: Smaller engines typically lower costs. This modest size helped reduce the predicted price.  
   - **Make (Toyota)**: Toyota is a reliable, mass-market brand, which usually means moderate pricing (not luxury-tier).  
   - **Highway MPG (38)**: Excellent fuel efficiency makes this car economical to run, appealing to budget-conscious buyers and keeping the price competitive.  
   - **Curb weight (2015 lbs)**: Lighter than average, suggesting a compact design. This often lowers manufacturing costs and final prices.  
   - **Engine type (OHC)**: A common, standard engine type—not a high-end feature, so it doesn’t push the price up.  

   **Overall**: The combination of a small engine, efficient fu

{15: '### Explanation for This Car\'s Prediction\nThe model predicted this Toyota\'s price at **$6,417.97**, very close to its actual price of **$6,488.00** (only **$70.03 under**). Here\'s why:\n\n1. **Key features driving the prediction**:  \n   - **Engine size (92.0)**: Smaller engines typically lower costs. This modest size helped reduce the predicted price.  \n   - **Make (Toyota)**: Toyota is a reliable, mass-market brand, which usually means moderate pricing (not luxury-tier).  \n   - **Highway MPG (38)**: Excellent fuel efficiency makes this car economical to run, appealing to budget-conscious buyers and keeping the price competitive.  \n   - **Curb weight (2015 lbs)**: Lighter than average, suggesting a compact design. This often lowers manufacturing costs and final prices.  \n   - **Engine type (OHC)**: A common, standard engine type—not a high-end feature, so it doesn’t push the price up.  \n\n   **Overall**: The combination of a small engine, efficient fuel use, lightweight