# Evaluating Large Language Models for Automated Exploratory Data Analysis  
## A Statistical Perspective

This experiment compares:
- **Human statistical analysis (ground truth)**  
- **LLM-generated EDA insights (Ollama + Mistral)**

The goal is to evaluate whether LLMs can replicate statistically grounded EDA reasoning.


## Hypothesis

**H‚ÇÄ (Null Hypothesis):**  
LLM-generated EDA insights are not semantically aligned with
statistically grounded human analysis.

**H‚ÇÅ (Alternative Hypothesis):**  
LLM-generated EDA insights show significant semantic alignment
with human statistical analysis.


In [13]:
import sys
import os
from app.preprocessing import clean_dataset

# Absolute path to project root
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))

# Add to Python path
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

print("Added to sys.path:", PROJECT_ROOT)


Added to sys.path: /home/bane/EDA_LLM


In [2]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util

from app.eda import generate_eda_context
from app.llm_insights import generate_llm_insights
from app.stats_human_baseline import (
    human_statistical_summary,
    human_summary_text
)


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
df_raw = pd.read_csv("/home/bane/EDA_LLM/data/data.csv")
df_clean = clean_dataset(df_raw)
df_clean.head()

Unnamed: 0,CountryName,CountryCode,BirthRate,InternetUsers,IncomeGroup
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income


In [None]:
# HUMAN EDA ‚Äî RAW
human_raw = human_summary_text(
    human_statistical_summary(df_raw)
)

# HUMAN EDA ‚Äî CLEANED
human_clean = human_summary_text(
    human_statistical_summary(df_clean)
)


üìä HUMAN-READABLE STATISTICAL SUMMARY

üî¢ NUMERIC FEATURES:
- BirthRate: mean=21.47, median=19.68, std=10.61, min=7.9, max=49.66, missing=0
- InternetUsers: mean=42.08, median=41.0, std=29.03, min=0.9, max=96.55, missing=0

üî§ CATEGORICAL FEATURES:
- CountryName: unique_values=195, most_common='Afghanistan', missing=0
- CountryCode: unique_values=195, most_common='ABW', missing=0
- IncomeGroup: unique_values=4, most_common='High income', missing=0


In [None]:
# LLM EDA ‚Äî RAW
eda_raw = generate_eda_context(df_raw)
llm_raw = generate_llm_insights(eda_raw)

# LLM EDA ‚Äî CLEANED
eda_clean = generate_eda_context(df_clean)
llm_clean = generate_llm_insights(eda_clean)


{'numeric_summary': {'BirthRate': {'mean': 21.469928205128202,
   'median': 19.68,
   'std': 10.605466693579938,
   'min': 7.9,
   'max': 49.661},
  'InternetUsers': {'mean': 42.07647089194872,
   'median': 41.0,
   'std': 29.030788424830387,
   'min': 0.9,
   'max': 96.5468}},
 'correlation_matrix': {'BirthRate': {'BirthRate': 1.0,
   'InternetUsers': -0.816},
  'InternetUsers': {'BirthRate': -0.816, 'InternetUsers': 1.0}},
 'dataset_shape': {'rows': 195, 'columns': 5}}

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

# Stability of reasoning
human_stability = util.cos_sim(
    model.encode(human_raw, convert_to_tensor=True),
    model.encode(human_clean, convert_to_tensor=True)
).item()

llm_stability = util.cos_sim(
    model.encode(llm_raw, convert_to_tensor=True),
    model.encode(llm_clean, convert_to_tensor=True)
).item()

print("Human Raw vs Clean Similarity:", round(human_stability, 3))
print("LLM Raw vs Clean Similarity:", round(llm_stability, 3))


In [9]:
llm_text = generate_llm_insights(eda_context)

print("LLM GENERATED SUMMARY:\n")
print(llm_text)


LLM GENERATED SUMMARY:

 In the provided dataset, we observe the following trends and patterns:

1. Birth Rate (BirthRate) has a mean of approximately 21.47, median of 19.68, and standard deviation of around 10.61. The range is quite large, from 7.9 to 49.66.

2. Internet Users per 100 people (InternetUsers) has a mean of approximately 42.08, median of 41.0, and standard deviation of around 29.03. The minimum and maximum values are 0.9 and 96.55 respectively.

3. There is a strong negative correlation (-0.816) between Birth Rate and Internet Users, indicating that as the Birth Rate increases, the number of Internet Users tends to decrease, and vice versa.

4. The dataset contains 195 observations across 5 features (columns). This suggests relative stability in terms of sample size, but the wide ranges in some variables may indicate variability in the data.

5. Statistically meaningful observations could be those with extreme values (min, max) or outliers identified through further stat

In [10]:
model = SentenceTransformer("all-MiniLM-L6-v2")

emb_human = model.encode(human_text, convert_to_tensor=True)
emb_llm = model.encode(llm_text, convert_to_tensor=True)

semantic_similarity = util.cos_sim(emb_human, emb_llm).item()

semantic_similarity


0.4192032516002655

In [11]:
human_words = set(human_text.lower().split())
llm_words = set(llm_text.lower().split())

coverage = len(human_words & llm_words) / len(human_words)
coverage


0.03125

In [12]:
results = {
    "Semantic Similarity": round(semantic_similarity, 3),
    "Insight Coverage": round(coverage, 3)
}

results


{'Semantic Similarity': 0.419, 'Insight Coverage': 0.031}

## Interpretation

- Human statistical reasoning remains highly stable after data cleaning
- LLM-generated insights show greater sensitivity to preprocessing
- Indicates that LLM reasoning is less robust to data transformations
- Highlights the importance of structured grounding in applied systems
