EDAgent

Your personal Exploratory Data Analyst + AI Agent

Drop in a CSV. Get a complete EDA — automatically.

An agentic, LLM-powered Exploratory Data Analysis pipeline built with LangGraph and Ollama.
Nine specialized AI agents analyze your dataset, then generate a polished HTML report — all running 100% locally.

What it does

You give it a CSV. It spins up a 9-stage LangGraph pipeline where each node is an AI agent that analyzes a different aspect of your data, writes a summary, and passes its findings to the next stage. At the end, you get:

An interactive Streamlit dashboard with tabbed results and live progress
A rich, color-coded terminal output (if run via CLI)
A self-contained report.html — dark-themed, browser-ready, heatmap embedded inline
A correlation_heatmap.png saved to output/

Pipeline

schema → quality → stats → outliers → correlation → importance → synthesis → model_rec → feature_eng

Each node runs a Python analysis tool first, then passes the raw result to the LLM to reason over and summarize in plain English.

Architecture Note: All Python computations (correlation matrices, outlier IQR bounds, descriptive stats) run on the full dataset to ensure statistical accuracy. However, the data passed to the LLM is intentionally capped (e.g., only the top 20 strongest correlations or top 15 outlier features) to prevent context window overload and hallucination.

#	Agent	What it analyzes
1	Schema	Shape, column types, null counts per column
2	Quality	Duplicates, missing value %, columns with nulls
3	Statistics	Descriptive stats, skewness, categorical value counts
4	Outliers	IQR-based detection — count, bounds, example values
5	Correlation	Pearson matrix, multicollinearity flags, heatmap
6	Feature Importance	Target-aware correlation ranking (powered by robust multi-tiered target detection), falling back to variance (numeric) and entropy (categorical) ranking
7	Synthesis	Full EDA narrative — overview, issues, patterns, recommendations
8	Model Recommendation	Infers problem type, recommends models, flags uncertainty, suggests metrics
9	Feature Engineering	Suggests concrete new features: log transforms, bins, interactions, encodings

Note: Feature engineering code is executed in a restricted sandbox (__builtins__ stripped, only pd, np, and df exposed) with a 2-attempt reflection loop that feeds errors back to the LLM for self-correction. The pipeline uses robust multi-tiered target detection to completely hide the target variable from the LLM prompt and sandbox to prevent data leakage, and aggressively checks newly generated features to reject trivial copies (correlation > 0.999) of existing features.

Quickstart

1. Prerequisites

Install Ollama and pull the model:

ollama pull llama3.2

2. Install dependencies

pip install -r requirements.txt

3. Run the Dashboard

streamlit run app.py

This will open the EDAgent web dashboard in your browser. Just drag and drop your CSV into the upload area!

Running in CLI (Alternative)

If you prefer the terminal, you can run the pipeline directly:

# On your own dataset
python pipeline.py your_dataset.csv

# With built-in test data
python pipeline.py

Example terminal output

┌──────────────────────────────────────────────────────────────────┐
│ EDAgent Pipeline                                                 │
│ Dataset: Teen_Mental_Health_Dataset.csv  Rows: 1200  Cols: 13   │
└──────────────────────────────────────────────────────────────────┘

  [ schema ]       Running schema agent...
  [ quality ]      Running quality agent...
  [ stats ]        Running stats agent...
  [ outliers ]     Running outlier agent...
  [ correlation ]  Running correlation agent...
  [ importance ]   Running feature importance agent...
  [ synthesis ]    Running synthesis agent...
  [ model-rec ]    Running model recommendation agent...
  [ feature-eng ]  Running feature engineering agent...

┌─────────────────────────────────────────────────────────────────┐
│ Done!                                                           │
│ HTML Report → output/report.html                                │
│ Heatmap     → output/correlation_heatmap.png                    │
└─────────────────────────────────────────────────────────────────┘

Tech Stack

Tool	Role
Streamlit	Interactive web dashboard
LangGraph	Agent orchestration & state management
LangChain + Ollama	Local LLM inference
Llama 3.2	The underlying language model
pandas	Data analysis
seaborn / matplotlib	Correlation heatmap
scipy	Entropy calculation for feature importance
rich	Terminal UI

Project Structure

EDAgent/
├── app.py               # Streamlit web dashboard
├── pipeline.py          # Full LangGraph pipeline (all 9 agents)
├── requirements.txt
├── README.md
└── output/              # Auto-generated, git-ignored
    ├── report.html
    └── correlation_heatmap.png

Why local?

No API keys. No data sent to the cloud. Everything runs on your machine via Ollama. Swap llama3.2 for any other Ollama-compatible model by changing one line in pipeline.py:

llm = ChatOllama(model="your-model-here")

Created by Varnit :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDAgent

What it does

Pipeline

Quickstart

1. Prerequisites

2. Install dependencies

3. Run the Dashboard

Running in CLI (Alternative)

Example terminal output

Tech Stack

Project Structure

Why local?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EDAgent

What it does

Pipeline

Quickstart

1. Prerequisites

2. Install dependencies

3. Run the Dashboard

Running in CLI (Alternative)

Example terminal output

Tech Stack

Project Structure

Why local?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages