Skip to content

A full-stack NLP application that transforms scattered thoughts into organized, semantically meaningful clusters with AI-powered insights. Built with Streamlit, LangGraph agents, and Neo4j.

License

Notifications You must be signed in to change notification settings

aevnum/braindump-sanctuary

Repository files navigation

🧠 Brain Dump Sanctuary

Capture, cluster, and reflect on your thoughts in real-time.

A full-stack NLP application that transforms scattered thoughts into organized, semantically meaningful clusters with AI-powered insights. Built with Streamlit, LangGraph agents, and Neo4j.


✨ Features

Core Capabilities

  • πŸ’­ Brain Dump Capture: Write down thoughts, ideas, and questions instantly
  • πŸ”— Semantic Clustering: Automatically group related thoughts using embeddings (sentence-transformers) and HDBSCAN
  • πŸ“Š Interactive Knowledge Graph: Visualize all thoughts as an interactive Plotly graph with color-coded clusters
  • 🏷️ AI-Generated Labels: Auto-label clusters using Google Gemini LLM (e.g., "Personal Growth", "Technical Concepts")
  • πŸ€– Multi-Agent Intelligence:
    • SearchAgent: Generate contextualized summaries with optional web search (Tavily)
    • QuestionAgent: Generate Socratic questions for deeper reflection
    • GenerationAgent: Synthesize insights across related thoughts
    • FeedAgent: Curate a blog-style feed of your 5 most recent thoughts with agent insights
  • πŸ“ Persistent Storage: All thoughts stored in Neo4j graph database with full history
  • πŸ”„ Real-time Updates: Refresh embeddings and recalculate clusters on demand

Two-View Interface

  1. Home Tab 🏠: Semantic cluster map + comprehensive table of all thoughts
  2. Feed Tab πŸ“°: Blog-style cards with summaries and reflection questions

πŸ—οΈ Architecture

Tech Stack

Frontend:          Streamlit (single-page Python web app)
Backend:           Python 3.10+ with LangChain/LangGraph
Database:          Neo4j Aura (graph database)
NLP Pipeline:      
  - Embeddings:    sentence-transformers (all-MiniLM-L6-v2)
  - Clustering:    HDBSCAN + UMAP dimensionality reduction
  - LLM:           Google Gemini 2.5 Flash (via LangChain)
Search:            Tavily API (optional, with mock fallback)

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    STREAMLIT UI                         β”‚
β”‚  (app.py - Home Tab | Feed Tab | Sidebar Controls)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                   β”‚
    β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
    β”‚ AGENTS β”‚         β”‚ EMBEDDINGS β”‚
    β”‚ (Day2) β”‚         β”‚ & CLUSTERS β”‚
    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β”‚ (Day 1)    β”‚
        β”‚              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                   β”‚
    β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚    BRAINDUMP_CORE.PY       β”‚
    β”‚  - BrainDumpDB (Neo4j)     β”‚
    β”‚  - EmbeddingEngine         β”‚
    β”‚  - ClusterEngine           β”‚
    β”‚  - ClusterLabelEngine      β”‚
    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
    β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   NEO4J GRAPH DB     β”‚
    β”‚  (Aura Cloud)        β”‚
    β”‚  - Dump Nodes        β”‚
    β”‚  - Cluster Nodes     β”‚
    β”‚  - Relationships     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

Input β†’ Processing β†’ Storage β†’ Visualization

  1. User enters brain dump β†’ app.py sidebar
  2. Generate embedding β†’ EmbeddingEngine (sentence-transformer)
  3. Store in Neo4j β†’ BrainDumpDB.add_dump()
  4. Re-cluster on demand β†’ ClusterEngine.cluster() + ClusterLabelEngine.label_clusters()
  5. Visualize clusters β†’ Plotly interactive graph (Home tab)
  6. Enrich with AI β†’ Agents analyze for Feed tab

Neo4j Graph Schema

Node: Dump
β”œβ”€β”€ id: UUID
β”œβ”€β”€ text: String (the brain dump)
β”œβ”€β”€ embedding: Vector[384] (from sentence-transformer)
β”œβ”€β”€ created_at: DateTime
└── cluster_id: Reference to Cluster

Node: Cluster
β”œβ”€β”€ id: UUID
β”œβ”€β”€ label: String (e.g., "Personal Growth")
β”œβ”€β”€ description: String (optional)
β”œβ”€β”€ created_at: DateTime

Relationships:
β”œβ”€β”€ Dump -[:IN_CLUSTER]-> Cluster
└── Dump -[:SIMILAR_TO {weight: 0.0-1.0}]-> Dump

Component Breakdown

File Purpose Key Classes
app.py Streamlit UI & session management Main app logic, page layout
braindump_core.py NLP pipeline & database BrainDumpDB, EmbeddingEngine, ClusterEngine, ClusterLabelEngine
agents.py AI agents for insights QuestionAgent, SearchAgent, GenerationAgent, FeedAgent
neo4j_maintenance.py Database utilities Neo4j query helpers, debugging
cleanup_db.py Reset database Wipe all data (for testing)

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Neo4j Aura account (free tier available)
  • API keys for Google Gemini (free) and optionally Tavily

1. Setup Neo4j (Required)

Create a free Neo4j Aura instance:

  1. Go to https://console.neo4j.io/
  2. Sign up for a free account
  3. Create a new "Free" Aura instance
  4. Copy your connection details:
    • URI: neo4j+ssc://xxxxx.databases.neo4j.io
    • Username: neo4j
    • Password: (set during creation)

2. Installation

# Clone repository
git clone <repo-url>
cd braindump-sanctuary

# Install dependencies
pip install -r requirements.txt

# Create environment file
cp .env.example .env  # or create manually

# Edit .env with your credentials

.env Template:

# Neo4j (Required)
NEO4J_URI=neo4j+ssc://your-instance-id.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password-here

# Google Gemini (Required for AI features)
GOOGLE_API_KEY=your_google_api_key_here

# Web Search (Optional - uses mock data if not provided)
TAVILY_API_KEY=your_tavily_api_key_here

# Alternative: Perplexity (Optional)
PERPLEXITY_API_KEY=your_perplexity_key_here

Get API Keys:

3. Run the App

streamlit run app.py

App opens at http://localhost:8501


πŸ“± User Interface Guide

Sidebar Controls

Left Sidebar - Main interaction hub:

  • πŸ—ΊοΈ Navigation: Radio buttons to toggle between Home and Feed tabs
  • πŸ“ Text Area: Input field for new brain dumps
  • βœ… Add to Sanctuary: Save the thought to Neo4j
  • πŸ—‘οΈ Clear: Empty the input field
  • πŸ”„ Refresh Clusters: Recalculate all embeddings and clusters (computationally intensive, 10-30 seconds)
  • ⚠️ Clear All Dumps: Permanently delete all thoughts from the database
  • πŸ“Š Total Brain Dumps: Counter showing total stored thoughts

Home Tab 🏠 - Knowledge Graph View

Top Section: Interactive Cluster Map

  • Visualization: 2D interactive Plotly graph
  • Nodes: Each point = one brain dump
  • Colors: Each color = a semantic cluster (similar ideas grouped together)
  • Layout: UMAP dimensionality reduction for spatial meaning
  • Interactions: Hover to see full text, zoom/pan to explore
  • Relationships: Edges show semantic similarity between thoughts

Bottom Section: Comprehensive Table

  • Columns: Brain Dump | Cluster Label | Created At
  • Sorting: Most recent first (newest at top)
  • Format: Plain text, easy to copy

Actions:

  • Click "πŸ”„ Refresh Clusters" to recalculate (when you add many new thoughts)
  • Spinner animations show progress of embedding generation and labeling

Feed Tab πŸ“° - Blog-Style Feed

What You See:

  1. Up to 5 Recent Cards in reverse chronological order
  2. Each Card Contains:
    • πŸ’­ Title: Your full brain dump text
    • 🏷️ Cluster Badge: Category label (generated by LLM)
    • πŸ“„ Summary: AI-generated context
      • If Tavily API set: Real web search + synthesis
      • If not: Mock contextual data (demo mode)
    • ❓ Reflection Questions: 5 Socratic questions (from QuestionAgent)

Why Use Feed?

  • Quick morning review without table scrolling
  • AI-generated insights help you think deeper
  • Guided reflection via questions
  • Patterns become visible across multiple cards
  • Perfect for journaling workflow

πŸ”„ Typical Workflows

Workflow 1: Daily Brain Dumping + Review

Morning:
1. Open app at http://localhost:8501
2. Sidebar: Type 3-5 quick thoughts
3. Click "Add to Sanctuary" each time
4. Switch to Feed tab
5. Read summaries and reflect on questions
6. (Takes 5-10 minutes)

Evening:
1. Go to Home tab
2. Click "Refresh Clusters"
3. Observe how new thoughts clustered with existing ones
4. Spot emerging themes and connections

Workflow 2: Deep Dive Analysis

1. Accumulate 20+ thoughts over several days
2. Go to Home tab
3. Click "Refresh Clusters"
4. Examine the visualization:
   - Which clusters are densest?
   - What themes emerge?
   - Any surprising connections?
5. Switch to Feed tab
6. Read the agent-generated insights
7. Use for next creative session

Workflow 3: Topic-Specific Exploration

1. Add 10+ thoughts about a specific topic
2. They should cluster together automatically
3. Hover over the cluster in Home tab
4. Read the AI-generated cluster label
5. Check Feed tab for synthesis across related thoughts
6. Use the questions to drill deeper

πŸ€– Agent Capabilities

QuestionAgent

Role: Socratic questioning for deeper reflection

Input: Brain dump text Output: 5 open-ended questions

Example:

Brain Dump: "Why do I procrastinate on important tasks?"

Generated Questions:
1. What task are you procrastinating on right now?
2. What emotion arises when you think about starting it?
3. What would you do instead if you didn't procrastinate?
4. What is the smallest first step you could take?
5. What would happen if you started today?

SearchAgent

Role: Web search + synthesis or mock data generation

Input: Brain dump text Output: Relevant context or summary

With Tavily API:

  • Searches the web for related information
  • Synthesizes findings with LLM
  • Provides citations and context

Without Tavily API (Demo Mode):

  • Generates plausible contextual information
  • Maintains consistency with the thought
  • Allows testing without API key

GenerationAgent

Role: Cross-thought synthesis

Input: Multiple related brain dumps (from same cluster) Output: Synthesized insight combining themes

Example:

Related Dumps:
- "Neural networks learn patterns"
- "Our brains also learn from patterns"
- "What if consciousness is pattern recognition?"

Synthesis: These thoughts suggest that consciousness might emerge 
from the brain's pattern recognition capabilities...

FeedAgent

Role: Orchestrates other agents for feed generation

Process:

  1. Selects 5 most recent thoughts
  2. For each thought:
    • Gets cluster assignment
    • Calls SearchAgent for summary
    • Calls QuestionAgent for reflection questions
  3. Formats as blog-style cards

βš™οΈ Configuration & Customization

Environment Variables (.env)

Create .env file in project root:

# ============ REQUIRED ============

# Neo4j Aura Connection
NEO4J_URI=neo4j+ssc://your-instance-id.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-super-secure-password

# Google Gemini LLM API
GOOGLE_API_KEY=AIzaSyDxxxxxxxxxxxxxxxxxxxxx

# ============ OPTIONAL ============

# Tavily Web Search (for better summaries)
TAVILY_API_KEY=tvly-xxxxxxxxxxxxxxxxx

# Perplexity API (alternative search)
PERPLEXITY_API_KEY=pplx-xxxxxxxxxxxxxxxxx

Optional: Streamlit Configuration

Create .streamlit/config.toml:

[theme]
primaryColor = "#4ECDC4"
backgroundColor = "#0d1117"
secondaryBackgroundColor = "#161b22"
textColor = "#c9d1d9"

[client]
showErrorDetails = true

[logger]
level = "info"

[server]
port = 8501

πŸ”§ Development & Debugging

Running Locally

streamlit run app.py

Clearing Database

python cleanup_db.py

Neo4j Maintenance

python neo4j_maintenance.py

Viewing Logs

# Terminal shows Streamlit logs
# Check sidebar for spinner status during refresh

Common Issues

Issue: "Failed to connect to Neo4j"

  • Verify .env has correct URI, username, password
  • Check Neo4j Aura instance is running (console.neo4j.io)
  • Firewall: Neo4j needs outbound HTTPS (port 7687)

Issue: "GOOGLE_API_KEY not found"

  • Ensure key is in .env file
  • Restart streamlit: streamlit run app.py

Issue: Clusters not showing

  • Click "Refresh Clusters" in sidebar
  • Wait for spinner to finish (10-30 seconds)
  • Needs at least 2-3 thoughts for clustering

Issue: Tavily search not working

  • Mock data is used if API key missing (expected)
  • Optional feature; not required for core functionality

πŸ“Š Performance Notes

Computational Complexity

  • Adding 1 thought: ~1 second
  • Clustering N thoughts: ~O(N) to O(N log N) depending on N
    • 10 thoughts: ~2 seconds
    • 100 thoughts: ~5-10 seconds
    • 1000 thoughts: ~30-60 seconds
  • LLM labeling: ~2-5 seconds per cluster (depends on Gemini API latency)

Storage

  • Neo4j Free Tier: ~5 GB storage
  • Typical thought: ~500 bytes
  • Can store ~10 million thoughts theoretically

Scaling Recommendations

  • Local testing: βœ… Recommended
  • Shared team use: Consider dedicated Neo4j instance
  • Large scale (10k+ thoughts): May need performance tuning (indexing, batching)

πŸŽ“ Learning Resources

Concepts Explained

Semantic Embeddings

  • sentence-transformers model converts text to 384-dimensional vectors
  • Similar texts β†’ similar vectors β†’ can cluster together
  • Distance in vector space β‰ˆ semantic similarity

HDBSCAN Clustering

  • Density-based clustering (unlike K-means which needs K)
  • Automatically finds clusters of any shape
  • Robust to outliers (marks them as "noise")

Neo4j Graph Database

  • Stores relationships as first-class citizens
  • Fast for relationship queries (unlike SQL)
  • Perfect for knowledge graphs and recommendations

LLMs for Labeling

  • Google Gemini generates cluster labels from examples
  • LangChain chains handle prompt + LLM + parsing
  • Enables semantic understanding of cluster themes

πŸ“ License

See LICENSE file for details.


🀝 Contributing

Found a bug or have a feature idea?

  1. Check existing issues
  2. Create new issue with clear description
  3. (Optional) Submit PR with fix

❓ FAQ

Q: Can I use this with a local Neo4j instance? A: Yes! Update NEO4J_URI to neo4j://localhost:7687 in .env

Q: What if I don't have a Tavily API key? A: That's fine! SearchAgent will generate mock data (demo mode)

Q: Can I export all my thoughts? A: Use the table in Home tab or query Neo4j directly. Export feature coming soon.

Q: How often should I click "Refresh Clusters"? A: After adding multiple new thoughts (5+), or when you want to see updated organization.

Q: Is my data secure? A: Only you have your Neo4j password. Data is encrypted in transit and at rest on Neo4j Aura.


πŸš€ Roadmap

  • Export functionality (CSV, JSON, markdown)
  • Thought search & filtering
  • Collaborative mode (multiple users)
  • Email digest of weekly summaries
  • Browser extension for quick capture
  • Mobile app
  • Advanced analytics dashboard

[client] showErrorDetails = true


---

## πŸ”§ Database Management

### Re-calculate Clusters
1. Go to Home tab
2. Click πŸ”„ **Refresh** button
3. Wait for spinner to complete

### Clear Everything and Start Fresh
1. Click πŸ—‘οΈ **Clear All Dumps** in sidebar
2. Confirm action
3. Add new thoughts

### Neo4j Maintenance
Use the `neo4j_maintenance.py` script:
```bash
# Show database statistics
python neo4j_maintenance.py stats

# Remove duplicate brain dumps
python neo4j_maintenance.py dedup

# Clear all data (WARNING: Irreversible)
python neo4j_maintenance.py clear

πŸ“Š Performance Tips

Clustering

  • Fast: 3-20 dumps (< 5 seconds)
  • Moderate: 20-100 dumps (5-30 seconds)
  • Slow: 100+ dumps (> 30 seconds)
  • Use caching to avoid recomputing

API Usage

  • Google Gemini: Free tier allows ~60 requests/min
  • Tavily Search: Free tier allows ~100 searches/month
  • Set environment variables to enable features

Neo4j Query Performance

  • Dump and cluster lookups are indexed for fast retrieval
  • Embedding vectors cached in graph nodes
  • Knowledge graph relationships enable fast similarity searches

πŸ› Troubleshooting

Neo4j Connection Error

β†’ Verify your Aura instance is running: https://console.neo4j.io/ β†’ Check NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD in .env β†’ Ensure your IP is allowlisted in Aura instance settings

"API Key not found" error

β†’ Make sure .env file exists with GOOGLE_API_KEY β†’ Restart streamlit: streamlit run app.py

Clustering takes forever

β†’ You might have 100+ dumps β†’ Only happens on first run or "Refresh" β†’ Consider batching cluster operations for large datasets

Feed cards showing mock data

β†’ Your TAVILY_API_KEY isn't set β†’ Add it to .env and restart

"ModuleNotFoundError" for imports

β†’ Missing dependencies: pip install -r requirements.txt β†’ Wrong Python version? Try python3 -m pip install ...


πŸ“š Architecture Overview

braindump_core.py
β”œβ”€β”€ BrainDumpDB (Neo4j Graph Database)
β”œβ”€β”€ EmbeddingEngine (sentence-transformers)
β”œβ”€β”€ ClusterEngine (HDBSCAN + UMAP)
└── Visualization (Plotly)

agents.py
β”œβ”€β”€ QuestionAgent (Socratic questions)
└── SearchAgent (Web search + synthesis)

app.py
β”œβ”€β”€ Sidebar (Input + Navigation)
β”œβ”€β”€ Home Tab (Knowledge Graph + Table)
β”œβ”€β”€ Feed Tab (Blog-style cards)
└── Helper Functions (Rendering)

neo4j_maintenance.py
β”œβ”€β”€ Database statistics
β”œβ”€β”€ Deduplication
└── Data cleanup

🎯 Next Steps

  1. Set up Neo4j Aura instance at https://console.neo4j.io/
  2. Add your first thought via sidebar
  3. Switch between Home and Feed to see different views
  4. Check back daily to build your idea garden
  5. Monitor performance using neo4j_maintenance.py stats

πŸ’‘ Tips

  • Best for: Capturing fleeting thoughts, finding patterns, exploring ideas
  • Start small: 5-10 thoughts to see clustering in action
  • Use specificity: "Why do dreams fade?" works better than "dreams"
  • Review regularly: Feed tab is great for morning/evening review
  • Share clusters: Screenshot Home tab to share your thinking

πŸ“ Notes

  • All data stored locally in braindump.db (no cloud sync)
  • Embeddings cached in database (fast retrieval)
  • Cluster labels generated once and reused
  • Each brain dump timestamped automatically

Ready to catch your thoughts before they slip away! 🧠✨

For issues or questions, check REFACTOR_SUMMARY.md or ARCHITECTURE.md

About

A full-stack NLP application that transforms scattered thoughts into organized, semantically meaningful clusters with AI-powered insights. Built with Streamlit, LangGraph agents, and Neo4j.

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •  

Languages