# CommunityPulse AI Agent — End-to-End Walkthrough

This notebook demonstrates the complete multi-modal pipeline described in the blog post:

1. **Load three data modalities** — Reddit text, Census ACS tabular data, and news headlines
2. **BERT sentence embeddings** (Week 1 model)
3. **BERTopic / LDA topic modeling** (Week 2 model)
4. **LangChain LLM agent synthesis** (Week 3 model)
5. **Qualitative validation** — case studies and validation table

All cells run in **offline mode** — no API keys required.

In [None]:
# Install dependencies (run once)
# !pip install -r ../requirements.txt

In [None]:
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Ensure the repo root is on the path
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid', palette='muted')
print('Setup complete')

---
## Step 1 — Load All Three Data Modalities

The `CommunityPulseDataLoader` orchestrates three specialised loaders:
- `RedditLoader` — text posts from city subreddits
- `CensusLoader` — ACS demographic/economic indicators
- `NewsLoader` — headlines with pre-computed sentiment scores

In [None]:
from src.data_loader import CommunityPulseDataLoader

loader = CommunityPulseDataLoader(
    communities=['seattle', 'portland', 'denver'],
    offline=True,  # Uses bundled sample CSVs — no API keys needed
)
data = loader.load_all()

print(f"Reddit posts loaded:    {len(data['reddit'])}")
print(f"Census rows loaded:     {len(data['census'])}")
print(f"News headlines loaded:  {len(data['news'])}")

In [None]:
# Preview Reddit data
data['reddit'][['community', 'title', 'score']].head(6)

In [None]:
# Preview Census data
data['census'][[
    'community', 'median_household_income', 'median_rent',
    'poverty_rate', 'unemployment_rate', 'bachelor_degree_pct'
]]

In [None]:
# Preview News data
data['news'][['community', 'headline', 'sentiment_score']].head(6)

---
## Figure A — Demographic Comparison (Census Data)

A grouped bar chart comparing key Census indicators across the three cities.

In [None]:
census = data['census']
metrics = ['poverty_rate', 'unemployment_rate', 'bachelor_degree_pct']
labels  = ['Poverty Rate (%)', 'Unemployment (%)', "Bachelor's Degree+ (%)"]
communities = census['community'].tolist()

x = np.arange(len(metrics))
width = 0.25
colors = ['#4C72B0', '#DD8452', '#55A868']

fig, ax = plt.subplots(figsize=(10, 5))
for i, (community, color) in enumerate(zip(communities, colors)):
    row = census[census['community'] == community].iloc[0]
    values = [float(row[m]) for m in metrics]
    bars = ax.bar(x + i * width, values, width, label=community, color=color, alpha=0.85)

ax.set_xticks(x + width)
ax.set_xticklabels(labels, fontsize=11)
ax.set_title('Demographic Comparison Across Communities (Census ACS)', fontsize=13)
ax.set_ylabel('Percentage')
ax.legend()
plt.tight_layout()
plt.savefig('../data/sample/fig_demographics.png', dpi=150, bbox_inches='tight')
plt.show()

---
## Figure B — News Sentiment Distribution

Box plot of raw sentiment scores per community + stacked bar of positive/neutral/negative counts.

In [None]:
news = data['news'].copy()

def categorise(s):
    if s > 0.2: return 'positive'
    if s < -0.2: return 'negative'
    return 'neutral'

news['category'] = news['sentiment_score'].apply(categorise)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Box plot
sns.boxplot(data=news, x='community', y='sentiment_score', palette='coolwarm', ax=axes[0])
axes[0].axhline(0, color='grey', linestyle='--', linewidth=0.8)
axes[0].set_title('News Sentiment Score by Community', fontsize=12)
axes[0].set_xlabel('Community')
axes[0].set_ylabel('Sentiment Score')

# Stacked bar
counts = news.groupby(['community', 'category']).size().unstack(fill_value=0)
counts = counts.reindex(columns=['positive', 'neutral', 'negative'], fill_value=0)
counts.plot(kind='bar', stacked=True, color=['#55A868', '#C0C0C0', '#C44E52'],
            ax=axes[1], edgecolor='white')
axes[1].set_title('Headlines by Sentiment Category', fontsize=12)
axes[1].set_xlabel('Community')
axes[1].set_ylabel('Number of Headlines')
axes[1].tick_params(axis='x', rotation=0)

plt.suptitle('Figure B — News Sentiment Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../data/sample/fig_sentiment.png', dpi=150, bbox_inches='tight')
plt.show()

---
## Step 2 — Week 1 Model: BERT Sentence Embeddings

We use `all-MiniLM-L6-v2` to encode each Reddit post into a 384-dimensional vector.

In [None]:
from src.models import BERTEmbedder

embedder = BERTEmbedder()  # Loads all-MiniLM-L6-v2 (or uses offline fallback)

# Embed all Reddit posts
data['reddit'] = embedder.embed_dataframe(data['reddit'], text_col='title')

print(f"Embedding shape (first post): {data['reddit']['embedding'].iloc[0].shape}")
print(f"Sample embedding (first 5 dims): {data['reddit']['embedding'].iloc[0][:5]}")

In [None]:
# Compute community centroid cosine similarities
communities = data['reddit']['community'].unique()
centroids = {
    c: np.mean(np.vstack(data['reddit'][data['reddit']['community'] == c]['embedding']), axis=0)
    for c in communities
}

def cosine_sim(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

print("Discourse similarity (cosine) between community centroids:")
for i, c1 in enumerate(communities):
    for c2 in list(communities)[i+1:]:
        sim = cosine_sim(centroids[c1], centroids[c2])
        print(f"  {c1} ↔ {c2}: {sim:.4f}")

---
## Figure C — BERT Embedding Clusters (UMAP 2-D Projection)

We reduce 384-d embeddings to 2-D using UMAP and colour-code by community.

In [None]:
try:
    import umap
    embeddings = np.vstack(data['reddit']['embedding'].tolist())
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=5, min_dist=0.3)
    reduced = reducer.fit_transform(embeddings)

    fig, ax = plt.subplots(figsize=(9, 7))
    cmap = plt.get_cmap('tab10')
    for i, community in enumerate(data['reddit']['community'].unique()):
        mask = data['reddit']['community'] == community
        ax.scatter(reduced[mask, 0], reduced[mask, 1], label=community,
                   s=100, alpha=0.85, color=cmap(i), edgecolors='white', linewidths=0.5)
    ax.set_title('Figure C — BERT Embedding Clusters (UMAP 2-D)', fontsize=13)
    ax.set_xlabel('UMAP Dim 1')
    ax.set_ylabel('UMAP Dim 2')
    ax.legend()
    plt.tight_layout()
    plt.savefig('../data/sample/fig_umap.png', dpi=150, bbox_inches='tight')
    plt.show()
except ImportError:
    print('umap-learn not installed — skipping UMAP plot. Install with: pip install umap-learn')

---
## Step 3 — Week 2 Model: Topic Modeling (BERTopic / LDA)

We fit a topic model on the combined corpus. BERTopic is tried first; LDA is the fallback.

In [None]:
from src.models import TopicModeler

topic_modeler = TopicModeler(n_topics=8, min_topic_size=2)

docs = (
    data['reddit']['title'].fillna('') + ' ' +
    data['reddit']['selftext'].fillna('')
).tolist()

topic_modeler.fit(docs)

# Assign topics to each post
data['reddit']['topic'] = topic_modeler.assign_topics(docs)

print('Topic model backend:', topic_modeler._backend)
print()
print(topic_modeler.topic_summary_table().to_string(index=False))

---
## Figure D — Topic Distribution per Community

In [None]:
from src.analysis import plot_topic_distribution
plot_topic_distribution(data['reddit'], topic_modeler, save_path='../data/sample/fig_topics.png')

---
## Step 4 — Week 3 Model: LangChain Agent Synthesis

The LangChain agent wraps the upstream models as tools and answers natural-language queries.
Without an `OPENAI_API_KEY`, it uses a deterministic offline mode that assembles answers from tool outputs.

In [None]:
from src.models import CommunityPulseAgent
import os

agent = CommunityPulseAgent(
    embedder=embedder,
    topic_modeler=topic_modeler,
    data=data,
    openai_api_key=os.getenv('OPENAI_API_KEY', ''),  # empty = offline mode
)

result = agent.run(
    "Compare Seattle, Portland, and Denver: which city faces the most acute social challenges "
    "based on news sentiment, Reddit discussion topics, and demographic data?"
)
print(result)

In [None]:
# Individual tool calls — inspect the raw evidence
for city in ['Seattle', 'Portland', 'Denver']:
    print(f"\n{'='*50}")
    print(f"  {city}")
    print('='*50)
    print('Sentiment:', agent.get_community_sentiment(city))
    print('Topics:   ', agent.get_top_topics(city))
    print('Demographics:', agent.get_demographic_profile(city)['profile'])

---
## Step 5 — Qualitative Validation

`CommunityPulseAnalyser` joins all three modalities into a scorecard and generates narrative case studies.

In [None]:
from src.analysis import CommunityPulseAnalyser

analyser = CommunityPulseAnalyser(data=data, topic_modeler=topic_modeler)

# Numeric scorecard
scorecard = analyser.community_summary_table()
display(scorecard)

In [None]:
# Qualitative case studies
for city in ['Seattle', 'Portland', 'Denver']:
    print(f"\n{'='*60}")
    print(f"  CASE STUDY: {city}")
    print('='*60)
    print(analyser.generate_case_study(city))

---
## Figure E — Multi-Modal Scorecard Heatmap

In [None]:
# Normalise key metrics for side-by-side heatmap
scorecard_plot = scorecard.set_index('community')[[
    'mean_news_sentiment', 'poverty_rate_pct', 'unemployment_rate_pct',
    'bachelor_degree_pct', 'avg_post_score'
]].copy()

# Normalise each column to [0, 1]
scorecard_norm = (scorecard_plot - scorecard_plot.min()) / (scorecard_plot.max() - scorecard_plot.min() + 1e-9)

fig, ax = plt.subplots(figsize=(11, 4))
sns.heatmap(
    scorecard_norm.T,
    annot=scorecard_plot.T.round(2),
    fmt='g',
    cmap='RdYlGn',
    linewidths=0.5,
    ax=ax,
    cbar_kws={'label': 'Normalised score (0=worst, 1=best)'}
)
ax.set_title('Figure E — Multi-Modal Community Scorecard Heatmap', fontsize=13)
ax.set_xlabel('Community')
ax.set_yticklabels([
    'News Sentiment', 'Poverty Rate', 'Unemployment',
    "Bachelor's Degree", 'Reddit Engagement'
], rotation=0)
plt.tight_layout()
plt.savefig('../data/sample/fig_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

---
## Summary

This notebook has demonstrated the complete CommunityPulse pipeline:

| Step | Component | Model Family |
|---|---|---|
| 1 | `CommunityPulseDataLoader` | — (data ingestion) |
| 2 | `BERTEmbedder` | Week 1: Transformer NLP |
| 3 | `TopicModeler` | Week 2: Probabilistic Topic Model |
| 4 | `CommunityPulseAgent` | Week 3: LangChain Tool-Calling Agent |
| 5 | `CommunityPulseAnalyser` | Qualitative Validation |

**Key findings:**
- Portland shows the most coherently negative signals across all three modalities
- Seattle's prosperity paradox: high income but high housing anxiety on Reddit
- Denver's challenges are less visible in news sentiment but clear in topic modeling and Census data

See `blog_post.md` for the full write-up.