# Waywo - Getting Started

This notebook demonstrates how to interact with the Waywo project's core components:

1. **Database Access** - Query posts, comments, and projects using SQLAlchemy
2. **Embedding Client** - Generate embeddings for text
3. **Semantic Search** - Find similar projects
4. **LLM Client** - Make LLM calls using the configured endpoint
5. **Workflow Testing** - Run workflow components interactively

## Setup

First, let's set up the path and import our modules.

In [3]:
import sys
sys.path.insert(0, '/app')

# Enable async support in Jupyter
import nest_asyncio
nest_asyncio.apply()

## 1. Database Access

The project uses SQLite with SQLAlchemy. Let's explore the database.

In [4]:
from src.db.database import SessionLocal
from src.db.models import WaywoPostDB, WaywoCommentDB, WaywoProjectDB
from src.db import client as db_client

In [5]:
# Get database statistics
stats = db_client.get_database_stats()
print("Database Statistics:")
print(f"  Posts: {stats['posts_count']}")
print(f"  Comments: {stats['comments_count']}")
print(f"  Projects: {stats['projects_count']}")
print(f"  Processed comments: {stats['processed_comments_count']}")
print(f"  Valid projects: {stats['valid_projects_count']}")
print(f"  Projects with embeddings: {stats['projects_with_embeddings_count']}")

Database Statistics:
  Posts: 39
  Comments: 8708
  Projects: 444
  Processed comments: 385
  Valid projects: 443
  Projects with embeddings: 443


In [6]:
# Query recent posts
post_ids = db_client.get_all_post_ids()[:5]  # Get first 5
print(f"Found {len(post_ids)} posts:\n")
for post_id in post_ids:
    post = db_client.get_post(post_id)
    if post:
        comment_count = db_client.get_comment_count_for_post(post.id)
        print(f"  [{post.id}] {post.title} - {comment_count} comments")

Found 5 posts:

  [30516198] Ask HN: What are you working on? (March 2022) - 1 comments
  [30883352] Ask HN: What Are You Up To? (April 2022) - 24 comments
  [31236618] Ask HN: What Are You Working On? (May 2022) - 10 comments
  [31949485] Ask HN: What Are You Working On? (July 2022) - 2 comments
  [32309210] Ask HN: What Are You Working On? (August 2022) - 41 comments


In [7]:
# Query comments (with optional post_id filter)
comments = db_client.get_all_comments(limit=5)
print(f"Found {len(comments)} comments:\n")
for comment in comments:
    text_preview = comment.text[:100] + "..." if len(comment.text) > 100 else comment.text
    print(f"  [{comment.id}] by {comment.by}: {text_preview}\n")

Found 5 comments:

  [46378010] by Colin_S: I&#x27;m working on Lumi AI (<a href="https:&#x2F;&#x2F;apps.shopify.com&#x2F;lumi-ai-seo-alt-text" ...

  [46374571] by flavioaiello: <a href="https:&#x2F;&#x2F;github.com&#x2F;magikrun" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2...

  [46370640] by stevenicr: working on a hybrid done with you quick website launches ( <a href="https:&#x2F;&#x2F;betterwebgroup...

  [46333890] by bitstrategist: heyy i am rachel Joseph , i am bit strategist 
A seasoned blockchain strategist with a decade of exp...

  [46326590] by tungnt620: <a href="https:&#x2F;&#x2F;pocketbasecloud.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pocketbaseclo...



In [8]:
# Query projects
projects = db_client.get_all_projects(limit=5)
print(f"Found {len(projects)} projects:\n")
for project in projects:
    print(f"  [{project.id}] {project.title}")
    print(f"      Scores: idea={project.idea_score}, complexity={project.complexity_score}")
    print(f"      Tags: {', '.join(project.hashtags) if project.hashtags else 'None'}")
    print()

Found 5 projects:

  [451] Polingo Interactive Language Learning
      Scores: idea=8, complexity=7
      Tags: language, learning, ai, education, app

  [450] Tinqer: TypeScript LINQ-to-SQL Port
      Scores: idea=6, complexity=7
      Tags: typescript, linq, database, open-source, query

  [449] Decentralized Global Social Feed
      Scores: idea=8, complexity=9
      Tags: web3, decentralized, social, blockchain, p2p

  [448] Credit Card Rewards Optimizer
      Scores: idea=8, complexity=6
      Tags: fintech, rewards, optimization, personal finance

  [447] DB Pro Desktop Database Workbench
      Scores: idea=8, complexity=7
      Tags: database, electron, react, ai, productivity



## 2. Embedding Client

The embedding client connects to an external embedding service to generate vector embeddings for text.

In [9]:
from src.clients.embedding import (
    get_single_embedding, 
    get_embeddings, 
    check_embedding_service_health,
    DEFAULT_EMBEDDING_URL
)
import asyncio

# Check health
async def check_health():
    return await check_embedding_service_health()

is_healthy = asyncio.get_event_loop().run_until_complete(check_health())
print(f"Embedding service URL: {DEFAULT_EMBEDDING_URL}")
print(f"Embedding service healthy: {is_healthy}")

Embedding service URL: http://192.168.5.96:8000
Embedding service healthy: True


In [10]:
# Generate an embedding for sample text
sample_text = "A machine learning tool for analyzing code repositories"

async def generate_embedding():
    try:
        return await get_single_embedding(sample_text)
    except Exception as e:
        print(f"Error: {e}")
        return None

embedding = asyncio.get_event_loop().run_until_complete(generate_embedding())

if embedding:
    print(f"Generated embedding for: '{sample_text}'")
    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 10 values: {embedding[:10]}")
else:
    print("Failed to generate embedding (service may be unavailable)")

Generated embedding for: 'A machine learning tool for analyzing code repositories'
Embedding dimension: 4096
First 10 values: [0.00121307373046875, -0.0030517578125, 0.01434326171875, -0.001983642578125, -0.00084686279296875, -0.000885009765625, 0.0169677734375, -0.004364013671875, 0.0277099609375, -0.00848388671875]


## 3. Semantic Search

Use vector embeddings to find semantically similar projects.

In [11]:
# Check embedding coverage
total_projects = db_client.get_total_project_count()
projects_with_embeddings = db_client.get_projects_with_embeddings_count()

print(f"Total projects: {total_projects}")
print(f"Projects with embeddings: {projects_with_embeddings}")
if total_projects > 0:
    print(f"Coverage: {projects_with_embeddings / total_projects * 100:.1f}%")

Total projects: 444
Projects with embeddings: 443
Coverage: 99.8%


In [12]:
# Perform semantic search
query = "AI tools for developers"

async def get_query_embedding():
    try:
        return await get_single_embedding(query)
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return None

query_embedding = asyncio.get_event_loop().run_until_complete(get_query_embedding())

if query_embedding:
    results = db_client.semantic_search(query_embedding, limit=5)
    print(f"Semantic search results for: '{query}'\n")
    for project, similarity in results:
        print(f"  [{similarity:.3f}] {project.title}")
        print(f"           {project.short_description}\n")
else:
    print("Could not generate query embedding")

Semantic search results for: 'AI tools for developers'

  [0.844] AI Hacker Builder
           Autonomous AI that generates secure code

  [0.815] Non-LLM AI Modeling Paradigm
           AI approach focusing on modeling rather than scaling text prediction

  [0.812] AI Study Buddy
           Peer-level AI study companion for high school students

  [0.811] Personal AI Photography Hub
           AI-driven site for automatic photo tagging and color discovery

  [0.810] AI Human Interaction Research Robot
           A robot studying AI-human interaction dynamics



## 3.5 Rerank Service

The rerank service uses a cross-encoder model to improve retrieval quality by reranking candidates based on query-document relevance.

In [13]:
from src.clients.rerank import (
    rerank_documents,
    check_rerank_service_health,
    DEFAULT_RERANK_URL
)

# Check rerank service health
async def check_rerank_health():
    return await check_rerank_service_health()

is_healthy = asyncio.get_event_loop().run_until_complete(check_rerank_health())
print(f"Rerank service URL: {DEFAULT_RERANK_URL}")
print(f"Rerank service healthy: {is_healthy}")

Exception in callback Task.__step()
handle: <Handle Task.__step()>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: cannot enter context: <_contextvars.Context object at 0xffffb1dc1940> is already entered
Rerank service health check failed: All connection attempts failed


Rerank service URL: http://192.168.5.173:8111
Rerank service healthy: False


In [None]:
# Test reranking with sample documents
query = "AI tools for developers"
documents = [
    "A machine learning framework for building neural networks",
    "A recipe app for cooking enthusiasts",
    "An AI-powered code review tool that suggests improvements",
    "A weather forecasting service for farmers",
    "A developer productivity tool using LLMs for code generation"
]

async def test_rerank():
    try:
        result = await rerank_documents(query, documents)
        return result
    except Exception as e:
        print(f"Error: {e}")
        return None

result = asyncio.get_event_loop().run_until_complete(test_rerank())

if result:
    print(f"Query: '{query}'\n")
    print("Reranked documents (by relevance):")
    for i, idx in enumerate(result.ranked_indices):
        score = result.scores[idx]
        doc = documents[idx][:60] + "..." if len(documents[idx]) > 60 else documents[idx]
        print(f"  {i+1}. [{score:+.2f}] {doc}")
else:
    print("Reranking failed (service may be unavailable)")

## 4. LLM Client

The project uses an OpenAI-compatible LLM endpoint (Nemotron).

In [77]:
from src.llm_config import get_llm

llm = get_llm()
print(f"LLM configured:")
print(f"  Model: {llm.model}")
print(f"  Base URL: {llm.api_base}")

LLM configured:
  Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  Base URL: http://192.168.6.19:8002/v1


In [None]:
import asyncio

# Make a simple LLM call
async def test_llm():
    prompt = "What are three key features of a good developer tool? Be brief."
    response = await llm.acomplete(prompt)
    return response.text

result = asyncio.get_event_loop().run_until_complete(test_llm())
print("LLM Response:")
print(result)

## 5. Workflow Testing

Test the LlamaIndex workflow components interactively.

In [78]:
from src.workflows.waywo_project_workflow import WaywoProjectWorkflow

# Create workflow instance
workflow = WaywoProjectWorkflow(timeout=300)

In [79]:
# Test with a sample comment
sample_comment = """
I've been building a CLI tool called 'codestat' that analyzes your git repositories 
and generates insights about your coding patterns. It tracks things like:
- Most active times of day
- Language usage over time
- Commit frequency patterns

Check it out: https://github.com/example/codestat
"""

async def run_workflow():
    result = await workflow.run(comment_text=sample_comment, comment_id=99999)
    return result

# Note: This will make multiple LLM calls and may take a minute
# Uncomment the following lines to run:
# result = asyncio.get_event_loop().run_until_complete(run_workflow())
# print(f"Workflow completed. Projects extracted: {len(result) if result else 0}")

## 6. RAG Chatbot (with Reranking)

Test the RAG chatbot workflow that answers questions using project data. The chatbot now uses semantic search + reranking for improved retrieval quality.

In [80]:
from src.workflows.waywo_chatbot_workflow import WaywoChatbotWorkflow

chatbot = WaywoChatbotWorkflow(top_k=5)

Embeddings have been explicitly disabled. Using MockEmbedding.


In [81]:
# Example query
query = "What AI or machine learning projects are people working on?"

async def run_chat():
    result = await chatbot.chat(query)
    return result

response = asyncio.get_event_loop().run_until_complete(run_chat())

print(f"Query: {query}\n")
print(f"Response:\n{response.response}")
print(f"\nSource projects: {len(response.source_projects)}")

Query: What AI or machine learning projects are people working on?

Response:

Here are some of the AI‑ and ML‑focused projects that have been shared in the recent “What are you working on?” threads on Hacker News:

| # | Project (as posted) | Relevance | Idea Score | Complexity | Brief description | Tags |
|---|----------------------|-----------|------------|------------|-------------------|------|
| 1 | **Non‑LLM AI Modeling Paradigm** | 81 % | 7/10 | 6/10 | A new way to think about AI that treats it as a **modeling problem** rather than simply scaling up text‑prediction models. It explores alternative architectures and training strategies that could be more efficient or specialized. | #ai, #modeling, #ml, #paradigm |
| 2 | **AI Hacker Builder** | 79 % | 5/10 | 6/10 | An **autonomous AI system** that generates secure code, aiming to close the gap between the high demand for secure software and the limited supply of developers who can write it safely. | #ai, #security, #cybersecurity,

## 7. Raw SQL Queries

For more complex analysis, you can run raw SQL queries directly.

In [None]:
from sqlalchemy import text
from src.db.database import SessionLocal

with SessionLocal() as session:
    # Example: Get top hashtags by frequency
    result = session.execute(text("""
        SELECT hashtags, COUNT(*) as count 
        FROM waywo_projects 
        WHERE hashtags IS NOT NULL AND hashtags != '[]'
        GROUP BY hashtags 
        ORDER BY count DESC 
        LIMIT 10
    """))
    
    print("Top hashtag combinations:")
    for row in result:
        print(f"  {row.hashtags}: {row.count}")

In [83]:
with SessionLocal() as session:
    # Example: Distribution of idea scores
    result = session.execute(text("""
        SELECT idea_score, COUNT(*) as count 
        FROM waywo_projects 
        WHERE idea_score IS NOT NULL
        GROUP BY idea_score 
        ORDER BY idea_score
    """))
    
    print("Idea score distribution:")
    for row in result:
        bar = "#" * row.count
        print(f"  {row.idea_score}: {bar} ({row.count})")

Idea score distribution:
  1: ## (2)
  5: ### (3)
  6: ####################### (23)
  7: ############ (12)
  8: ##### (5)


In [86]:
# Count "dead" comments
with SessionLocal() as session:
    total_comments = session.execute(text("SELECT COUNT(*) FROM waywo_comments")).scalar()
    dead_comments = session.execute(text("SELECT COUNT(*) FROM waywo_comments WHERE text = '[dead]'")).scalar()
    
    print(f"Total comments: {total_comments}")
    print(f"Dead comments: {dead_comments}")
    print(f"Dead percentage: {dead_comments / total_comments * 100:.1f}%")
    print(f"Live comments: {total_comments - dead_comments}")

Total comments: 8705
Dead comments: 153
Dead percentage: 1.8%
Live comments: 8552


## Next Steps

- Explore the `src/` directory for more modules
- Check `src/db/client.py` for all available database operations
- See `src/workflows/` for workflow implementations
- See `src/clients/` for embedding, rerank, and other service clients
- Use `src/main.py` as reference for API endpoints