# Preview from LangChain

## Step by step code

1. get LangSmith API Key from environment
2. set up anthropic key for chat model
3. set up embedding model for embeddings
4. select vector database
5. set up document loader and split up

## Tutorials

### LangChain RAG tutorial document

#### Part 1

- https://python.langchain.com/docs/tutorials/rag

#### Part 2

Extends the implementation to accommodate conversation-style interactions and multi-step retrieval processes.

- https://python.langchain.com/docs/tutorials/qa_chat_history/

LangChain document loader for GitHub Repo

- https://python.langchain.com/docs/integrations/document_loaders/github/

LangChain document loader for Git Repository

- https://python.langchain.com/docs/integrations/document_loaders/git/

LangChain document loader for Source Code (e.g. Python)

- https://python.langchain.com/docs/integrations/document_loaders/source_code/

LangSmith evaluation for a chatbot

- https://docs.smith.langchain.com/evaluation/tutorials/evaluation

LangSmith evaluation for a rag

- https://docs.smith.langchain.com/evaluation/tutorials/rag

In [1]:
import sys
import os

# Get the path to the parent directory of the notebook
parent_dir = os.path.dirname(os.path.abspath(""))

# Add the parent directory to sys.path
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# Now do the import
from helper.secrets_loader import SecretsLoader

envfile = os.path.join(parent_dir, ".env")

### 1000 Imports

* [x] set up LangSmith key
* [x] set up Claude key (Anthropic)
* [x] set up Gemini key (Google)
* [x] set up Github personal access token

In [4]:
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = SecretsLoader.get_token("LANGSMITH_API_KEY", envfile)
os.environ["CLAUDE_API_KEY"] = SecretsLoader.get_token("CLAUDE_API_KEY", envfile)
os.environ["GOOGLE_API_KEY"] = SecretsLoader.get_token("GOOGLE_API_KEY", envfile)
os.environ["GITHUB_PA_TOKEN"] = SecretsLoader.get_token("GITHUB_PA_TOKEN", envfile)

Loaded LANGSMITH_API_KEY from token file: /Users/ejacquin/Desktop/Northeastern/School_Work/Summer_II_2025/CS4973/techcredit/.env
Loaded CLAUDE_API_KEY from token file: /Users/ejacquin/Desktop/Northeastern/School_Work/Summer_II_2025/CS4973/techcredit/.env
Loaded GOOGLE_API_KEY from token file: /Users/ejacquin/Desktop/Northeastern/School_Work/Summer_II_2025/CS4973/techcredit/.env
Loaded GITHUB_PA_TOKEN from token file: /Users/ejacquin/Desktop/Northeastern/School_Work/Summer_II_2025/CS4973/techcredit/.env


* [x] set up Claude (Anthropic) llm
* [x] set up Google Gemini as embedding model

In [5]:

from langchain.chat_models import init_chat_model
from langchain_google_genai import GoogleGenerativeAIEmbeddings


llm = init_chat_model("claude-3-5-haiku-latest", model_provider="anthropic")
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

# Choosing database

Load, chunk, split, embed and vectorize code data and document data into database

## Candidates

1. cassandra
2. open search https://opensearch.org/platform/os-search/vector-database/
3. Pinecone
4. MongoDB
5. PostgreSQL
6. [x] Chroma, locally hosted with sqlite --> **using this one**

In [6]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="python_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

document_store = Chroma(
    collection_name="document_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",
)

* [ ] Import and load a GitHub Repo as a document

In [7]:
from langchain_community.document_loaders import GithubFileLoader
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph

# Load and chunk contents of the github repo
loader = GithubFileLoader(
    repo="ameliarogerscodes/TC-Examples",  # the repo name
    access_token=os.environ["GITHUB_PA_TOKEN"],
    branch="main",  # the branch name
    github_api_url="https://api.github.com",
    file_filter=lambda file_path: file_path.endswith(
        ".py"
    ),  # load all python files.
)
documents = loader.load()

* [ ] test documents content

In [None]:
print(documents[7].metadata)

* [ ] map metadata

In [None]:
import json

# Step 1: Load the JSON metadata from a file (adjust path accordingly)
with open('./repo_metadata.json', 'r', encoding='utf-8') as f:
    json_metadata_list = json.load(f)

# The sample JSON will be like a list of dicts, for example:
# [
#   {"path": "src/pybreaker.py", "type": "source", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"},
#   {"path": "test/unitest_pybreaker.py", "type": "test", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"}
# ]

# Step 2: Create a dictionary mapping from path to metadata (excluding 'path' key)
metadata_map = {
    entry['path']: {k: v for k, v in entry.items() if k != 'path'}
    for entry in json_metadata_list
}

* [X] use code splitter

In [None]:
# use code-splitter 
# https://pypi.org/project/code-splitter/
python_splitter = TiktokenSplitter(Language.Python, max_size=200)

all_splits = [
    Document(
        page_content=(
            # "# ===== code structure =====\n"
            # + "\n".join(f"# {line}" for line in splits.subtree.splitlines())
            # + "\n\n"
            splits.text
        ),
        metadata=metadata_map.get(doc.metadata.get('path'), {})
    )
    for doc in documents 
    for splits in python_splitter.split(doc.page_content.encode("utf-8"))
]

* [ ] print page_content and metadata for a split

In [None]:
print(all_splits[34].page_content)
print(all_splits[34].metadata)

* [ ] load, split and embed documents into vector database

* [x] index chunks for code vector db, DO NOT LOAD TWICE!

In [None]:
# Index chunks
code_embed_index = vector_store.add_documents(documents=all_splits)

* [ ] load, chunk, split and embed documents about technical credit

In [None]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load and chunk contents of the blog
bs4_strainer = bs4.SoupStrainer(
    class_=("article-header__section", "article-header__topic-and-issue-section",
            "article-header article-header__title", "article-header__subtitle",
        	"article-header__meta",
        	"article-table-of-contents",
        	"article-contents",
        	"article-footer")
)

doc_loader = WebBaseLoader(
    web_paths=("https://cacm.acm.org/opinion/technical-credit/",),
    bs_kwargs=dict(
        parse_only=bs4_strainer
    ),
)

text_documents = doc_loader.load()
# recurisive splitter , 7 , all splits

* [X] test the web page document

In [None]:
assert len(text_documents) == 1
print(f"Total characters: {len(text_documents[0].page_content)}")
print(text_documents[0].page_content[:500])

* [ ] split the document

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
text_splits = text_splitter.split_documents(text_documents)

print(f"Split article into {len(text_splits)} sub-documents.")

* [X] embed the documents into vector database, DO NOT LOAD TWICE!

In [None]:
document_ids = document_store.add_documents(documents=text_splits)

# RAG System Part
## Customize Prompt
## Define nodes and graphs in the rag system
1. [X] retrieve code and metadata
2. [X] retrieve academic documents
3. [X] send message to LLM

* [ ] design prompt template.

In [None]:
# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
from langchain_core.prompts import ChatPromptTemplate
from jinja2 import Environment, BaseLoader
import textwrap

jinja2_prompt = """\
{% for part in parts %}
Here is the No. {{ part.ordinal }} part of a tech credit
Descirption:
{{ part.tech_credit }}

Example code for that tech credit:
{{ part.context_code }}

Here is the code from user:
{{ part.user_code }}
{% endfor %}
"""

user_prompt_template = Environment(loader=BaseLoader).from_string(jinja2_prompt)

prompt = ChatPromptTemplate([
    ("system", "You are an assistant for identifying technical credit. Use the following pieces \
                of retrieved context to answer the question. If you don't know the answer, just \
                say that you don't know. For each code snippet, use three sentences maximum and \
                keep the answer concise."),
    ("user", textwrap.dedent("""\
                Some documentation about tech credit:
                {context_doc}

                The following are snippets of codes that are most similar to example codes of 
                tech credits.
                {rendered}
                
                Question: {question}
                Answer:
                """))
    ])

* [ ] collect metadata from code

In [None]:
def collect_unique_pairs(documents):
    """
    Collect unique concatenated 'tech_credit: description' strings from document metadata.

    Args:
        documents (list[dict]): A list of Document objects, each with a 'metadata' field.

    Returns:
        list[str]: A list of unique 'tech_credit: description' strings.
    """
    seen = set()

    for doc in documents:
        metadata = doc.metadata
        credit = metadata.get("tech_credit")
        description = metadata.get("tech_credit_description")
        if credit and description:
            combined = f"{credit}: {description}"
            seen.add(combined)

    return list(seen)

In [None]:
from urllib.parse import urlparse
from typing import List

def load_repo(url: str, branch: str) -> List[Document]:
    """
    Loads all Python files from a GitHub repository using the repository URL.

    Args:
        url (str): The full URL of the GitHub repository (e.g., "https://github.com/org/project").

    Returns:
        List[Document]: A list of loaded document objects (the format depends on GithubFileLoader).

    Raises:
        ValueError: If the URL is not a valid GitHub repo URL.
    """
    parsed = urlparse(url)
    if parsed.netloc != "github.com":
        raise ValueError(f"URL is not a github.com repo: {url}")

    # The path is like '/org/project' or '/org/project/'
    path_parts = parsed.path.strip('/').split('/')
    if len(path_parts) < 2:
        raise ValueError(f"Invalid GitHub repository URL: {url}")
    repo_name = '/'.join(path_parts[:2])  # Only org/project, ignore any deeper paths

    loader = GithubFileLoader(
        repo=repo_name,
        branch=branch,
        # access_token=ACCESS_TOKEN,
        github_api_url="https://api.github.com",
        file_filter=lambda file_path: not file_path.startswith("tests/") and file_path.endswith(".py"),
    )
    documents = loader.load()
    return documents

def split_documents(documents: List[Document]) -> List[str]:
    """
    Splits Python code documents into code snippets and prepends the code structure as a comment header.

    Args:
        documents (List[Document]): List of Document objects containing Python code.

    Returns:
        List[Document]: List of new strings, each with a code structure comment followed by the code snippet.
    """
    python_splitter = TiktokenSplitter(Language.Python, max_size=200)

    all_splits = []
    for doc in documents:
        splits = python_splitter.split(doc.page_content.encode("utf-8"))
        for snippet in splits:
            # delete the header for now, only splitting the literal source code
            # header = (
            #    "# ===== code structure =====\n" +
            #    "\n".join(f"# {line}" for line in snippet.subtree.splitlines()) +
            #    "\n\n"
            #)
            all_splits.append(snippet.text)
    return all_splits

In [None]:
import heapq
from typing import Callable
from statistics import mean, median # for later usage of different similarity score

def default_min_score_fn(results: list[tuple[Document, float]]) -> float:
    """Default scoring function: returns the minimum score."""
    return min(score for _, score in results)

def top_k_similar_queries(
    queries: list[str],
    vectorstore,
    k: int = 3,
    scoring_fn: Callable[[list[tuple[Document, float]]], float] = default_min_score_fn,
    top_docs_per_query: int = 4
) -> list[tuple[str, list[Document], float]]:
    """
    Executes similarity_search_with_score for each query and returns top-k queries
    sorted by a user-defined scoring function.

    Args:
        queries: List of query strings.
        vectorstore: A LangChain-compatible vector store.
        k: Number of top results to return.
        scoring_fn: Function that maps a list of (Document, score) to a float score.
        top_docs_per_query: Number of documents to retrieve for each query.

    Returns:
        A list of (query, documents, aggregated_score) sorted by aggregated_score descending.
    """
    heap = []

    for query in queries:
        results = vectorstore.similarity_search_with_score(query, k=top_docs_per_query)
        if not results:
            continue

        agg_score = scoring_fn(results)
        
        # if agg_score < 0.65:
        #    continue
    
        docs = [doc for doc, _ in results]

        # Use negative score to simulate a max-heap
        heapq.heappush(heap, (-agg_score, query, docs))
        if len(heap) > k:
            heapq.heappop(heap)

    # Return sorted results: highest score first
    top_k = sorted([(-score, query, docs) for score, query, docs in heap], reverse=True)
    return [(query, docs, score) for score, query, docs in top_k]


In [None]:
from typing_extensions import List, TypedDict

# Define state for application
class State(TypedDict):
    question: str
    context_doc: List[Document]
    # a part that contains example codes, user code, tech credit and index order
    parts: list[dict]
    url: str # user code
    branch: str
    answer: str

# Define application steps
def retrieve(state: State):
    repo_splits = split_documents(load_repo(state["url"], state["branch"]))
    retrieved_docs = top_k_similar_queries(repo_splits, vector_store)
    parts = [
        {
            "ordinal": i + 1,
            "tech_credit": "\n".join(collect_unique_pairs(context_code)),
            "user_code": user_code,
            "context_code": "\n\n".join(doc.page_content for doc in context_code)
        }
        for i, (user_code, context_code, _) in enumerate(retrieved_docs)
    ]
    return {"parts": parts}

def retrieve_doc(state: State):
    retrieved_doc = document_store.similarity_search(state["question"])
    return {"context_doc": retrieved_doc}
    
def generate(state: State):
    doc_content = "\n\n".join(doc.page_content for doc in state["context_doc"])
    user_prompt = user_prompt_template.render(parts=state["parts"])
    messages = prompt.invoke({"question": state["question"], "rendered": user_prompt, 
                              "context_doc": doc_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_node(retrieve_doc)
graph_builder.add_edge(START, "retrieve")
graph_builder.add_edge(START, "retrieve_doc")
graph = graph_builder.compile()

# Ask Question Part
## Circuit Breaker
## MVC model
## Iterator pattern


* [ ] Ask questions and test RAG

In [None]:
response = graph.invoke({"question": "Tell me what tech credits does the repo possibly use?",
"url": "https://github.com/danielfm/pybreaker", "branch": "main"})
print(response["answer"])

In [None]:
response = graph.invoke({"question": "Tell me what tech credits does the repo possibly use?",
"url": "https://github.com/zacernst/circuit_breaker", "branch": "master"})
print(response["answer"])

In [None]:
response = graph.invoke({"question": "Tell me what tech credits does the repo possibly use?",
"url": "https://github.com/fabfuel/circuitbreaker", "branch": "develop"})
print(response["answer"])

In [None]:
response = graph.invoke({"question": "Tell me what tech credits does the repo possibly use?",
"url": "https://github.com/gmargari/pymvc", "branch": "main"})
print(response["answer"])

In [None]:
response = graph.invoke({"question": "Tell me what tech credits does the repo possibly use?",
"url": "https://github.com/etimberg/pycircuitbreaker", "branch": "master"})
print(response["answer"])

In [None]:
# response = graph.invoke({"question": "Tell me if the following is a tech credit and what do these 3 states do?",
# "url": "https://github.com/pallets/jinja", "branch": "main"})
# print(response["answer"])

# Roadmap:

1. Use a text embedding model.
   + [X] gemini text-embedding-004
2. Code splitter may not work properly.
   + [X] change to code-splitter that works better.
3. [X] Switch in memory vector storage to a vector database
   + [X] code vector database with additional metadata for tech credit
   + [X] document vector database with academic context about tech credit
4. [ ] Load more standard example codes for tech credit (~20 more examples)
5. [ ] Load more documents (academic contexts) for technical credit (3~5 related academic paper)
6. [X] Allow user to ask about a repository, instead of small snippets of codes
   * [X] load, chunk, split, embed and vectorize a repo
   * [X] similarity search and filter user code by similarity score compared with example code
   * [ ] batch process to LLM
   * [ ] organize response and answers
7. [ ] validate the results
   * [ ] Test our trained data in TC-Examples
   * [ ] use different embedding and LLM to test five example repos
   * Gemini, 
   * claude-3.5-haiku,
8. Jun 6:
   1. [ ] choose five test repos
   2. [ ] two more LLM and one more embeddings models
   3. [ ] expected output:
      1. [ ] human expert
      2. [ ] LLM as judge