# LangChain RAG system for detecting Tech Credit

## Step by step code

1. get LangSmith API Key from environment
2. set up anthropic key for chat model
3. set up embedding model for embeddings
4. select vector database
5. set up document loader and split up

## Set up API keys

## Tutorials
LangChain RAG tutorial document
Part 1
https://python.langchain.com/docs/tutorials/rag

Part 2
extends the implementation to accommodate conversation-style interactions and
multi-step retrieval processes.
https://python.langchain.com/docs/tutorials/qa_chat_history/

LangChain document loader for GitHub Repo
https://python.langchain.com/docs/integrations/document_loaders/github/

LangChain document loader for Git Repository
https://python.langchain.com/docs/integrations/document_loaders/git/

LangChain document loader for Source Code (e.g. Python)
https://python.langchain.com/docs/integrations/document_loaders/source_code/

LangSmith evaluation for a chatbot
https://docs.smith.langchain.com/evaluation/tutorials/evaluation

LangSmith evaluation for a rag
https://docs.smith.langchain.com/evaluation/tutorials/rag

* [X] set up LangSmith key

In [15]:
# use dotenv to store secrets
%reload_ext dotenv
%dotenv
import os

In [8]:
import getpass

if not os.environ.get("LANGSMITH_API_KEY"):
  os.environ["LANGSMITH_TRACING"] = "true"
  os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith to enable tracing: ")


* [X] set up Anthropic key

In [12]:
if not os.environ.get("CLAUDE_API_KEY"):
  os.environ["CLAUDE_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")

from langchain.chat_models import init_chat_model

claude_llm = init_chat_model("claude-sonnet-4-20250514", model_provider="anthropic")
claude_llm_haiku = init_chat_model("claude-3-5-haiku-latest", model_provider="anthropic")

In [4]:
if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

openai_llm = init_chat_model("gpt-4o-mini", model_provider="openai")

* [X] set up Google gemini as embedding model

In [17]:
if not os.environ.get("GOOGLE_API_KEY"):
   os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [6]:
if not os.environ.get("HF_TOKEN"):
  os.environ["HF_TOKEN"] = getpass.getpass("Enter API key for HuggingFace Hub Token: ")

from langchain_huggingface import HuggingFaceEmbeddings

hugging_face_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [8]:
from langchain_openai import OpenAIEmbeddings

openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

* [X] set up vector database

# Choosing database

Load, chunk, split, embed and vectorize code data and document data into database

## Candidates

1. cassandra
2. open search https://opensearch.org/platform/os-search/vector-database/
3. Pinecone
4. MongoDB
5. PostgreSQL
6. [X] Chroma, locally hosted with sqlite

In [18]:
from langchain_chroma import Chroma

# huggingface_store = Chroma(
#     collection_name="hf_python_tech_credit",
#     embedding_function=hugging_face_embeddings,
#     persist_directory="./chroma_langchain_db",
# )

# openai_store = Chroma(
#     collection_name="openai_python_tech_credit",
#     embedding_function=openai_embeddings,
#     persist_directory="./chroma_langchain_db",
# )

vector_store = Chroma(
    collection_name="python_tech_credit",
    embedding_function=gemini_embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

document_store = Chroma(
    collection_name="document_tech_credit",
    embedding_function=gemini_embeddings,
    persist_directory="./chroma_langchain_db",
)

* [ ] Import and load a GitHub Repo as a document

In [39]:
from langchain_community.document_loaders import GithubFileLoader
from langchain_community.document_loaders import JSONLoader
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from code_splitter import Language, TiktokenSplitter

if not os.environ.get("GITHUB_PERSONAL_ACCESS_TOKEN"):
   os.environ["GITHUB_PERSONAL_ACCESS_TOKEN"] = getpass.getpass("Enter ACCESS_TOKEN for GitHub: ")

# Load and chunk contents of the github repo
loader = GithubFileLoader(
    repo="ameliarogerscodes/TC-Examples",  # the repo name
    branch="main",  # the branch name
    # access_token=ACCESS_TOKEN, # delete/comment out this argument if you've set the access token as an env var.
    github_api_url="https://api.github.com",
    # parser=LanguageParser(language=Language.PYTHON, parser_threshold=200),
    file_filter=lambda file_path: file_path.endswith(
        ".py"
    ),  # load all python files.
)
documents = loader.load()

* [ ] test documents content

In [20]:
print(documents[7].metadata)

{'path': 'MVC/view.py', 'sha': '906c1ec1fe1d4d79a7e8b9fbdf80ab20e17236a0', 'source': 'https://api.github.com/ameliarogerscodes/TC-Examples/blob/main/MVC/view.py'}


* [ ] map metadata

In [21]:
import json

# Step 1: Load the JSON metadata from a file (adjust path accordingly)
with open('./repo_metadata.json', 'r', encoding='utf-8') as f:
    json_metadata_list = json.load(f)

# The sample JSON will be like a list of dicts, for example:
# [
#   {"path": "src/pybreaker.py", "type": "source", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"},
#   {"path": "test/unitest_pybreaker.py", "type": "test", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"}
# ]

# Step 2: Create a dictionary mapping from path to metadata (excluding 'path' key)
metadata_map = {
    entry['path']: {k: v for k, v in entry.items() if k != 'path'}
    for entry in json_metadata_list
}

* [X] use code splitter

In [22]:
# use code-splitter
# https://pypi.org/project/code-splitter/
python_splitter = TiktokenSplitter(Language.Python, max_size=200)

all_splits = [
    Document(
        page_content=(
            # "# ===== code structure =====\n"
            # + "\n".join(f"# {line}" for line in splits.subtree.splitlines())
            # + "\n\n"
            splits.text
        ),
        metadata=metadata_map.get(doc.metadata.get('path'), {})
    )
    for doc in documents
    for splits in python_splitter.split(doc.page_content.encode("utf-8"))
]

* [ ] print page_content and metadata for a split

In [23]:
print(all_splits[34].page_content)
print(all_splits[34].metadata)

class ConcreteComponentA(Component):
    """
    Each Concrete Component must implement the `accept` method in such a way
    that it calls the visitor's method corresponding to the component's class.
    """

    def accept(self, visitor: Visitor) -> None:
        """
        Note that we're calling `visitConcreteComponentA`, which matches the
        current class name. This way we let the visitor know the class of the
        component it works with.
        """

        visitor.visit_concrete_component_a(self)

    def exclusive_method_of_concrete_component_a(self) -> str:
        """
        Concrete Components may have special methods that don't exist in their
        base class or interface. The Visitor is still able to use these methods
        since it's aware of the component's concrete class.
        """

        return "A"
{'type': 'source', 'tech_credit': 'Visitor Pattern', 'tech_credit_description': 'Represent an operation to be performed on instances of a set of classes.

* [ ] load, split and embed documents into vector database

* [x] index chunks for code vector db, DO NOT LOAD TWICE!

In [24]:
# Index chunks
code_embed_index_gemini = vector_store.add_documents(documents=all_splits)
# code_embed_index_hf = huggingface_store.add_documents(documents=all_splits)
# code_embed_index_openai = openai_store.add_documents(documents=all_splits)

* [ ] load, chunk, split and embed documents about technical credit

In [25]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load and chunk contents of the blog
bs4_strainer = bs4.SoupStrainer(
    class_=("article-header__section", "article-header__topic-and-issue-section",
            "article-header article-header__title", "article-header__subtitle",
        	"article-header__meta",
        	"article-table-of-contents",
        	"article-contents",
        	"article-footer")
)

doc_loader = WebBaseLoader(
    web_paths=("https://cacm.acm.org/opinion/technical-credit/",),
    bs_kwargs=dict(
        parse_only=bs4_strainer
    ),
)

text_documents = doc_loader.load()
# recurisive splitter , 7 , all splits

USER_AGENT environment variable not set, consider setting it to identify your requests.


* [X] test the web page document

In [26]:
assert len(text_documents) == 1
print(f"Total characters: {len(text_documents[0].page_content)}")
print(text_documents[0].page_content[:500])

Total characters: 14484
Opinion

Computing Profession 


Balancing initial investment and long-term results in the software development process.


				By Ian Gorton, Alessio Bucaioni, and Patrizio Pelliccione 

Posted Dec 26 2024 



What Is Technical Credit?
Technical Credit in Practice
A Research Agenda for Technical Credit
Conclusion
References
Footnotes




Technical debt (TD) is an established concept in software engineering encompassing an unavoidable side effect of software development.3 It arises due to tight s


* [ ] split the document

In [27]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
text_splits = text_splitter.split_documents(text_documents)

print(f"Split article into {len(text_splits)} sub-documents.")

Split article into 20 sub-documents.


* [X] embed the documents into vector database, DO NOT LOAD TWICE!

In [28]:
document_ids = document_store.add_documents(documents=text_splits)

# RAG System Part
## Customize Prompt
## Define nodes and graphs in the rag system
1. [X] retrieve code and metadata
2. [X] retrieve academic documents
3. [X] send message to LLM

* [ ] design prompt template.

In [29]:
# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
from langchain_core.prompts import ChatPromptTemplate
from jinja2 import Environment, BaseLoader
import textwrap

jinja2_prompt = """\
{% for part in parts %}
Here is the No. {{ part.ordinal }} part of 
Descirption:
{{ part.tech_credit }}

Example code for that tech credit:
{{ part.context_code }}

Here is the code from user:
{{ part.user_code }}
{% endfor %}
"""

user_prompt_template = Environment(loader=BaseLoader).from_string(jinja2_prompt)

prompt = ChatPromptTemplate([
    ("system", "You are an assistant for identifying technical credit. Use the following pieces \
                of retrieved context to answer the question. The context are code snippets If you don't know the answer, just \
                say that you don't know. For each code snippet, use three sentences maximum and \
                keep the answer concise."),
    ("user", textwrap.dedent("""\
                Some documentation about tech credit:
                {context_doc}

                The following are snippets of codes that are most similar to example codes of
                tech credits.
                {rendered}

                Question: {question}
                Answer:
                """))
    ])

* [ ] collect metadata from code

In [30]:
def collect_unique_pairs(documents):
    """
    Collect unique concatenated 'tech_credit: description' strings from document metadata.

    Args:
        documents (list[dict]): A list of Document objects, each with a 'metadata' field.

    Returns:
        list[str]: A list of unique 'tech_credit: description' strings.
    """
    seen = set()

    for doc in documents:
        metadata = doc.metadata
        credit = metadata.get("tech_credit")
        description = metadata.get("tech_credit_description")
        if credit and description:
            combined = f"{credit}: {description}"
            seen.add(combined)

    return list(seen)

In [31]:
from urllib.parse import urlparse
from typing import List

def load_repo(url: str, branch: str) -> List[Document]:
    """
    Loads all Python files from a GitHub repository using the repository URL.

    Args:
        url (str): The full URL of the GitHub repository (e.g., "https://github.com/org/project").

    Returns:
        List[Document]: A list of loaded document objects (the format depends on GithubFileLoader).

    Raises:
        ValueError: If the URL is not a valid GitHub repo URL.
    """
    parsed = urlparse(url)
    if parsed.netloc != "github.com":
        raise ValueError(f"URL is not a github.com repo: {url}")

    # The path is like '/org/project' or '/org/project/'
    path_parts = parsed.path.strip('/').split('/')
    if len(path_parts) < 2:
        raise ValueError(f"Invalid GitHub repository URL: {url}")
    repo_name = '/'.join(path_parts[:2])  # Only org/project, ignore any deeper paths

    loader = GithubFileLoader(
        repo=repo_name,
        branch=branch,
        # access_token=ACCESS_TOKEN,
        github_api_url="https://api.github.com",
        file_filter=lambda file_path: not file_path.startswith("tests/") and file_path.endswith(".py"),
    )
    documents = loader.load()
    return documents

def split_documents(documents: List[Document]) -> List[str]:
    """
    Splits Python code documents into code snippets and prepends the code structure as a comment header.

    Args:
        documents (List[Document]): List of Document objects containing Python code.

    Returns:
        List[Document]: List of new strings, each with a code structure comment followed by the code snippet.
    """
    python_splitter = TiktokenSplitter(Language.Python, max_size=200)

    all_splits = []
    for doc in documents:
        splits = python_splitter.split(doc.page_content.encode("utf-8"))
        for snippet in splits:
            # delete the header for now, only splitting the literal source code
            # header = (
            #    "# ===== code structure =====\n" +
            #    "\n".join(f"# {line}" for line in snippet.subtree.splitlines()) +
            #    "\n\n"
            #)
            all_splits.append(snippet.text)
    return all_splits

In [32]:
import heapq
from typing import Callable
from statistics import mean, median # for later usage of different similarity score

def default_min_score_fn(results: list[tuple[Document, float]]) -> float:
    """Default scoring function: returns the minimum score."""
    return min(score for _, score in results)

def top_k_similar_queries(
    queries: list[str],
    vectorstore,
    k: int = 3,
    scoring_fn: Callable[[list[tuple[Document, float]]], float] = default_min_score_fn,
    top_docs_per_query: int = 4
) -> list[tuple[str, list[Document], float]]:
    """
    Executes similarity_search_with_score for each query and returns top-k queries
    sorted by a user-defined scoring function.

    Args:
        queries: List of query strings.
        vectorstore: A LangChain-compatible vector store.
        k: Number of top results to return.
        scoring_fn: Function that maps a list of (Document, score) to a float score.
        top_docs_per_query: Number of documents to retrieve for each query.

    Returns:
        A list of (query, documents, aggregated_score) sorted by aggregated_score descending.
    """
    heap = []

    for query in queries:
        results = vectorstore.similarity_search_with_score(query, k=top_docs_per_query)
        if not results:
            continue

        agg_score = scoring_fn(results)

        # if agg_score < 0.65:
        #    continue

        docs = [doc for doc, _ in results]

        # Use negative score to simulate a max-heap
        heapq.heappush(heap, (-agg_score, query, docs))
        if len(heap) > k:
            heapq.heappop(heap)

    # Return sorted results: highest score first
    top_k = sorted([(-score, query, docs) for score, query, docs in heap], reverse=True)
    return [(query, docs, score) for score, query, docs in top_k]


In [33]:
from typing_extensions import List, TypedDict
from langchain_core.vectorstores.base import VectorStore
from langchain_core.language_models.chat_models import BaseChatModel

# Define state for application
class State(TypedDict):
    question: str
    context_doc: List[Document]
    # a part that contains example codes, user code, tech credit and index order
    parts: list[dict]
    url: str # user code
    branch: str
    llm: BaseChatModel
    vector_store: VectorStore
    answer: str

# Define application steps
def retrieve(state: State):
    repo_splits = split_documents(load_repo(state["url"], state["branch"]))
    retrieved_docs = top_k_similar_queries(repo_splits, state["vector_store"])
    parts = [
        {
            "ordinal": i + 1,
            "tech_credit": "\n".join(collect_unique_pairs(context_code)),
            "user_code": user_code,
            "context_code": "\n\n".join(doc.page_content for doc in context_code)
        }
        for i, (user_code, context_code, _) in enumerate(retrieved_docs)
    ]
    return {"parts": parts}

def retrieve_doc(state: State):
    retrieved_doc = document_store.similarity_search(state["question"], k = 3)
    return {"context_doc": retrieved_doc}

def generate(state: State):
    doc_content = "\n\n".join(doc.page_content for doc in state["context_doc"])
    user_prompt = user_prompt_template.render(parts=state["parts"])
    messages = prompt.invoke({"question": state["question"], "rendered": user_prompt,
                              "context_doc": doc_content})
    response = state["llm"].invoke(messages)
    return {"answer": response.content}

# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_node(retrieve_doc)
graph_builder.add_edge(START, "retrieve")
graph_builder.add_edge(START, "retrieve_doc")
graph = graph_builder.compile()

In [37]:
def rag_state_wrapper(user_input: dict, llm, vector_store) -> dict:
    """
    get user input and llm and vector_store to compose the final state
    """
    state = dict(user_input)
    state.setdefault("context_doc", [])
    state.setdefault("parts", [])
    state["llm"] = llm
    state["vector_store"] = vector_store
    return state

# def openai_and_chroma_state(user_input):
#     return rag_state_wrapper(user_input,
#                              llm=openai_llm, vector_store=vector_store)

def anthropic_and_chroma_state(user_input):
    return rag_state_wrapper(user_input,
                             llm=claude_llm, vector_store=vector_store)

# def anthropic_3_5_state(user_input):
#     return rag_state_wrapper(user_input,
#                              llm=claude_llm_3_5, vector_store=openai_store)

# def openai_and_hf_state(user_input):
#     return rag_state_wrapper(user_input,
#                              llm=openai_llm, vector_store=huggingface_store)

# def anthropic_and_hf_state(user_input):
#     return rag_state_wrapper(user_input,
#                              llm=claude_llm, vector_store=huggingface_store)

# Ask Question Part
Use example code repo

1. test for different LLM model and embedding model


* [ ] Ask questions and test RAG

In [35]:
response1 = graph.invoke(openai_and_chroma_state({
    "question": "What tech credits can you identify in this repo?",
    "url": "https://github.com/praypratyay/TicTacToe",
    "branch": "main"}))
print("---Response from Openai with gemini embeddings")
print(response1["answer"])

NameError: name 'openai_llm' is not defined

In [38]:
response2 = graph.invoke(anthropic_and_chroma_state({
    "question": "What tech credits can you identify in this repo",
    "url": "https://github.com/praypratyay/TicTacToe",
    "branch": "main"}))
print("---Response from Anthropic with gemini embeddings---")
print(response2["answer"])

TypeError: "Could not resolve authentication method. Expected either api_key or auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly omitted"

In [39]:
response4 = graph.invoke(openai_and_hf_state({
    "question": "What tech credits can you identify from this repo?",
    "url": "https://github.com/praypratyay/TicTacToe",
    "branch": "main"}))
print("---Response from Openai with hugging face embeddings---")
print(response4["answer"])

---Response from Openai with hugging face embeddings---
The identified tech credits from the provided code snippets include the Strategy Pattern, which enables interchangeable algorithms through encapsulation, as evidenced by the implementations of different bot strategies in the user's code. Additionally, a Circuit Breaker pattern can be observed that enhances system resilience by detecting service failures, which is illustrated in the second and third code snippets discussing failure management. Overall, both design patterns contribute to modularity and maintainability in software engineering practices.


In [40]:
response3 = graph.invoke(anthropic_and_hf_state({
    "question": "What tech credits can you identify from this repo?",
    "url": "https://github.com/praypratyay/TicTacToe",
    "branch": "main"}))
print("---Response from Anthropic with hugging face embeddings---")
print(response3["answer"])

---Response from Anthropic with hugging face embeddings---
Based on the provided code snippets, I can identify the following technical credits:

**Strategy Pattern Implementation**: The `BotPlayingStrategy` code demonstrates a clear implementation of the Strategy pattern with an abstract base class and multiple concrete strategies (EASY, MEDIUM, HARD). This creates technical credit by allowing different bot playing algorithms to be easily swapped and extended without modifying existing code. The pattern enables future modifications to bot behavior through simple strategy replacement.

**Builder Pattern**: The `GameBuilder` class implements the Builder pattern with method chaining, providing a flexible way to construct game objects with different configurations. This creates technical credit by making game creation more maintainable and allowing easy addition of new configuration parameters. The fluent interface design facilitates future extensions to game setup requirements.

**Encapsu

In [41]:
response_one_tc = graph.invoke(anthropic_and_chroma_state({
    "question": "Can you identify an adapter pattern as tech credit in this repo?",
    "url": "https://github.com/praypratyay/TicTacToe",
    "branch": "main"}))
print("---Response from Anthropic with gemini embeddings---")
print(response_one_tc["answer"])

---Response from Anthropic with gemini embeddings---
Based on the code snippets provided, I cannot identify a clear adapter pattern implementation in this repository. The code shows examples of the Strategy pattern with `BotPlayingStrategy` and its concrete implementations (`EASYBotPlayingStrategy`, `MEDIUMBotPlayingStrategy`, `HARDBotPlayingStrategy`), and a Factory pattern with `BotPlayingStrategyFactory`. 

The adapter pattern would typically involve wrapping an existing class with an incompatible interface to make it work with another class, but none of the provided code demonstrates this structural pattern. The examples focus on behavioral patterns like Strategy and Template Method rather than adapter-style interface compatibility solutions.


# Roadmap:

1. Use a text embedding model.
   + [X] gemini text-embedding-004
   + [X] huggingface sentence-transformers/all-mpnet-base-v2
   + [ ] OpenAI text embedding large 
2. Code splitter may not work properly.
   + [X] change to code-splitter that works better.
3. [X] Switch in memory vector storage to a vector database
   + [X] code vector database with additional metadata for tech credit
   + [X] document vector database with academic context about tech credit
4. [ ] Load more standard example codes for tech credit (~20 more examples)
5. [ ] Load more documents (academic contexts) for technical credit (3~5 related academic paper)
6. [X] Allow user to ask about a repository, instead of small snippets of codes
   * [X] load, chunk, split, embed and vectorize a repo
   * [X] similarity search and filter user code by similarity score compared with example code
   * [ ] batch process to LLM
   * [ ] organize response and answers
7. [ ] validate the results
   * [ ] Test our trained data in TC-Examples
   * [ ] use different embedding and LLM to test five example repos
   * Gemini,
   * claude-3.5-haiku, claude-sonnet-4
   * OpenAI gpt-4o-mini