# Preview from LangChain

## Step by step code

1. get LangSmith API Key from environment
2. set up anthropic key for chat model
3. set up embedding model for embeddings
4. select vector database
5. set up document loader and split up

## Set up API keys

## Tutorials
LangChain RAG tutorial document
Part 1
https://python.langchain.com/docs/tutorials/rag

Part 2
extends the implementation to accommodate conversation-style interactions and
multi-step retrieval processes.
https://python.langchain.com/docs/tutorials/qa_chat_history/

LangChain document loader for GitHub Repo
https://python.langchain.com/docs/integrations/document_loaders/github/

LangChain document loader for Git Repository
https://python.langchain.com/docs/integrations/document_loaders/git/

LangChain document loader for Source Code (e.g. Python)
https://python.langchain.com/docs/integrations/document_loaders/source_code/

LangSmith evaluation for a chatbot
https://docs.smith.langchain.com/evaluation/tutorials/evaluation

* [X] set up LangSmith key

In [1]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith to enable tracing: ")


Enter API key for LangSmith to enable tracing:  ········


* [X] set up Anthropic key

In [2]:
if not os.environ.get("ANTHROPIC_API_KEY"):
  os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("claude-3-5-haiku-latest", model_provider="anthropic")

Enter API key for Anthropic:  ········


* [X] set up Google gemini as embedding model

In [3]:
if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

Enter API key for Google Gemini:  ········


* [X] set up vector database 

# Choosing database

Load, chunk, split, embed and vectorize code data and document data into database

## Candidates

1. cassandra
2. open search https://opensearch.org/platform/os-search/vector-database/
3. Pinecone
4. MongoDB
5. PostgreSQL
6. [X] Chroma, locally hosted with sqlite

In [4]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="python_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

document_store = Chroma(
    collection_name="document_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",
)

* [ ] Import and load a GitHub Repo as a document

In [5]:
from langchain import hub
from langchain_community.document_loaders import GithubFileLoader
from langchain_community.document_loaders import JSONLoader
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from code_splitter import Language, TiktokenSplitter

if not os.environ.get("GITHUB_PERSONAL_ACCESS_TOKEN"):
   os.environ["GITHUB_PERSONAL_ACCESS_TOKEN"] = getpass.getpass("Enter ACCESS_TOKEN for GitHub: ")

# Load and chunk contents of the github repo
loader = GithubFileLoader(
    repo="ameliarogerscodes/TC-Examples",  # the repo name
    branch="main",  # the branch name
    # access_token=ACCESS_TOKEN, # delete/comment out this argument if you've set the access token as an env var.
    github_api_url="https://api.github.com",
    # parser=LanguageParser(language=Language.PYTHON, parser_threshold=200),
    file_filter=lambda file_path: file_path.endswith(
        ".py"
    ),  # load all python files.
)
documents = loader.load()

Enter ACCESS_TOKEN for GitHub:  ········


* [ ] test documents content

In [6]:
print(documents[0].metadata)

{'path': 'CircuitBreaker/CircuitBreaker.py', 'sha': 'fcfed0fc14e5ae96bf700bf02546ddd55f7a4b8a', 'source': 'https://api.github.com/ameliarogerscodes/TC-Examples/blob/main/CircuitBreaker/CircuitBreaker.py'}


* [ ] map metadata

In [7]:
import json

# Step 1: Load the JSON metadata from a file (adjust path accordingly)
with open('./repo_metadata.json', 'r', encoding='utf-8') as f:
    json_metadata_list = json.load(f)

# The sample JSON will be like a list of dicts, for example:
# [
#   {"path": "src/pybreaker.py", "type": "source", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"},
#   {"path": "test/unitest_pybreaker.py", "type": "test", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"}
# ]

# Step 2: Create a dictionary mapping from path to metadata (excluding 'path' key)
metadata_map = {
    entry['path']: {k: v for k, v in entry.items() if k != 'path'}
    for entry in json_metadata_list
}

* [X] use code splitter

In [8]:
# use code-splitter 
# https://pypi.org/project/code-splitter/
python_splitter = TiktokenSplitter(Language.Python, max_size=100)
all_splits = [
    Document(
        page_content=splits.text,
        metadata=metadata_map.get(doc.metadata.get('path'), {})
    )
    for doc in documents 
    for splits in python_splitter.split(doc.page_content.encode("utf-8"))
]

* [ ] print page_content and metadata for a split

In [9]:
print(all_splits[3].page_content)
print(all_splits[4].metadata)

class CircuitBreakerState:
    def __init__(self, breaker, name):
        self._breaker = breaker
        self._name = name

    @property
    def name(self):
        return self._name
{'type': 'source', 'tech_credit': 'Circuit Breaker', 'tech_credit_description': 'Enhance system resilience by dynamically detecting service failures and preventing cascading issues, especially in distributed systems.'}


* [ ] load, split and embed documents into vector database

* [x] index chunks for code vector db
* [ ] index chunks for documentation vector db

In [10]:
# Index chunks
code_embed_index = vector_store.add_documents(documents=all_splits)

* [ ] load, chunk, split and embed documents about technical credit

In [10]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load and chunk contents of the blog
bs4_strainer = bs4.SoupStrainer(
    class_=("article-header__section", "article-header__topic-and-issue-section",
            "article-header article-header__title", "article-header__subtitle",
        	"article-header__meta",
        	"article-table-of-contents",
        	"article-contents",
        	"article-footer")
)

doc_loader = WebBaseLoader(
    web_paths=("https://cacm.acm.org/opinion/technical-credit/",),
    bs_kwargs=dict(
        parse_only=bs4_strainer
    ),
)

text_documents = doc_loader.load()
# recurisive splitter , 7 , all splits

USER_AGENT environment variable not set, consider setting it to identify your requests.


* [ ] test the web page document

In [11]:
assert len(text_documents) == 1
print(f"Total characters: {len(text_documents[0].page_content)}")
print(text_documents[0].page_content[:500])

Total characters: 14484
Opinion

Computing Profession 


Balancing initial investment and long-term results in the software development process.


				By Ian Gorton, Alessio Bucaioni, and Patrizio Pelliccione 

Posted Dec 26 2024 



What Is Technical Credit?
Technical Credit in Practice
A Research Agenda for Technical Credit
Conclusion
References
Footnotes




Technical debt (TD) is an established concept in software engineering encompassing an unavoidable side effect of software development.3 It arises due to tight s


* [ ] split the document

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
text_splits = text_splitter.split_documents(text_documents)

print(f"Split article into {len(text_splits)} sub-documents.")

Split article into 20 sub-documents.


* [ ] embed the documents into vector database

In [None]:
document_ids = document_store.add_documents(documents=text_splits)

# RAG System Part
## Customize Prompt
## Define nodes and graphs in the rag system
1. [X] retrieve code and metadata
2. [X] retrieve academic documents
3. [X] send message to LLM

* [ ] design prompt template.

In [13]:
# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
from langchain_core.prompts import ChatPromptTemplate
import textwrap

prompt = ChatPromptTemplate([
    ("system", "You are an assistant for identifying technical credit. Use the following pieces \
                of retrieved context to answer the question. If you don't know the answer, just \
                say that you don't know. Use three sentences maximum and keep the answer concise."),
    ("user", textwrap.dedent("""\
                Here is the descirption for the tech credit:
                {tech_credit}

                Some documentation about tech credit:
                {context_doc}

                Here is an example code for that tech credit:
                {context_code}

                Here is the code from user:
                {code}

                Question: {question}
                Answer:
                """))
])

* [ ] collect metadata from code

In [14]:
def collect_unique_pairs(documents):
    """
    Collect unique concatenated 'tech_credit: description' strings from document metadata.

    Args:
        documents (list[dict]): A list of Document objects, each with a 'metadata' field.

    Returns:
        list[str]: A list of unique 'tech_credit: description' strings.
    """
    seen = set()

    for doc in documents:
        metadata = doc.metadata
        credit = metadata.get("tech_credit")
        description = metadata.get("tech_credit_description")
        if credit and description:
            combined = f"{credit}: {description}"
            seen.add(combined)

    return list(seen)

In [15]:
from typing_extensions import List, TypedDict

# Define state for application
class State(TypedDict):
    question: str
    context_code: List[Document]
    context_doc: List[Document]
    code: str # user code
    tech_credit: List[str] # metadata of tech_credit and description fetched from code db
    answer: str

# Define application steps
def retrieve(state: State):
    retrieved_codes = vector_store.search(
        state["code"], search_type="similarity_score_threshold", 
        score_threshold=0.5 # set a threshold for similarity on code
    )
    retrieved_metadata = collect_unique_pairs(retrieved_codes)
    return {"context_code": retrieved_codes, "tech_credit": retrieved_metadata}

def retrieve_doc(state: State):
    retrieved_doc = document_store.similarity_search(state["question"])
    return {"context_doc": retrieved_doc}
    
def generate(state: State):
    code_content = "\n\n".join(doc.page_content for doc in state["context_code"])
    tech_credit_content = "\n".join(state["tech_credit"])
    # print("Tech credit description: ", tech_credit_content)
    doc_content = "\n\n".join(doc.page_content for doc in state["context_doc"])
    #print(docs_content)
    messages = prompt.invoke({"question": state["question"], "code": state["code"], 
                              "context_code": code_content, "tech_credit": tech_credit_content, 
                              "context_doc": doc_content})
    response = llm.invoke(messages)
    return {"answer": response.content}
    # return {"answer": "No answer"}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_node(retrieve_doc)
graph_builder.add_edge(START, "retrieve")
graph_builder.add_edge(START, "retrieve_doc")
graph = graph_builder.compile()

# Ask Question Part
## Circuit Breaker
## MVC model
## Iterator pattern


* [ ] Ask questions and test RAG

In [18]:
response = graph.invoke({"question": 
                         "Tell me if the following is a tech credit and what do these 3 states do?",
"code": """class CircuitBreakerState(Enum):
    CLOSED = 'CLOSED'
    OPEN = 'OPEN'
    HALF_OPEN = 'HALF_OPEN'
"""})
print(response["answer"])

Yes, this is a technical credit implementation. The three states of the Circuit Breaker represent different stages of system resilience: 

1. CLOSED state: The service is functioning normally and allowing requests to pass through.
2. OPEN state: When failures exceed a threshold, the circuit "breaks" and prevents further requests to avoid cascading failures.
3. HALF_OPEN state: A transitional state where the system tentatively allows a few requests to test if the underlying service has recovered.

This pattern helps prevent system overload and provides a mechanism for graceful degradation during service disruptions, which aligns with the technical credit description of enhancing system resilience in distributed systems.


In [11]:
response = graph.invoke({"question": "Tell me if the following is a tech credit and what do these 3 states do?",
"code": """class myBreaker(Enum):
    ON = 'CLOSED'
    OFF = 'OPEN'
    HALF_ON = 'HALF_OPEN'"""})
print(response["answer"])

Yes, this is a technical credit implementation of a Circuit Breaker pattern. The three states represent different stages of service resilience:

1. ON (CLOSED): Normal operation where the service is functioning correctly and requests are allowed.
2. OFF (OPEN): Service is temporarily disabled after repeated failures to prevent further system damage.
3. HALF_ON (HALF_OPEN): A transitional state where the system cautiously allows a few requests to test if the underlying issue has been resolved.

These states help manage system resilience by dynamically detecting and responding to service failures, preventing cascading issues in distributed systems.


In [20]:
response = graph.invoke({"question": "Tell me if the following is a tech credit?",
"code": """class UserModel:
    def __init__(self):
        self.users = []
    
    def add_user(self, name, age):
        user = {"name": name, "age": age}
        self.users.append(user)
        return user
    
    def get_all_users(self):
        return self.users

class UserView:
    def show_users(self, users):
        print("\n=== User List ===")
        if not users:
            print("No user")
        else:
            for i, user in enumerate(users, 1):
                print(f"{i}. Name: {user['name']}, Age: {user['age']}")
    
    def show_menu(self):
        print("\n=== Menu ===")
        print("1. Add user")
        print("2. List all users")
        print("3. Exit")
        return input("Your option: ")
    
    def get_user_input(self):
        name = input("Name: ")
        age = input("Age: ")
        return name, age
    
    def show_message(self, message):
        print(message)"""})
print(response["answer"])

Yes, this is an implementation of the Model-View-Controller (MVC) technical credit pattern. The code separates concerns into three distinct components: UserModel (data and business logic), UserView (user interface), and the implied UserController (not shown, but suggested by the MVC structure). This approach promotes modularity, makes the code more maintainable, and allows independent development and testing of each component, which aligns perfectly with the MVC technical credit description provided earlier.


In [21]:
response = graph.invoke({"question": "Tell me if the following is a tech credit?",
"code": """# This program prints Hello, world!

print('Hello, world!')"""})
print(response["answer"])

No relevant docs were retrieved using the relevance score threshold 0.5


Based on the provided context about technical credit, the simple "Hello, world!" program does not represent technical credit. Technical credit involves strategic design decisions that provide long-term benefits, such as creating abstraction layers, implementing circuit breakers, or developing reference architectures. This basic print statement does not demonstrate any characteristics of technical credit that would ease future modifications or provide systemic advantages.


# Roadmap:

1. Use a text embedding model.
   + [X] gemini text-embedding-004
2. Code splitter may not work properly.
   + [X] change to code-splitter that works better.
3. [X] Switch in memory vector storage to a vector database
   + [X] code vector database with additional metadata for tech credit
   + [X] document vector database with academic context about tech credit
4. [ ] Load more standard example codes for tech credit (~20 more examples)
5. [ ] Load more documents (academic contexts) for technical credit (3~5 related academic paper)
6. [ ] Allow user to ask about a repository, instead of small snippets of codes
   * [ ] load, chunk, split, embed and vectorize a repo
   * [ ] similarity search and filter user code by similarity score compared with example code
   * [ ] batch process to LLM
   * [ ] organize response and answers