# Preview from LangChain

## Step by step code

1. get LangSmith API Key from environment
2. set up anthropic key for chat model
3. set up embedding model for embeddings
4. select vector database
5. set up document loader and split up

## Tutorials
LangChain RAG tutorial document
Part 1
https://python.langchain.com/docs/tutorials/rag

Part 2
extends the implementation to accommodate conversation-style interactions and
multi-step retrieval processes.
https://python.langchain.com/docs/tutorials/qa_chat_history/

LangChain document loader for GitHub Repo
https://python.langchain.com/docs/integrations/document_loaders/github/

LangChain document loader for Git Repository
https://python.langchain.com/docs/integrations/document_loaders/git/

LangChain document loader for Source Code (e.g. Python)
https://python.langchain.com/docs/integrations/document_loaders/source_code/

LangSmith evaluation for a chatbot
https://docs.smith.langchain.com/evaluation/tutorials/evaluation

* [X] set up LangSmith key

In [2]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith to enable tracing: ")


Enter API key for LangSmith to enable tracing:  ········


* [X] set up Anthropic key

In [3]:
if not os.environ.get("ANTHROPIC_API_KEY"):
  os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("claude-3-5-haiku-latest", model_provider="anthropic")

Enter API key for Anthropic:  ········


* [X] set up Google gemini as embedding model

In [4]:
if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

Enter API key for Google Gemini:  ········


* [X] set up vector database 

# Choosing database

## Candidates

1. cassandra
2. open search https://opensearch.org/platform/os-search/vector-database/
3. Pinecone
4. MongoDB
5. PostgreSQL
6. [X] Chroma

In [4]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="python_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

document_store = Chroma(
    collection_name="document_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",
)

* [ ] Import and load a GitHub Repo as a document

In [5]:
import bs4
from langchain import hub
from langchain_community.document_loaders import GithubFileLoader
from langchain_community.document_loaders import JSONLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from code_splitter import Language, TiktokenSplitter
from typing_extensions import List, TypedDict

if not os.environ.get("GITHUB_PERSONAL_ACCESS_TOKEN"):
   os.environ["GITHUB_PERSONAL_ACCESS_TOKEN"] = getpass.getpass("Enter ACCESS_TOKEN for GitHub: ")

# Load and chunk contents of the github repo
loader = GithubFileLoader(
    repo="danielfm/pybreaker",  # the repo name
    branch="main",  # the branch name
    # access_token=ACCESS_TOKEN, # delete/comment out this argument if you've set the access token as an env var.
    github_api_url="https://api.github.com",
    # parser=LanguageParser(language=Language.PYTHON, parser_threshold=200),
    file_filter=lambda file_path: file_path.endswith(
        ".py"
    ),  # load all python files.
)
documents = loader.load()

Enter ACCESS_TOKEN for GitHub:  ········


* [ ] test documents content

In [6]:
print(documents[0].metadata)

{'path': 'src/pybreaker/__init__.py', 'sha': 'c7fa085ff4bd506de069f999bbcdeada74aff4bd', 'source': 'https://api.github.com/danielfm/pybreaker/blob/main/src/pybreaker/__init__.py'}


* [ ] map metadata

In [7]:
import json

# Step 1: Load the JSON metadata from a file (adjust path accordingly)
with open('./repo_metadata.json', 'r', encoding='utf-8') as f:
    json_metadata_list = json.load(f)

# The sample JSON will be like a list of dicts, for example:
# [
#   {"path": "src/pybreaker.py", "type": "source", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"},
#   {"path": "test/unitest_pybreaker.py", "type": "test", "tech-credit": "Circuit Breaker", "tech_credit_description": "good design"}
# ]

# Step 2: Create a dictionary mapping from path to metadata (excluding 'path' key)
metadata_map = {
    entry['path']: {k: v for k, v in entry.items() if k != 'path'}
    for entry in json_metadata_list
}

* [X] use code splitter

In [8]:
# use code-splitter 
# https://pypi.org/project/code-splitter/
python_splitter = TiktokenSplitter(Language.Python, max_size=100)
all_splits = [
    Document(
        page_content=splits.text,
        metadata=metadata_map.get(doc.metadata.get('path'), {})
    )
    for doc in documents 
    for splits in python_splitter.split(doc.page_content.encode("utf-8"))
]

* [ ] print page_content and metadata for a split

In [9]:
print(all_splits[3].page_content)
print(all_splits[4].metadata)

__all__ = (
    "CircuitBreaker",
    "CircuitBreakerListener",
    "CircuitBreakerError",
    "CircuitMemoryStorage",
    "CircuitRedisStorage",
    "STATE_OPEN",
    "STATE_CLOSED",
    "STATE_HALF_OPEN",
)

STATE_OPEN = "open"
STATE_CLOSED = "closed"
STATE_HALF_OPEN = "half-open"

T = TypeVar("T")
ExceptionType = TypeVar("ExceptionType", bound=BaseException)


* [ ] load, split and embed documents into vector database

* [x] index chunks for code vector db
* [ ] index chunks for documentation vector db

In [None]:
# Index chunks
_ = vector_store.add_documents(documents=all_splits)

* [ ] load, chunk, split and embed documents about technical credit

In [None]:
# Load and chunk contents of the blog
doc_loader = WebBaseLoader(
    web_paths=("https://cacm.acm.org/opinion/technical-credit/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=()
        )
    ),
)

# recurisive splitter , 7 , all splits

* [ ] design prompt template.

In [None]:
# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate([
    ("system", "You are an assistant for identifying technical credit. Use the following pieces \
                of retrieved context to answer the question. If you don't know the answer, just \
                say that you don't know. Use three sentences maximum and keep the answer concise."),
    ("user", """Here is the descirption for the tech credit:
                {tech_credit}

                Some documentation about tech credit:
                {context_doc}

                Here is an example code for that tech credit:
                {context_code}

                Here is the code from user:
                {code}

                Question: {question}
                Answer:""")
])

* [ ] collect metadata from code

In [2]:
def collect_unique_pairs(documents):
    """
    Collect unique concatenated 'tech_credit: description' strings from document metadata.

    Args:
        documents (list[dict]): A list of Document objects, each with a 'metadata' field.

    Returns:
        list[str]: A list of unique 'tech_credit: description' strings.
    """
    seen = set()

    for doc in documents:
        metadata = doc.get("metadata", {})
        credit = metadata.get("tech_credit")
        description = metadata.get("tech_credit_description")
        if credit and description:
            combined = f"{credit}: {description}"
            seen.add(combined)

    return list(seen)

In [1]:
# Define state for application
class State(TypedDict):
    question: str
    context_code: List[Document]
    context_doc: List[Document]
    code: str # user code
    tech_credit: List[str] # metadata of tech_credit and description fetched from code db
    answer: str

# Define application steps
def retrieve(state: State):
    retrieved_codes = vector_store.similarity_search(state["code"])
    retrieved_metadata = collect_unique_pairs(retrieved_codes)
    return {"context_code": retrieved_codes, "tech_credit": retrieved_metad}

def retrieve_doc(state: State):
    retrieved_doc = document_store.similarity_search(state["question"])
    return {"context_doc": retrieved_doc}
    
def generate(state: State):
    code_content = "\n\n".join(doc.page_content for doc in state["context_code"])
    tech_credit_content = "\n".join(state["tech_credit"])
    doc_content = "\n\n".join(doc.page_content for doc in state["context_doc"])
    #print(docs_content)
    messages = prompt.invoke({"question": state["question"], "code": state["code"], 
                              "context_code": code_content, "tech_credit": tech_credit_content, 
                              "context_doc": doc_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_node(retrieve_doc)
graph_builder.add_edge(START, "retrieve")
graph_builder.add_edge(START, "retrieve_doc")
graph = graph_builder.compile()

NameError: name 'TypedDict' is not defined

* [ ] Ask questions and test RAG

In [11]:
response = graph.invoke({"question": "Explain the Class CircuitBreaker"})
print(response["answer"])

The CircuitBreaker is a design pattern that prevents a system from repeatedly trying operations that are likely to fail. In its "closed" state, the circuit breaker executes operations normally, tracking failures and tripping (opening the circuit) when a failure threshold is exceeded. When the circuit is open, it prevents further attempts to execute the operation, helping to protect the system from repeated failures.


In [21]:
response = graph.invoke({"question": "Tell me if the following is a component of circuit breaker and what does these three state do?",
"code": """
class CircuitBreakerState(Enum):
    CLOSED = 'CLOSED'
    OPEN = 'OPEN'
    HALF_OPEN = 'HALF_OPEN'
"""})
print(response["answer"])

Yes, these are standard states of a circuit breaker implementation. CLOSED represents the normal operating state where requests are allowed. OPEN represents the state where requests are blocked due to failures, and HALF_OPEN is a transitional state where a single request is allowed to test if the system has recovered.


In [19]:
response = graph.invoke({"question": """Tell me if the following is a component of circuit breaker and what does 
these three state do?
class CircuitBreakerState(Enum):
    CLOSED = 'CLOSED'
    OPEN = 'OPEN'
    HALF_OPEN = 'HALF_OPEN'
"""})
print(response["answer"])

Yes, this is a component of the circuit breaker pattern defining the possible states of a circuit breaker. The states represent different conditions of the circuit breaker: CLOSED means the circuit is functioning normally and allowing requests, OPEN indicates the circuit is blocked and preventing requests after repeated failures, and HALF_OPEN represents a transitional state where the circuit allows a limited number of requests to test if the underlying issue has been resolved.


## Known problems:

1. Text embedding model by gemini may not be optimal.
2. Code splitter may not work properly.
   + [X] change to code-splitter that works better.
4. Switch in memory vector storage to a vector database