# Preview from LangChain

## Step by step code

1. get LangSmith API Key from environment
2. set up anthropic key for chat model
3. set up embedding model for embeddings
4. select vector database
5. set up document loader and split up

## Tutorials
LangChain RAG tutorial document
Part 1
https://python.langchain.com/docs/tutorials/rag

Part 2
extends the implementation to accommodate conversation-style interactions and
multi-step retrieval processes.
https://python.langchain.com/docs/tutorials/qa_chat_history/

LangChain document loader for GitHub Repo
https://python.langchain.com/docs/integrations/document_loaders/github/

LangChain document loader for Git Repository
https://python.langchain.com/docs/integrations/document_loaders/git/

LangChain document loader for Source Code (e.g. Python)
https://python.langchain.com/docs/integrations/document_loaders/source_code/

LangSmith evaluation for a chatbot
https://docs.smith.langchain.com/evaluation/tutorials/evaluation

* [X] set up LangSmith key

In [1]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for LangSmith to enable tracing: ")


Enter API key for LangSmith to enable tracing:  ········


* [X] set up Anthropic key

In [2]:
if not os.environ.get("ANTHROPIC_API_KEY"):
  os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("claude-3-5-haiku-latest", model_provider="anthropic")

Enter API key for Anthropic:  ········


* [X] set up Google gemini as embedding model

In [3]:
if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

Enter API key for Google Gemini:  ········


* [X] set up vector database 

# Choosing database

## Candidates

1. cassandra
2. open search https://opensearch.org/platform/os-search/vector-database/
3. Pinecone
4. MongoDB
5. PostgreSQL
6. [X] Chroma

In [4]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="python_tech_credit",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

* [ ] Import and load a GitHub Repo as a document

In [5]:
import bs4
from langchain import hub
from langchain_community.document_loaders import GithubFileLoader
from langchain_core.documents import Document
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from code_splitter import Language, TiktokenSplitter
from typing_extensions import List, TypedDict

if not os.environ.get("GITHUB_PERSONAL_ACCESS_TOKEN"):
   os.environ["GITHUB_PERSONAL_ACCESS_TOKEN"] = getpass.getpass("Enter ACCESS_TOKEN for GitHub: ")

# Load and chunk contents of the github repo
loader = GithubFileLoader(
    repo="danielfm/pybreaker",  # the repo name
    branch="main",  # the branch name
    # access_token=ACCESS_TOKEN, # delete/comment out this argument if you've set the access token as an env var.
    github_api_url="https://api.github.com",
    # parser=LanguageParser(language=Language.PYTHON, parser_threshold=200),
    file_filter=lambda file_path: file_path.endswith(
        ".py"
    ),  # load all python files.
)
documents = loader.load()

Enter ACCESS_TOKEN for GitHub:  ········


* [ ] test documents content

In [11]:
print(documents[0].metadata)

{'path': 'src/pybreaker/__init__.py', 'sha': 'c7fa085ff4bd506de069f999bbcdeada74aff4bd', 'source': 'https://api.github.com/danielfm/pybreaker/blob/main/src/pybreaker/__init__.py'}


* [ ] map metadata

In [7]:
import json

# Step 1: Load the JSON metadata from a file (adjust path accordingly)
with open('./repo_metadata.json', 'r', encoding='utf-8') as f:
    json_metadata_list = json.load(f)

# The sample JSON will be like a list of dicts, for example:
# [
#   {"path": "src/pybreaker.py", "type": "source", "tech-credit": "Circuit Breaker"},
#   {"path": "test/unitest_pybreaker.py", "type": "test", "tech-credit": "Circuit Breaker"}
# ]

# Step 2: Create a dictionary mapping from path to metadata (excluding 'path' key)
metadata_map = {
    entry['path']: {k: v for k, v in entry.items() if k != 'path'}
    for entry in json_metadata_list
}

* [X] use code splitter

In [9]:
# use code-splitter 
# https://pypi.org/project/code-splitter/
python_splitter = TiktokenSplitter(Language.Python, max_size=100)
all_splits = [
    Document(
        page_content=splits.text,
        metadata=metadata_map.get(doc.metadata.get('path'), {})
    )
    for doc in documents 
    for splits in python_splitter.split(doc.page_content.encode("utf-8"))
]

In [14]:
print(all_splits[3].metadata)

{'type': 'source', 'tech-credit': 'Circuit Breaker'}


In [10]:
# Index chunks
_ = vector_store.add_documents(documents=all_splits)

# Define prompt for question-answering
# N.B. for non-US LangSmith endpoints, you may need to specify
# api_url="https://api.smith.langchain.com" in hub.pull.
prompt = hub.pull("rlm/rag-prompt")

# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    # print([doc.metadata for doc in retrieved_docs[:8]])
    # print(retrieved_docs[:6])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Ask questions and test RAG

In [11]:
response = graph.invoke({"question": "Explain the Class CircuitBreaker"})
print(response["answer"])

[Document(id='99c5b2ab-ccf9-44dd-8b74-8af121ae67a0', metadata={'tech-credit': 'Circuit Breaker'}, page_content='class CircuitClosedState(CircuitBreakerState):\n    """In the normal "closed" state, the circuit breaker executes operations as\n    usual. If the call succeeds, nothing happens. If it fails, however, the\n    circuit breaker makes a note of the failure.\n\n    Once the number of failures exceeds a threshold, the circuit breaker trips\n    and "opens" the circuit.\n    """'), Document(id='f1986f2c-590e-4521-bb57-392ce8596e82', metadata={'tech-credit': 'Circuit Breaker'}, page_content='class CircuitBreakerListener:\n    """Listener class used to plug code to a ``CircuitBreaker`` instance when certain events happen."""\n\n    def before_call(self, cb: CircuitBreaker, func: Callable[..., T], *args: Any, **kwargs: Any) -> None:\n        """This callback function is called before the circuit breaker `cb` calls `fn`."""'), Document(id='5906d166-22ce-40b7-9ec7-915f7cc2a5a3', metadat

## Known problems:

1. Text embedding model by gemini may not be optimal.
2. Code splitter may not work properly.
   + [X] change to code-splitter that works better.
4. Switch in memory vector storage to a vector database