## Simple Gen AI app using LangChain.

### Load all the required keys

What API keys do we need:
- `OPENAI_API_KEY` - call OpenAI model LLM to generate response
- `LANGCHAIN_API_KEY` - for collecting all the information in LangSmith
- `LANGCHAIN_TRACING_V2` - for enabling the LangSmith tracing
- `LANGCHAIN_PROJECT` - LangChain project name

In [1]:
import os
from dotenv import load_dotenv

# Load the environment variables in .env file
load_dotenv()

OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
print(f'OPENAI_API_KEY: {OPENAI_API_KEY[:10]}***{OPENAI_API_KEY[-3:]}')

# We will use LANGCHAIN_API_KEY for LangSmith tracking
LANGCHAIN_API_KEY = os.environ['LANGCHAIN_API_KEY']
print(f'LANGCHAIN_API_KEY: {LANGCHAIN_API_KEY[:10]}***{LANGCHAIN_API_KEY[-3:]}')

# Required by LangChain 
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
LANGCHAIN_PROJECT = os.environ['LANGCHAIN_PROJECT']
print(f'LANGCHAIN_PROJECT: {LANGCHAIN_PROJECT}')

OPENAI_API_KEY: sk-proj-ID***tIA
LANGCHAIN_API_KEY: lsv2_pt_64***43b
LANGCHAIN_PROJECT: LangChain tutorial get started


### Retreve information from a website

We will read the entire content of an website. For reading website content we will `beautifulsoup4` library. This library will help us **scrape** the entire website data. 

#### Data Ingestion

Data Integestion means: From the entire website we will **scrape** the data.
For loading the data into LangChain documents we will use `WebBaseLoader`. WebBaseLoader gets a URL link and uses `beautifulsoup4` to retrieve the data from it into a LangChain documents.

In [2]:
from langchain.document_loaders import WebBaseLoader

url = 'https://docs.smith.langchain.com/prompt_engineering/tutorials/prompt_commit'

web_loader = WebBaseLoader(web_path=url)
documents = web_loader.load()
documents

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://docs.smith.langchain.com/prompt_engineering/tutorials/prompt_commit', 'title': 'How to Sync Prompts with GitHub | 🦜️🛠️ LangSmith', 'description': 'LangSmith provides a collaborative interface to create, test, and iterate on prompts.', 'language': 'en'}, page_content='\n\n\n\n\nHow to Sync Prompts with GitHub | 🦜️🛠️ LangSmith\n\n\n\n\n\n\n\n\nSkip to main contentOur new LangChain Academy Course Deep Research with LangGraph is now live! Enroll for free.API ReferenceRESTPythonJS/TSSearchRegionUSEUGo to AppGet StartedObservabilityEvaluationPrompt EngineeringQuickstartsTutorialsOptimize a classifierSync Prompts with GitHubHow-to GuidesCreate a promptRun the playground against a custom LangServe model serverRun the playground against an OpenAI-compliant model provider/proxyUpdate a promptManage prompts programmaticallyManaging Prompt SettingsCommit TagsOpen a prompt from a tracePublic prompt hubPrompt CanvasInclude multimodal content in a promptTrigger 

We loaded all the documents. But it is huge content containing a lot of information. We need to divide it into **chunks**. We cannot give the entire content to the LLM, because there is a limitation for the maximum context size. 

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splitted_documents = text_splitter.split_documents(documents)

print(f'Splitted documents size: {len(splitted_documents)}')
min_chunk_size = min([len(s.page_content) for s in splitted_documents])
max_chunk_size = max([len(s.page_content) for s in splitted_documents])

print(f'Minimun document size: {min_chunk_size}')
print(f'Maximum document size: {max_chunk_size}')

Splitted documents size: 50
Minimun document size: 48
Maximum document size: 498


### Converting documents into vectors

When we work with Q&A chatbot or document Q&A or RAG application, we use **cosine similarity** to provide **context** to the LLM.
To convert our documents into vectors, we need an embedding technique. For this tutorial we will use `OpenAI`.

After we embed all the documents we will store them in **FAISS** database.

In [4]:
from langchain_openai.embeddings import OpenAIEmbeddings

model_name = 'text-embedding-3-small'
embeddings = OpenAIEmbeddings(model=model_name)

In [5]:
import os
import faiss
from langchain.vectorstores import FAISS
from langchain.docstore.in_memory import InMemoryDocstore
from langchain.vectorstores.utils import DistanceStrategy

db_folder = './db/faiss'
faiss_db_index = 'simple-app-faiss-store'

index_path = os.path.join(db_folder, f'{faiss_db_index}.faiss')

if not os.path.exists(index_path):
    print(f'Creating a FAISS database in dir {db_folder} and index name {faiss_db_index}')
    index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

    faiss_db = FAISS(embeddings,
                    index=index,
                    docstore=InMemoryDocstore(),
                    index_to_docstore_id={},
                    distance_strategy=DistanceStrategy.COSINE)
else:
    print(f'Loading FAISS database from {db_folder} with index name {faiss_db_index}')
    faiss_db = FAISS.load_local(folder_path=db_folder, embeddings=embeddings,
                                index_name=faiss_db_index,
                                allow_dangerous_deserialization=True)
    
faiss_db.add_documents(splitted_documents)
faiss_db.save_local(folder_path=db_folder, index_name=faiss_db_index)

Loading FAISS database from ./db/faiss with index name simple-app-faiss-store


## Lets query from the **FAISS** database

In [6]:
query = 'How LangSmith webhooks interact with GitHub?'

K = 4

result = faiss_db.similarity_search_with_score(query=query, k=K)
for i, (sim, score) in enumerate(result):
    print(f'Similarity {i + 1}: {sim}')
    print(f'Score: {score}')
    print('\n')

Similarity 1: page_content='LangSmith webhooks don't directly interact with GitHub—they call an intermediary server that you create.
This server requires a GitHub PAT to authenticate and make commits to your repository.
Must include the repo scope (public_repo is sufficient for public repositories).
Go to GitHub > Settings > Developer settings > Personal access tokens > Tokens (classic).
Click Generate new token (classic).
Name it (e.g., "LangSmith Prompt Sync"), set an expiration, and select the required scopes.' metadata={'source': 'https://docs.smith.langchain.com/prompt_engineering/tutorials/prompt_commit', 'title': 'How to Sync Prompts with GitHub | 🦜️🛠️ LangSmith', 'description': 'LangSmith provides a collaborative interface to create, test, and iterate on prompts.', 'language': 'en'}
Score: 0.4748978614807129


Similarity 2: page_content='LangSmith webhooks don't directly interact with GitHub—they call an intermediary server that you create.
This server requires a GitHub PAT to 

### Create LLM

In [7]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model='gpt-4o', api_key=OPENAI_API_KEY)
print(llm)

client=<openai.resources.chat.completions.completions.Completions object at 0x10f3be120> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x10f3be510> root_client=<openai.OpenAI object at 0x10f2796d0> root_async_client=<openai.AsyncOpenAI object at 0x10f27afd0> model_name='gpt-4o' model_kwargs={} openai_api_key=SecretStr('**********')


### Create a **Retrieval chain**

To pass all the documents to the LLM we need to provide all the retrieved documents. For that purpose we will use `create_stuff_documents_chain` method from `langchain.chains.combine_documents` package.

Document chain will be responsible for providing the prompt template {context} input parameter.

In [8]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate

prompt_template = """
Answer the following question based only on the provided context:
<context> {context} </context>
"""

prompt = ChatPromptTemplate.from_template(template=prompt_template)

documents_chain = create_stuff_documents_chain(llm=llm, prompt=prompt)
print(f'Documents chain: {documents_chain}')

Documents chain: bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='\nAnswer the following question based only on the provided context:\n<context> {context} </context>\n'), additional_kwargs={})])
| ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x10f3be120>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x10f3be510>, root_client=<openai.OpenAI object at 0x10f2796d0>, root_async_client=<openai.AsyncOpenAI object at 0x10f27afd0>, model_name='gpt-4o', model_kwargs={}, openai_api_key=SecretStr('**********'))
| StrOutputParser() kwargs={} config={'run_name': 'stuff_documents_cha

In [9]:
from langchain_core.documents import Document

print(f'Calling LLM with query: "{query}" ...')
result = documents_chain.invoke(input={
    'input': query,
    'context': [Document(page_content="""LangSmith webhooks don't directly interact with GitHub—they call an intermediary server that you create.
This server requires a GitHub PAT to authenticate and make commits to your repository.
Must include the repo scope (public_repo is sufficient for public repositories).""")]
})

print('Result:')
print(result)

Calling LLM with query: "How LangSmith webhooks interact with GitHub?" ...
Result:
How do LangSmith webhooks interact with GitHub repositories according to the provided context?

LangSmith webhooks do not interact directly with GitHub repositories. Instead, they call an intermediary server, which you must create. This server uses a GitHub Personal Access Token (PAT) with at least the `public_repo` scope to authenticate and make commits to your repository.


However, we want the documents to first come from the retriever we just setup. That way, we can use the retriever to dynamically select the most relevant documents and pass those in for a given question.

But what exactly is a **Retriever**?<br><br>
We have a Vector store DB. This Vector store has all the vector information available right now. Retriever can be considered as an interface, which is responsible for if anybody asks a given input, this interface will be a way of getting the data from the Vector store.
<br><br>
Each vectorstorstore DB has a method `as_retriever()` which returns the datastore as a Retriever interface.
<br><br>
From the documentation: `Return VectorStoreRetriever initialized from this VectorStore.`

In [10]:
from langchain.chains.retrieval import create_retrieval_chain

retriever = faiss_db.as_retriever()
retrieval_chain = create_retrieval_chain(retriever=retriever, combine_docs_chain=documents_chain)
retrieval_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x10f1e3b60>, search_kwargs={}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='\nAnswer the following question based only on the provided context:\n<context> {context} </context>\n'), additional_kwargs={})])
            | ChatOpenAI(cl

In [11]:
retrieved = retrieval_chain.invoke(input={'input': query})
retrieved

{'input': 'How LangSmith webhooks interact with GitHub?',
 'context': [Document(id='fa467293-55bb-423d-b0de-a6475833be95', metadata={'source': 'https://docs.smith.langchain.com/prompt_engineering/tutorials/prompt_commit', 'title': 'How to Sync Prompts with GitHub | 🦜️🛠️ LangSmith', 'description': 'LangSmith provides a collaborative interface to create, test, and iterate on prompts.', 'language': 'en'}, page_content='LangSmith webhooks don\'t directly interact with GitHub—they call an intermediary server that you create.\nThis server requires a GitHub PAT to authenticate and make commits to your repository.\nMust include the repo scope (public_repo is sufficient for public repositories).\nGo to GitHub > Settings > Developer settings > Personal access tokens > Tokens (classic).\nClick Generate new token (classic).\nName it (e.g., "LangSmith Prompt Sync"), set an expiration, and select the required scopes.'),
  Document(id='a1f3d399-8733-4877-9763-6fe48dcb9f52', metadata={'source': 'https

In [12]:
print(retrieved['answer'])

What is necessary to authenticate and make commits to a GitHub repository according to the provided context?

To authenticate and make commits to a GitHub repository, you need to create an intermediary server and use a GitHub Personal Access Token (PAT). The PAT must include the "repo" scope, and "public_repo" is sufficient for public repositories. You generate this token by going to GitHub > Settings > Developer settings > Personal access tokens > Tokens (classic), clicking "Generate new token (classic)," naming it (e.g., "LangSmith Prompt Sync"), setting an expiration, and selecting the required scopes. Additionally, you need to configure a .env file with your GITHUB_TOKEN, GITHUB_REPO_OWNER, GITHUB_REPO_NAME, and optionally customize GITHUB_FILE_PATH and GITHUB_BRANCH.


w