# My First LLM App

## Purpose


In the following, I aim to leverage Language Models (LLMs) to address inquiries regarding our personal data. To achieve this, the initial step involves transferring the contents of our personal data into a vector database. This crucial step facilitates efficient searches for relevant sections within the text. Utilizing both our data and the text interpretation capabilities of LLMs, we can effectively respond to user questions.

The implementation of our use case will heavily rely on the LangChain framework.

To start the construction of my personal assistant, I am going to start answering the question about the current President of the USA, I feed the prompt with information from the Wikipedia article “President of USA”. The idea is that we are going to use GPT3.5 as a LLM and we want to enable it to answer questions that the LLM cannot know usuing context injection.

When using context injection, we are not modifying the LLM, we focus on the prompt itself and inject relevant context into the prompt. So we need to think about how to provide the prompt with the right information. 


## Process

I am going to focus on creating a vector store, which is a specialized type of data store designed for efficient storage and retrieval of large quantities of vector data. 

Vector databases excel at querying and retrieving subsets of data based on criteria like similarity measures or mathematical operations. The initial step involves converting text data into vectors, but simply storing them in a data frame and searching for similarities step by step would be slow.

I wanto to emphasize the importance of indexing as the second key component of a vector database. Indexing enables efficient mapping of queries to the most relevant items in the vector store without the need to compute similarities between every query and document, significantly improving the search process.

I am going to calculate the embeddings and storing them in a vector store. To do this, we are using suitable modules from LangChain and chroma as a vector store.

In [10]:
import os
from dotenv import load_dotenv

load_dotenv()

openai_api_key=os.getenv('OPENAI_API_KEY', 'YourAPIKey')

## Collect data that I want to use to answer the users’ questions

In [1]:
import requests
from bs4 import BeautifulSoup
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# URL of the Wikipedia page to scrape
url = 'https://en.wikipedia.org/wiki/President_of_the_United_States'

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the text on the page
text = soup.get_text()
text = text.replace('\n', '')
# Open a new file called 'output.txt' in write mode and store the file object in a variable
with open('output.txt', 'w', encoding='utf-8') as file:
    # Write the string to the file
    file.write(text)


##  Load the data and define how I want to split the data into text chunks

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# load the document
with open('./output.txt', encoding='utf-8') as f:
    text = f.read()

# define the text splitter
text_splitter = RecursiveCharacterTextSplitter(    
    chunk_size = 500,
    chunk_overlap  = 100,
    length_function = len,
)

texts = text_splitter.create_documents([text])


## Define the Embeddings Model I want to use to calculate the embeddings for your text chunks and store them in a vector store (here: Chroma)

In [3]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# define the embeddings model
embeddings = OpenAIEmbeddings()

# use the text chunks and the embeddings model to fill our vector store
db = Chroma.from_documents(texts, embeddings)

  warn_deprecated(


##  Calculate the embeddings for the user’s question, find similar text chunks in our vector store and use them to build our prompt

In [21]:
from langchain import OpenAI
from langchain import PromptTemplate

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

users_question = "Who is the current President of the United States?"

# use our vector store to find similar text chunks
results = db.similarity_search(
    query=users_question
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm(prompt_text)





'Joe Biden'

In [22]:
from langchain import OpenAI
from langchain import PromptTemplate

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

users_question = "Who was the first President of the United States?"

# use our vector store to find similar text chunks
results = db.similarity_search(
    query=users_question
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm(prompt_text)

'George Washington'

In [23]:
from langchain import OpenAI
from langchain import PromptTemplate

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

users_question = "Who was the fourth President of the United States?"

# use our vector store to find similar text chunks
results = db.similarity_search(
    query=users_question
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm(prompt_text)

'James Madison (1809–1817)'

In [24]:
from langchain import OpenAI
from langchain import PromptTemplate

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

users_question = "Who was the first five Presidents of the United States?"

# use our vector store to find similar text chunks
results = db.similarity_search(
    query=users_question
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm(prompt_text)

'George Washington, John Adams, Thomas Jefferson, James Madison, James Monroe'

In [25]:
from langchain import OpenAI
from langchain import PromptTemplate

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

users_question = "Who was the last five Presidents of the United States?"

# use our vector store to find similar text chunks
results = db.similarity_search(
    query=users_question
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm(prompt_text)

'\nThe last five presidents of the United States were: Donald Trump, Barack Obama, George W. Bush, Bill Clinton, and George H. W. Bush.'