# **[All You Need to Know to Build Your First LLM App](https://medium.com/towards-data-science/all-you-need-to-know-to-build-your-first-llm-app-eb982c78ffac)**

<img src ='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*MKlUfYZdwSWpEulibj6S_g.png'>

### **1. Load documents using Langchain**

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/GPT-4"
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

# find all the text on the page
text = soup.get_text()

# find the content div
content_div = soup.find('div', {'class': 'mw-parser-output'})

# remove unwanted elements from div
unwanted_tags = ['sup', 'span', 'table', 'ul', 'ol']
for tag in unwanted_tags:
    for match in content_div.findAll(tag):
        match.extract()

print(content_div.get_text())

2023 text-generating language model



Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot.  As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.
Observers reported that the iteration of ChatGPT using GPT-4 was an improvement on the previous iteration based on GPT-3.5, with the caveat that GPT-4 retains some of the problems with earlier revisions. GPT-4, equipped with vision capabilities (GPT-4V), is capable of taking images as input on ChatGPT. OpenAI has

### **2. Split our document into text fragments**
- **Next, we must divide the text into smaller sections called text chunks.**
- **Each text chunk represents a data point in the embedding space, allowing the computer to determine the similarity between these chunks.**

In [None]:
!pip install -q langchain

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


article_text = content_div.get_text()


text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

texts = text_splitter.create_documents([article_text])
print(len(texts))
print(texts[0])
print(texts[1])
print(texts[2])

245
page_content='2023 text-generating language model'
page_content='Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by'
page_content='model created by OpenAI, and the fourth in its series of GPT foundation models. It was launched on'


<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*YcleaA2sDs_IyjmrNzIajQ.png'>

### **3. From Text Chunks to Embeddings**
- openai 버젼은 1.0 이하 버젼으로해야 Embedding Model이 실행된다/


In [None]:
! pip install openai==0.28



In [None]:
texts[0]

Document(page_content='2023 text-generating language model')

In [None]:
import openai
from google.colab import userdata
my_api_key = userdata.get('openai-api-key')

# OpenAI API 키 설정
openai.api_key = my_api_key

print(texts[0])

embedding = openai.Embedding.create(
    input=texts[0].page_content, model="text-embedding-ada-002"
)["data"][0]["embedding"]

len(embedding)

page_content='2023 text-generating language model'


1536

In [None]:
embedding

[-0.03262288123369217,
 0.00018029265629593283,
 -0.005053458269685507,
 0.020396320149302483,
 0.010556112974882126,
 0.03301592916250229,
 -0.03691831976175308,
 0.0009545421344228089,
 -0.027036001905798912,
 -0.020705141127109528,
 0.027639608830213547,
 0.020957814529538155,
 -0.020536692813038826,
 0.0031426192726939917,
 -0.003091733902692795,
 0.009320823475718498,
 0.017448468133807182,
 -0.006576514337211847,
 0.01151767373085022,
 0.00137829571031034,
 0.015062113292515278,
 0.002530238591134548,
 0.0004711296933237463,
 -0.007818822748959064,
 -0.00021626344823744148,
 0.009903374128043652,
 0.019722525030374527,
 -0.027822095900774002,
 0.0014818214112892747,
 -0.011040402576327324,
 0.018711833283305168,
 -0.006565986666828394,
 -0.012444141320884228,
 -0.00635542580857873,
 -0.01629740372300148,
 -0.019806748256087303,
 0.0004943791427649558,
 -0.021224524825811386,
 0.015553421340882778,
 -0.013560113497078419,
 0.02153334766626358,
 0.027864208444952965,
 0.01781344041

In [None]:
import numpy as np

np.array(embedding).shape

(1536,)

- We convert our text, such as the first text chunk containing “2023 text-generating language model,” into a vector with 1536 dimensions. By doing this for each text chunk, we can observe in a 1536-dimensional space which text chunks are closer and more similar to each other.

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*lssUQDyZfz3MZCpCxIh-bw.png'>

#### **A commonly used distance metric is cosine similarity. So let’s try to calculate the cosine similarity between our question and the text chunks:**

In [None]:
import numpy as np
from numpy.linalg import norm
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
from bs4 import BeautifulSoup
import pandas as pd
import openai

####################################################################
# load documents
####################################################################
# URL of the Wikipedia page to scrape
url = 'https://en.wikipedia.org/wiki/Prime_Minister_of_the_United_Kingdom'

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the text on the page
text = soup.get_text()
print(len(text))
text[:1000]

60038


"\n\n\n\nPrime Minister of the United Kingdom - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAppearance\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1\nHistory\n\n\n\n\n\n\n\n\n2\nAuthority, powers and constraints\n\n\n\n\n\n\n\n\n3\nConstitutional background\n\n\n\n\n\n\n\n\n4\nMo

In [None]:


####################################################################
# split text
####################################################################
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

texts = text_splitter.create_documents([text[:1000]])

####################################################################
# calculate embeddings
####################################################################
# create new list with all text chunks
text_chunks=[]

for text in texts:
    text_chunks.append(text.page_content)

df = pd.DataFrame({'text_chunks': text_chunks})

####################################################################
# get embeddings from text-embedding-ada model
####################################################################
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df['ada_embedding'] = df.text_chunks.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

####################################################################
# calculate the embeddings for the user's question
####################################################################
users_question = "What is GPT-4?"

question_embedding = get_embedding(text=users_question, model="text-embedding-ada-002")

# create a list to store the calculated cosine similarity
cos_sim = []

for index, row in df.iterrows():
   A = row.ada_embedding
   B = question_embedding

   # calculate the cosine similarity
   cosine = np.dot(A,B)/(norm(A)*norm(B))

   cos_sim.append(cosine)

df["cos_sim"] = cos_sim
df.sort_values(by=["cos_sim"], ascending=False)

Unnamed: 0,text_chunks,ada_embedding,cos_sim
9,4\nModern premiership\n\n\n\n\nToggle Modern p...,"[-0.001666662865318358, 0.00393084017559886, -...",0.770997
10,4.2\nPrime Minister's Office\n\n\n\n\n\n\n\n\n...,"[0.014478943310678005, 0.0020605376921594143, ...",0.754074
0,Prime Minister of the United Kingdom - Wikipedia,"[0.004399775993078947, -0.01084962673485279, -...",0.739513
11,4.4\nSecurity and transport\n\n\n\n\n\n\n\n\n4...,"[0.02221507392823696, -0.020824098959565163, 0...",0.732841
6,Pages for logged out editors learn more\n\n\n\...,"[0.004999903962016106, 0.003818664001300931, 0...",0.730828
12,4.6\nDeputy\n\n\n\n\n\n\n4.6.1\nSucc,"[0.009663875214755535, -0.003192658070474863, ...",0.723244
2,Navigation\n\t\n\n\nMain pageContentsCurrent e...,"[0.0052303746342659, -0.002746198559179902, -0...",0.718181
4,Search\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n...,"[0.004861578810960054, 0.01493444200605154, 0....",0.717354
3,Contribute\n\t\n\n\nHelpLearn to editCommunity...,"[0.008618349209427834, 0.009969587437808514, 0...",0.715512
7,Contents\nmove to sidebar\nhide\n\n\n\n\n(Top)...,"[0.01620987057685852, -0.0002970777277369052, ...",0.713264


### **4. Define the model you want to use**

In [None]:
!pip install -q langchain_community # Install the missing module

In [None]:
my_api_key

'sk-proj-t9c715UDN980ktfAHwJvT3BlbkFJ7NnyJSZ0SugFhhw0rVwM'

In [None]:
from langchain.llms import OpenAI

llm = OpenAI(openai_api_key= my_api_key, temperature=0.7)

# 기본 모델 확인
print(llm.model_name)

gpt-3.5-turbo-instruct


  warn_deprecated(


In [None]:
models = openai.Model.list()
print([model['id'] for model in models['data']])

['dall-e-3', 'gpt-4-1106-preview', 'dall-e-2', 'tts-1-hd-1106', 'tts-1-hd', 'gpt-4o-mini-2024-07-18', 'gpt-4-0125-preview', 'babbage-002', 'gpt-4-turbo-preview', 'text-embedding-3-small', 'text-embedding-3-large', 'tts-1', 'gpt-3.5-turbo', 'whisper-1', 'gpt-4o-2024-05-13', 'text-embedding-ada-002', 'gpt-3.5-turbo-16k', 'davinci-002', 'gpt-4-turbo-2024-04-09', 'tts-1-1106', 'gpt-3.5-turbo-0125', 'gpt-4-turbo', 'gpt-3.5-turbo-1106', 'gpt-4o-mini', 'gpt-4o', 'gpt-3.5-turbo-instruct-0914', 'gpt-3.5-turbo-instruct', 'gpt-4-0613', 'gpt-4']


### **5. Define our Prompt Template**

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

# 특정 모델을 사용하도록 OpenAI LLM 초기화
llm = ChatOpenAI(openai_api_key=my_api_key,
                 model="gpt-4o-mini-2024-07-18",
                 temperature=0.7)

# 간단한 메시지로 채팅 모델 테스트
response = llm([HumanMessage(content="Hello, how are you?")])
print(response.content)

  warn_deprecated(
  warn_deprecated(


Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?


In [None]:
users_question = "Who was the first Prime Minister of the UK?"

# 간단한 메시지로 채팅 모델 테스트
response = llm([HumanMessage(content=users_question)])
print(response.content)

The first Prime Minister of the United Kingdom is generally considered to be Sir Robert Walpole. He served as First Lord of the Treasury from 1721 to 1742 and is often regarded as the de facto leader of the government during that time, although the title "Prime Minister" was not officially used at the time. Walpole is recognized for his significant influence over the cabinet and Parliament, effectively laying the groundwork for the modern role of the Prime Minister.


### **6. Creating a vector store (vector database)**
#### **Hallucination!**

In [None]:
users_question = input("Enter your question: ")

# 간단한 메시지로 채팅 모델 테스트
response = llm([HumanMessage(content=users_question)])
print(response.content)

Enter your question: 조상구에 대해 알려줘
조상구(趙相九, 1940년 1월 30일 ~ )는 대한민국의 정치인으로, 주로 1990년대와 2000년대 초반에 활동한 인물입니다. 그는 1996년 제15대 국회의원에 선출되었으며, 이후 여러 차례 국회의원으로 활동했습니다. 조상구는 한나라당(현재의 국민의힘)의 소속으로, 보수적인 정치 성향을 가지고 있습니다.

그의 정치 경력 동안 여러 가지 정책과 법안에 참여했으며, 국가의 정치 및 경제 발전에 기여하고자 노력했습니다. 또한 그는 지역사회와의 소통에도 힘썼습니다.

조상구에 대한 보다 구체적인 정보나 특정 사건에 대한 질문이 있다면 말씀해 주세요.


### **6.1. Collect data that we want to use to answer the users’ questions:**

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ApbpqcZUMF-YaA6DbnVGww.png'>

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader


text = '''
조상구는 고려시대 명장으로 중국 수나라 100만 대군을 물리친 영웅이다.
'''

# Open a new file called 'output.txt' in write mode and store the file object in a variable
with open('output.txt', 'w', encoding='utf-8') as file:
    # Write the string to the file
    file.write(text)

### **6.2. Load the data and define how you want to split the data into text chunks**

### **PDF, Excel 등을 다양한 loader로 - Momgo DB의 Documents(Pages, metadata 로 구성)**

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*f5mcqjHkiz9QX63dpQnzxg.png'>

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# load the document
with open('./output.txt', encoding='utf-8') as f:
    text = f.read()

# define the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 100,
    length_function = len,
)

texts = text_splitter.create_documents([text])
texts

[Document(page_content='조상구는 고려시대 명장으로 중국 수나라 100만 대군을 물리친 영웅이다.')]

### **6.3. Define the Embeddings Model you want to use to calculate the embeddings for your text chunks and store them in a vector store (here: Chroma)**

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ydbixXRwfgMYVdpctYTdew.png'>

- **최근에는 vector DB와 더불어 Graph DB를 적용  중**

In [None]:
!pip install -q chromadb

In [None]:
!pip install -q tiktoken

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# define the embeddings model
embeddings = OpenAIEmbeddings(openai_api_key=my_api_key)

# use the text chunks and the embeddings model to fill our vector store
db = Chroma.from_documents(texts, embeddings)

  warn_deprecated(


In [None]:
db

<langchain_community.vectorstores.chroma.Chroma at 0x7a42f9d15d20>

### **6.4. Calculate the embeddings for the user’s question, find similar text chunks in our vector store and use them to build our prompt**

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/1*r2n4uA-ZlxZatnlhTVwv5Q.png'>

In [None]:
from langchain.llms import OpenAI
from langchain import PromptTemplate, LLMChain
from langchain.schema import HumanMessage

user_question = "조상구에 대해 알려줘"

# use our vector store to find similar text chunks
results = db.similarity_search_with_score( # Use similarity_search_with_score directly
    query=user_question,
    k=5 # Pass the number of results using 'k'
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])
# Create an LLMChain for easier prompt management
llm_chain = LLMChain(llm=llm, prompt=prompt)

# fill the prompt template
# Extract the document contents from the results
context_contents = [doc.page_content for doc, _ in results]
# Run the LLMChain with the formatted prompt
response = llm_chain.run(context=context_contents, users_question=user_question)

# Print the LLM's response
print(response)

  warn_deprecated(
  warn_deprecated(


조상구는 고려시대의 명장으로, 중국 수나라의 100만 대군을 물리친 영웅입니다.


## **Summary**

### - To enable our LLM to analyze and answer questions about our data, we usually don’t fine-tune the model. Instead, during **the fine-tuning process, the objective is to improve the model’s ability to effectively respond to a specific task, rather than teaching it new information.**

### - In the case of Alpaca 7B, the LLM (LLaMA) was fine-tuned to behave and interact like a chatbot. The focus was on refining the model’s responses, rather than teaching it completely new information.

### - So **to be able to answer questions about our own data, we use the Context Injection approach.** Creating an LLM app with Context Injection is a relatively simple process. **The main challenge lies in organizing and formatting the data to be stored in a vector database. This step is crucial for efficiently retrieving contextually similar information and ensuring reliable results.**

### - The goal of the article was **to demonstrate a minimalist approach to using embedding models, vector stores**, and LLMs to process user queries. It shows how these technologies can work together to provide relevant and accurate answers, even to constantly changing facts.