# 대규모 언어 모델의 잠재력을 끌어내는 RAG 활용법
## **랭체인** 구성과 파이프라인 설정하기
<img src='../cover.png' width='300'>

* 저자:  
    - [Lior Gazit](https://www.linkedin.com/in/liorgazit).  
    - [Meysam Ghaffari](https://www.linkedin.com/in/meysam-ghaffari-ph-d-a2553088/).
* 역자:
    - [박조은](https://github.com/corazzon)
* 이 노트북은 다음의 책에서 소개하는 내용입니다.
    - 역서 : NLP와 LLM 실전 가이드(한빛미디어)
    - 원서 : [Mastering NLP from Foundations to LLMs](https://www.amazon.com/dp/1804619183)

colab 실습 :
https://github.com/corazzon/Mastering-NLP-from-Foundations-to-LLMs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/corazzon/Mastering-NLP-from-Foundations-to-LLMs/blob/main/Chapter8_notebooks/Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb)  


원서 Colab 실습 :
https://github.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs   


<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter8_notebooks/Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**이 노트북의 목적:**  
이 노트북은 **LangChain**을 사용한 의미 검색 파이프라인 구축에 중점을 둡니다.  
의사가 환자 기록을 검색하고 다음과 같은 질문을 통해 환자를 찾을 수 있는 방법을 보여줍니다:  

`"최근에 여행을 다녀 온 환자는 누구입니까?"`  

의사의 진료 기록에서 내부 검색이 가능하도록 **임베딩**과 이를 **벡터 데이터베이스**에 저장하는 완전한 **RAG** 파이프라인을 구성합니다.  
LLM 없이 유사도 검색에만 기반한 이러한 파이프라인이 최적화되지 않았음을 보여줍니다.  
다음 노트북에서는 LLM을 통합하여 검색을 향상시키는 방법을 다룹니다:  
**Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb**

**요구사항:**  
* Google Colab에서 실행 시, 다음 런타임 노트북 설정을 사용하세요: `Python3, CPU`

>*```면책사항: 이 노트북에서 다루는 내용과 아이디어는 저자들 개인의 것이며, 저자들의 고용주의 견해나 지적 재산을 대변하지 않습니다.```*

설정:

In [1]:
# 주의사항:
# 아래 코드에서 Python 패키지 불일치로 인한 오류가 발생하는 경우, 새로운 버전이 원인일 수 있습니다.
# 이런 경우, "default_installations"를 False로 설정하여 원래 이미지로 되돌리세요:
default_installations = True
if default_installations:
    !pip -q install langchain langchain-community
    !pip -q install sentence_transformers
    !pip -q install faiss-cpu
else:
    import requests
    text_file_path = "requirements__Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.txt"
    url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter8_notebooks/" + text_file_path
    res = requests.get(url)
    with open(text_file_path, "w") as f:
        f.write(res.text)

    !pip install -r requirements__Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.txt

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Imports:

In [2]:
import requests
from langchain.document_loaders import TextLoader
import textwrap
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

코드 설정:

In [3]:
# 우리가 사용하는 데이터 파일에서는 이 짧은 문자열이 서로 다른 임상 보고서들을 구분하는 구분자입니다:
split_text_by = '"Title: Mocked up record'
chunk_size = 2000
chunk_overlap = 0

### 가상 진료 기록이 담긴 텍스트 파일 불러오기
이 파일들은 우리가 활용하고자 하는 정보를 담고 있습니다.  
이번 예제에서는 로딩을 간단하고 빠르게 하기 위해 모든 가상 보고서를 하나의 CSV 테이블로 합쳤습니다.

In [4]:
text_file_path = "mocked_up_physician_records.csv"
url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter8_notebooks/" + text_file_path
res = requests.get(url)
with open(text_file_path, "w") as f:
    f.write(res.text)

In [6]:
print(res.text[:2000])

"Title: Mocked up record
Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in severi

파일의 텍스트 내용 불러오기:

In [7]:
# 문서 불러오기
text_loader = TextLoader(text_file_path)
documents = text_loader.load()
documents

[Document(metadata={'source': 'mocked_up_physician_records.csv'}, page_content='"Title: Mocked up record\nPhysician Name: Dr. ABC\nDate: June 25, 2099\nPatient ID: 987654321\nChief Complaint: Abdominal pain\n\nHistory of Present Illness:\nThe patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.\n\nDuring the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.\n\nRegarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, interm

랭체인 변수 유형을 살펴보기 (데이터 조작 방법을 알기 위해 유용합니다)

In [8]:
print(type(documents[0]))

<class 'langchain_core.documents.base.Document'>


문서 수:

In [9]:
len(documents)

1

원본 텍스트에 접근하는 예제를 살펴보겠습니다.

In [10]:
print(documents[0].page_content[0:200])

"Title: Mocked up record
Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old ma


In [12]:
print(documents[0].page_content[200:400])

le, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago


### 임베딩을 위한 데이터 전처리

In [13]:
# 텍스트 나누기
text_splitter = CharacterTextSplitter(chunk_size=chunk_size,
                                      chunk_overlap=chunk_overlap,
                                      separator=split_text_by)
splitted_docs = text_splitter.split_documents(documents)

In [14]:
len(splitted_docs)

4

In [15]:
print(splitted_docs[0].page_content)

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in severity. The pain is exacerbat

In [16]:
print(splitted_docs[1].page_content)

Physician Name: Dr. ABC
Date: November 15, 2099
Patient ID: 123456789
Chief Complaint: Fatigue and joint pain

History of Present Illness:
The patient, Ms. Sarah Thompson, a 57-year-old female, presents today with complaints of fatigue and joint pain. Ms. Thompson is widowed and lives alone. She has no recent history of travel outside the country.

During the evaluation, Ms. Thompson reveals a pertinent family history of autoimmune diseases, with her sister being diagnosed with rheumatoid arthritis. She also reports a personal history of hypothyroidism, which is being managed with thyroid hormone replacement therapy.

Regarding her chief complaint, Ms. Thompson describes the fatigue as persistent and overwhelming, affecting her ability to perform daily activities. She rates her fatigue as 8 out of 10 in severity. Additionally, she reports joint pain primarily in her knees and wrists, which is worse in the morning and improves with movement throughout the day. She denies any swelling or

In [17]:
print(splitted_docs[2].page_content)

Physician Name: Dr. ABC
Date: November 28, 2099
Patient ID: 987654321
Chief Complaint: Migraine Headaches

History of Present Illness:
Title: Mocked up record
The patient, Mr. Michael Johnson, a 40-year-old male, presents today with a chief complaint of recurring migraine headaches. He is married and lives with his spouse and two children. Mr. Johnson has not traveled recently outside of his local area.

During the evaluation, Mr. Johnson reports a family history of migraine headaches, with his mother and sister both experiencing similar symptoms. He denies any significant past medical conditions, surgeries, or hospitalizations. He mentions occasional stress and irregular sleep patterns due to his demanding work schedule.

Regarding his chief complaint, Mr. Johnson describes his headaches as recurrent episodes of moderate to severe throbbing pain, usually localized to one side of his head. He experiences associated symptoms such as sensitivity to light and sound, as well as nausea and 

In [18]:
print(splitted_docs[3].page_content)

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August 27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams reveals a family history of gestational diabetes, with her mother having developed the condition during her own pregnancies. She mentions no personal history of significant medical conditions, surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams reports typical discomforts associated with the third trimester of pregnancy, including backache, frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced diet and regular exercise 

### 벡터 데이터베이스에 저장할 임베딩 생성  
허깅페이스의 오픈 소스 모델을 사용합니다.

In [19]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### 벡터 데이터베이스 생성

벡터 데이터베이스로는 FAISS(Facebook AI Similarity Search)를 선택했습니다:

자세한 내용은 다음 링크를 참조하세요:

* https://python.langchain.com/docs/integrations/vectorstores/
* https://python.langchain.com/v0.1/docs/integrations/vectorstores/faiss/#:~:text=Now%2C%20we%20can%20query%20the,similarity_search

In [20]:
vector_db = FAISS.from_documents(splitted_docs, embeddings)
vector_db

<langchain_community.vectorstores.faiss.FAISS at 0x7c27cf059fd0>

### 내부 문서를 활용한 유사도 검색 수행

**질문 #1: 8월에 출산 예정인 임산부 환자가 있나요?**  

In [21]:
query1 = "Are there any pregnant patients who are due to deliver in August?"
docs = vector_db.similarity_search(query1)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**[오류의 예!] 질문 #2: 출산 예정일이 9월인 임산부가 있나요?**  
이것은 유사성 검색이 **틀린 결과**를 제공하는 예입니다.  
질문과 유사한 텍스트를 제공하긴 했지만, 이 사례는 유사성이 정답을 맞히는 것과는 다르다는 점을 보여줍니다.  

In [22]:
# query2 = "9월에 출산 예정인 임산부가 있나요?"
query2 = "Are there any pregnant patients who are due to deliver in September?"
docs = vector_db.similarity_search(query2)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**질문 #3: 최근에 여행을 다녀온 환자는 누구인가요?**

In [23]:
query3 = "Which patients have travelled recently?"
docs = vector_db.similarity_search(query3)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate

**질문 #4: 검사실 검사가 필요한 환자는 누구인가요?**

In [24]:
query4 = "Which patients require lab work?"
docs = vector_db.similarity_search(query4)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate