# 고급 LangChain 구성 및 파이프라인 적용하기
* 책 303~306 쪽

<img src='https://raw.githubusercontent.com/corazzon/Mastering-NLP-from-Foundations-to-LLMs/refs/heads/main/cover.png'
     alt="NLP와 LLM 실전 가이드(한빛미디어)"
     style="border: 3px solid gray; box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.3); border-radius: 10px; width: 300px;"   width='300'>


* 저자:  
    - [Lior Gazit](https://www.linkedin.com/in/liorgazit).  
    - [Meysam Ghaffari](https://www.linkedin.com/in/meysam-ghaffari-ph-d-a2553088/).
* 역자:
    - [박조은](https://github.com/corazzon)
* 이 노트북은 다음의 책에서 소개하는 내용입니다.
    - 역서 : NLP와 LLM 실전 가이드(한빛미디어)
    - 원서 : [Mastering NLP from Foundations to LLMs](https://www.amazon.com/dp/1804619183)

colab 실습 :
https://github.com/corazzon/Mastering-NLP-from-Foundations-to-LLMs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/corazzon/Mastering-NLP-from-Foundations-to-LLMs/blob/main/Chapter9_notebooks/Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb)  


원서 Colab 실습 :
https://github.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs   
  
<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter9_notebooks/Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**이 노트북의 목적:**  
이 노트북은 8장에서 다룬 파이프라인(**Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb**)을 개선한 버전입니다.  

우리는 **RAG** 파이프라인을 완성하여, **임베딩**을 생성하고 이를 **벡터 DB**에 저장해 의사의 노트에서 "내부 검색"을 구현합니다.  
이전 노트북과 달리, 여기서는 LLM을 활용해 검색을 수행하며, 잘못된 검색 결과로 인해 발생하는 오류를 방지합니다.  

**필수 사항:**  
* Colab에서 실행 시, 런타임 노트북 설정으로 `Python 3, CPU`를 사용하세요.  
* 이 코드는 OpenAI의 API를 LLM으로 사용하므로, 유료 **API 키**가 필요합니다.  

>*```면책사항: 이 노트북에서 다루는 내용과 아이디어는 저자들 개인의 것이며, 저자들의 고용주의 견해나 지적 재산을 대변하지 않습니다.```*

설치:

In [11]:
# 주의사항:
# 아래 코드에서 Python 패키지 불일치로 인한 오류가 발생하는 경우, 새로운 버전이 원인일 수 있습니다.
# 이런 경우, "default_installations"를 False로 설정하여 원래 이미지로 되돌리세요:
default_installations = True
if default_installations:
    !pip -q install langchain langchain_community
    !pip -q install sentence_transformers
    !pip -q install faiss-cpu
    !pip -q install openai==0.28.1
else:
    import requests
    text_file_path = "requirements__Ch9_Advanced_LangChain_Configurations_and_Pipeline.txt"
    url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter9_notebooks/" + text_file_path
    res = requests.get(url)
    with open(text_file_path, "w") as f:
        f.write(res.text)

    !pip install -r requirements__Ch9_Advanced_LangChain_Configurations_and_Pipeline.txt

Imports:

In [12]:
import requests
from langchain.document_loaders import TextLoader
import textwrap
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

코드 설정:

### 모의 의사 노트가 포함된 텍스트 파일 로드  
이 파일들은 우리가 찾으려는 정보를 담고 있습니다.  
이 예제에서는 모든 모의 보고서를 하나의 .CSV 테이블로 결합하여 로딩 과정을 간단하고 짧게 만들었습니다.  

In [13]:
# # 원서
# text_file_path = "mocked_up_physician_records.csv"
# url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter8_notebooks/" + text_file_path
# res = requests.get(url)
# with open(text_file_path, "w") as f:
#     f.write(res.text)

### 한국어 버전

In [14]:
text_file_path = "mocked_up_physician_records_ko.csv"
url = "https://raw.githubusercontent.com/corazzon/Mastering-NLP-from-Foundations-to-LLMs/refs/heads/main/Chapter8_notebooks/" + text_file_path
res = requests.get(url)
with open(text_file_path, "w") as f:
    f.write(res.text)

파일의 텍스트 내용을 로드합니다:

In [15]:
# 문서 불러오기
text_loader = TextLoader(text_file_path)
documents = text_loader.load()

LangChain 변수 유형을 확인합니다 (이것은 조작 방법을 이해하는 데 유용합니다):

In [16]:
print(type(documents[0]))

<class 'langchain_core.documents.base.Document'>


문서 개수를 확인합니다:

In [17]:
len(documents)

1

모델에 사용할 텍스트를 확인합니다.

In [18]:
print(documents[0].page_content[0:2000])

"Title: Mocked up record
Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in severi

### 임베딩을 위해 텍스트 데이터 문서별로 나누기

In [25]:
# 우리가 사용하는 데이터 파일에서는 이 짧은 문자열이 서로 다른 임상 보고서들을 구분하는 구분자입니다:
split_text_by = '"Title: Mocked up record'
chunk_size = 2000
chunk_overlap = 0

In [26]:
# 텍스트 나누기
text_splitter = CharacterTextSplitter(chunk_size=chunk_size,
                                      chunk_overlap=chunk_overlap,
                                      separator=split_text_by)
splitted_docs = text_splitter.split_documents(documents)

In [27]:
len(splitted_docs)

4

In [28]:
print(splitted_docs[0].page_content)

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain

History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr. Anderson recently returned from a business trip to Europe about two weeks ago. He denies any respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr. Anderson revealed a pertinent family history of cardiovascular disease, with his father having suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type 2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in severity. The pain is exacerbat

### 벡터 데이터베이스에 저장할 임베딩 생성  
허깅페이스의 오픈 소스 모델을 사용합니다.
* https://huggingface.co/sentence-transformers/all-mpnet-base-v2
* MPNet은 BERT, RoBERTa, XLNet 등의 장점을 융합한 최신 Transformer 모델입니다.
* 한국어로는 https://huggingface.co/upskyy/e5-small-korean 를 사용해 봅니다.

In [29]:
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embeddings = HuggingFaceEmbeddings(model_name="upskyy/e5-small-korean")

  embeddings = HuggingFaceEmbeddings(model_name="upskyy/e5-small-korean")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### 벡터 데이터베이스 생성

<img src="https://python.langchain.com/assets/images/vectorstores-2540b4bc355b966c99b0f02cfdddb273.png">

* 출처 : https://python.langchain.com/docs/concepts/vectorstores/

벡터 데이터베이스로는 FAISS(Facebook AI Similarity Search)를 선택했습니다:

자세한 내용은 다음 링크를 참조하세요:

* https://python.langchain.com/docs/integrations/vectorstores/
* https://python.langchain.com/v0.1/docs/integrations/vectorstores/faiss/#:~:text=Now%2C%20we%20can%20query%20the,similarity_search

In [30]:
vector_db = FAISS.from_documents(splitted_docs, embeddings)

### "내부" 문서를 기반으로 유사도 검색 수행  

**질문 #1: 8월에 출산 예정인 임산부 환자가 있나요?**  

In [31]:
query1 = "8월에 출산 예정인 임산부 환자가 있나요?"
docs = vector_db.similarity_search(query1)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**[오류의 예!] 질문 #2: 출산 예정일이 9월인 임산부가 있나요?**  
이것은 유사성 검색이 **틀린 결과**를 제공하는 예입니다.  
질문과 유사한 텍스트를 제공하긴 했지만, 이 사례는 유사성이 정답을 맞히는 것과는 다르다는 점을 보여줍니다.  

In [32]:
query2 = "출산 예정일이 9월인 임산부가 있나요?"
docs = vector_db.similarity_search(query2)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: July 10, 2099
Patient ID: 246813579
Chief Complaint: Pregnancy Follow-
up

History of Present Illness:
The patient, Mrs. Emily Adams, a 30-year-old female, presents today
for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August
27th, 2099. Mrs. Adams is married and lives with her husband.

During the evaluation, Mrs. Adams
reveals a family history of gestational diabetes, with her mother having developed the condition
during her own pregnancies. She mentions no personal history of significant medical conditions,
surgeries, or complications in previous pregnancies.

Regarding her chief complaint, Mrs. Adams
reports typical discomforts associated with the third trimester of pregnancy, including backache,
frequent urination, and occasional heartburn. She denies any vaginal bleeding, severe abdominal
pain, or significant changes in fetal movements. Mrs. Adams mentions adhering to a well-balanced
diet and regular exercise

**질문 #3: 최근에 여행을 다녀온 환자는 누구인가요?**

In [33]:
query3 = "최근에 여행을 다녀온 환자는 누구인가요?"
docs = vector_db.similarity_search(query3)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: June 25, 2099
Patient ID: 987654321
Chief Complaint: Abdominal pain
History of Present Illness:
The patient, Mr. John Anderson, a 42-year-old male, presents today with
a chief complaint of abdominal pain. He is married and resides with his wife and two children. Mr.
Anderson recently returned from a business trip to Europe about two weeks ago. He denies any
respiratory symptoms or exposure to sick individuals during his travel.

During the evaluation, Mr.
Anderson revealed a pertinent family history of cardiovascular disease, with his father having
suffered a myocardial infarction in his 60s. He also reports that his maternal grandmother had type
2 diabetes. Mr. Anderson denies any personal history of chronic illnesses, surgeries, or
hospitalizations.

Regarding his chief complaint, Mr. Anderson describes the abdominal pain as a
dull, intermittent ache located in the lower right quadrant. He rates the pain as 5 out of 10 in
severity. The pain is exacerbate

**질문 #4: 검사실 검사가 필요한 환자는 누구인가요?**

In [34]:
query4 = "검사실 검사가 필요한 환자는 누구인가요?"
docs = vector_db.similarity_search(query4)
print(textwrap.fill(str(docs[0].page_content), width=100, replace_whitespace=False))

Physician Name: Dr. ABC
Date: November 15, 2099
Patient ID: 123456789
Chief Complaint: Fatigue and
joint pain

History of Present Illness:
The patient, Ms. Sarah Thompson, a 57-year-old female,
presents today with complaints of fatigue and joint pain. Ms. Thompson is widowed and lives alone.
She has no recent history of travel outside the country.

During the evaluation, Ms. Thompson
reveals a pertinent family history of autoimmune diseases, with her sister being diagnosed with
rheumatoid arthritis. She also reports a personal history of hypothyroidism, which is being managed
with thyroid hormone replacement therapy.

Regarding her chief complaint, Ms. Thompson describes the
fatigue as persistent and overwhelming, affecting her ability to perform daily activities. She rates
her fatigue as 8 out of 10 in severity. Additionally, she reports joint pain primarily in her knees
and wrists, which is worse in the morning and improves with movement throughout the day. She denies
any swelling or

****
다음 단계 전에 메모리를 정리합니다 (로컬 호스팅된 LLM을 선택한 경우에 유용):

In [35]:
import sys

local_vars = list(locals().items())
for var, obj in local_vars:
  if(sys.getsizeof(obj)) > 999:
    print(var, sys.getsizeof(obj))

HuggingFaceHub 1688
ChatOpenAI 1688
GPT4All 1688
_i11 1650
TextLoader 1688
CharacterTextSplitter 1688
HuggingFaceEmbeddings 1688
FAISS 1688


In [36]:
import gc
del CharacterTextSplitter
del HuggingFaceEmbeddings
del TextLoader
del FAISS
gc.collect()

369

****

# 8장 노트북의 개선: 요청 처리를 위한 LLM 설정  
이제 해당 파이프라인을 개선하겠습니다. 단순히 유사성 검색 결과를 의사에게 제공하는 것에 그치지 않고, 요청과 유사한 내용으로 간주된 결과를 바탕으로 LLM을 활용하여 이를 검토하고, 실제로 의사에게 적합한 결과를 판별해낼 것입니다.  

OpenAI API key:  
**문자열 형태로 아래 "..."에 OpenAI에서 발급받은 key를 입력해 주세요!**  


Colab 보안 비밀 설정은 왼쪽 열쇠 모양의 아이콘을 클릭하면 나옵니다.
<img src="https://i.imgur.com/7P383n4.png" width="500">




In [72]:
openai_api_key = "..."
# colab 보안 비밀 설정 사용시 주석 해제 후 사용
# from google.colab import userdata
# openai_api_key = userdata.get('OPENAI_API_KEY')

In [73]:
!pip -q install openai gpt4all==1.0.12

In [74]:
!wget https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q4_0.bin

--2025-03-14 10:02:31--  https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q4_0.bin
Resolving huggingface.co (huggingface.co)... 3.163.189.37, 3.163.189.114, 3.163.189.74, ...
Connecting to huggingface.co (huggingface.co)|3.163.189.37|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/7e/26/7e26b3c7ced64f7024dbcce87fddd78a593c50c86955e5a756d14710387ada70/d1735b93e1dc503f1045ccd6c8bd73277b18ba892befd1dc29e9b9a7822ed998?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27nous-hermes-13b.ggmlv3.q4_0.bin%3B+filename%3D%22nous-hermes-13b.ggmlv3.q4_0.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1741950151&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MTk1MDE1MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy83ZS8yNi83ZTI2YjNjN2NlZDY0ZjcwMjRkYmNjZTg3ZmRkZDc4YTU5M2M1MGM4Njk1NWU1YTc1NmQxNDcxMDM4N2FkYTcwL2QxNzM1YjkzZTF

In [4]:
import os
import langchain
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
from langchain.chat_models import ChatOpenAI
from langchain.llms import GPT4All

### LLM 설정: 유료 LLM(OpenAI의 GPT)과 무료 LLM(Hugging Face 모델) 중 선택  

In [2]:
paid_vs_free = "paid"

# GPT4all .bin 파일의 경로 (Google Colab에서 실행하기에 적합):
path_to_bin = "./nous-hermes-13b.ggmlv3.q4_0.bin"

# 백엔드 LLM:
# "gptj", "llama" 등 다양한 모델 중 선택 합니다.
backend_llm = "llama"

In [5]:
if paid_vs_free == "paid":
    os.environ["OPENAI_API_KEY"] = openai_api_key
    llm = ChatOpenAI()
elif paid_vs_free == "free":
    llm = GPT4All(
        model=path_to_bin,
        max_tokens=1000,
        # backend=backend_llm,
        verbose=False)

Found model file at  ./nous-hermes-13b.ggmlv3.q4_0.bin


### QA 체인 생성  
`load_qa_chain()`을 통해 RAG 프레임워크를 구성합니다. 이 기능은 다양한 텍스트 문서를 입력받아 검색 가능한 형태로 준비하며, 사용자 프롬프트를 분석하여 관련된 텍스트를 찾아냅니다. 찾아낸 텍스트는 선택된 LLM에 제공되어, LLM이 적절한 맥락을 바탕으로 프롬프트에 답변할 수 있게 합니다.

In [37]:
chain = load_qa_chain(llm, chain_type="stuff")

### 동일한 요구 사항을 기반으로 검색하되, 임베딩 유사성 대신 LLM을 "두뇌"로 활용하여 검색

**Question #1: 8월에 출산 예정인 임산부 환자가 있나요?**  

In [38]:
import langchain
langchain.debug = True

In [None]:
current_query = query1
print(current_query)
docs = vector_db.similarity_search(current_query, k=2)
# print(chain.run(input_documents=docs, question=current_query))
response = chain.invoke({"input_documents": docs, "question": current_query})
response

8월에 출산 예정인 임산부 환자가 있나요?
[32;1m[1;3m[chain/start][0m [1m[chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "8월에 출산 예정인 임산부 환자가 있나요?",
  "context": "Physician Name: Dr. ABC\nDate: July 10, 2099\nPatient ID: 246813579\nChief Complaint: Pregnancy Follow-up\n\nHistory of Present Illness:\nThe patient, Mrs. Emily Adams, a 30-year-old female, presents today for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August 27th, 2099. Mrs. Adams is married and lives with her husband.\n\nDuring the evaluation, Mrs. Adams reveals a family history of gestational diabetes, with her mother having developed the condition during her own pregnancies. She mentions no personal history of significant medical conditions, surgeries, or complications in previous pregnancies.\n\nRegarding her chief complaint, Mrs. Adams repo

In [None]:
docs[0]

**[OpenAI의 LLM이 오류를 감지하고 피했습니다!] 질문 #2: 9월에 출산 예정인 임산부가 있나요?**  
다만, 양자화되어 "성능이 저하된" 일부 무료 LLM들은 실패할 수 있으며, 9월 출산 예정일에 대해 물었음에도 8월 출산 예정일을 답변으로 제시할 수 있습니다.

In [None]:
current_query = query2
print(current_query)
docs = vector_db.similarity_search(current_query, k=2)
# print(chain.run(input_documents=docs, question=current_query))
response = chain.invoke({"input_documents": docs, "question": current_query})
response

출산 예정일이 9월인 임산부가 있나요?
[32;1m[1;3m[chain/start][0m [1m[chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "출산 예정일이 9월인 임산부가 있나요?",
  "context": "Physician Name: Dr. ABC\nDate: July 10, 2099\nPatient ID: 246813579\nChief Complaint: Pregnancy Follow-up\n\nHistory of Present Illness:\nThe patient, Mrs. Emily Adams, a 30-year-old female, presents today for a routine pregnancy follow-up. She is currently 32 weeks pregnant, with a due date of August 27th, 2099. Mrs. Adams is married and lives with her husband.\n\nDuring the evaluation, Mrs. Adams reveals a family history of gestational diabetes, with her mother having developed the condition during her own pregnancies. She mentions no personal history of significant medical conditions, surgeries, or complications in previous pregnancies.\n\nRegarding her chief complaint, Mrs. Adams reports 

**질문 #3: 최근에 여행을 다녀온 환자는 누구인가요?**

In [None]:
current_query = query3
print(current_query)
docs = vector_db.similarity_search(current_query)
# print(chain.run(input_documents=docs, question=current_query))
response = chain.invoke({"input_documents": docs, "question": current_query})
response

In [None]:
docs[1]

**질문 #4: 검사실 검사가 필요한 환자는 누구인가요?**

In [None]:
current_query = query4
print(current_query)
docs = vector_db.similarity_search(current_query)
# print(chain.run(input_documents=docs, question=current_query))
response = chain.invoke({"input_documents": docs, "question": current_query})
response

In [None]:
docs[0]

**질문 #4 수정: *확실하게* 검사실 검사를 필요로 하는 환자는 누구인가요?**

In [None]:
current_query = "확실하게 검사실 검사를 필요로 하는 환자는 누구인가요?"
print(current_query)
docs = vector_db.similarity_search(current_query)
# print(chain.run(input_documents=docs, question=current_query))
response = chain.invoke({"input_documents": docs, "question": current_query})
response