<a href="https://colab.research.google.com/github/appletreeleaf/NLP_Projects/blob/main/LangChain%EC%9D%84_%ED%99%9C%EC%9A%A9%ED%95%9C_QA_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 여러 문서에서 찾아서 답변하는 챗봇 만들기

> 유튜브 [빵형의 개발도상국](https://www.youtube.com/@bbanghyong)

- QA ChatBot
- LangChain
- ChatGPT (gpt-3.5-turbo)
- ChromaDB

> Reference: https://youtu.be/3yPBVii7Ct0

In [None]:
!pip install -q langchain openai tiktoken chromadb

## 여러 문서

> TechCrunch 기사 21개

In [2]:
!wget -q https://github.com/kairess/toy-datasets/raw/master/techcrunch-articles.zip
!unzip -q techcrunch-articles.zip -d articles

```
21개의 기사를 articles 폴더에 저장해주겠습니다.
```

## Setting up LangChain

OpenAI API Key

https://platform.openai.com/account/api-keys

In [3]:
import os

os.environ["OPENAI_API_KEY"] = "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"

In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

## Load multiple and process documents

In [5]:
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./articles', glob="*.txt", loader_cls=TextLoader)

documents = loader.load()

len(documents)

21

```
DirectoryLoader를 통해 폴더 내 모든 기사를 load해주겠습니다.
만약 하나의 txt파일을 load할 경우 TextLoader(.txt)해주면 되겠습니다.
```

## Split texts

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) # turncation
texts = text_splitter.split_documents(documents)

len(texts)

233

```
chatGPT는 입력토큰 길이 제한이 있기 떄문에
기사당 1000글자씩 split해주겠습니다.
이때 chunk_overlap은 다음 chunk에 이전 chunk의 200자를
overlap하겠다는 뜻입니다.
```

In [14]:
texts[2:4]

[Document(page_content='What the memo points out is that in March, a leaked foundation language model from Meta, called LLaMA, was leaked in fairly rough form. Within weeks, people tinkering around on laptops and penny-a-minute servers had added core features like instruction tuning, multiple modalities and reinforcement learning from human feedback. OpenAI and Google were probably poking around the code, too, but they didn’t — couldn’t — replicate the level of collaboration and experimentation occurring in subreddits and Discords.\n\nCould it really be that the titanic computation problem that seemed to pose an insurmountable obstacle — a moat — to challengers is already a relic of a different era of AI development?\n\nSam Altman already noted that we should expect diminishing returns when throwing parameters at the problem. Bigger isn’t always better, sure — but few would have guessed that smaller was instead.\n\nGPT-4 is a Walmart, and nobody actually likes Walmart', metadata={'sour

## Create Chroma DB

1. Text -> Embbedings
2. `db` 폴더에 데이터 저장
3. DB 초기화
4. `db` 폴더로부터 DB 로드

In [15]:
persist_directory = 'db'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=texts,
    embedding=embedding,
    persist_directory=persist_directory)

  warn_deprecated(


In [16]:
vectordb.persist()
vectordb = None

```
vectordb를 비워주겠습니다.
```

In [18]:
# load vector db
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding)

## Make a retriever

In [19]:
retriever = vectordb.as_retriever()

In [28]:
# return relevant documents about query
docs = retriever.get_relevant_documents("What is Generative AI?")

# contents
print('**contents** \n')
for i, doc in enumerate(docs):
    print(f"{i+1}. {doc.page_content}")
    print('-'*100)

# file_source
print('**file_source** \n')
for i, doc in enumerate(docs):
    print(f"{i+1}. {doc.metadata['source']}")

**contents** 

1. Developers can get in on the action too, building AI steps into workflows, giving them the option of tapping into external apps and large language models to build generative AI experiences themselves. Just last week the company made its updated developer experience generally available, and this should make it easier to incorporate generative AI into the platform in customized ways, Seaman says.

“So this gives us the foundation to give users choice and flexibility to bring AI into their work in their business whenever they’re ready, and however they like. We’ve got 2,600 apps in the ecosystem right now, which includes a lot of the leading LLMs, and we see a lot of customers already choosing to integrate generative AI into Slack themselves,” he said.
----------------------------------------------------------------------------------------------------
2. As brands incorporate generative AI into their creative workflows to generate new content associated with the company,

### 결과를 k개 반환

In [30]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [31]:
docs = retriever.get_relevant_documents("What is Generative AI?")

# k = 3
for i, doc in enumerate(docs):
    print(f"{i+1}. {doc.metadata['source']}")

1. articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
2. articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
3. articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


## Make a chain

In [33]:
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True)

In [40]:
def process_llm_response(llm_response):
    print('answer:')
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

## Query

In [41]:
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

answer:
 Pando raised $30 million in a Series B round, bringing its total raised to $45 million.


Sources:
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


```
qa_chain을 선언한 후
query를 입력해주면
vector db에 저장된 문서를 참조해
질문에 맞는 답변을 return 해줍니다.

이어서 chin의 output을 자세히 보겠습니다.
```

In [None]:
llm_response

```
qa_chain의 아웃풋은
query, result, source_document를 key로 하는 dictionary입니다.

source_document는 어떤 문서를 참조했는지에 대한 정보를 나타냅니다.
위 예시의 경우 하나의 문서만을 참조했으나,
여러문서의 내용을 조합하는 경우도 있습니다.
```

In [45]:
query = "Who led the round in Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

answer:
 Iron Pillar and Uncorrelated Ventures.


Sources:
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [27]:
query = "What did Databricks acquire?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Databricks acquired Okera, a data governance platform with a focus on AI.


Sources:
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt


In [26]:
query = "What is Generative AI?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Generative AI is a type of artificial intelligence that uses algorithms and data to create new and unique content, such as text, images, or code. It can be incorporated into workflows and platforms to automate and enhance creative processes, and can be customized to adhere to a brand's style and guidelines. 


Sources:
articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


In [28]:
query = "Who is CMA?"
llm_response = qa_chain(query)
process_llm_response(llm_response)


CMA stands for Competition and Markets Authority. It is a UK government agency that is responsible for promoting competition and fair markets for consumers and businesses.


Sources:
articles/05-04-cma-generative-ai-review.txt
articles/05-04-cma-generative-ai-review.txt
articles/05-04-cma-generative-ai-review.txt
