<a href="https://colab.research.google.com/github/chlwldns00/NLP/blob/main/%EB%AC%B8%EC%84%9C%EA%B8%B0%EB%B0%98_%EB%8B%B5%EB%B3%80%EC%9D%84_%ED%95%B4%EC%A3%BC%EB%8A%94_QA_%EC%B1%97%EB%B4%87_Langchain%EC%82%AC%EC%9A%A9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 기업이나 교육기관 같은 곳에서 메뉴얼 기반으로(문서 기반) QA를 해주는 챗봇을 개발하는것이 목표


------------------
### 사용기술
- QA ChatBot
- LangChain
- ChatGPT (gpt-3.5-turbo)
- ChromaDB(벡터 데이터 베이스)



In [1]:
%pip install -q langchain openai tiktoken chromadb

Note: you may need to restart the kernel to use updated packages.


## 여러 문서

> TechCrunch 기사(뉴스기사) 21개(txt파일)

In [2]:
!wget -q https://github.com/kairess/toy-datasets/raw/master/techcrunch-articles.zip
!unzip -q techcrunch-articles.zip -d articles

## LangChain

OpenAI API Key

https://platform.openai.com/account/api-keys

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "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"

In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

## data load
langchain의 데이터로더

In [5]:
# loader = TextLoader('single_text_file.txt') 단일문서일경우
loader = DirectoryLoader('./articles', glob="*.txt", loader_cls=TextLoader)

documents = loader.load()

len(documents)

21

## gpt api에 입력토큰수를 초과할수 있으니 1000개로 쪼개주고 그다음 반복자를 200으로 설정


In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) #article에 있는 내용을 천자 단위로 쪼개주고 반복(context로 선택시 context length가 초과될수있으므로)
texts = text_splitter.split_documents(documents)

len(texts)

233

In [10]:
texts[2:4] #잘나오는지 테스트

[Document(page_content='Go right ahead.\n\nSo as you might have noticed, Sergey has come back to do a little bit more on the artificial intelligence side of things, which is something he’s always been interested in; I would say historically, we’ve always had an interest in artificial intelligence. But that has escalated significantly over the past decade or so. The acquisition of DeepMind was a brilliant choice. And you can see some of the outcomes first of the spectacular stuff, like playing Go and winning. And then the more productive stuff, like figuring out how 200 million proteins are folded up.', metadata={'source': 'articles/05-05-vint-cerf-on-the-exhilarating-mix-of-thrill-and-hazard-at-the-frontiers-of-tech.txt'}),
 Document(page_content='Then there’s the large language models and the chatbots. And I think we’re still in a very peculiar period of time, where we’re trying to characterize what these things can and can’t do, and how they go off the rails, and how do you take adva

## Create Chroma DB

1. 쪼갠 Text 를 Embbeding
2. `db` 폴더에 데이터 저장
3. DB 초기화
4. `db` 폴더로부터 DB 로드

In [9]:
persist_directory = 'db' #db라는 디렉토리

embedding = OpenAIEmbeddings() #출처표시만 잘되는지 halluciantion테스트만 할거기에 openai임베딩 가져오기

vectordb = Chroma.from_documents( #벡터db 설정 문서,임베딩방식,저장공간
    documents=texts,
    embedding=embedding,
    persist_directory=persist_directory)

In [None]:
vectordb.persist() #vectordb초기화 메소드
vectordb = None

In [11]:
vectordb = Chroma(
    persist_directory=persist_directory, #DB폴더에서 로드
    embedding_function=embedding)

## Make a retriever

In [15]:
retriever = vectordb.as_retriever() #vectorDB의 검색기 정의
retriever

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7e76e4aad600>)

In [14]:
docs = retriever.get_relevant_documents("What is Generative AI?") #DB에서 이 질문의 답변과 연관된 문서 -> meta data: 를 찾아준다

print(docs[0])
for doc in docs:
    print(doc.metadata["source"])

page_content='Developers can get in on the action too, building AI steps into workflows, giving them the option of tapping into external apps and large language models to build generative AI experiences themselves. Just last week the company made its updated developer experience generally available, and this should make it easier to incorporate generative AI into the platform in customized ways, Seaman says.\n\n“So this gives us the foundation to give users choice and flexibility to bring AI into their work in their business whenever they’re ready, and however they like. We’ve got 2,600 apps in the ecosystem right now, which includes a lot of the leading LLMs, and we see a lot of customers already choosing to integrate generative AI into Slack themselves,” he said.' metadata={'source': 'articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt'}
articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles/05-03-nova-is-building-

## 결과를 K개 반환

In [16]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})


In [17]:
docs = retriever.get_relevant_documents("What is Generative AI?")

for doc in docs:
    print(doc.metadata["source"])

articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


## 검색 질답 체인 만들기

In [18]:
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True)


In [19]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

## test -> 일단은 DB 텍스트 내에 있는 질문 + 형식도 텍스트 와 거의 동일하게 질문해보기


In [20]:
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Pando raised $30 million in a Series B round, bringing its total raised to $45 million.


Sources:
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


## LLm response구조 보기

In [21]:
llm_response

{'query': 'How much money did Pando raise?',
 'result': ' Pando raised $30 million in a Series B round, bringing its total raised to $45 million.',
 'source_documents': [Document(page_content='Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pando’s global sales, marketing and delivery capabilities.\n\n“We will not expand into new industries or adjacent product areas,” he told TechCrunch in an email interview. “Great talent is the foundation of the business — we will continue to augment our teams at all levels of the organization. Pando is also open to explori

In [1]:
llm_response['source_documents']

NameError: ignored

In [None]:
query = "Who led the round in Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Iron Pillar and Uncorrelated Ventures led the round.


Sources:
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [None]:
query = "What did Databricks acquire?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Databricks acquired Okera, a data governance platform with a focus on AI.


Sources:
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt


In [None]:
query = "What is Generative AI?"
llm_response = qa_chain(query)
process_llm_response(llm_response)


 Generative AI is a type of Artificial Intelligence that is used to create new content from existing data. It can be used to automate tasks, generate content, and create new ideas.


Sources:
articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt
articles/05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


In [None]:
query = "Who is CMA?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 The CMA is the Competition and Markets Authority, a body in the U.K. responsible for enforcing competition and consumer protection law.


Sources:
articles/05-04-cma-generative-ai-review.txt
articles/05-04-cma-generative-ai-review.txt
articles/05-04-cma-generative-ai-review.txt


In [None]:
!git init

Reinitialized existing Git repository in /content/drive/MyDrive/Colab Notebooks/.git/


In [None]:
!git config --global user.name 'chlwldns00'
!git config --global user.email 'chlwldns00@naver.com'

In [None]:
cd /content/drive/MyDrive/Colab Notebooks/Riffusion모델을 사용한 음악생성 인공지능 모델 테스트해보기

[Errno 20] Not a directory: '/content/drive/MyDrive/Colab Notebooks/Riffusion모델을 사용한 음악생성 인공지능 모델 테스트해보기'
/content/drive/MyDrive/Colab Notebooks


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!git add 문서기반 답변을 해주는 QA 챗봇-Langchain사용

fatal: pathspec '문서기반' did not match any files


In [None]:

!git remote add origin  https://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@github.com/TmaxSoftProject/Basic_ChatBot_DBreference


error: remote origin already exists.


In [None]:
!git remote set-url origin https://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@github.com/TmaxSoftProject/Basic_ChatBot_DBreference

In [None]:
!git remote -v


origin	https://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@github.com/TmaxSoftProject/Basic_ChatBot_DBreference (fetch)
origin	https://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@github.com/TmaxSoftProject/Basic_ChatBot_DBreference (push)


In [None]:
!git commit -m "firstcommit"

[master 8e02ed1] firstcommit
 1 file changed, 1 insertion(+), 1 deletion(-)


In [None]:
!git push origin master



Enumerating objects: 98, done.
Counting objects:   1% (1/98)Counting objects:   2% (2/98)Counting objects:   3% (3/98)Counting objects:   4% (4/98)Counting objects:   5% (5/98)Counting objects:   6% (6/98)Counting objects:   7% (7/98)Counting objects:   8% (8/98)Counting objects:   9% (9/98)Counting objects:  10% (10/98)Counting objects:  11% (11/98)Counting objects:  12% (12/98)Counting objects:  13% (13/98)Counting objects:  14% (14/98)Counting objects:  15% (15/98)Counting objects:  16% (16/98)Counting objects:  17% (17/98)Counting objects:  18% (18/98)Counting objects:  19% (19/98)Counting objects:  20% (20/98)Counting objects:  21% (21/98)Counting objects:  22% (22/98)Counting objects:  23% (23/98)Counting objects:  24% (24/98)Counting objects:  25% (25/98)Counting objects:  26% (26/98)Counting objects:  27% (27/98)Counting objects:  28% (28/98)Counting objects:  29% (29/98)Counting objects:  30% (30/98)Counting objects:  31% (31/98)Counting objects:

In [None]:
!git config --global --list

user.name=chlwldns00
user.email=chlwldns00@naver.com
