# SOPT 3주차 스터디 중 Fine-tuning 학습을 위한 LLamaIndex 연습입니다.
## Author: Sigrid Jin (twitter.com/sigridjin_eth)
### 링크 모음
* https://github.com/jerryjliu/llama_index
* https://gpt-index.readthedocs.io/
* https://medium.com/@jerryjliu98/how-unstructured-and-llamaindex-can-help-bring-the-power-of-llms-to-your-own-data-3657d063e30d
* https://github.com/jerryjliu/llama_index/blob/main/examples/chatbot/Chatbot_SEC.ipynb
* 0.5.0 버전 이하의 collab 실습은 다음 링크를 참고하시기 바랍니다. https://colab.research.google.com/drive/1uL1TdMbR4kqa0Ksrd_Of_jWSxWt1ia7o?usp=sharing
* https://gpt-index.readthedocs.io/en/latest/guides/tutorials/building_a_chatbot.html

## LlamaIndex란 무엇인가?
LlamaIndex is an interface between your data and LLM’s; it offers the toolkit for you to setup a query interface around your data for any downstream task, whether it’s question-answering, summarization, or more.

## 이번 튜토리얼에서 다룰 내용은...
* How to build a context augmented chatbot.
* We use Langchain for the underlying Agent/Chatbot abstractions, and we use LlamaIndex for the data retrieval/lookup/querying.
* The result is a chatbot agent that has access to a rich set of “data interface” Tools that LlamaIndex provides to answer queries over your data.

In [7]:
!pip install llama-index # 0.6.0.alpha3 버전을 기준으로 실습합니다.
# pip install --upgrade llama-index==0.6.0.alpha3
!pip install pdfminer.six # pip install pdfminer가 아닙니다.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# download files
!mkdir data
!wget https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1 -O data/UBER.zip
!unzip data/UBER.zip -d data

mkdir: cannot create directory ‘data’: File exists
--2023-05-02 04:27:45--  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/948jr9cfs7fgj99/UBER.zip [following]
--2023-05-02 04:27:46--  https://www.dropbox.com/s/dl/948jr9cfs7fgj99/UBER.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc31d876bcefc9c06ed1a4e69cc8.dl.dropboxusercontent.com/cd/0/get/B7T9mGW-wE-m2O1KPvAarYQ-L6FuTwNeMZptd1tH_i0DvLGthb8f-0PP20pB4bNO2fE8F1CH3UBn2-YgiyOLp_jdJbc47JasLun2m8_oyXk0I6q2cltAS5lpbwapmvonPlsrPzekGy7RbiK7wxbfZnbQFBCNfTXHj4KLVWNEKJKZ9w/file?dl=1# [following]
--2023-05-02 04:27:46--  https://uc31d876bcefc9c06ed1a4e69cc8.dl.dropboxusercontent.com/cd/0/get/B7T9mGW-wE-m2O1KPvAarYQ-L6FuTwNeMZptd1t

In [4]:
# set text wrapping
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [9]:
import os
os.environ['OPENAI_API_KEY'] = "sk-내꺼야"
!pip show llama-index

Name: llama-index
Version: 0.6.0a3
Summary: Interface between LLMs and your data.
Home-page: https://github.com/jerryjliu/gpt_index
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: dataclasses-json, langchain, numpy, openai, pandas, tenacity, tiktoken
Required-by: 


In [11]:
# from llama_index import download_loader, GPTSimpleVectorIndex
from llama_index import download_loader, GPTVectorStoreIndex, ServiceContext, StorageContext, load_index_from_storage
from pathlib import Path

### Unstructured.io Reader를 이용한 Unstructured Data 가져오기

Unstructured.io HTML 파서를 이용한다
* 참고: LlamaHub https://llamahub.ai/

In [12]:
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

In [13]:
loader = UnstructuredReader()
doc_set = {}
all_docs = []
years = [2022, 2021, 2020, 2019]
for year in years:
    year_docs = loader.load_data(file=Path(f'./data/UBER/UBER_{year}.html'), split_documents=False)
    # insert year metadata into each year
    for d in year_docs:
        d.extra_info = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Setup Service Context

In [14]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size_limit=512)

### 연도별 SEC 재무제표를 Vector Index로 변환

2019년부터 2022년까지 SEC 각 연도별 재무제표를 Vector Index로 변환한다. 그 이후에 Vector Index 를 하나의 Vector Store로 합친 전역 Index를 하나 만든다.

We first setup a vector index for each year. Each vector index allows us to ask questions about the 10-K filing of a given year.

We build each index and save it to disk.

In [16]:
# # initialize simple vector indices + global vector index
# # NOTE: don't run this cell if the indices are already loaded! 
# index_set = {}
# for year in years:
#     cur_index = GPTSimpleVectorIndex.from_documents(doc_set[year], service_context=service_context)
#     index_set[year] = cur_index
#     cur_index.save_to_disk(f'index_{year}.json')

service_context = ServiceContext.from_defaults(chunk_size_limit=512)
index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = GPTVectorStoreIndex.from_documents(
        doc_set[year], 
        service_context=service_context,
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f'./storage/{year}')

In [17]:
# To load an index from disk, do the following...
# Load indices from disk
index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(persist_dir=f'./storage/{year}')
    cur_index = load_index_from_storage(storage_context=storage_context)
    index_set[year] = cur_index

## 여러 개의 Indexes를 묶는 (즉, HTML 파일을 묶는) 합성 그래프 만들기
* Since we have access to documents of 4 years, we may not only want to ask questions regarding the 10-K document of a given year, but ask questions that require analysis over all 10-K filings.

* To address this, we compose a “graph” which consists of a list index defined over the 4 vector indices. Querying this graph would first retrieve information from each vector index, and combine information together via the list index.

In [24]:
from llama_index import GPTListIndex, LLMPredictor, ServiceContext, load_graph_from_storage
from langchain import OpenAI
from llama_index.indices.composability import ComposableGraph

In [25]:
# describe each index to help traversal of composed graph
index_summaries = [f"UBER 10-k Filing for {year} fiscal year" for year in years]

# define an LLMPredictor set number of output tokens
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, max_tokens=512))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
storage_context = StorageContext.from_defaults()

# define a list index over the vector indices
# allows us to synthesize information across each index
graph = ComposableGraph.from_indices(
    GPTListIndex,
    [index_set[y] for y in years], 
    index_summaries=index_summaries,
    service_context=service_context,
    storage_context = storage_context,
)
root_id = graph.root_id

# [optional] save to disk
storage_context.persist(persist_dir=f'./storage/root')

# [optional] load from disk, so you don't need to build graph from scratch
graph = load_graph_from_storage(
    root_id=root_id, 
    service_context=service_context,
    storage_context=storage_context,
)

## Setting up the Chatbot Agent (Langchain 활용)
* 외부 챗봇과 인터랙션을 하는 것에서의 인터페이스를 Langchain으로 사용하고 Langchain을 LlamaIndex랑 연결한다.
* We use Langchain to setup the outer chatbot agent, which has access to a set of Tools. LlamaIndex provides some wrappers around indices and graphs so that they can be easily used within a Tool interface.
* We want to define a separate Tool for each index (corresponding to a given year), as well as the graph. We can define all tools under a central LlamaToolkit interface.
* Below, we define a IndexToolConfig for our graph. Note that we also import a DecomposeQueryTransform module for use within each vector index within the graph - this allows us to “decompose” the overall query into a query that can be answered from each subindex.

In [26]:
from langchain.agents import Tool
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent

from llama_index.langchain_helpers.agents import LlamaToolkit, create_llama_chat_agent, IndexToolConfig

In [27]:
# Note that we also import a DecomposeQueryTransform module for use within each vector index
# within the graph - this allows us to “decompose” the overall query
# into a query that can be answered from each subindex.

# 그러니까 DecomposeQueryTransform 를 사용하면 전체 query를 subquery로 잘라서
# subindex가 각각 대답할 수 있는 형태로 변환한다는 뜻임
# define a decompose transform
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
decompose_transform = DecomposeQueryTransform(
    llm_predictor, verbose=True
)

In [28]:
# define custom retrievers
# !pip install --upgrade llama-index==0.6.0.alpha3
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

In [29]:
custom_query_engines = {}
for index in index_set.values():
    query_engine = index.as_query_engine()
    query_engine = TransformQueryEngine(
        query_engine,
        query_transform=decompose_transform,
        transform_extra_info={'index_summary': index.index_struct.summary},
    )
    custom_query_engines[index.index_id] = query_engine
custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
    response_mode='tree_summarize',
    verbose=True,
)

# tool config
graph_config = IndexToolConfig(
    query_engine=query_engine,
    name=f"Graph Index",
    description="useful for when you want to answer queries that require analyzing multiple SEC 10-K documents for Uber.",
    tool_kwargs={"return_direct": True}
)

* Besides the GraphToolConfig object, we also define an IndexToolConfig corresponding to each index.

In [30]:
# define toolkit
index_configs = []
for y in range(2019, 2023):
    query_engine = index_set[y].as_query_engine(
        similarity_top_k=3,
    )
    tool_config = IndexToolConfig(
        query_engine=query_engine, 
        name=f"Vector Index {y}",
        description=f"useful for when you want to answer queries about the {y} SEC 10-K for Uber",
        tool_kwargs={"return_direct": True}
    )
    index_configs.append(tool_config)

* Finally, we combine these configs with our LlamaToolkit.

In [31]:
toolkit = LlamaToolkit(
    index_configs=index_configs + [graph_config],
)

* Finally, we call create_llama_chat_agent to create our Langchain chatbot agent, which has access to the 5 Tools we defined above.

In [32]:
memory = ConversationBufferMemory(memory_key="chat_history")
llm=OpenAI(temperature=0)
agent_chain = create_llama_chat_agent(
    toolkit,
    llm,
    memory=memory,
    verbose=True
)

## Testing the Agent

In [34]:
agent_chain.run(input="Hi I am Sigrid!")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: Do I need to use a tool? No
AI: Hi Sigrid, nice to meet you! How can I help you today?[0m

[1m> Finished chain.[0m


'Hi Sigrid, nice to meet you! How can I help you today?'

In [35]:
agent_chain.run(input="What were some of the biggest risk factors in 2020 for Uber?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: Do I need to use a tool? Yes
Action: Vector Index 2020
Action Input: Risk Factors[0m
Observation: [33;1m[1;3m
The following are some of these risks, any of which could have an adverse effect on our business financial condition, operating results, or prospects:

1. The COVID-19 pandemic and the impact of actions to mitigate the pandemic has adversely affected and may continue to adversely affect parts of our business.

2. Our business would be adversely affected if Drivers were classified as employees, workers or quasi-employees instead of independent contractors.

3. The mobility, delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region.

4. To remain competitive in certain markets, we have in the past lowered, 

'\nThe following are some of these risks, any of which could have an adverse effect on our business financial condition, operating results, or prospects:\n\n1. The COVID-19 pandemic and the impact of actions to mitigate the pandemic has adversely affected and may continue to adversely affect parts of our business.\n\n2. Our business would be adversely affected if Drivers were classified as employees, workers or quasi-employees instead of independent contractors.\n\n3. The mobility, delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region.\n\n4. To remain competitive in certain markets, we have in the past lowered, and may continue to lower, fares or service fees, and we have in the past offered, and may continue to offer, significant Driver incentives and consumer discounts and promot

In [36]:
cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points."
)
agent_chain.run(input=cross_query_str)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
AI: Sure, here are the key risk factors for Uber across the years:

2019:
• The mobility, delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region.
• We have incurred significant losses since inception, including in the United States and other major markets. We expect our operating expenses to increase significantly in the foreseeable future, and we may not achieve profitability.

2020:
• The COVID-19 pandemic and the impact of actions to mitigate the pandemic has adversely affected and may continue to adversely affect parts of our business.
• Our business would be adversely affected if Drivers were classified as employees, workers or quasi-employees instead of independent contractors.
• To remain competitive in certain mar

'Sure, here are the key risk factors for Uber across the years:\n\n2019:\n• The mobility, delivery, and logistics industries are highly competitive, with well-established and low-cost alternatives that have been available for decades, low barriers to entry, low switching costs, and well-capitalized competitors in nearly every major geographic region.\n• We have incurred significant losses since inception, including in the United States and other major markets. We expect our operating expenses to increase significantly in the foreseeable future, and we may not achieve profitability.\n\n2020:\n• The COVID-19 pandemic and the impact of actions to mitigate the pandemic has adversely affected and may continue to adversely affect parts of our business.\n• Our business would be adversely affected if Drivers were classified as employees, workers or quasi-employees instead of independent contractors.\n• To remain competitive in certain markets, we have in the past lowered, and may continue to l

In [44]:
while True:
    text_input = input("User: Question is ... ")
    response = agent_chain.run(input=text_input)
    print(f'Agent: {response}')

User: Question is ... What were some of the legal proceedings against Uber in 2022?


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAI: In 2022, Uber faced a number of legal proceedings, including:

• A class action lawsuit alleging that Uber misclassified its drivers as independent contractors instead of employees.

• A lawsuit alleging that Uber violated the California Consumer Privacy Act by collecting and sharing personal information without consent.

• A lawsuit alleging that Uber violated the Telephone Consumer Protection Act by sending unsolicited text messages.

• A lawsuit alleging that Uber violated the Americans with Disabilities Act by failing to provide accessible transportation services.

• A lawsuit alleging that Uber violated the Fair Credit Reporting Act by using background checks to discriminate against certain applicants.[0m

[1m> Finished chain.[0m
Agent: In 2022, Uber faced a number of legal proceedings, including:

• A class action lawsuit alleging

KeyboardInterrupt: ignored