# Indexes 索引

**Indexes** refer to ways to structure documents so that LLMs can best interact with them.

**索引**是指组织文档的方法,使LLM可以更好地与文档互动。

## 建立一个演示问答应用。主要包含以下4部

1. 建立索引(Indexes)
2. 从索引中建立索引器(Retreiver)
3. 创建问答链(question answering chain)
4. 提出问题

默认情况下，LangChain 使用 `Chroma` 作为向量存储(vectorstore)来索引(Indexes)和搜索嵌入(Embeddings)。数据是放在内存里的，是暂时保存的。要完成本教程，我们首先需要安装 `chromadb`。

In [1]:
# !pip install chromadb

## Prepare data

In [3]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [32]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # 这个默认使用 `text-davinci-003` 模型
from langchain.chat_models import ChatOpenAI # 这个使用 `gpt-3.5-turbo`模型

In [5]:
from langchain.document_loaders import TextLoader
loader = TextLoader('./data/state_of_the_union.txt', encoding='utf8')

### 建立Indexes

In [10]:
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader]) # 返回 `VectorStoreIndexWrapper` 对象，包含 `query` 和 `query_with_sources` 方法

VectorstoreIndexCreator 主要做了以下内容：
1. 将文档拆分成块(chunks)
2. 为每个文档(document)创建嵌入(embeddings)
3. 在 vectorstore 中存储文档(document)和嵌入(embeddings)

![VectorstoreIndexCreator](./img/embeddings.png)

In [11]:
# 索引建立后，可以直接query了
query = "What did the president say about Ketanji Brown Jackson"
index.query(query)

" The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

In [12]:
query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)

{'question': 'What did the president say about Ketanji Brown Jackson',
 'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence. He also said that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
 'sources': './data/state_of_the_union.txt'}

In [14]:
index.vectorstore # 直接查看vectorstore对象

<langchain.vectorstores.chroma.Chroma at 0x12ff0d930>

In [15]:
index.vectorstore.as_retriever() # 获得vector store的检索器(retriever)

VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x12ff0d930>, search_type='similarity', search_kwargs={})

### 分解VectorstoreIndexCreator做的工作
1. 获取文档

In [18]:
# from langchain.document_loaders import TextLoader
# loader = TextLoader('./data/state_of_the_union.txt', encoding='utf8')
documents = loader.load()

2. 把文件分成块(chunks)

In [19]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

3. 使用OpenAI的embeddings

In [20]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

4. 创建 vectorstore 用作索引。

In [21]:
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

5. 在检索器(retriever)接口中公开此索引。

In [22]:
retriever = db.as_retriever()

6. 创建一个链并用它来回答问题！

In [30]:
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0.0), chain_type="stuff", retriever=retriever)

In [31]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

"The President nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court. He described her as one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and a consensus builder. She has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

In [29]:
ChatOpenAI(temperature=0.0)

ChatOpenAI(verbose=False, callbacks=None, callback_manager=None, client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, model_name='gpt-3.5-turbo', temperature=0.0, model_kwargs={}, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy='', request_timeout=None, max_retries=6, streaming=False, n=1, max_tokens=None)

In [33]:
OpenAI(temperature=0.0)

OpenAI(cache=None, verbose=False, callbacks=None, callback_manager=None, client=<class 'openai.api_resources.completion.Completion'>, model_name='text-davinci-003', temperature=0.0, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0, n=1, best_of=1, model_kwargs={}, openai_api_key='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', openai_api_base='', openai_organization='', openai_proxy='', batch_size=20, request_timeout=None, logit_bias={}, max_retries=6, streaming=False, allowed_special=set(), disallowed_special='all')

> 上面就是 `VectorstoreIndexCreator` 封装的逻辑。
> 
> 你也可以自己修改默认配置， 如：


In [35]:
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Chroma, 
    embedding=OpenAIEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
)