following this tutorial https://python.langchain.com/en/latest/modules/indexes/getting_started.html

# Getting started

In [1]:
# download state_of_the_union.txt
!curl https://raw.githubusercontent.com/hwchase17/langchain/master/docs/modules/state_of_the_union.txt -o assets/state_of_the_union.txt
!echo ""
!head assets/state_of_the_union.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39027  100 39027    0     0   123k      0 --:--:-- --:--:-- --:--:--  125k

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 



In [2]:
# imports
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator

The tutorial uses OpenAI, but I want to use Huggingface

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
hf_embeddings = HuggingFaceEmbeddings()

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# loader and index
loader = TextLoader('assets/state_of_the_union.txt')
index = VectorstoreIndexCreator(embedding=hf_embeddings).from_loaders([loader])

Using embedded DuckDB without persistence: data will be transient


In [8]:
# init a HF llm (by default OpenAI is used)
from langchain.llms import HuggingFaceHub
llm=HuggingFaceHub(repo_id="google/flan-ul2")

# query
query = "What did the president say about Ketanji Brown Jackson"
index.query(query, llm=llm)

'One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of'

## Walkthrough

VectorstoreIndexCreator is:
- Splitting documents into chunks
- Creating embeddings for each document
- Storing documents and embeddings in a vectorstore

In [10]:
# load
documents = loader.load()

# and split into chunks
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [11]:
print(f"{type(texts) = }")
print(f"{len(texts)  = }")

type(texts) = <class 'list'>
len(texts)  = 42


In [12]:
# create embeddings, just as before
hf_embeddings = HuggingFaceEmbeddings()

In [14]:
# create index (vectorstore)
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, hf_embeddings)

# expose index (retriever)
retriever = db.as_retriever()

# create a chain to answer the questions and pass the retriever
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Using embedded DuckDB without persistence: data will be transient


In [15]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

'One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of'

In [16]:
qa.run("Was the president concerned with any disease")

'COVID-19'

now one that shouldn't have an answer

In [17]:
qa.run("What did the president say about the last episode of Friends")

"I don't know"

nice :)