# TextLoader

The goal of this project is to make an LLM aware of the contents of a document so that 
- a user can ask questions relevant to that document and 
- the LLM can respond in terms of the content of that document. 

In [1]:
# pip install langchain
# pip install openai

## Load and Split a Text file 

LangChain has several document loaders. Our file, `facts.txt` is a text file, so we use the appropriate loader 

In [2]:
from langchain.document_loaders import TextLoader
from dotenv import load_dotenv 

load_dotenv()

loader = TextLoader("facts.txt") # an instance of a TextLoader object
docs = loader.load() # A list with a single `Document`and with metadata 
print(type(loader))
print(type(docs))
print(type(docs[0]))

<class 'langchain_community.document_loaders.text.TextLoader'>
<class 'list'>
<class 'langchain_core.documents.base.Document'>


The type of `docs` is list, however, its elements are of a LangChain specific type and have novel properties:
- page content is the content
- metadata is about the content. 

These two properties alone consitute a document. 

In [3]:
n = 80
print(f"{docs[0].page_content[:n]}...") # First n characters
print(docs[0].metadata)

1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostr...
{'source': 'facts.txt'}


There is one element in this list, and the `page_content` of that item is the entirety of the contents of the text file. This `page_content` is too long to feed to an LLM; we need to split the text into many documents. 

Langchain has a text splitter. Beware, its behavior is a little less than intuitive; 
- chunk first then 
- look for split character from the end of the chunk moving toward the start of chunk
- split where split character is found.  

In [4]:
from langchain.text_splitter import CharacterTextSplitter 

# Create a text splitter to give to the load_and_split function.
text_splitter = CharacterTextSplitter(
    is_separator_regex=False, 
    # First split into chunks of a specified number of characters.
    chunk_size = 200, 
    # Then, split at the separator character that is 
    # closest to the end of the chunk.
    separator = "\n", 
    # Let chunks to overlap so that content is not mututally disjoint. 
    chunk_overlap = 0  
)

loader = TextLoader("facts.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter,
)
for doc in docs[:5]:
    print(doc.metadata)
    print(doc.page_content)
    print("\n")

{'source': 'facts.txt'}
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.


{'source': 'facts.txt'}
4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can't jump.


{'source': 'facts.txt'}
7. The letter 'Q' is the only letter not appearing in any U.S. state name.
8. The heart of a shrimp is located in its head.
9. Australia is the only continent covered by a single country.


{'source': 'facts.txt'}
10. The Great Wall of China is approximately 13,171 miles long.
11. Bananas are berries, but strawberries aren't.
12. The Sphinx of Giza has the body of a lion and the head of a human.


{'source': 'facts.txt'}
13. The first computer bug was an actual bug trapped in a computer.
14. Neil Armstrong was the first man to walk 

Another note on chunking; if the chunk size is set so low that there is no split character in a chunk then a longer chunk is used. A message informs you when this happens. 

In [5]:
text_splitter = CharacterTextSplitter(
    chunk_size = 130, 
    separator = "\n", # does this second by looking for 
    chunk_overlap = 0 
)

loader = TextLoader("facts.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter,
)
for doc in docs[:5]:
    print(doc.metadata)
    print(doc.page_content)
    print("\n")

Created a chunk of size 143, which is longer than the specified 130
Created a chunk of size 135, which is longer than the specified 130
Created a chunk of size 148, which is longer than the specified 130
Created a chunk of size 147, which is longer than the specified 130
Created a chunk of size 137, which is longer than the specified 130


{'source': 'facts.txt'}
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.


{'source': 'facts.txt'}
3. Honey is the only natural food that is made without destroying any kind of life.
4. A snail can sleep for three years.


{'source': 'facts.txt'}
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'


{'source': 'facts.txt'}
6. The elephant is the only mammal that can't jump.
7. The letter 'Q' is the only letter not appearing in any U.S. state name.


{'source': 'facts.txt'}
8. The heart of a shrimp is located in its head.
9. Australia is the only continent covered by a single country.




### Chunk overlap

Returning to a reasonable chunk size, lets examine the chunk overlap.

In [6]:
from langchain.text_splitter import CharacterTextSplitter 

text_splitter = CharacterTextSplitter(
    is_separator_regex=False,
    chunk_size = 200, 
    separator = "\n", 
    chunk_overlap = 100 # If this is big enough to capture a separator...
)

loader = TextLoader("facts.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter,
)
for doc in docs[:2]:
    print(doc.metadata)
    print(doc.page_content)
    print("\n")

{'source': 'facts.txt'}
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.


{'source': 'facts.txt'}
3. Honey is the only natural food that is made without destroying any kind of life.
4. A snail can sleep for three years.




We will use a chunk overlap of 0 in this project. 

## Embedding 



SentenceTransformer runs locally and embeds into $R^{668}$. However, it is from the HuggingFace API so it is blocked by DISH. 

OpenAI Embeddings is an API and embeds into $R^{1536}$. You have to pay for it. 

In [7]:
# pip install sentence-transformers

In [8]:
# pip install tiktoken # required for some reason
import tiktoken
# from langchain.embeddings import OpenAIEmbeddings # depricated
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings
    )

embeddings = OpenAIEmbeddings()
query_string = "hello there weary traveler"
query_vector = embeddings.embed_query(query_string)
print("The image of the query under the OpenAI embedding is of length "
      f" {len(query_vector)} and is \n {query_vector[:3]}...")


The image of the query under the OpenAI embedding is of length  1536 and is 
 [0.0017633300633127511, -0.004282372977640875, -0.010506087850271642]...


## Vector Store

A vector store is a database of strings and their images; roughly, it is the graph of the embedding. 

Time to get a database to use as a vector store. LangChain has a version of ChromaDB.

In [59]:
##### If ya need it
import os 
os.system("rm -r vector-store")

0

In [62]:
# pip install -U langchain chromadb
from langchain.vectorstores.chroma import Chroma
from langchain.text_splitter import CharacterTextSplitter 
# from langchain.embeddings import OpenAIEmbeddings # depricated
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

text_splitter = CharacterTextSplitter(
    is_separator_regex=False, 
    chunk_size = 200, 
    separator = "\n", 
    chunk_overlap = 0  
)

loader = TextLoader("facts.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter,
)
db = Chroma.from_documents(
    documents = docs,
    embedding=embeddings,
    # Name the directory for the vector store.
    persist_directory="vector-store3"
)
print(f"The databse is of type {type(db)} \n" 
      f"and has {db._collection.count()} entries.")

The databse is of type <class 'langchain_community.vectorstores.chroma.Chroma'> 
and has 54 entries.


In [63]:
db = Chroma.from_documents(
    documents = docs,
    embedding=embeddings,
    # Name the directory for the vector store.
    persist_directory="vector-store3"
)
print(f"The databse is of type {type(db)} \n" 
      f"and has {db._collection.count()} entries.")

The databse is of type <class 'langchain_community.vectorstores.chroma.Chroma'> 
and has 108 entries.


NB: The database is now populated. If you run the `.from_documents` method twice without deleting the directory specified by `persist_directory` then you get duplicate entries.

To view individual entries use the get method.

In [43]:
# Retrieve two entries with more than the default info included.
# Note that `id` is always included. 
gots = db.get(limit=2, include=['embeddings',"metadatas", "documents",
                        #  "uris","data" # we don't use these for local files.
                         ])

print(f"ids: {gots['ids']}\n")
print(f"embedding: {gots['embeddings'][0][0:3]}\n")
print(f"Metadatas: {gots['metadatas']}\n")
printable_docs = '\n\n'.join(gots['documents'][0:2] )
print(f"documents: \n\n{printable_docs}\n")
print(f"uris: {gots['uris']}\n")
print(f"data: {gots['data']}")

ids: ['05b8dcae-b61a-11ee-b6ca-acde48001122', '05b8dd1c-b61a-11ee-b6ca-acde48001122']

embedding: [-0.0012050693621858954, 0.0028918308671563864, 0.008143449202179909]

Metadatas: [{'source': 'facts.txt'}, {'source': 'facts.txt'}]

documents: 

1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.

4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can't jump.

uris: None

data: None


One can also retrieve using `ids`.

In [45]:
ids = gots['ids']
ids

['05b8dcae-b61a-11ee-b6ca-acde48001122',
 '05b8dd1c-b61a-11ee-b6ca-acde48001122']

In [46]:
db.get(ids=ids, 
    include=['embeddings',"metadatas", "documents","uris","data"]
    ) # set a data loader? 

ValueError: You must set a data loader on the collection if loading from URIs.

## Use custom ids


In [47]:
import os

"""Delete existing database, both the object and the directory."""
# try:
#     os.system("rm -r vectorstore")
#     del db 
# except:
#     None


ids = [str(i) for i in range(1, len(docs) + 1)]


db_with_custom_ids = Chroma.from_documents(
    documents = docs,
    embedding=embeddings,
    # persist_directory="vstore",
    ids=ids
) 

db_with_custom_ids.get(limit=2) 

{'ids': ['1', '10'],
 'embeddings': None,
 'metadatas': [{'source': 'facts.txt'}, {'source': 'facts.txt'}],
 'documents': ['1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.',
  '25. There are about 7,000 feathers on an eagle.\n26. Marie Curie was the first woman to win a Nobel Prize and remains the only person to have won in two different fields—Physics and Chemistry.'],
 'uris': None,
 'data': None}

In [48]:
# and we may delete thusly
print(f"there were {db_with_custom_ids._collection.count()}")
db_with_custom_ids._collection.delete(ids=[ids[2]])
print(f"there are {db_with_custom_ids._collection.count()}")


there were 54
there are 53


## Search 

Search through the documents and find the `k` that are most similar to a query. 

In [49]:
query = "What is an interesting fact about the english language?"

results = db.similarity_search(
    query = query,
    # k = 2 #Number of results to display. Default 4
    )

for result in results:
    print(result)

page_content='1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.' metadata={'source': 'facts.txt'}
page_content='1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.' metadata={'source': 'facts.txt'}
page_content="4. A snail can sleep for three years.\n5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'\n6. The elephant is the only mammal that can't jump." metadata={'source': 'facts.txt'}
page_content="4. A snail can sleep for three years.\n5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'\n6. The elephant is the only mammal that can't jump." metadata={'source': 'facts.txt'}


In [50]:
for result in results:
    print("\n")
    print(result.metadata) 
    print(result.page_content) 




{'source': 'facts.txt'}
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.


{'source': 'facts.txt'}
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.


{'source': 'facts.txt'}
4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can't jump.


{'source': 'facts.txt'}
4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can't jump.


A separate method gives the similarity score (cosine similarity) with the similar documents. 

In [51]:
results = db.similarity_search_with_score(
    "What is an interesting fact about the english language?",
    # k = 2 #Number of results to display. Default 4
    )

for result in results:
    print("\n")
    print(result[0].metadata) 
    print(f"Search score: {result[1]}") #search score
    print(result[0].page_content) 
    



{'source': 'facts.txt'}
Search score: 0.349289208650589
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.


{'source': 'facts.txt'}
Search score: 0.35016560554504395
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.


{'source': 'facts.txt'}
Search score: 0.3518882989883423
4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can't jump.


{'source': 'facts.txt'}
Search score: 0.35238325595855713
4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can

# Prompt 

We returned chunks previously. That was awkward because it was not natural language.

## RetrievalQA 

`RetrievalQA` is a chain for searching a vector store for relevant entries, and presenting results with a string generated by an LLM. 

It requires a retriever object. Retriever objects deserve some explaining. 

Our vector store `db`  has the method `.similarity_search` that you might think `RetrievalQA` could use to get relevant documents from the vector store before the LLM decides how to present the result. But that method name is specific to Chroma; 
LangChain's devs realized that there would be a huge number 
of vector store options some day. So, instead of having a RetrievalQA chain 
that has code for every known vector store's version of `similarity search`
they decided that devs of vector stores, like Chroma, need to provide an 
`.as_retriever` object on their vector store with the method 
```python
db.get_relevant_documents(s: string) -> documents
```
to allow the vector store and RetrievalQA to "talk".


In [53]:
from langchain.chains import RetrievalQA 
# from langchain.chat_models import ChatOpenAI #depricated 
from langchain_openai import ChatOpenAI

"""For debuggin, use these two lines. 
Warning: verbose and hard to read."""
# import langchain
# langchain.debug = True 

load_dotenv() 
chat = ChatOpenAI()
embeddings = OpenAIEmbeddings()

db = Chroma(
    persist_directory = "emb",
    # embedding # dont use this so that new vectors are not put into the database. 
    embedding_function = embeddings, # use this 
)

# Chroma has complied and has a .as_retriever method on its database objects.
retriever = db.as_retriever()

chain = RetrievalQA.from_chain_type(
    llm=chat,
    retriever=retriever,
    # `chain_type`s are are loaded topic. More later.
    #  the argument `stuff` just stuffs all of the retrieved documents 
    # into a SystemMessagePromptTemplate. 
    # This is the most basic chain type. 
    chain_type="stuff",
    # Other chain_type argument options:
    #chain_type="map_reduce" # takes top 4 similar documents and individually stuffs them into a SystemMessagePromptTemplate, then takes results of all into another template for ChatGPT.
    #map_rerank # for each of the top 4 similar documents, asks LLM to generate answer and a score for how completely the answer from the document answers the question/use query.
    #refine # for each 4 similar documents, in series (parallel above 2), ask LLM to refine answer by considering the document and its previous answer. 
)

query = "Give me an interesting fact about the english language."
result = chain.run(query)

In [55]:
print(f"The result of running the chain with the query is as follows:\n---\n{result}\n---")


The result of running the chain with the query is as follows:
---
An interesting fact about the English language is that it is constantly evolving and adding new words. On average, around 1,000 new words are added to the English dictionary each year.
---


ME: For fun, a function to wrap text. I discovered later that there is such a think in the python default libraries. 
```python
from textwrap import wrap 
for line in wrap(result, width=75):
    print(line)
```

In [65]:
def wrap(s, max_char = 60):
    lines =[]
    position=0
    while position < len(s):
        candidate_line = s[position:position + max_char]
        reversed = candidate_line[::-1]
        reversed_first_space = reversed.index(" ")
        last_space = max_char-1-reversed_first_space
        position += last_space + 1
        line= candidate_line[:last_space+1]
        lines.append(line)
    for line in lines:
        print(line)
    return None 

In [66]:
s= """
1. "Dreamt" is the only English word that ends with the letters "mt."
2. An ostrich's eye is bigger than its brain.
3. Honey is the only natural food that is made without destroying any kind of life.
4. A snail can sleep for three years.
5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The elephant is the only mammal that can't jump.
7. The letter 'Q' is the only letter not appearing in any U.S. state name.
8. The heart of a shrimp is located in its head.
9. Australia is the only continent covered by a single country.
10. The Great Wall of China is approximately 13,171 miles long.
11. Bananas are berries, but strawberries aren't.
12. The Sphinx of Giza has the body of a lion and the head of a human.
13. The first computer bug was an actual bug trapped in a computer.
14. Neil Armstrong was the first man to walk on the moon.
15. The Eiffel Tower in Paris leans slightly in the sun due to thermal expansion.
16. Queen Elizabeth II is the longest-reigning current monarch.
17. The Leaning Tower of Pisa took 200 years to construct.
18. Angel Falls is the highest waterfall in the world, located in Venezuela.
19. Sword swallowing is a skill that takes 3-10 years to learn.
20. Isaac Newton invented the cat flap.
21. Earth, Texas, is the only place on Earth named 'Earth.'
22. Thomas Edison, who invented the lightbulb, was afraid of the dark.
23. The Pacific Ocean is the largest ocean on Earth, covering more than 60 million square miles.
24. Zeus was the king of the Greek gods according to ancient Greek myth.
25. There are about 7,000 feathers on an eagle.
26. Marie Curie was the first woman to win a Nobel Prize and remains the only person to have won in two different fields—Physics and Chemistry.
27. The Sahara Desert is the largest hot desert in the world.
28. There are 86,400 seconds in a day."""
wrap(s)


1. "Dreamt" is the only English word that ends with the 
letters "mt."
2. An ostrich's eye is bigger than its 
brain.
3. Honey is the only natural food that is made 
without destroying any kind of life.
4. A snail can sleep 
for three years.
5. The longest word in the English 
language is 
'pneumonoultramicroscopicsilicovolcanoconiosis.'
6. The 
elephant is the only mammal that can't jump.
7. The letter 
'Q' is the only letter not appearing in any U.S. state 
name.
8. The heart of a shrimp is located in its head.
9. 
Australia is the only continent covered by a single 
country.
10. The Great Wall of China is approximately 
13,171 miles long.
11. Bananas are berries, but 
strawberries aren't.
12. The Sphinx of Giza has the body of 
a lion and the head of a human.
13. The first computer bug 
was an actual bug trapped in a computer.
14. Neil Armstrong 
was the first man to walk on the moon.
15. The Eiffel Tower 
in Paris leans slightly in the sun due to thermal 
expansion.
16. Queen Eliz

## Other Chain Types

- chain_type="map_reduce"  takes top 4 similar documents and individually stuffs them into a SystemMessagePromptTemplate, then takes results of all into another template for ChatGPT.
- map_rerank .., for each of the top 4 similar documents, asks LLM to generate answer and a score for how completely the answer from the document answers the question/use query.
- refine ... for each 4 similar documents, in series (parallel above 2), ask LLM to refine answer by considering the document and its previous answer. 

## Redundat Files 

A vector store will contain duplicate entries; two entries with different ids can have the same string AND the same image of the string under the embedding. 

You might say that the vector store is then an indexed graph of the embedding with possible duplicate entries. 

Returning the same entry twice is not desirable because the retriever will retrieve multiple compies of the most relevant document, squeezing out the 5th most important unique document. 


We can force having duplicate entries by running this code two or more times. 

```python
db = Chroma.from_documents(
    documents = docs,
    embedding=embeddings,
    # Name the directory for the vector store.
    persist_directory="vector-store"
)
```

Check to make sure there are more that 54 documents.

In [64]:
db._collection.count()

108

Observe that similarity search returns redundancies.

In [65]:
results = db.similarity_search_with_score(
    "What is an interesting fact about the english language?",
    # k = 2 #Number of results to display. Default 4
    )
results

[(Document(page_content='1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.', metadata={'source': 'facts.txt'}),
  0.34984490275382996),
 (Document(page_content='1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.', metadata={'source': 'facts.txt'}),
  0.3509979844093323),
 (Document(page_content="4. A snail can sleep for three years.\n5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'\n6. The elephant is the only mammal that can't jump.", metadata={'source': 'facts.txt'}),
  0.3523455858230591),
 (Document(page_content="4. A snail can sleep for three years.\n5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoc

### Max Margnal Relevance Searches

There are two methods on db that we can use to remove redundancies; both work by returning the (unique) documents with the 4 highest relevance scores. 
- one takes in a string
- the other takes in the vector that is the image of that string under the embedding. 

In [66]:
query_string = "what is an interesting fact about the english language?"

db.max_marginal_relevance_search(
    query = query_string,
    lambda_mult=0.8
    )

[Document(page_content='1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.', metadata={'source': 'facts.txt'}),
 Document(page_content="4. A snail can sleep for three years.\n5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'\n6. The elephant is the only mammal that can't jump.", metadata={'source': 'facts.txt'}),
 Document(page_content="86. Broccoli and cauliflower are the only vegetables that are flowers.\n87. The dot over an 'i' or 'j' is called a tittle.\n88. A group of owls is called a parliament.", metadata={'source': 'facts.txt'}),
 Document(page_content='118. The original Star-Spangled Banner was sewn in Baltimore.\n119. The average adult spends more time on the toilet than they do exercising.', metadata={'source': 'facts.txt'})]

In [67]:
query_vector = embeddings.embed_query(query_string)

db.max_marginal_relevance_search_by_vector(
    embedding = query_vector,
    lambda_mult=0.8
    )

[Document(page_content='1. "Dreamt" is the only English word that ends with the letters "mt."\n2. An ostrich\'s eye is bigger than its brain.\n3. Honey is the only natural food that is made without destroying any kind of life.', metadata={'source': 'facts.txt'}),
 Document(page_content="4. A snail can sleep for three years.\n5. The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'\n6. The elephant is the only mammal that can't jump.", metadata={'source': 'facts.txt'}),
 Document(page_content="86. Broccoli and cauliflower are the only vegetables that are flowers.\n87. The dot over an 'i' or 'j' is called a tittle.\n88. A group of owls is called a parliament.", metadata={'source': 'facts.txt'}),
 Document(page_content='118. The original Star-Spangled Banner was sewn in Baltimore.\n119. The average adult spends more time on the toilet than they do exercising.', metadata={'source': 'facts.txt'})]

Again, this is not natural language. Thus, we would like to combine `RetrievalQA` with this kind of search, thereby 
- reducing the amount of tokens sent by reducing redundancy
- having the LLM consider 4 distinct documents instead of repeat documents. 

### Custom File Retriever

To combine these ingredients we create our own retriever. In the file `redundant_file_retriever.py` we specify a retriever object with the following:
```python 
from langchain.embeddings.base import Embeddings
from langchain.vectorstores.chroma import Chroma
from langchain.schema import BaseRetriever

class RedundantFilterRetriever(BaseRetriever):
    """
    Vector stores are required by LangChain to have an oject that functions 
    as a retriever and has two methods 
    (get_relevant_documents and aget_relevant_documents).
    Instead of using the built in object for Chroma, we are making a custom 
    retriever object.
    """
    #    Require the instantiator to 
    # 1. specify an embedding (named `embeddings`) that is the class Embeddings
    # 2. specify a vector store (named `chroma`) that is in the class Chroma
    embeddings: Embeddings 
    chroma: Chroma

    def get_relevant_documents(self, query_string ):
        """
        This method is required for any retriever object 
        """
        # Calculate image of query.
        query_vector = self.embeddings.embed_query(query_string)
        # Feed image to max_marginal_relevance_search_by_vector 
        results = self.chroma.max_marginal_relevance_search_by_vector(
            embedding = query_vector, # the parameter name is LangChain's fault. 
            lambda_mult=0.8 
            )
        return results
    
    async def aget_relevant_documents(self):
        return []
```

### Call Custom File Retriever

We have to instantiate the object before calling a method on it. 

In [68]:
from redundant_filter_retriever import RedundantFilterRetriever

chat = ChatOpenAI()
embeddings = OpenAIEmbeddings() #Remember that this is what we are using.

# create a db without adding entries. 
db_from_file = Chroma(
    persist_directory = "vector-store",
    # embedding # dont use this so that new vectors are not put into the database. 
    embedding_function = embeddings, # use this; the name sucks IMO. 
)

# Instantiate our custom retriever. 
retriever = RedundantFilterRetriever(
    embeddings = embeddings,
    chroma = db_from_file
)

chain = RetrievalQA.from_chain_type(
    llm=chat,
    retriever=retriever,
    chain_type="stuff" 
)

query = "Give me an interesting fact about the english language."
result = chain.run(query)
print(result)

The longest word in the English language is 'pneumonoultramicroscopicsilicovolcanoconiosis.'


Indeed, the LLM has given us a single fact in natural language. 