# LangChain Demo - Large documents

## This notebook

This notebook collects Python examples using LangChain on large documents, especially those that are larger than the token number limitation.

Some changes though:
- use Annoy instead of FAISS as a vector database
- use Google Search API instead of SerpAPI
- change in examples and additional examples 
- change in API keys setup

This notebook has been tested in June 2023 on AWS SageMaker using DataScience 3.0 image.

Test environment:
> - AWS SageMaker Studio's notebook 
>> - Kernel image Data Science 3.0
>> - t3.medium 2CPU - 4GB
>> - Python 3.9.15
>> - Linux default 4.14.304-226.531.amzn2.x86_64

More informatioon about Langchain and dedicated examples in other bookmarks of the same folder.

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">NOTEBOOK SETUP</div>



**Instructions**

All setups are at the top of the notebook so that you can run all this section initialize the notebook.

Before running the setup you may need to create the following resources
- request an OpenAI API keys. OpenAI APIs are not free.

Confer to the setup sections for instruction on how to create those resources.

---
## API keys and environment

Langchain will get the API keys from environment variables or function parameters.

**Instructions**

- Never show the keys in shared notebooks, whether it part of the code or a log. A simple way to avoid key leakage, is to use environement variables.  You set the environment variable in the terminal or some local configuration. If so you do not have to set the key here.

- If it is easier for you to set the key here by assigning the value, do not forget to empty the string right after you run this block. The environment will be kept in memory as long as the kernel runs.

- Be careful when printing the keys. Ensure that you remove the outputs. 

- Before sharing check that the keys are not printed out by some features of the libraries. Avoid to print libraries' objects. They often hold the API keys as a property and may disclose the key value.


I Store API keys and configuration information in AWS Secrets Manager. The code below retrieves the secret holding the keys. The secret is a JSON string consisting in key/value pairs. It will be used later to set various environnement variables.

When using Notebooks and SageMaker do not forget to give permissions to read this secret to SageMaker execution role.

In [7]:
!apt-get update && apt-get install -y jq 1>/dev/null

Hit:1 http://deb.debian.org/debian bullseye InRelease
Hit:2 http://deb.debian.org/debian-security bullseye-security InRelease
Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Fetched 44.1 kB in 1s (61.8 kB/s)
Reading package lists... Done


In [8]:
%%bash --out secrets 
# using AWS's Secret Manager to store keys
# garb the keys and store it into a Pytthon variable
export RESPONSE=$(aws secretsmanager get-secret-value --secret-id 'salvia/labbench/tests' )
export SECRETS=$( echo $RESPONSE | jq '.SecretString | fromjson')

echo $SECRETS

---
## pip upgrade

In [9]:
!pip install --upgrade pip  1>/dev/null

[0m

---
## LangChain Setup

**Resources**
> - [LangChain GetStarted](https://python.langchain.com/docs/get_started/quickstart)

In [10]:
!pip install langchain==0.0.230 1>/dev/null

[0m

---
## OpenAI Setup

**Resources**
> - [OpenAI tutorial on API keys](https://platform.openai.com/docs/quickstart)
> - [OpenAI package on Pypi](https://pypi.org/project/openai/)

In [11]:
import os

os.environ["OPENAI_API_KEY"] = eval(secrets)["OPENAI_API_KEY"]


In [12]:
!pip install openai==0.27.8 1>/dev/null

[0m

---
## Setup Annoy as a vector database 

Some examples requires a Vector Database (document selector, document retrieval).

LangChain use ChromaDB by default. For whatever reason it failed to install. Used Annoy instead. An alterntive is FAIIS. You may also want to use online Vector database like Pinecone or Weaviate. 

Most of these packages include c++ code and requires GCC at the install time. It is not included in SageMaker DataScience 3 image. So the first step is installing GCC. 

NOTE: Annoy is read-only - once the index is built you cannot add any more emebddings.

<br/>

**Resources**
> - [Annoy package on Pypi](https://pypi.org/project/annoy/)

***Note***

having some issues with ChromaDB. Langchain and ChromaDB seems to require different versions of pydantic.

Install GCC C++ compiler

In [13]:
!apt-get update && apt-get install -y build-essential 1>/dev/null

Hit:1 http://deb.debian.org/debian bullseye InRelease
Hit:2 http://deb.debian.org/debian-security bullseye-security InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Reading package lists... Done


Install Annoy

In [14]:
pip install annoy==1.17.3 1>/dev/null

[0mNote: you may need to restart the kernel to use updated packages.


## Setup additional text managelment tools

When working with embeddings additonal packages are required.

- tiktoken, as a encoder and tokenizer

**Resources**
> - [Tiktoken package on Pypi](https://pypi.org/project/tiktoken/)

 

In [15]:
!pip install tiktoken==0.4.0 1>/dev/null

[0m

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">
BASIC CONCEPTS
</div>

https://archive.org/stream/alicesadventur00carr/alicesadventur00carr_djvu.txt

https://archive.org/download/alicesadventur00carr/alicesadventur00carr.pdf

https://archive.org/download/alicesadventur00carr/alicesadventur00carr_meta.xml

---
## Download example documents once

Download a book from the Internet Archive

In [23]:
!mkdir -p work/largedoc/data

In [5]:
!curl https://archive.org/stream/alicesadventur00carr/alicesadventur00carr_djvu.txt \
 -o work/largedoc/data/alice_in_wonderland.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  282k    0  282k    0     0   210k      0 --:--:--  0:00:01 --:--:--  210k


In [6]:
!ls work/largedoc/data

alice_in_wonderland.txt


---
## Common initializations

In [25]:
notebook_folder = "work/largedoc"
documents_folder = f"{notebook_folder}/data"

print(documents_folder)

work/largedoc/data


---
# Document loaders

Loaders are an easy ways to import documents from other sources 
and make it available for use in your language models. There are lot of loadre type.

**Resources**
> - Document Loaders: https://python.langchain.com/docs/modules/data_connection/document_loaders
> - List of loaders: https://github.com/hwchase17/langchain/tree/master/langchain/document_loaders

In [29]:
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader

# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0, model_name='text-davinci-003')

# This is the source document.    
document_path = f"{documents_folder}/alice_in_wonderland.txt"
 
# Setup a text loader
loader = TextLoader(document_path)
alice_documents = loader.load()

text = alice_documents[0].page_content

# check the length
length = len(text)
print(f"{length=}")

# check the number of tokens
num_tokens = llm.get_num_tokens(text)
print(f"{num_tokens=}")

length=288447
num_tokens=100995


In [None]:
---
# Summaries Of long Text
If the text is longer than the limit in tokens, the text must be splitted in chunks. 
Langchain components will take care of splitting and chaining the summarization tasks.

The Summarization Chain breaks the text into smaller chunks and summarizing each chunk, creating a final summary based on the individual summaries.

In this example, the chain first splits the essay into chunks of 2000 characters. It then generates summaries for each chunk and creates a final concise summary based on these individual summaries.

<br/>
**Resources**
> - Qummarization quickstart: https://python.langchain.com/docs/modules/chains/popular/summarize

## Summary with default prompt

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)

# check texts
num_texts = len(texts)
print(f"{num_texts=}")


num_texts=171


In [27]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)

 This code snippet is used to track analytics for a website, sending pageview and process URL events, setting up a cache bust and server name for the data packets, and creating a tracking image for in-page executions. It includes HTML, JavaScript, and SVG elements such as a viewport meta tag, two Google site verification meta tags, a license notice, a try/catch statement, a banner, a login button, a search bar, a media button, a primary navigation menu, a search menu, a wayback search, a save-page-form, a wayback slider, a media subnav, a media slider, an information menu, an info box, a desktop subnav, a dropdown menu, an ia-topnav, a signed-out-dropdown, a logo, a link to the homepage, a waybackpagesarchived count, a search form, a search icon, a media-button, a menu-item, a donate link, a hamburger icon, a main menu, audio collections, image collections, software libraries, CD-ROM images, ZX Spectrum, DOOM Level CD software, books to borrow, Open Library books, texts, and libraries 

## Summary with prompt engineering

In [22]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# setup a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
prompt_template = """Write a concise summary of the following text. 
Focus on the story and ignore details of Project Gutenberg. 

% TEXT:

{text}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=prompt, 
    combine_prompt=prompt, 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)


Alice goes on a dream-like adventure in Wonderland, encountering a Queen who orders her beheaded. Alice stands up to the Queen and calls her a pack of cards, at which point the whole pack rises up and Alice wakes up to find herself in the lap of her sister. Along the way, Alice meets a caterpillar who advises her to eat a mushroom to change her size, and attends a strange croquet game with live hedgehogs and flamingos as the balls and mallets. At the end of her adventure, Alice imagines her sister growing up and keeping a loving and simple heart, telling stories to other children and remembering her own childhood.


In [None]:
---
# Complex search on large document


## List characters using summary and a prompt

In [33]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# setup a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
prompt_template = """
Focus on the story and ignore details of Project Gutenberg.
Ooutput a list of all characters. 

% TEXT:

{text}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=prompt, 
    combine_prompt=prompt, 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)

7. Eaglet 
8. Duck 
9. Dodo 
10. Bill the Lizard 
11. Caterpillar 
12. Cheshire Cat 
13. Queen of Hearts 
14. King of Hearts 
15. Knave of Hearts 
16. White Rabbit 
17. Mad Hatter 
18. March Hare 
19. Dormouse 
20. Gryphon 
21. Mock Turtle


**OUTPUT**

Weird response. Alice is missing.

```
7. Eaglet 
8. Duck 
9. Dodo 
10. Bill the Lizard 
11. Caterpillar 
12. Cheshire Cat 
13. Queen of Hearts 
14. King of Hearts 
15. Knave of Hearts 
16. White Rabbit 
17. Mad Hatter 
18. March Hare 
19. Dormouse 
20. Gryphon 
21. Mock Turtle
```

## List characters using summary and two prompt

In [35]:
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0.3, model_name='text-davinci-003')

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)
print(f"\nFound {len(texts)} part(s)")


# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
map_prompt_template = """
Focus on the story and ignore details of Project Gutenberg.
List all the characters.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

% TEXT:

{text}
"""

map_prompt = PromptTemplate(template=map_prompt_template, input_variables=["text"])


# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
# elements of the list of sample will diseapper from the list 
combine_prompt_template = """
Merge all characters lists.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

% TEXT:

{text}
"""

combine_prompt = PromptTemplate(template=combine_prompt_template, input_variables=["text"])


# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=map_prompt, 
    combine_prompt=combine_prompt, 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)


Found 171 part(s)
- The Duke and the King: Two con artists who Huck meets on his journey.
- Pap Finn: Huck's drunken and abusive father.
- Widow Douglas and Miss Watson: Two kind-hearted women who take Huck in.
- Tom Sawyer: Huck's best friend and a mischievous young boy.
- The Shepherdson Family: A rival family of the Grangerfords.
- The Judge: A kind-hearted judge who helps Huck and Jim.
- The Grangerford Children: A large family of children who Huck meets on his journey.
- The Wilks Sisters: Three young girls who Huck meets on his journey.
- The Grangerford Servants: A group of servants who work for the Grangerfords.
- The Grangerford Dogs: A pack of dogs who accompany the Grangerfords on their travels.
- The Boggs Family: A family of farmers who Huck meets on his journey.
- The Phelps Family: A family of farmers who Huck meets on his journey.


In [None]:
**OUTPUT**

May give random answers

```
ound 171 part(s)
- Muff Potter: An alcoholic who is falsely accused of murder.
- Widow Douglas: A kind woman who takes in Huckleberry Finn and tries to civilize him.

Merged Character List:
- Wayback Machine: a digital archive of the World Wide Web and other information on the Internet. 
- Internet Archive: a non-profit organization that maintains the Wayback Machine. 
- Safari: a web archiving service. 
- Edge: a web archiving service. 
- Archive-It: a subscription service for archiving websites. 
- Michael Hart: Founder of Project Gutenberg
- Volunteers: People from around the world who contribute to the project
- Tom Sawyer: A young, mischievous boy growing up in the fictional town of St. Petersburg.
- Joe Harper: Tom's best friend.
- Huckleberry Finn: Tom's other best friend.
- Becky Thatcher: A girl Tom has a crush on.
- Aunt Polly: Tom's strict but loving guardian.
- Injun Joe: The villain of the novel, Injun Joe is a dangerous and cruel man.
- Alice: protagonist of the story; she is a young girl who falls down a rabbit-
```

````
- The Duke and the King: Two con artists who Huck meets on his journey.
- Pap Finn: Huck's drunken and abusive father.
- Widow Douglas and Miss Watson: Two kind-hearted women who take Huck in.
- Tom Sawyer: Huck's best friend and a mischievous young boy.
- The Shepherdson Family: A rival family of the Grangerfords.
- The Judge: A kind-hearted judge who helps Huck and Jim.
- The Grangerford Children: A large family of children who Huck meets on his journey.
- The Wilks Sisters: Three young girls who Huck meets on his journey.
- The Grangerford Servants: A group of servants who work for the Grangerfords.
- The Grangerford Dogs: A pack of dogs who accompany the Grangerfords on their travels.
- The Boggs Family: A family of farmers who Huck meets on his journey.
- The Phelps Family: A family of farmers who Huck meets on his journey.
```

In [None]:
## List characters using formatted output

In [39]:
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.output_parsers import CommaSeparatedListOutputParser


# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0.3, model_name='text-davinci-003')

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)
print(f"\nFound {len(texts)} part(s)")

# stup a parser
output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
map_prompt_template = """
Focus on the story and ignore details of Project Gutenberg.
List all the characters.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

{format_instructions}

% TEXT:

{text}
"""

map_prompt = PromptTemplate(template=map_prompt_template, 
                        input_variables=["text"],
                        partial_variables={"format_instructions": format_instructions}
                       )

# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
# elements of the list of sample will diseapper from the list 
combine_prompt_template = """
Merge all characters lists.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

% TEXT:

{text}
"""

combine_prompt = PromptTemplate(template=combine_prompt_template, 
                        input_variables=["text"],
                        #partial_variables={"format_instructions": format_instructions}
                       )


# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=map_prompt, 
    combine_prompt=combine_prompt, 
    verbose=False)

# run the chain against all the document chunks
response = chain.run(texts)


print("\nResponse")
print(response)

#characters = output_parser.parse(response)

#print("\nCharacters")
#print(characters)


Found 171 part(s)

Response

- Avould: A mysterious figure who appears to John in his dreams, offering him advice and guidance.
- Other Little Children: A group of children who John meets in his travels, who help him on his journey.


## list characters using the vector DB
- first get a list
- then query each character

In [45]:
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.schema import Document
from pprint import pprint

# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
#model_name="gpt-3.5-turbo" # fdoes not work with map reduce
model_name='text-davinci-003'
llm = OpenAI(temperature=0.3, model_name=model_name)

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)
print(f"\nFound {len(texts)} part(s)")

output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
map_prompt_template = """
You will be given a text.
Extract the characters's names.
Ignore details of Project Gutenberg.

{format_instructions}. Add "characters:" in front of the list.

% TEXT:

{text}
"""

map_prompt = PromptTemplate(template=map_prompt_template, 
                        input_variables=["text"],
                        partial_variables={"format_instructions": format_instructions}
                       )

print("\nMap Prompt")
print(map_prompt)


# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="stuff", 
    prompt=map_prompt,  
    verbose=False)

i = 0
characters = {}
prefix = "Characters:"
for text in texts:
    i += 1
    
    # run the chain against all the document chunks
    # chain expect a list of documents
    response = chain.run([text])

    #print(f"\nResponse {i}")
    #print(response)

    lines = response.split('\n')
    #print(f"\nlines {i}")
    #print(lines)
    for line in lines:
        #print(f"\nline {i}")
        #print(line)
        if line.startswith(prefix):
            part_characters = output_parser.parse(line.replace(prefix, ''))

            print(f"\nPart {i} Characters: {part_characters}")
 
            #characters.extend(part_characters)
            for character in part_characters:
                if character in characters:
                    characters[character] += 1
                else:
                    characters.update({character: 1})




Found 171 part(s)

Map Prompt
input_variables=['text'] output_parser=None partial_variables={'format_instructions': 'Your response should be a list of comma separated values, eg: `foo, bar, baz`'} template='\nYou will be given a text.\nExtract the characters\'s names.\nIgnore details of Project Gutenberg.\n\n{format_instructions}. Add "characters:" in front of the list.\n\n% TEXT:\n\n{text}\n' template_format='f-string' validate_template=True

Part 1 Characters: ['Alice', 'White Rabbit', 'Caterpillar', 'Cheshire Cat', 'Mad Hatter', 'March Hare', 'Dormouse', 'Queen of Hearts.']

Part 2 Characters: ['']

Part 3 Characters: ['']

Part 7 Characters: ['']

Part 8 Characters: ['']

Part 9 Characters: ['']

Part 10 Characters: ['']

Part 11 Characters: ['']

Part 12 Characters: ['']

Part 13 Characters: ['']

Part 14 Characters: ['']

Part 16 Characters: ['']

Part 17 Characters: ['']

Part 18 Characters: ['']

Part 19 Characters: ['']

Part 21 Characters: ['']

Part 22 Characters: ['']

Par

In [41]:
print(texts[25])

page_content='</mask>\n      <use fill="#FFFFFF" xlink:href="#path-1" class="style-scope primary-nav"></use>\n      <g mask="url(#mask-2)" fill="#FFFFFF" class="style-scope primary-nav">\n        <path d="M0,0 L26.6666667,0 L26.6666667,30 L0,30 L0,0 Z" id="swatch" class="style-scope primary-nav"></path>\n      </g>\n    </g>\n  </svg>\n<!--?lit$4992531951$-->\n  <svg class="ia-wordmark stacked style-scope primary-nav" height="30" viewBox="0 0 95 30" width="95" xmlns="http://www.w3.org/2000/svg">\n    <g fill="#fff" fill-rule="evenodd" class="style-scope primary-nav">\n      <g transform="translate(0 17)" class="style-scope primary-nav">' metadata={'source': 'work/largedoc/data/alice_in_wonderland.txt'}


In [42]:
print(texts[44])

page_content='</svg>' metadata={'source': 'work/largedoc/data/alice_in_wonderland.txt'}


In [46]:
from pprint import pformat

print(f"\nAll Characters")
pprint(characters)


All Characters
{'': 67,
 '1.3051717-3.1166422': 1,
 '1.3357486-1.3354261': 1,
 '28.932117': 1,
 '29.0339092': 1,
 '3.1454001-1.1273669': 1,
 '3.1454001-1.1273669h2.3140896': 1,
 '33.13115253': 1,
 '4.5105': 1,
 'Ada': 1,
 'Alice': 81,
 "Alice's Sister": 1,
 'Alice,': 1,
 'Alice.': 1,
 'Anti-pathies': 1,
 'Arthur Conan Doyle': 1,
 'Aunt Polly': 1,
 'Avould': 1,
 'Baby': 3,
 'Becky Thatcher': 1,
 'Bill': 4,
 'Bill the Lizard': 1,
 'Birds': 1,
 'Canary': 1,
 'Cat': 5,
 'Caterpillar': 8,
 'Charlotte Lucas': 1,
 'Cheshire Cat': 3,
 'Cheshire Puss': 1,
 'Classical master': 1,
 'Colonel Fitzwilliam': 1,
 'Conger-eel': 1,
 'Cook': 3,
 'Croquet': 1,
 'Dinah': 6,
 'Dodo': 3,
 'Dormouse': 10,
 'Dormouse.': 2,
 'Drawling-master.': 1,
 'Duchess': 10,
 "Duchess's Cook": 1,
 'Duck': 2,
 'Eabbit': 1,
 'Eaglet': 2,
 'Edgar Athelino': 1,
 'Edwin': 1,
 'Elizabeth Bennet': 1,
 'Elsie': 1,
 'Executioner': 1,
 'Executioner.': 1,
 'Farmer.': 1,
 'Father William': 1,
 'Feet': 1,
 'Fish-Footman': 1,
 'Five': 

In [57]:
characters_list = [ c for (c,v) in characters.items()
                   if c != '' and c != 'None' and c != 'none' 
                   and v > 1]

print(characters_list)

['Alice', 'White Rabbit', 'Caterpillar', 'Cheshire Cat', 'March Hare', 'Dormouse', 'Tom Sawyer', 'Joe Harper', 'Huckleberry Finn', 'Dinah', 'Mabel', 'Rabbit', 'Mouse', 'William the Conqueror', 'Duck', 'Dodo', 'Lory', 'Eaglet', 'Mary Ann', 'Bill', 'Lizard', 'Guinea-Pigs', 'Puppy', 'Pigeon', 'Duchess', 'Footman', 'Cook', 'Baby', 'Cat', 'Hatter', 'Queen of Hearts', 'Five', 'Seven', 'Two', 'Queen', 'King', 'Knave', 'Gryphon', 'Mock Turtle', 'Whiting', 'King of Hearts', 'Dormouse.']


In [60]:
from langchain.chains import RetrievalQA
from langchain.vectorstores import Annoy
from langchain.embeddings import OpenAIEmbeddings

# Get embedding engine ready
embeddings = OpenAIEmbeddings()
 
# Embedd your texts andd store them in the vector database
# dtabase is in memory. it might be savecd to a file and loader later on.
db = Annoy.from_documents(texts, embeddings)

# Init a retriever for this db
# lookup for trelevqnt parts
retriever = db.as_retriever(search_type="similarity", 
                            search_kwargs={"k":5,
                                           "score_threshold": 0.9
                                          })
 
instructions = ". Give a funny answer 30 words long."

summary = {}
i = 0


# set remove duplicate strings
for character in set(characters_list):
    i += 1

    # ra query
    query = f"Who is {character}? {instructions}"

    # retrieve and count indexed documents relevant for the query
    docs = retriever.get_relevant_documents(query)
    print(f"({i}) - Found {len(docs)} relevant documen(s) for {character}")

   # NOTE score threshold not implemnted in Annoy
   # if len(docs) < 10:
   #   continue
                                           

    #samples = "\n\n".join([x.page_content[:200] for x in docs[:5]])
    #print(samples)
    

    # create a chain to answer questions 
    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(), 
        chain_type="stuff", 
        retriever=retriever, 
        return_source_documents=False)

    response = qa({"query": query})
    
    summary[character] = response

with open('alice_characters_summary.txt', 'w') as file:
    file.write(pformat(summary))
    


(1) - Found 5 relevant documen(s) for Dinah
(2) - Found 5 relevant documen(s) for Duck
(3) - Found 5 relevant documen(s) for White Rabbit
(4) - Found 5 relevant documen(s) for Rabbit
(5) - Found 5 relevant documen(s) for Dormouse
(6) - Found 5 relevant documen(s) for Duchess
(7) - Found 5 relevant documen(s) for Mabel
(8) - Found 5 relevant documen(s) for Bill
(9) - Found 5 relevant documen(s) for Gryphon
(10) - Found 5 relevant documen(s) for Huckleberry Finn
(11) - Found 5 relevant documen(s) for Baby
(12) - Found 5 relevant documen(s) for Eaglet
(13) - Found 5 relevant documen(s) for Five
(14) - Found 5 relevant documen(s) for Knave
(15) - Found 5 relevant documen(s) for Mouse
(16) - Found 5 relevant documen(s) for King
(17) - Found 5 relevant documen(s) for Mary Ann
(18) - Found 5 relevant documen(s) for Two
(19) - Found 5 relevant documen(s) for Lory
(20) - Found 5 relevant documen(s) for Alice
(21) - Found 5 relevant documen(s) for Cook
(22) - Found 5 relevant documen(s) for Foot

In [61]:
pprint(summary)

{'Alice': {'query': 'Who is Alice? . Give a funny answer 30 words long.',
           'result': ' Alice is a brave and adventurous young girl who is not '
                     'afraid to speak her mind and stand up for what she '
                     'believes in. She is also very kind and loves animals, '
                     "even if they don't always love her back. Despite all of "
                     'her adventures, she still finds time for Tea and '
                     'biscuits.'},
 'Baby': {'query': 'Who is Baby? . Give a funny answer 30 words long.',
          'result': ' Baby is the starfish-shaped creature Alice caught who '
                    'was snorting like a steam-engine and kept doubling itself '
                    'up and straightening itself out again. It was so strange '
                    'that Alice thought it might be a magical creature, but it '
                    'was really just a very confused and agitated baby.'},
 'Bill': {'query': 'Who is Bill? . Giv