# LangChain Demo - Large documents

## This notebook

This notebook collects Python examples using LangChain on large documents, especially those that are larger than the token number limitation.

Some changes though:
- use Annoy instead of FAISS as a vector database
- use Google Search API instead of SerpAPI
- change in examples and additional examples 
- change in API keys setup

This notebook has been tested in June 2023 on AWS SageMaker using DataScience 3.0 image.

Test environment:
> - AWS SageMaker Studio's notebook 
>> - Kernel image Data Science 3.0
>> - t3.medium 2CPU - 4GB
>> - Python 3.9.15
>> - Linux default 4.14.304-226.531.amzn2.x86_64

More informatioon about Langchain and dedicated examples in other bookmarks of the same folder.

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">NOTEBOOK SETUP</div>



**Instructions**

All setups are at the top of the notebook so that you can run all this section initialize the notebook.

Before running the setup you may need to create the following resources
- request an OpenAI API keys. OpenAI APIs are not free.

Confer to the setup sections for instruction on how to create those resources.

---
## API keys and environment

Langchain will get the API keys from environment variables or function parameters.

**Instructions**

- Never show the keys in shared notebooks, whether it part of the code or a log. A simple way to avoid key leakage, is to use environement variables.  You set the environment variable in the terminal or some local configuration. If so you do not have to set the key here.

- If it is easier for you to set the key here by assigning the value, do not forget to empty the string right after you run this block. The environment will be kept in memory as long as the kernel runs.

- Be careful when printing the keys. Ensure that you remove the outputs. 

- Before sharing check that the keys are not printed out by some features of the libraries. Avoid to print libraries' objects. They often hold the API keys as a property and may disclose the key value.


I Store API keys and configuration information in AWS Secrets Manager. The code below retrieves the secret holding the keys. The secret is a JSON string consisting in key/value pairs. It will be used later to set various environnement variables.

When using Notebooks and SageMaker do not forget to give permissions to read this secret to SageMaker execution role.

In [4]:
!apt-get update && apt-get install -y jq 1>/dev/null

Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB]
Get:2 http://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:4 http://deb.debian.org/debian bullseye/main amd64 Packages [8183 kB]
Get:5 http://deb.debian.org/debian-security bullseye-security/main amd64 Packages [252 kB]
Get:6 http://deb.debian.org/debian bullseye-updates/main amd64 Packages [14.8 kB]
Fetched 8658 kB in 2s (4805 kB/s)
Reading package lists... Done
debconf: delaying package configuration, since apt-utils is not installed


In [5]:
%%bash --out secrets 
# using AWS's Secret Manager to store keys
# garb the keys and store it into a Pytthon variable
export RESPONSE=$(aws secretsmanager get-secret-value --secret-id 'salvia/labbench/tests' )
export SECRETS=$( echo $RESPONSE | jq '.SecretString | fromjson')

echo $SECRETS

---
## pip upgrade

In [6]:
!pip install --upgrade pip  1>/dev/null

[0m

---
## LangChain Setup

**Resources**
> - [LangChain GetStarted](https://python.langchain.com/docs/get_started/quickstart)

In [7]:
!pip install langchain==0.0.230 1>/dev/null

[0m

---
## OpenAI Setup

**Resources**
> - [OpenAI tutorial on API keys](https://platform.openai.com/docs/quickstart)
> - [OpenAI package on Pypi](https://pypi.org/project/openai/)

In [8]:
import os

os.environ["OPENAI_API_KEY"] = eval(secrets)["OPENAI_API_KEY"]


In [9]:
!pip install openai==0.27.8 1>/dev/null

[0m

---
## Setup Annoy as a vector database 

Some examples requires a Vector Database (document selector, document retrieval).

LangChain use ChromaDB by default. For whatever reason it failed to install. Used Annoy instead. An alterntive is FAIIS. You may also want to use online Vector database like Pinecone or Weaviate. 

Most of these packages include c++ code and requires GCC at the install time. It is not included in SageMaker DataScience 3 image. So the first step is installing GCC. 

NOTE: Annoy is read-only - once the index is built you cannot add any more emebddings.

<br/>

**Resources**
> - [Annoy package on Pypi](https://pypi.org/project/annoy/)

***Note***

having some issues with ChromaDB. Langchain and ChromaDB seems to require different versions of pydantic.

Install GCC C++ compiler

In [10]:
!apt-get update && apt-get install -y build-essential 1>/dev/null

Hit:1 http://deb.debian.org/debian bullseye InRelease
Hit:2 http://deb.debian.org/debian-security bullseye-security InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Reading package lists... Done
debconf: delaying package configuration, since apt-utils is not installed


Install Annoy

In [11]:
pip install annoy==1.17.3 1>/dev/null

[0mNote: you may need to restart the kernel to use updated packages.


---
## Setup ChromaDB as a vector database 

force the vrsion in order to workaround the issue with Pydanic version

In [12]:
pip install chromadb==0.3.26 1>/dev/null

[0mNote: you may need to restart the kernel to use updated packages.


## Setup additional text managelment tools

When working with embeddings additonal packages are required.

- tiktoken, as a encoder and tokenizer

**Resources**
> - [Tiktoken package on Pypi](https://pypi.org/project/tiktoken/)

 

In [13]:
!pip install tiktoken==0.4.0 1>/dev/null

[0m

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">
BASIC CONCEPTS
</div>

https://archive.org/stream/alicesadventur00carr/alicesadventur00carr_djvu.txt

https://archive.org/download/alicesadventur00carr/alicesadventur00carr.pdf

https://archive.org/download/alicesadventur00carr/alicesadventur00carr_meta.xml

---
## Download example documents once

Download a book from the Internet Archive

In [23]:
!mkdir -p work/largedoc/data

In [20]:
!curl  https://ia800307.us.archive.org/17/items/aliceinwonderlan00carriala/aliceinwonderlan00carriala_djvu.txt \
 -o work/largedoc/data/alice_in_wonderland.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29923  100 29923    0     0  29509      0  0:00:01  0:00:01 --:--:-- 29509


In [6]:
!ls work/largedoc/data

alice_in_wonderland.txt


---
## Common initializations

In [14]:
notebook_folder = "work/largedoc"
documents_folder = f"{notebook_folder}/data"

print(documents_folder)

work/largedoc/data


---
# Document loaders

Loaders are an easy ways to import documents from other sources 
and make it available for use in your language models. There are lot of loadre type.

**Resources**
> - Document Loaders: https://python.langchain.com/docs/modules/data_connection/document_loaders
> - List of loaders: https://github.com/hwchase17/langchain/tree/master/langchain/document_loaders

In [22]:
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader

# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0, model_name='text-davinci-003')

# This is the source document.    
document_path = f"{documents_folder}/alice_in_wonderland.txt"
 
# Setup a text loader
loader = TextLoader(document_path)
alice_documents = loader.load()

text = alice_documents[0].page_content

# check the length
length = len(text)
print(f"{length=}")

# check the number of tokens
num_tokens = llm.get_num_tokens(text)
print(f"{num_tokens=}")

length=29258
num_tokens=20373


---
# Summaries Of long Text
If the text is longer than the limit in tokens, the text must be splitted in chunks. 
Langchain components will take care of splitting and chaining the summarization tasks.

The Summarization Chain breaks the text into smaller chunks and summarizing each chunk, creating a final summary based on the individual summaries.

In this example, the chain first splits the essay into chunks of 2000 characters. It then generates summaries for each chunk and creates a final concise summary based on these individual summaries.

<br/>
**Resources**

> - Qummarization quickstart: https://python.langchain.com/docs/modules/chains/popular/summarize

## Summary with default prompt

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)

# check texts
num_texts = len(texts)
print(f"{num_texts=}")


num_texts=16


In [25]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)

 Alice in Wonderland is a classic novel by Lewis Carroll about a girl named Alice who falls down a rabbit hole and embarks on a journey of self-discovery. Along the way, she meets a variety of strange creatures and experiences surreal events. The book has been digitized by the University of California Los Angeles and contains a poem, "Down the Rabbit-Hole," and other stories such as "The Pool of Tears," "A Caucus Race and a Long Tale," and "Alice's Evidence." Alice must use her wit and courage to make it through the strange and mysterious world of Wonderland.


**ÖUTPUT**

Typical response

 Alice in Wonderland is a classic novel by Lewis Carroll about a girl named Alice who falls down a rabbit hole and embarks on a journey of self-discovery. Along the way, she meets a variety of strange creatures and experiences surreal events. The book has been digitized by the University of California Los Angeles and contains a poem, "Down the Rabbit-Hole," and other stories such as "The Pool of Tears," "A Caucus Race and a Long Tale," and "Alice's Evidence." Alice must use her wit and courage to make it through the strange and mysterious world of Wonderland.

## Summary with prompt engineering

In [28]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# setup a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
prompt_template = """Write a pedantic summary of the following text. 
Describe characters.

% TEXT:

{text}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=prompt, 
    combine_prompt=prompt, 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)


Alice in Wonderland is a classic novel by Lewis Carroll that follows the story of Alice, a young girl who falls down a rabbit hole and finds herself in a strange and magical world. Along the way, she meets a variety of characters, including the White Rabbit, the Caterpillar, the Cheshire Cat, the Mad Hatter, the Queen of Hearts, and the Mock Turtle. Through her adventures, Alice learns valuable lessons about life and the importance of being true to oneself. The characters in the story are Alice, the White Rabbit, the Caterpillar, the Cheshire Cat, the Mad Hatter, the Queen of Hearts, and the Mock Turtle. Each character has their own unique personality and provides Alice with advice and guidance as she navigates her way through Wonderland.


**OUTPUT**

Typical response


Alice in Wonderland is a classic novel by Lewis Carroll that follows the story of Alice, a young girl who falls down a rabbit hole and finds herself in a strange and magical world. Along the way, she meets a variety of characters, including the White Rabbit, the Caterpillar, the Cheshire Cat, the Mad Hatter, the Queen of Hearts, and the Mock Turtle. Through her adventures, Alice learns valuable lessons about life and the importance of being true to oneself. The characters in the story are Alice, the White Rabbit, the Caterpillar, the Cheshire Cat, the Mad Hatter, the Queen of Hearts, and the Mock Turtle. Each character has their own unique personality and provides Alice with advice and guidance as she navigates her way through Wonderland.

---
# Complex search on large document


## List characters using summary and a prompt

In [39]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# setup a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
prompt_template = """
Output a list of all characters. Describe each character. 

Expected response format:
- character name: character description

% TEXT:

{text}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=prompt, 
    combine_prompt=prompt, 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)


- The Cheshire Cat: A mischievous cat who appears and disappears at will. He is often seen grinning and is known for his clever riddles. 

- The Queen of Hearts: The Queen of Hearts is the ruler of Wonderland. She is a fierce and powerful ruler who is often seen as a tyrant. 

- The White Rabbit: The White Rabbit is a frantic creature who is always running late. He is often seen carrying a pocket watch and is known for his nervousness.


**OUTPUT**

Weird response. Alice is missing.

```
- The White Rabbit: The White Rabbit is a character who Alice meets in the woods. He is known for his frantic behavior and his tendency to be late. 

- The Cheshire Cat: The Cheshire Cat is a mysterious creature who Alice meets in the woods. He is known for his mischievous behavior and his ability to disappear and reappear at will.
```

## List characters using summary and two prompt

In [41]:
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0.3, model_name='text-davinci-003')

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)
print(f"\nFound {len(texts)} part(s)")


# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
map_prompt_template = """
List all the characters.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

% TEXT:

{text}
"""

map_prompt = PromptTemplate(template=map_prompt_template, input_variables=["text"])


# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
# elements of the list of sample will diseapper from the list 
combine_prompt_template = """
Merge all characters lists.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

% TEXT:

{text}
"""

combine_prompt = PromptTemplate(template=combine_prompt_template, input_variables=["text"])


# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=map_prompt, 
    combine_prompt=combine_prompt, 
    verbose=False)

# run the chain against all the document chunks
summary = chain.run(texts)

print(summary)


Found 16 part(s)

• The Queen of Hearts - The Queen of Hearts is the ruler of Wonderland. She is known for her temper and her love of executions. 

• The White Rabbit - The White Rabbit is a talking rabbit who Alice follows into Wonderland. He is known for his frantic behavior and his tendency to be late. 

• The Mad Hatter - The Mad Hatter is a strange character who hosts a tea party with the March Hare and the Dormouse. He is known for his strange behavior and his nonsensical riddles. 

• The March Hare - The March Hare is a character who attends the Mad Hatter's tea party. He is known for his mischievous behavior and his tendency to play tricks on Alice. 

• Tweedledee and Tweedledum - Tweedledee and Tweedledum are two strange characters who Alice meets in the woods. They are known for their nonsensical conversations and their tendency to argue with each other. 

• The Cheshire Cat - The Cheshire Cat is a mysterious cat who appears and disappears at will. He is known for his enigma

In [None]:
**OUTPUT**

May give random answers

```
- Muff Potter: An alcoholic who is falsely accused of murder.
- Widow Douglas: A kind woman who takes in Huckleberry Finn and tries to civilize him.

Merged Character List:
- Wayback Machine: a digital archive of the World Wide Web and other information on the Internet. 
- Internet Archive: a non-profit organization that maintains the Wayback Machine. 
- Safari: a web archiving service. 
- Edge: a web archiving service. 
- Archive-It: a subscription service for archiving websites. 
- Michael Hart: Founder of Project Gutenberg
- Volunteers: People from around the world who contribute to the project
- Tom Sawyer: A young, mischievous boy growing up in the fictional town of St. Petersburg.
- Joe Harper: Tom's best friend.
- Huckleberry Finn: Tom's other best friend.
- Becky Thatcher: A girl Tom has a crush on.
- Aunt Polly: Tom's strict but loving guardian.
- Injun Joe: The villain of the novel, Injun Joe is a dangerous and cruel man.
- Alice: protagonist of the story; she is a young girl who falls down a rabbit-
```

````
- The Duke and the King: Two con artists who Huck meets on his journey.
- Pap Finn: Huck's drunken and abusive father.
- Widow Douglas and Miss Watson: Two kind-hearted women who take Huck in.
- Tom Sawyer: Huck's best friend and a mischievous young boy.
- The Shepherdson Family: A rival family of the Grangerfords.
- The Judge: A kind-hearted judge who helps Huck and Jim.
- The Grangerford Children: A large family of children who Huck meets on his journey.
- The Wilks Sisters: Three young girls who Huck meets on his journey.
- The Grangerford Servants: A group of servants who work for the Grangerfords.
- The Grangerford Dogs: A pack of dogs who accompany the Grangerfords on their travels.
- The Boggs Family: A family of farmers who Huck meets on his journey.
- The Phelps Family: A family of farmers who Huck meets on his journey.
```

## List characters using formatted output

In [43]:
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.output_parsers import CommaSeparatedListOutputParser


# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0.3, model_name='text-davinci-003')

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)
print(f"\nFound {len(texts)} part(s)")

# stup a parser
output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
map_prompt_template = """
List all the characters.
Output the name and description of the characters. 

{format_instructions}

% TEXT:

{text}
"""

map_prompt = PromptTemplate(template=map_prompt_template, 
                        input_variables=["text"],
                        partial_variables={"format_instructions": format_instructions}
                       )

# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
# elements of the list of sample will diseapper from the list 
combine_prompt_template = """
Merge all characters lists.
Output the list of characters as a bullet points list which shows the name and description of the characters. 

{format_instructions}

% TEXT:

{text}
"""

combine_prompt = PromptTemplate(template=combine_prompt_template, 
                        input_variables=["text"],
                        partial_variables={"format_instructions": format_instructions}
                       )


# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="map_reduce", 
    map_prompt=map_prompt, 
    combine_prompt=combine_prompt, 
    verbose=False)

# run the chain against all the document chunks
response = chain.run(texts)


print("\nResponse")
print(response)

characters = output_parser.parse(response)

print("\nCharacters")
print(characters)


Found 16 part(s)

Response

Answer: Alice, a young girl who falls down a rabbit hole and discovers a magical world; The White Rabbit, a talking rabbit who wears a waistcoat and carries a pocket watch; The Caterpillar, a large blue caterpillar who smokes a hookah and gives Alice advice; The Cheshire Cat, a mysterious cat with a wide grin who can disappear and reappear at will; The Mad Hatter, a mad tea party host who wears a top hat and speaks in riddles; The March Hare, a mad tea party guest who is always late; The Queen of Hearts, a tyrannical ruler who is obsessed with beheading people; The King of Hearts, the Queen's husband who is easily manipulated; The Duchess, a rude and unpleasant woman who is the Queen's advisor; The Gryphon, a strange creature who takes Alice to the Mock Turtle; The Mock Turtle, a sad creature who tells Alice stories of his past; The Jabberwocky, a terrifying creature that Alice must face in order to escape Wonderland; Tweedledum and Tweedledee, two characte

## List characters using the vector DB
- first get a list
- then query each character

In [49]:
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.schema import Document
from pprint import pprint

# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
#model_name="gpt-3.5-turbo" # fdoes not work with map reduce
model_name='text-davinci-003'
llm = OpenAI(temperature=0.3, model_name=model_name)

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)
print(f"\nFound {len(texts)} part(s)")

output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

# setup. a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
map_prompt_template = """
You will be given a text.
Extract the characters's names.
Ignore details of Project Gutenberg.

{format_instructions}. Add "characters:" in front of the list.

% TEXT:

{text}
"""

map_prompt = PromptTemplate(template=map_prompt_template, 
                        input_variables=["text"],
                        partial_variables={"format_instructions": format_instructions}
                       )

print("\nMap Prompt")
print(map_prompt)


# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="stuff", 
    prompt=map_prompt,  
    verbose=False)

i = 0
characters = {}
prefix = "Characters:"
for text in texts:
    i += 1
    
    # run the chain against all the document chunks
    # chain expect a list of documents
    response = chain.run([text])

    #print(f"\nResponse {i}")
    #print(response)

    lines = response.split('\n')
    #print(f"\nlines {i}")
    #print(lines)
    for line in lines:
        #print(f"\nline {i}")
        #print(line)
        if line.startswith(prefix):
            part_characters = output_parser.parse(line.replace(prefix, ''))

            print(f"\nPart {i} Characters: {part_characters}")
 
            #characters.extend(part_characters)
            for character in part_characters:
                if character in characters:
                    characters[character] += 1
                else:
                    characters.update({character: 1})




Found 49 part(s)

Map Prompt
input_variables=['text'] output_parser=None partial_variables={'format_instructions': 'Your response should be a list of comma separated values, eg: `foo, bar, baz`'} template='\nYou will be given a text.\nExtract the characters\'s names.\nIgnore details of Project Gutenberg.\n\n{format_instructions}. Add "characters:" in front of the list.\n\n% TEXT:\n\n{text}\n' template_format='f-string' validate_template=True

Part 1 Characters: ['Alice', 'Lewis Carroll', 'Georgie Gregg']

Part 2 Characters: ['Alice', 'Caterpillar', 'Pig', 'Pepper', 'Queen', 'Mock Turtle', 'Lobster', 'Tarts Stealer.']

Part 3 Characters: ['Alice', 'Rabbit-Hole']

Part 4 Characters: ['Alice', 'Rabbit-Hole', 'Alice in Wonderland', 'White Rabbit', 'Dodo', 'Lory', 'Eaglet', 'Mock Turtle', 'Gryphon', 'Queen of Hearts']

Part 5 Characters: ['Alice', 'Rabbit-Hole']

Part 6 Characters: ['Alice']

Part 7 Characters: ['Alice,']

Part 8 Characters: ['Alice', 'The Pool of Tears']

Part 9 Character

In [50]:
from pprint import pformat

print(f"\nAll Characters")
pprint(characters)


All Characters
{'': 1,
 'Alice': 45,
 'Alice in Wonderland': 1,
 "Alice's Evidence": 2,
 'Alice,': 3,
 'Bill': 2,
 'Caterpillar': 7,
 'Caucus Race': 1,
 'Caucus-Race': 1,
 'Cheshire Cat': 1,
 'Croquet-Ground': 2,
 'Dodo': 2,
 'Dormouse': 3,
 'Dormouse.': 1,
 'Duck': 1,
 'Eaglet': 2,
 'Evidence': 2,
 'Georgie Gregg': 1,
 'Gryphon': 1,
 'King of Hearts': 1,
 'King of Hearts.': 1,
 'Lewis Carroll': 1,
 'Lobster': 3,
 'Lobster Quadrille': 2,
 'Long Tale': 1,
 'Long-Tale': 1,
 'Lory': 2,
 'Mad Hatter': 4,
 'March Hare': 4,
 'Mock Turtle': 5,
 'Mouse': 1,
 'Pepper': 7,
 'Pig': 7,
 'Queen': 3,
 'Queen of Hearts': 3,
 'Rabbit': 5,
 'Rabbit-Hole': 3,
 'Tarts Stealer.': 1,
 'The Caterpillar': 1,
 'The Cheshire Cat.': 1,
 'The Dormouse': 1,
 'The Duchess': 1,
 'The Gryphon': 1,
 'The King': 1,
 'The King of Hearts': 1,
 'The Knave of Hearts': 1,
 'The Mad Hatter': 1,
 'The March Hare': 1,
 'The Mock Turtle': 2,
 'The Pool of Tears': 1,
 'The Queen': 3,
 'The Queen of Hearts': 1,
 'The Tarts': 1,

In [57]:
## fikter out empty values, weird characters and less frequent names
# the result is sensible to the chunck size as larger chunk will reduce the number of occurences

characters_list = [ c for (c,v) in characters.items()
                   if c != '' and c != 'None' and c != 'none' 
                   and v > 1 and len(c) > 2]

print(characters_list)

['Alice', 'Caterpillar', 'Pig', 'Pepper', 'Queen', 'Mock Turtle', 'Lobster', 'Rabbit-Hole', 'White Rabbit', 'Dodo', 'Lory', 'Eaglet', 'Queen of Hearts', 'Alice,', 'Bill', 'Mad Hatter', 'March Hare', 'Dormouse', 'Rabbit', 'The Queen', 'Croquet-Ground', 'The Mock Turtle', 'Lobster Quadrille', "Alice's Evidence", 'Evidence']


In [52]:
from langchain.chains import RetrievalQA
from langchain.vectorstores import Annoy
from langchain.embeddings import OpenAIEmbeddings

# Get embedding engine ready
embeddings = OpenAIEmbeddings()
 
# Embedd your texts andd store them in the vector database
# dtabase is in memory. it might be savecd to a file and loader later on.
db = Annoy.from_documents(texts, embeddings)

# Init a retriever for this db
# lookup for trelevqnt parts
retriever = db.as_retriever(search_type="similarity", 
                            search_kwargs={"k":5,
                                           "score_threshold": 0.9
                                          })
 
instructions = ". Give a funny answer 30 words long."

summary = {}
i = 0


# set remove duplicate strings
for character in set(characters_list):
    i += 1

    # ra query
    query = f"Who is {character}? {instructions}"

    # retrieve and count indexed documents relevant for the query
    docs = retriever.get_relevant_documents(query)
    print(f"({i}) - Found {len(docs)} relevant documen(s) for {character}")

   # NOTE score threshold not implemnted in Annoy
   # if len(docs) < 10:
   #   continue
                                           

    #samples = "\n\n".join([x.page_content[:200] for x in docs[:5]])
    #print(samples)
    

    # create a chain to answer questions 
    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(), 
        chain_type="stuff", 
        retriever=retriever, 
        return_source_documents=False)

    response = qa({"query": query})
    
    summary[character] = response

with open(f'{notebook_folder}/alice_characters_summary_annoy.txt', 'w') as file:
    file.write(pformat(summary))

print('\n')
pprint(summary)

(1) - Found 5 relevant documen(s) for Mad Hatter
(2) - Found 5 relevant documen(s) for Alice,
(3) - Found 5 relevant documen(s) for Queen
(4) - Found 5 relevant documen(s) for Pepper
(5) - Found 5 relevant documen(s) for Eaglet
(6) - Found 5 relevant documen(s) for Caterpillar
(7) - Found 5 relevant documen(s) for White Rabbit
(8) - Found 5 relevant documen(s) for Pig
(9) - Found 5 relevant documen(s) for Lobster
(10) - Found 5 relevant documen(s) for Dodo
(11) - Found 5 relevant documen(s) for Alice
(12) - Found 5 relevant documen(s) for The Mock Turtle
(13) - Found 5 relevant documen(s) for Lory
(14) - Found 5 relevant documen(s) for Queen of Hearts
(15) - Found 5 relevant documen(s) for Rabbit-Hole
(16) - Found 5 relevant documen(s) for Rabbit
(17) - Found 5 relevant documen(s) for March Hare
(18) - Found 5 relevant documen(s) for Dormouse
(19) - Found 5 relevant documen(s) for Mock Turtle
(20) - Found 5 relevant documen(s) for Bill
(21) - Found 5 relevant documen(s) for The Queen
(

In [None]:
# TODO make it readable

## Using chroma and similarity threshold

**Resources**
> - https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma

In [63]:
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings


# Get embedding engine ready
embeddings = OpenAIEmbeddings()
 
# Embedd your texts andd store them in the vector database
# dtabase is in memory. it might be savecd to a file and loader later on.
chroma_db = Chroma.from_documents(texts, embeddings)

# Init a retriever for this db
# lookup for trelevqnt parts
retriever = chroma_db.as_retriever(search_type="similarity_score_threshold", 
                            search_kwargs={"k":5,
                                           "score_threshold": 0.7
                                          })
     
instructions = ". Give a funny answer 30 words long."

summary = {}
i = 0


# set remove duplicate strings
for character in set(characters_list):
    i += 1

    # ra query
    query = f"Who is {character}? {instructions}"

    # retrieve and count indexed documents relevant for the query
    docs = retriever.get_relevant_documents(query)
    print(f"({i}) - Found {len(docs)} relevant document(s) for {character}")

    # NOTE score threshold not implemnted in Annoy
    if len(docs) < 1:
        continue
                                           

    #samples = "\n\n".join([x.page_content[:200] for x in docs[:5]])
    #print(samples)
    

    # create a chain to answer questions 
    qa = RetrievalQA.from_chain_type(
        llm=OpenAI(), 
        chain_type="stuff", 
        retriever=retriever, 
        return_source_documents=False)

    response = qa({"query": query})
    
    summary[character] = response

with open(f'{notebook_folder}/alice_characters_summary_chroma.txt', 'w') as file:
    file.write(pformat(summary))

print§('\n')
pprint(summary)

(1) - Found 5 relevant document(s) for Mad Hatter
(2) - Found 5 relevant document(s) for Alice,




(3) - Found 0 relevant document(s) for Queen
(4) - Found 0 relevant document(s) for Pepper
(5) - Found 0 relevant document(s) for Eaglet
(6) - Found 1 relevant document(s) for Caterpillar
(7) - Found 5 relevant document(s) for White Rabbit
(8) - Found 1 relevant document(s) for Pig
(9) - Found 0 relevant document(s) for Lobster
(10) - Found 0 relevant document(s) for Dodo
(11) - Found 5 relevant document(s) for Alice
(12) - Found 3 relevant document(s) for The Mock Turtle
(13) - Found 0 relevant document(s) for Lory
(14) - Found 0 relevant document(s) for Queen of Hearts
(15) - Found 5 relevant document(s) for Rabbit-Hole
(16) - Found 5 relevant document(s) for Rabbit
(17) - Found 1 relevant document(s) for March Hare
(18) - Found 0 relevant document(s) for Dormouse
(19) - Found 3 relevant document(s) for Mock Turtle
(20) - Found 0 relevant document(s) for Bill
(21) - Found 0 relevant document(s) for The Queen
(22) - Found 0 relevant document(s) for Evidence
(23) - Found 1 relevant doc

## Turn into a readable response using summarization 

wrao the response into a document and display the result.

It actually summarizes the dict response. 

In [66]:
from langchain.schema import Document
from pprint import pformat

document = Document(
    page_content=pformat(summary)
)

response =[document]

In [77]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

# setup a custom prompt
# the Summarization Chain provides a defaults prompt: write a concise summary.
prompt_template = """
Return a pretty list of names and description.

Expected response format:
- character name: character description

% TEXT:

{text}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

# the attribute map_reduce instruct the chain to 
# - first apply the model to each chunck (map stage) 
# - then all map results and apply the model (reduce stage)
chain = load_summarize_chain(
    llm, 
    chain_type="stuff", 
    prompt=prompt, 
    verbose=False)

# run the chain against all the document chunks
output = chain.run(response)

print(output)


- Alice: A curious girl who loves to explore and has a wild imagination.
- Alice's Evidence: The jury of her peers made up of a curious assortment of characters from Wonderland.
- Caterpillar: A wise and witty creature who has an affinity for smoking hookah.
- Croquet-Ground: The Queen's loyal butler and faithful servant.
- Lobster Quadrille: A famous lobster who loves to dance.
- Mad Hatter: A wild and eccentric character who loves to throw wild tea parties and ask nonsensical questions.
- March Hare: A wild hare who is always late and known for his wild antics and silly stories.
- Mock Turtle: An incredibly mysterious creature that has been known to make people laugh uncontrollably.
- Pig: A very silly character who loves to have fun and make people laugh.
- Rabbit: A wise little creature who is always on the lookout for Alice's shenanigans.
- Rabbit-Hole: An adventurous bunny who loves to explore and have fun.
- The Mock Turtle: A mysterious creature who is said to be half-turtle, 


- Alice: A curious girl who loves to explore and has a wild imagination. 
- Alice's Evidence: A jury of Alice's peers from Wonderland, including the White Rabbit, the Cheshire Cat, the Mad Hatter, and the Caterpillar. 
- Caterpillar: A wise and witty creature who has an affinity for smoking hookah. 
- Croquet-Ground: The Queen's loyal butler and faithful servant, made of croquet hoops, mallets, and balls. 
- Lobster Quadrille: A famous lobster who loves to dance and show off his moves. 
- Mad Hatter: A wild and eccentric character who loves to throw wild tea parties and ask nonsensical questions. 
- March Hare: A wild hare who is always late and loves to tell silly stories. 
- Mock Turtle: An incredibly mysterious creature that has been known to make people laugh uncontrollably. 
- Pig: A very silly character who loves to have fun and make people laugh. 
- Rabbit: A wise little creature who is always on the lookout for Alice's shenanigans. 
- Rabbit-Hole: An adventurous bunny who loves to explore and have a good time. 
- The Mock

## Turn into a readable response using the chat model

In [67]:
# To help construct our Chat Messages
from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

# We will be using a chat model, defaults to gpt-3.5-turbo
from langchain.chat_models import ChatOpenAI

chat_model = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
instructions = """
You will be given a python dictionnary with character names as key and description in the field result. 
Return a pretty list of names and description.
"""

characters = pformat(summary)

# Make your prompt which combines the instructions w/ the trxt
prompt = (instructions + characters)

# Call the LLM
output = chat_model([HumanMessage(content=prompt)])

print (output.content)


- Alice: Alice is a curious girl who loves to explore. She often finds herself in interesting and strange situations, like talking to caterpillars and playing a caucus race with a variety of animals. She has a wild imagination and is always ready for an adventure. With her courage and wit, there's no telling what she'll do next!
- Alice's Evidence: Alice's Evidence is the jury of her peers, made up of a curious assortment of characters from Wonderland, including the White Rabbit, the Cheshire Cat, the Mad Hatter, and the Caterpillar. They all come together to listen to her case and provide a verdict based on her unique perspective. It's sure to be an entertaining trial!
- Alice,: Alice is a young girl who finds herself in strange and wonderful places. She often meets talking animals, attends mysterious tea parties, and embarks on wild adventures, all while learning valuable lessons about life. She is a brave, curious, and imaginative explorer who never ceases to amaze!
- Caterpillar: C

---
## Chroma can use metadata

<div class="alert alert-block alert-warning"> 
    TODO
</div>


---
## Chroma can persist the database
<div class="alert alert-block alert-warning"> 
    TODO 
</div>

```python
db2.persist()
docs = db2.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
docs = db3.similarity_search(query)
print(docs[0].page_content)
```

---
## test gtp 3.5 ou 4 in chat

<div class="alert alert-block alert-warning"> 
    TODO
</div>
