# Build a S3 bucket plugin for ChatGPT

- https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/s3_directory.html
- https://github.com/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb

## S3 document loader

In [2]:
from langchain.document_loaders import S3DirectoryLoader

In [3]:
loader = S3DirectoryLoader("chatgpt-plugin",prefix="PaulGrahamEssays")

In [4]:
docs = loader.load()

In [40]:
import pickle
# # Save
# with open("s3docs.pkl","wb") as f:
#     pickle.dump(docs,f)
# Load
with open("s3docs.pkl","rb") as f:
    docs = pickle.load(f)
# docs

In [41]:
len(docs)

216

In [7]:
docs[0]

Document(page_content='Want to start a startup?  Get funded by\n\nY Combinator.\n\nWatch how this essay was\n\nwritten.\n\nFebruary 2009One of the things I always tell startups is a principle I learned\n\nfrom Paul Buchheit: it\'s better to make a few people really happy\n\nthan to make a lot of people semi-happy.  I was saying recently to\n\na reporter that if I could only tell startups 10 things, this would\n\nbe one of them.  Then I thought: what would the other 9 be?When I made the list there turned out to be 13:\n\n1. Pick good cofounders.Cofounders are for a startup what location is for real estate.  You\n\ncan change anything about a house except where it is.  In a startup\n\nyou can change your idea easily, but changing your cofounders is\n\nhard.\n\n[1]\n\nAnd the success of a startup is almost always a function\n\nof its founders.2. Launch fast.The reason to launch fast is not so much that it\'s critical to get\n\nyour product to market early, but that you haven\'t really sta

In [8]:
print(docs[0].page_content)

Want to start a startup?  Get funded by

Y Combinator.

Watch how this essay was

written.

February 2009One of the things I always tell startups is a principle I learned

from Paul Buchheit: it's better to make a few people really happy

than to make a lot of people semi-happy.  I was saying recently to

a reporter that if I could only tell startups 10 things, this would

be one of them.  Then I thought: what would the other 9 be?When I made the list there turned out to be 13:

1. Pick good cofounders.Cofounders are for a startup what location is for real estate.  You

can change anything about a house except where it is.  In a startup

you can change your idea easily, but changing your cofounders is

hard.

[1]

And the success of a startup is almost always a function

of its founders.2. Launch fast.The reason to launch fast is not so much that it's critical to get

your product to market early, but that you haven't really started

working on it till you've launched.  Launching teach

In [14]:
docs[0].metadata['source']

'C:\\Users\\ydebray\\AppData\\Local\\Temp\\tmpqaz2su57/PaulGrahamEssays/13sentences.txt'

## Split documents in chunks 

In [10]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

In [12]:
chunks = text_splitter.split_text(docs[5].page_content)
len(chunks)

5

In [19]:
chunks[0]

'July 2010What hard liquor, cigarettes, heroin, and crack have in common is\n\nthat they\'re all more concentrated forms of less addictive predecessors.\n\nMost if not all the things we describe as addictive are.  And the\n\nscary thing is, the process that created them is accelerating.We wouldn\'t want to stop it.  It\'s the same process that cures\n\ndiseases: technological progress.  Technological progress means\n\nmaking things do more of what we want.  When the thing we want is\n\nsomething we want to want, we consider technological progress good.\n\nIf some new technique makes solar cells x% more efficient, that\n\nseems strictly better.  When progress concentrates something we\n\ndon\'t want to want—when it transforms opium into heroin—it seems\n\nbad.  But it\'s the same process at work.\n\n[1]No one doubts this process is accelerating, which means increasing\n\nnumbers of things we like will be transformed into things we like\n\ntoo much.\n\n[2]As far as I know there\'s no wor

In [13]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1])

(389, 392)

In [18]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[5].metadata['source'].replace('C:\\Users\\ydebray\\AppData\\Local\\Temp\\', '').split('/')[2]
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

addiction.txt
41af49a835d4


In [24]:
texts = text_splitter.split_documents(docs)
len(texts)

1882

## Store doc chunks in VectorDB as embeddings

In [20]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

In [25]:
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


In [26]:
docsearch

<langchain.vectorstores.chroma.Chroma at 0x1e6000fea70>

## Query on Documents

In [27]:
from langchain import OpenAI, VectorDBQA

In [28]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=docsearch)

In [29]:
query = "What did McCarthy discover?"
qa.run(query)

' McCarthy discovered a way to create a programming language using a handful of simple operators and a notation for functions. He called this language Lisp, for “List Processing,” because one of his key ideas was to use a simple data structure called a list for both code and data.'

In [33]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "What did McCarthy discover?"
result = qa({"query": query})
result['result']
result = qa({"query": query})
result

{'query': 'What did McCarthy discover?',
 'result': ' McCarthy discovered a formal model of computation, which he called Lisp, that could be written in itself. It was a set of predefined operators and a notation for functions that could be used to create a programming language.',
 'source_documents': [Document(page_content='May 2001\n\n(I wrote this article to help myself understand exactly\n\nwhat McCarthy discovered.  You don\'t need to know this stuff\n\nto program in Lisp, but it should be helpful to\n\nanyone who wants to\n\nunderstand the essence of Lisp \x97 both in the sense of its\n\norigins and its semantic core.  The fact that it has such a core\n\nis one of Lisp\'s distinguishing features, and the reason why,\n\nunlike other languages, Lisp has dialects.)In 1960, John\n\nMcCarthy published a remarkable paper in\n\nwhich he did for programming something like what Euclid did for\n\ngeometry. He showed how, given a handful of simple\n\noperators and a notation for functions, y

In [37]:
# result['result']
result['source_documents'][0].dict()['metadata']['source'].replace('C:\\Users\\ydebray\\AppData\\Local\\Temp\\', '').split('/')[2]

'rootsoflisp.txt'

## TODO: Do the same from the S3 Plugin