In [None]:
# pip install langchain --upgrade
# Version: 0.0.149

In [26]:
#!pip install openai
#!pip install unstructured
#!pip install pinecone-client
#!pip install tiktoken

OpenAI provides an API to their pre-trained LLM (large language model).  Pinecone provides an API to their vector database, which is the where the vector embeddings from the PDF document are stored and queried. The Langchain framework provides a way to tie it all together, using the PDF file that you ran through the model and stored the resulting vector embeddings in Pinecone in conjunction with the existing LLM. 

You will need to register an account at OpenAI to get an API key.  (You will also need to give them a credit card number before you can use the API...they won't charge the card though until you've exhausted for you free $18 credit. Just set the guardrails they provide to make sure you don't get any crazy charges, and don't share your API key.)

You'll also need to register for an account with pinecone and get your API key for that as well. Just follow the instructions in their "Getting started" page. 

In [1]:
import os
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

### Load your data

In [43]:
#loader = UnstructuredPDFLoader("pythondatasciencehandbook.pdf")
loader = UnstructuredPDFLoader("Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-O’Reilly-Media-2019.pdf")

# I'm going to use pythondatasciencehandbook.pdf...update: could not get this one to load for some reason.
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

In [3]:
data = loader.load()

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.


In [4]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 1 document(s) in your data
There are 946237 characters in your document


### Chunk your data up into smaller documents

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [47]:
print (f'Now you have {len(texts)} documents')

Now you have 1199 documents


### Create embeddings of your documents to get ready for semantic search

In [48]:
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

In [49]:
# Retrieve the API keys from the environment variables
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV')
#print(PINECONE_API_KEY, PINECONE_API_ENV)

In [50]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [51]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "langchaintest" # put in the name of your pinecone index here

Before the next cell will execute without errors, you must go to your Pinecone account and create a new index called "langchaintest".  I needed 1536 dimensions for this book (as indicated by an error message first attempt), and for the rest of the settings I chose defaults.  This will take a while to load the vector embeddings in Pinecone. 

In [52]:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

In [53]:
# Let's query it without the context of OpenAI's LLM...we'll just get phrases from the book back.
query = "Which machine learning algorithms can be trained incrementally?"
docs = docsearch.similarity_search(query, include_metadata=True)

In [54]:
# Here's an example of the first document that was returned
docs[0].page_content[:250]

'Finally, if your system needs to be able to learn autonomously and it has limited resources (e.g., a smartphone application or a rover on Mars), then carrying around large amounts of training data and taking up a lot of resources to train for hours e'

In [55]:
docs[0].page_content

'Finally, if your system needs to be able to learn autonomously and it has limited resources (e.g., a smartphone application or a rover on Mars), then carrying around large amounts of training data and taking up a lot of resources to train for hours every day is a showstopper.\n\nFortunately, a better option in all these cases is to use algorithms that are capable of learning incrementally.\n\nOnline learning\n\nIn online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives (see Figure 1-13).\n\nFigure 1\n\n\n\n13. Online learning\n\nOnline learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a good option\n\n16\n\n|\n\nChapter 1: The Machine Learning Landscape'

### Query those docs to get your answer back

In [56]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [57]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

In [58]:
query = "Which machine learning algorithms can be trained incrementally?"
docs = docsearch.similarity_search(query, include_metadata=True)

In [59]:
chain.run(input_documents=docs, question=query)

' Online learning algorithms can be trained incrementally. Examples include Reinforcement Learning algorithms and Support Vector Machines.'

In [60]:
query = "List which machine learning algorithms do not require one-hot encoded data?"
docs = docsearch.similarity_search(query, include_metadata=True)

In [61]:
chain.run(input_documents=docs, question=query)

' k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Decision Trees and Random Forests, and Neural networks.'