## BluetickPDF - PDF Analyzer 

<img src="bluetick-logo.jpg" />

In [19]:
import os
os.environ["OPENAI_API_KEY"] = "add you openai api key"
os.environ["PINECONE_API_KEY"] = "add your pinecone api key"
os.environ["PINECONE_API_ENV"] = "add your pinecone api env"

In [None]:
pip install openai langchain tiktoken pinecone-client pypdf

## Loading the PDF: 
To begin, we need to load the PDF into our system. We can use the PyPDFLoader module from the LangChain library to accomplish this. Here's an example of how to load the PDF:


In [14]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("The-Book-Thief.pdf")
pages = loader.load()

## Splitting the Text: 
Since the entire book is now loaded as a single document, we need to split it into smaller chunks for processing and querying. The LangChain library provides a RecursiveCharacterTextSplitter that can handle this task. We can define the chunk size and overlap based on our requirements. Here's an example of splitting the book into smaller texts:


In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=5000, chunk_overlap=200)
texts = text_splitter.split_documents(pages)

num_documents = len(texts)
print(f"Now our book is split up into {num_documents} documents")

Now our book is split up into 353 documents


## Generating Embeddings:
To perform efficient queries on the text, we need to generate embeddings for each chunk. Embeddings capture the semantic meaning of the text, enabling us to find similar or relevant chunks efficiently. We can use the OpenAIEmbeddings module from LangChain to generate embeddings. Here's an example:


In [21]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get("OPENAI_API_KEY"))

## Creating a Knowledge Base Index: 
To enable fast and accurate querying, we will create an index of our knowledge base using Pinecone, a vector search engine. Pinecone allows us to store and retrieve embeddings efficiently. Here's an example of how to create the index:


In [23]:
from langchain.vectorstores import Pinecone
import pinecone

# Initialize Pinecone
pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"), environment=os.environ.get("PINECONE_API_ENV"))
index_name = "the-book-thief"

# Create the index
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

## Querying the Knowledge Base:
 Now that our knowledge base is ready, we can query it using the Generative AI model. We can utilize the ChatOpenAI model from LangChain, which is powered by OpenAI's GPT-3.5 Turbo. Here's an example of how to query the knowledge base:


In [26]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, max_tokens=1000, model_name='gpt-3.5-turbo', openai_api_key=os.environ.get("OPENAI_API_KEY"))

In [27]:
from langchain.chains import RetrievalQA

index_name = "the-book-thief"
text_field = "text"
index = pinecone.Index(index_name)
vectorstore = Pinecone(
    index, embeddings.embed_query, text_field
)

query = "Describe the scenarios when the Death met the book thief"

docs = vectorstore.similarity_search(query, k=3)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

output = qa.run(query)
print(output)

Death met the book thief, Liesel Meminger, on two occasions. The first time was when Liesel's younger brother died on a train journey to Munich, and Death came to collect his soul. The second time was when Death took Liesel away from Sydney after she had grown old and died. Death gave Liesel a dusty black book, which turned out to be the book she had written, The Book Thief. Death and Liesel sat down on the curb, and Liesel read her words. Death wanted to tell Liesel many things about beauty and brutality, but all he could do was turn to her and tell her the only truth he truly knows: "I am haunted by humans."
