## Performing Q&A on a Huge Book - Pinecone Vector Database

In [81]:
from langchain_groq import ChatGroq # Inference
from langchain.document_loaders import UnstructuredPDFLoader, PDFMinerLoader, PyPDFLoader # For Loading the book
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore # Vector DB
from langchain.chains.summarize import load_summarize_chain # To Perform Summarization
from langchain_groq import ChatGroq
from langchain.chains.question_answering import load_qa_chain # To Perform Q&A

In [62]:
load_dotenv() # Detecting env

True

In [63]:
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") # Setting up pinecone API KEY

### Loading the Book

In [64]:
large_data = PyPDFLoader("C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf") # Loading the book

In [65]:
loader = large_data.load()

Ignoring wrong pointing object 1221 0 (offset 0)
Ignoring wrong pointing object 1309 0 (offset 0)
Ignoring wrong pointing object 1388 0 (offset 0)
Ignoring wrong pointing object 1412 0 (offset 0)
Ignoring wrong pointing object 2082 0 (offset 0)
Ignoring wrong pointing object 2429 0 (offset 0)


In [66]:
loader

[Document(page_content='', metadata={'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf', 'page': 0}),
 Document(page_content='DATA SCIENCEtoTHE\nFIEL D GUIDE\n    \n \nSECOND  \nEDITION\n© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.', metadata={'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf', 'page': 1}),
 Document(page_content='', metadata={'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf', 'page': 2}),
 Document(page_content='FOREWORD\nData Science touches every aspect of our lives on a \ndaily basis. When we visit the doctor, drive our cars, \nget on an airplane, or shop for services, Data Science \nis changing the way we interact with and explore  \nour world.  \nOur world is now measured, \nmapped, and recorded in digital \nbits. Entire lives, from birth to \ndeath, are now catalogued in \nthe digital realm. These data, \noriginating from such diverse \nsources as connected vehicles, \nunderwater m

In [67]:
len(loader)

126

### Performing Chunking

In [68]:
chunkdata = RecursiveCharacterTextSplitter(chunk_size=5000,chunk_overlap=0) # Chunk size as 5000

In [69]:
splitdocs = chunkdata.split_documents(loader)

In [70]:
len(splitdocs) # Length of splitted data

104

In [71]:
{len(splitdocs[0].page_content)} # Finding number of characters after chunk

{117}

In [91]:
splitdocs[5]

Document(page_content='ACKNOWLEDGEMENTS\nWe  w o u l d  l i k e  to  e x p r e s s  o u r  s i n c e r e s t  g ra t i t u d e  to  \nall those who have made The Field Guide to Data \nScience  such a success. \nThank you to the nearly 15,000 \npeople who have downloaded \nthe digital copy from our website \nand the 100+ people who have \nconnected with The Field Guide \non our GitHub page . We have \nbeen overwhelmed by the \npopularity of the work within the \nData Science community. \nThank you to all of the \npractitioners who are using The \nField Guide as a resource. We are \nexcited to know that the work has \nhad such a strong influence, from \nshaping technical approaches to \nserving as the foundation for the \nvery definition and role of Data \nScience within major government \nand commercial organizations. Thank you to the educators and \nacademics who have incorporated \nThe Field Guide into your course \nwork. We appreciate your trusting \nthis guide as a way to introduce 

Initializing pinecone vector database

In [73]:
index_name = "langchain2" # Created index in pinecone db and defined here

### Vector Embeddings

In [75]:
vembeddings = PineconeVectorStore.from_documents(splitdocs[:15],OllamaEmbeddings(model='mxbai-embed-large'),index_name=index_name) # Ollama Embeddings

Performed Ollama Embeddings and performed vector embeddings for the book. Sliced out the book since it contains a lot of pages.

Instancing the model

In [83]:
groqllm = ChatGroq(model='llama3-8b-8192',temperature=0) # Groq Inference

## Summarization

In [87]:
chain = load_summarize_chain(groqllm,chain_type="map_reduce",verbose=True)  # Summarization

In [92]:
chain.run(splitdocs[:20])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"DATA SCIENCEtoTHE
FIEL D GUIDE
    
 
SECOND  
EDITION
© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED."


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"FOREWORD
Data Science touches every aspect of our lives on a 
daily basis. When we visit the doctor, drive our cars, 
get on an airplane, or shop for services, Data Science 
is changing the way we interact with and explore  
our world.  
Our world is now measured, 
mapped, and recorded in digital 
bits. Entire lives, from birth to 
death, are now catalogued in 
the digital realm. These data, 
originating from such diverse 
sources as connected vehicles, 
underwater microscopic cameras, 
and photos we post to social 
media, have propelled us into 
the greatest age of discovery 
humanit

'Here is a concise summary of the text:\n\nThe "Field Guide to Data Science" is a comprehensive guide published by Booz Allen Hamilton in 2015. The guide aims to help organizations understand and apply data science concepts, covering its core concepts, principles, and applications. The guide is divided into four main sections, including an introduction to data science, a practitioner\'s guide, navigating the trenches of data science, and case studies. The guide emphasizes the importance of humility, openness to learning, and collaboration in data science. It also highlights the need for data science to be a team sport that evolves rapidly, and encourages contributions from the community. The guide concludes with a look at the future of data science and parting thoughts.'

It gave good summarization for the huge data science book.

Fetching Relevant Results

In [115]:
query = "What are the good characteristics for a data science teams"

In [116]:
relavantsearch = vembeddings.similarity_search(query)

In [117]:
relavantsearch

[Document(page_content='AN INTRODUCTION TO DATA SCIENCE\nIf you haven’t heard of Data Science, you’re behind the \ntimes. Just renaming your Business Intelligence group \nthe Data Science group is not the solution. START HERE    for   THE BASICS', metadata={'page': 19.0, 'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf'}),
 Document(page_content='DATA SCIENCEtoTHE\nFIEL D GUIDE\n    \n \nSECOND  \nEDITION\n© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.', metadata={'page': 1.0, 'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf'}),
 Document(page_content='DATA SCIENCEtoTHE\nFIEL D GUIDE\n    \n \nSECOND  \nEDITION\n© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.', metadata={'page': 1.0, 'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf'}),
 Document(page_content='What do We Mean by  \nData Science?\nDescribing Data Science is like trying to describe a sunset – it \nshould be easy, but somehow c

Since I haven't performed the vector embeddings for the full document, it is not that much effective. Once I loaded the document fully it will have the capabilities to provide the similar kind of results.

### Q&A

In [122]:
queschain = load_qa_chain(groqllm,chain_type="map_reduce",verbose=True) 

In [123]:
queschain.run(input_documents = splitdocs,question = query)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following portion of a long document to see if any of the text is relevant to answer the question. 
Return any relevant text verbatim.
______________________
DATA SCIENCEtoTHE
FIEL D GUIDE
    
 
SECOND  
EDITION
© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.
Human: What are the good characteristics for a data science teams[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following portion of a long document to see if any of the text is relevant to answer the question. 
Return any relevant text verbatim.
______________________
FOREWORD
Data Science touches every aspect of our lives on a 
daily basis. When we visit the doctor, drive our cars, 
get on an airplane, or shop for services, Data Science 
is changing the way we interact with and explore  
our world.  
Our world is now measured, 
mapped, and recorded 

'Based on the provided text, the good characteristics for a data science team are:\n\n1. **Collaboration**: The team should consist of individuals with diverse backgrounds, including computer scientists, mathematicians, and domain experts.\n2. **Ability to produce new insights and analysis paths**: A good data science team should be able to generate new and valuable insights from data.\n3. **Breadth of experience**: A team with experience working with clients across various domains can provide a unique perspective on data science.\n4. **Multidisciplinary team**: The team should consist of individuals with diverse backgrounds, including computer scientists, mathematicians, and domain experts.\n5. **Ability to shift between deductive (hypothesis-based) and inductive (pattern-based) reasoning**: A good data science team should be able to combine the ability to reason deductively and inductively, creating an environment where models of reality are constantly tested, updated, and improved u

Based on the provided query, it gave good answering. That's good.

Thus the huge book of 126 pages is given to the llm model. We loaded the data and next transformed them into chunks for smoother processing and perfomed vector embeddings, later stored that vector embeddings to the pincone vector database, which is running in AWS cloud. And at last performing summarization and question answering based on the user query. 