## Performing Q&A on a Huge Book - Pinecone Vector Database

In [4]:
from langchain_groq import ChatGroq # Inference
from langchain.document_loaders import UnstructuredPDFLoader, PDFMinerLoader, PyPDFLoader # For Loading the book
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore # Vector DB
from langchain.chains.summarize import load_summarize_chain # To Perform Summarization
from langchain_groq import ChatGroq
from langchain.chains.question_answering import load_qa_chain # To Perform Q&A

In [2]:
load_dotenv() # Detecting env

True

In [5]:
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") # Setting up pinecone API KEY

### Loading the Book

In [13]:
# large_data = PyPDFLoader("C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf") # Loading the book
large_data = PyPDFLoader("C:\\Users\\USER\\Downloads\\CASE _1.pdf")

In [14]:
loader = large_data.load()

In [15]:
loader

[Document(page_content=' \n5-311-500 \nRevised April 5, 2012 \n©2012 by the Kellogg School of Management at  Northwestern University. This case was prepared by Sachin Waikar under the \nsupervision of Professor Kent Grayson, and it was written with assistance from Mike Daudet at Sony Corporation. To protect \ncompany confidentiality, some information has been disguised or fictionalized. Cases are develope d solely as the basis for clas s \ndiscussion. Cases are not intended to serve as endorsements, sour ces of primary data, or illustra tions of effective or ineffect ive \nmanagement. To order copies or request permission to reproduce mate rials, call 847.491.5400 or e-mail \ncases@kellogg.northwestern.edu. No part of this publication may be  reproduced, stored in a retrieval system, used in a spreadsh eet, or \ntransmitted in any form or by any means—el ectronic, mechanical, photocopying, recording,  or otherwise—without the permission of  \nthe Kellogg School of Management. \nKENT GR

In [16]:
len(loader)

16

### Performing Chunking

In [17]:
chunkdata = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=0) # Chunk size as 5000

In [18]:
splitdocs = chunkdata.split_documents(loader)

In [19]:
len(splitdocs) # Length of splitted data

82

In [20]:
{len(splitdocs[0].page_content)} # Finding number of characters after chunk

{410}

In [21]:
splitdocs[5]

Document(page_content='a manufacturer of telecommunications and measurement equipment. Over the following decades, Sony (the company’s official name as of 1958) became a major player in multiple electronics \nmarkets, including audio, video, communications , and information technology products for home \nand professional use. By 2010 the company’s motion picture, television, computer entertainment, music, and online businesses were operated by  more than 171,000 employees worldwide,', metadata={'source': 'C:\\Users\\USER\\Downloads\\CASE _1.pdf', 'page': 0})

Initializing pinecone vector database

In [22]:
index_name = "langchain2" # Created index in pinecone db and defined here

### Vector Embeddings

In [24]:
vembeddings = PineconeVectorStore.from_documents(splitdocs[:15],OllamaEmbeddings(model='mxbai-embed-large'),index_name=index_name) # Ollama Embeddings

Performed Ollama Embeddings and performed vector embeddings for the book. Sliced out the book since it contains a lot of pages.

Instancing the model

In [25]:
groqllm = ChatGroq(model='llama3-8b-8192',temperature=0) # Groq Inference

## Summarization

In [26]:
chain = load_summarize_chain(groqllm,chain_type="map_reduce",verbose=True)  # Summarization

In [27]:
chain.run(splitdocs[:20])

  warn_deprecated(




[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"5-311-500 
Revised April 5, 2012 
©2012 by the Kellogg School of Management at  Northwestern University. This case was prepared by Sachin Waikar under the 
supervision of Professor Kent Grayson, and it was written with assistance from Mike Daudet at Sony Corporation. To protect 
company confidentiality, some information has been disguised or fictionalized. Cases are develope d solely as the basis for clas s"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"discussion. Cases are not intended to serve as endorsements, sour ces of primary data, or illustra tions of effective or ineffect ive 
management. To order copies or request permission to reproduce mate rials, call 847.491.5400 or e-mail 
cases@kellogg.northwestern.edu. No part of this publication

"Here is a concise summary:\n\nSony's VAIO product manager, Richard Lopez, has one week to finalize a marketing plan for VAIO laptops in China. The company is struggling to decide which market segment to target and how to position the brand. VAIO's unique appeal in China lies in its intangible benefits, such as its statement computer status and association with high-quality Japanese electronics brands, which cannot be replicated by knockoff products."

It gave good summarization for the huge data science book.

Fetching Relevant Results

In [31]:
# query = "What are the good characteristics for a data science teams"
query = "Compare the consumer interview results with the CLUES report data. Why there is so much redundancy between the two studies? How do the clues segments relate to the three identities from the interviews? How are the four key consumer goals identified in the interviews related to CLUE segments?"

In [116]:
relavantsearch = vembeddings.similarity_search(query)

In [117]:
relavantsearch

[Document(page_content='AN INTRODUCTION TO DATA SCIENCE\nIf you haven’t heard of Data Science, you’re behind the \ntimes. Just renaming your Business Intelligence group \nthe Data Science group is not the solution. START HERE    for   THE BASICS', metadata={'page': 19.0, 'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf'}),
 Document(page_content='DATA SCIENCEtoTHE\nFIEL D GUIDE\n    \n \nSECOND  \nEDITION\n© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.', metadata={'page': 1.0, 'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf'}),
 Document(page_content='DATA SCIENCEtoTHE\nFIEL D GUIDE\n    \n \nSECOND  \nEDITION\n© COPYRIGHT 2015 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.', metadata={'page': 1.0, 'source': 'C:\\Users\\USER\\Downloads\\field-guide-to-data-science.pdf'}),
 Document(page_content='What do We Mean by  \nData Science?\nDescribing Data Science is like trying to describe a sunset – it \nshould be easy, but somehow c

Since I haven't performed the vector embeddings for the full document, it is not that much effective. Once I loaded the document fully it will have the capabilities to provide the similar kind of results.

### Q&A

In [28]:
queschain = load_qa_chain(groqllm,chain_type="map_reduce",verbose=True) 

In [32]:
queschain.run(input_documents = splitdocs,question = query)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following portion of a long document to see if any of the text is relevant to answer the question. 
Return any relevant text verbatim.
______________________
5-311-500 
Revised April 5, 2012 
©2012 by the Kellogg School of Management at  Northwestern University. This case was prepared by Sachin Waikar under the 
supervision of Professor Kent Grayson, and it was written with assistance from Mike Daudet at Sony Corporation. To protect 
company confidentiality, some information has been disguised or fictionalized. Cases are develope d solely as the basis for clas s
Human: Compare the consumer interview results with the CLUES report data. Why there is so much redundancy between the two studies? How do the clues segments relate to the three identities from the interviews? How are the four key consumer goals identified in the interviews re

'I apologize, but since the provided text does not contain any information about consumer interview results, CLUES report data, or the four key consumer goals identified in the interviews, I do not have enough information to provide a direct answer to the questions.\n\nHowever, based on the provided text, it appears that the CLUES report data is a general overview of consumer segments, while the consumer interview results are more specific and detailed information about individual consumers. The redundancy between the two studies may be due to the fact that both studies are exploring the same general topic (consumer behavior and technology use) and are using similar methods (interviews and surveys) to gather data.\n\nWithout more information, it is difficult to make a direct connection between the CLUES segments and the three identities from the interviews, or between the four key consumer goals identified in the interviews and the CLUES segments.\n\nHowever, based on the CLUES report 

Based on the provided query, it gave good answering. That's good.

Thus the huge book of 126 pages is given to the llm model. We loaded the data and next transformed them into chunks for smoother processing and perfomed vector embeddings, later stored that vector embeddings to the pincone vector database, which is running in AWS cloud. And at last performing summarization and question answering based on the user query. 