# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [1]:
#pip install --upgrade langchain

import os
import openai
import tiktoken
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file where you can have your own API keys

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
# pip install pypdf

In [3]:
from langchain.chains import RetrievalQA  ## retrieval over some documents
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader  ## document loader
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import DocArrayInMemorySearch  ## in memory vectore store
from IPython.display import display, Markdown

In [4]:
file = "./Data/MS_BAIS_New_Students.pdf"
loader = PyPDFLoader(file_path=file)

In [5]:
# loader = PyPDFLoader("./Data/MS_BAIS_New_Students.pdf")

In [6]:
# pip install --upgrade sqlalchemy

In [7]:
# pip install langchain sqlalchemy

In [8]:
from langchain.indexes import VectorstoreIndexCreator

In [9]:
# pip install docarray

In [10]:
# pip install --upgrade pip setuptools wheel


In [11]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [12]:
query ="Information about clear your immunization and insurance holds"

In [13]:
response1 = index.query(query)

In [14]:
display(Markdown(response1))

 Student Health Service information can be found at http://www.usf.edu/student-affairs/student-health-services. Immunization information is at http://www.usf.edu/student-affairs/student-health-services/immunizations, there are phone numbers and an email address available at this site.

In [15]:
loader = PyPDFLoader(file_path=file)

In [16]:
docs1 = loader.load()

In [17]:
len(docs1)

7

In [18]:
docs1[1]  ## look at the individual dcoument

Document(page_content='2 Review the BAIS  Program : The MS BAIS  Program curriculum includes  6 required courses and 6 electives.  Plan included below.  Some of \nthese courses have prerequisites.  See https://www.usf.edu/business/graduate/masters/bais/coursework.aspx  for more on the \ncoursework.  \nMS BAIS  Program Technical Core  Classes  - (required ): \n• ISM 6 124 – Advanced Systems Analysis and Design  (requires a systems analysis or software engineering prerequisite)  \n• ISM 6 218 – Advanced Database Administration  (requires a database prerequisite)  \n• ISM 6 225 – Distributed Information Systems  (should have completed  the Object -Orien ted programming prerequisite)  \n• QMB 6 304 – Analytical Methods for Business  \n• ISM 6 155 – Enterprise Information Systems Management  \n• ISM 6945 – BAIS Internship  - required prerequisites: Analytical Methods for Business (QMB 6304), Advanced Database (ISM \n6218) and  Data Mining (ISM 6136)  \nMS BAIS  Program Electives  – (17 cred

In [19]:
from langchain.embeddings import OpenAIEmbeddings ## directly embedding since it's not  a large document to chunck before..
embeddings = OpenAIEmbeddings()

In [20]:
embed = embeddings.embed_query("Information about clear your immunization and insurance holds")  # see what embedding do 

In [21]:
print(len(embed))  ### it has 1000 different elements

1536


In [22]:
print(embed[:5])  ## each of these elements have different numerical values

[-0.007898134799456781, 0.007864324066040481, 0.002782604780276943, -0.04316925459835935, -0.03216055653898886]


In [23]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(docs1)

In [24]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)  ## we want to embed all piece of text loaded and also want to store in vector space

In [26]:
query ="Information about clear your immunization and insurance holds"

In [27]:
docs = db.similarity_search(query)

In [28]:
len(docs)

4

In [29]:
docs[2]

Document(page_content='3 ▪ QMB 6304 – Analytical Methods for Business (if you have met the statistics prerequisite ) \n▪ ISM 6136 – Data Mining  \n▪ ISM 6218 – Advanced Database Administration (if you have met the database prerequisite ) \no Additional Courses to Consider:  \n▪ ISM 61 24 – Systems Analysis and Design ( if you have met the systems analysis prerequisite )  \n▪ ISM 6155 – Enterprise Information Systems Management  \n▪ ISM 6225 – Distributed Information Systems ( if you have met the object -oriented programming  prerequisite ) \n• Related to the above, s ome classes may fill up before you can register.  If this happens, please take advantage of the \n‘waitlist’ option.  This helps us know how many studen ts need to enroll and we can often adjust class sizes or suggest \nalternative classes.  \n• Your  last semester must include at least 2 or more credits .  International students may have to comply with policies \nregarding on -campus vs. online courses taken in the p rogr

In [30]:
retriever = db.as_retriever()

In [31]:
llm = ChatOpenAI(temperature = 0.0)


In [32]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])  ## we wil lcombine a document into single piece of text


In [33]:
qdocs

"USF MS BAIS  (on-campus) Program \nPlanning / Course Selection  \nFor Incoming Students  \n \n \nGetting started  - Fall 2022  \nWelcome to USF’s MS Business Analytics and Information Systems  program  (BAIS) .  This document is intended to help you get ready to \njoin us.  Please contact Prof. Barbara Warner ( bwarner@usf.edu ) with any questions.  \nGetting Started  - New Student Checklist:  \n1) Review our New Student PDF at http://www.usf.edu/business/graduate/masters/bais/new -student.aspx  \n2) Register for our department’s orientation via http://usfweb.usf.edu/business/forms/bais -orientation -register.aspx  by July 15 . It \nis tentatively scheduled for Thursday, August 18 , at 8:30 am.  At this time, orientation is planned to be on -campus. Registering for \norientation  lets us know of your intent to attend so we can send you notifications relating to your start at USF. NOTE :  if you are \nplanning to attend USF in the upcoming semester, but cannot attend orientation, pleas

In [34]:
response_qdocs = llm.call_as_llm(f"sys{qdocs} Question: Information about prerequisite") 


In [35]:
display(Markdown(response_qdocs))

The document provides information and instructions for incoming students of the MS Business Analytics and Information Systems (BAIS) program at USF. It includes a checklist of tasks to complete before starting the program, such as reviewing the new student PDF, registering for orientation, completing the academic integrity course, and clearing immunization and insurance holds. It also provides information on course selection and prerequisites, including a list of recommended pre-start courses and elective options. The document also mentions funding opportunities and provides contact information for various departments and services at USF.

In [36]:
response_docs = llm.call_as_llm(f"{response1} Question: Information about clear your immunization and insurance holds") 

In [44]:
display(Markdown(response_docs))

To find information about clearing your immunization and insurance holds at USF Student Health Services, you can visit their website at http://www.usf.edu/student-affairs/student-health-services/immunizations. This page should provide you with all the necessary information and instructions on how to clear these holds. Additionally, you can find phone numbers and an email address on the website to contact the Student Health Services directly for any specific questions or concerns you may have.

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [None]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [None]:
response = qa_stuff.run(query)

In [None]:
display(Markdown(response))

In [None]:
response = index.query(query, llm=llm)

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])