This notebook makes a question answering chain with a specified website as a context data.

# Setting up

Install dependencies

In [3]:
# %pip install langchain==0.0.189
%pip install pinecone-client
# %pip install openai
# %pip install tiktoken
# %pip install nest_asyncio

Note: you may need to restart the kernel to use updated packages.


Set up OpenAI API key

In [1]:
import os
os.environ["OPENAI_API_KEY"] = "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"

Set up Pinecone API keys

In [2]:
import pinecone

# initialize pinecone
pinecone.init(
    api_key="5dbdf6f8-f0e4-4e8a-8cf6-3dcc6ace494a",  # find at app.pinecone.io
    environment="gcp-starter"  # next to api key in console
)

  from tqdm.autonotebook import tqdm


**Load data from Web**

Extends from the WebBaseLoader, this will load a sitemap from a given URL, and then scrape and load all the pages in the sitemap, returning each page as a document.

The scraping is done concurrently, using WebBaseLoader. There are reasonable limits to concurrent requests, defaulting to 2 per second.

Link to the [documentation](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/sitemap.html)

In [3]:
from langchain.document_loaders import TextLoader
loader = TextLoader("scraped data.txt")
docs = loader.load()

In [4]:
docs

In [18]:
len(docs)

788

**Split the text from docs into smaller chunks**

There are many ways to split the text. We are using the text splitter that is recommended for generic texts. For more ways to slit the text check the [documentation](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1200,
    chunk_overlap  = 200,
    length_function = len,
)

docs_chunks = text_splitter.split_documents(docs)

Create embeddings

In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [5]:
from langchain.vectorstores import Pinecone

index_name = "ind"

# # #create a new index
# docsearch = Pinecone.from_documents(docs_chunks, embeddings, index_name=index_name)

# if you already have an index, you can load it like this
docsearch = Pinecone.from_existing_index(index_name, embeddings)


Vectorstore is ready. Let's try to query our docsearch with similarity search

In [9]:
query = "How to log in to my account"
docs = docsearch.similarity_search(query)
print(docs[0])

page_content='login\nhttps://infinity.icicibank.com/corp/AuthenticationController?\nFORMSGROUP_ID__=AuthenticationFG&__START_TRAN_FLAG__=Y&FG_BUTTONS__=LOAD&ACTION.LOAD=Y&AuthenticationFG.LOGIN_FLAG=1&BANK_ID=ICI&ITM=nli_personalb_personal_login_btn&_gl=1*30xkeg*_ga*MTgzMDcxOTY5Ni4xNjIwMDM5NDU0*_ga_SKB78GHTFV*MTYyODIzNDM4NC43Ny4xLjE2MjgyMzQ1MDQuMjc.&_ga=2.15973366.1179124605.1628150213-1830719696.1620039454' metadata={'page': 174.0, 'source': 'data/outputs.pdf'}


In [8]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
llm=OpenAI()

qa_with_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)

query = "How to get a visa credit card."
result = qa_with_sources({"query": query})
result["result"]

' Customers can apply for an ICICI Bank Credit Card by logging into their internet banking account and selecting from a range of cards including the Bank’s gemstone collection of Coral, Rubyx & Sapphiro; Ferrari Signature & Platinum cards and Unifare cards. They can then confirm their details on a pre-populated personal information page and submit the application by following the given steps.'

Output source documents that were found for the query

In [10]:
from langchain.chains import ConversationalRetrievalChain
from IPython.display import display
import ipywidgets as widgets

# Create conversation chain that uses our vectordb as retriver, this also allows for chat history management
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), docsearch.as_retriever())

In [None]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""

    if query.lower() == 'exit':
        print("I hope I was able to answer your queries!")
        return

    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))

    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

print("Welcome to the icici helpbot! Type 'exit' to stop.")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Welcome to the icici helpbot! Type 'exit' to stop.


  input_box.on_submit(on_submit)


Text(value='', placeholder='Please enter your question:')

NameError: name 'qa' is not defined

HTML(value='<b>User:</b> can you tell me about the board of directors')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The ICICI Bank Board of Directors consists of Mr. Giris…

HTML(value='<b>User:</b> where can i find more information on this')

HTML(value='<b><font color="blue">Chatbot:</font></b>  You can find more information on the ICICI Bank Board o…

HTML(value='<b>User:</b> what NRI savings accounts are available')

HTML(value='<b><font color="blue">Chatbot:</font></b>  ICICI Bank offers three types of NRI savings accounts: …

HTML(value='<b>User:</b> list the above accounts')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The three types of NRI savings accounts offered by ICIC…

HTML(value='<b>User:</b> tell me more about each ')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The three types of NRI savings accounts offered by ICIC…

HTML(value='<b>User:</b> NRI accounts')

HTML(value='<b><font color="blue">Chatbot:</font></b>  The three types of NRI savings accounts offered by ICIC…