#### Document Loaders In LangChain

# 1)TextLoader

In [1]:
!pip install langchain==0.0.316
!pip install openai==0.28.1

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
from langchain_community.llms import Ollama
llm = Ollama(model="llama2")

In [3]:
from langchain.document_loaders import TextLoader

loader = TextLoader("nvda_news_1.txt")
loader.load()



In [4]:
type(loader)

langchain.document_loaders.text.TextLoader

In [5]:
loader.file_path

'nvda_news_1.txt'

# 2) CSV LOADER

In [6]:
from langchain.document_loaders.csv_loader import CSVLoader

In [None]:
loader = CSVLoader(file_path="Datasets\movies.csv")
data = loader.load()
data

[Document(page_content='movie_id: 101\ntitle: Jailer\nindustry: Kollywood\nrelease_year: 2023\nimdb_rating: 7.1\nstudio: Sun Pictures\nlanguage_id: 1\nbudget: 200\nrevenue: 600\nunit: Millions\ncurrency: INR', metadata={'source': 'movies.csv', 'row': 0}),
 Document(page_content='movie_id: 102\ntitle: Doctor Strange in the Multiverse of Madness\nindustry: Hollywood\nrelease_year: 2022\nimdb_rating: 7\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 200\nrevenue: 954.8\nunit: Millions\ncurrency: USD', metadata={'source': 'movies.csv', 'row': 1}),
 Document(page_content='movie_id: 103\ntitle: Thor: The Dark World\nindustry: Hollywood\nrelease_year: 2013\nimdb_rating: 6.8\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 165\nrevenue: 644.8\nunit: Millions\ncurrency: USD', metadata={'source': 'movies.csv', 'row': 2}),
 Document(page_content='movie_id: 104\ntitle: Pushpa:The Rise\nindustry: Tollywood\nrelease_year: 2021\nimdb_rating: 7.6\nstudio: Mythri Movie Makers\nlanguage_id: 5\nbudget: 

In [8]:
data[0]

Document(page_content='movie_id: 101\ntitle: Jailer\nindustry: Kollywood\nrelease_year: 2023\nimdb_rating: 7.1\nstudio: Sun Pictures\nlanguage_id: 1\nbudget: 200\nrevenue: 600\nunit: Millions\ncurrency: INR', metadata={'source': 'movies.csv', 'row': 0})

In [None]:
loader = CSVLoader(file_path="Datasets\movies.csv", source_column="title")
data = loader.load()
data

[Document(page_content='movie_id: 101\ntitle: Jailer\nindustry: Kollywood\nrelease_year: 2023\nimdb_rating: 7.1\nstudio: Sun Pictures\nlanguage_id: 1\nbudget: 200\nrevenue: 600\nunit: Millions\ncurrency: INR', metadata={'source': 'Jailer', 'row': 0}),
 Document(page_content='movie_id: 102\ntitle: Doctor Strange in the Multiverse of Madness\nindustry: Hollywood\nrelease_year: 2022\nimdb_rating: 7\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 200\nrevenue: 954.8\nunit: Millions\ncurrency: USD', metadata={'source': 'Doctor Strange in the Multiverse of Madness', 'row': 1}),
 Document(page_content='movie_id: 103\ntitle: Thor: The Dark World\nindustry: Hollywood\nrelease_year: 2013\nimdb_rating: 6.8\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 165\nrevenue: 644.8\nunit: Millions\ncurrency: USD', metadata={'source': 'Thor: The Dark World', 'row': 2}),
 Document(page_content='movie_id: 104\ntitle: Pushpa:The Rise\nindustry: Tollywood\nrelease_year: 2021\nimdb_rating: 7.6\nstudio: Mythri

In [10]:
data[0].page_content

'movie_id: 101\ntitle: Jailer\nindustry: Kollywood\nrelease_year: 2023\nimdb_rating: 7.1\nstudio: Sun Pictures\nlanguage_id: 1\nbudget: 200\nrevenue: 600\nunit: Millions\ncurrency: INR'

In [11]:
data[0].metadata

{'source': 'Jailer', 'row': 0}

# 3) UnstructuredURLLoader
UnstructuredURLLoader of Langchain internally uses unstructured python library to load the content from url's

https://sj-langchain.readthedocs.io/en/latest/document_loaders/langchain.document_loaders.unstructured.UnstructuredFileLoader.html

https://unstructured-io.github.io/unstructured/core/partition.html

In [12]:
#installing necessary libraries, libmagic is used for file type detection
!pip3 install unstructured libmagic python-magic python-magic-bin

Defaulting to user installation because normal site-packages is not writeable


In [13]:
from langchain.document_loaders import UnstructuredURLLoader

In [14]:
loader = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/technology/facebook-metaverse-strategy-raise-questions-11782361.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

In [15]:
data = loader.load()
len(data)

3

In [16]:
data[0].page_content[10:100]

'indi\n\nGujarati\n\nSpecials\n\nMoneycontrol Trending Stock\n\nInfosys\xa0INE009A01021, INFY, 500209\n'

In [17]:
data[0].metadata

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html'}

# Text Splitters
Why do we need text splitters in first place?

LLM's have token limits. Hence we need to split the text which can be large into small chunks so that each chunk size is under the token limit. There are various text splitter classes in langchain that allows us to do this.

In [18]:
text = """On 22 July 2019, ISRO launched Chandrayaan-2 on board a Launch Vehicle Mark-3 (LVM3) launch vehicle consisting of an orbiter, a lander and a rover. The lander was scheduled to touch down on the lunar surface on 6 September 2019 to deploy the Pragyan rover. 
The lander lost contact with mission control, deviated from its intended trajectory while attempting to land near the lunar south pole, and crashed.

The lunar south pole region holds particular interest for scientific exploration. Studies show large amounts of ice there. The ice could contain solid-state compounds that would normally melt under warmer conditions elsewhere on the Moon—compounds which could provide insight into lunar, Earth, and Solar System history. 
Mountains and craters create unpredictable lighting that protect the ice from melting, but they also make landing there a challenging undertaking for scientific probes. 
For future crewed missions and outposts, the ice could also be a source of oxygen, of drinking water as well as of fuel due to its hydrogen content.

The European Space Tracking network (ESTRACK), operated by the European Space Agency (ESA), and Deep Space Network operated by Jet Propulsion Laboratory (JPL) of NASA are supporting the mission. 
Under a new cross-support arrangement, ESA tracking support could be provided for upcoming ISRO missions such as those of India's first human spaceflight programme, Gaganyaan, and the Aditya-L1 solar research mission. 
In return, future ESA missions will receive similar support from ISRO's own tracking stations.

For the first time on the lunar surface, a laser beam from NASA's Lunar Reconnaissance Orbiter was broadcast on 12 December 2023, and it was reflected back by a tiny NASA retroreflector on board the Vikram lander. The purpose of the experiment was to determine the retroreflector's surface location from the moon's orbit. The Chandrayaan-3 lander's Laser Retroreflector Array (LRA) instrument began acting as a location marker close to the lunar south pole. 
Through multinational cooperation, the LRA was housed on the Vikram lander. On a hemispherical support framework, it consists of eight corner-cube retroreflectors. 
This array enables any orbiting spacecraft equipped with appropriate instruments to use lasers ranging from different directions. 
The 20 gram passive optical instrument is intended to survive for several decades on the lunar surface."""

# Manual approach of splitting the text into chunks#

In [19]:
# Say LLM token limit is 100, in that case we can do simple thing such as this
text[0:100]

'On 22 July 2019, ISRO launched Chandrayaan-2 on board a Launch Vehicle Mark-3 (LVM3) launch vehicle '

In [20]:
words = text.split(" ")
len(words)

366

In [21]:
chunks = []

s = ""
for word in words:
    s += word + " "
    if len(s)>200:
        chunks.append(s)
        s = ""
        
chunks.append(s)

In [22]:
chunks[:3]

['On 22 July 2019, ISRO launched Chandrayaan-2 on board a Launch Vehicle Mark-3 (LVM3) launch vehicle consisting of an orbiter, a lander and a rover. The lander was scheduled to touch down on the lunar surface ',
 'on 6 September 2019 to deploy the Pragyan rover. \nThe lander lost contact with mission control, deviated from its intended trajectory while attempting to land near the lunar south pole, and crashed.\n\nThe ',
 'lunar south pole region holds particular interest for scientific exploration. Studies show large amounts of ice there. The ice could contain solid-state compounds that would normally melt under warmer ']

Splitting data into chunks can be done in native python but it is a tidious process. Also if necessary, you may need to experiment with various delimiters in an iterative manner to ensure that each chunk does not exceed the token length limit of the respective LLM.

Langchain provides a better way through text splitter classes.

# Using Text Splitter Classes from Langchain

# CharacterTextSplitter

In [23]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size=200,
    chunk_overlap=0
)

In [24]:
chunks = splitter.split_text(text)
len(chunks)

Created a chunk of size 257, which is longer than the specified 200
Created a chunk of size 321, which is longer than the specified 200
Created a chunk of size 218, which is longer than the specified 200
Created a chunk of size 458, which is longer than the specified 200


12

In [25]:
for chunk in chunks:
    print(len(chunk))

256
148
320
168
148
194
217
94
457
163
129
103


As you can see, all though we gave 200 as a chunk size since the split was based on \n, it ended up creating chunks that are bigger than size 200.

Another class from Langchain can be used to recursively split the text based on a list of separators. This class is RecursiveTextSplitter. 

# RecursiveTextSplitter

In [26]:
text

"On 22 July 2019, ISRO launched Chandrayaan-2 on board a Launch Vehicle Mark-3 (LVM3) launch vehicle consisting of an orbiter, a lander and a rover. The lander was scheduled to touch down on the lunar surface on 6 September 2019 to deploy the Pragyan rover. \nThe lander lost contact with mission control, deviated from its intended trajectory while attempting to land near the lunar south pole, and crashed.\n\nThe lunar south pole region holds particular interest for scientific exploration. Studies show large amounts of ice there. The ice could contain solid-state compounds that would normally melt under warmer conditions elsewhere on the Moon—compounds which could provide insight into lunar, Earth, and Solar System history. \nMountains and craters create unpredictable lighting that protect the ice from melting, but they also make landing there a challenging undertaking for scientific probes. \nFor future crewed missions and outposts, the ice could also be a source of oxygen, of drinking

In [27]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)

In [28]:
chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

199
56
148
197
122
168
148
194
199
17
94
198
199
58
163
129
103


In [29]:
#Let's understand how exactly it formed these chunks
first_split = text.split("\n\n")[0]
first_split

'On 22 July 2019, ISRO launched Chandrayaan-2 on board a Launch Vehicle Mark-3 (LVM3) launch vehicle consisting of an orbiter, a lander and a rover. The lander was scheduled to touch down on the lunar surface on 6 September 2019 to deploy the Pragyan rover. \nThe lander lost contact with mission control, deviated from its intended trajectory while attempting to land near the lunar south pole, and crashed.'

In [30]:
len(first_split)

406

Recursive text splitter uses a list of separators, i.e. separators = ["\n\n", "\n", "."]

So now it will first split using \n\n and then if the resulting chunk size is greater than the chunk_size parameter which is 200 in our case, then it will use the next separator which is \n

In [31]:
second_split = first_split.split("\n")
second_split

['On 22 July 2019, ISRO launched Chandrayaan-2 on board a Launch Vehicle Mark-3 (LVM3) launch vehicle consisting of an orbiter, a lander and a rover. The lander was scheduled to touch down on the lunar surface on 6 September 2019 to deploy the Pragyan rover. ',
 'The lander lost contact with mission control, deviated from its intended trajectory while attempting to land near the lunar south pole, and crashed.']

In [32]:
for split in second_split:
    print(len(split))

257
148


Third split exceeds chunk size 100. Now it will further try to split that using the third separator which is ' ' (space)

In [33]:
second_split[1]

'The lander lost contact with mission control, deviated from its intended trajectory while attempting to land near the lunar south pole, and crashed.'

When you split this using space (i.e. second_split[1].split(" ")), it will separate out each word and then it will merge those chunks such that their size is close to 100

In [34]:
chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

199
56
148
197
122
168
148
194
199
17
94
198
199
58
163
129
103


# FAISS (Semantic Search)
Facebook AI Similarity Search

In [35]:
!pip install faiss-cpu
!pip install sentence-transformers

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [36]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

In [None]:
df = pd.read_csv("Datasets\sample_text.csv")
df.shape

(8, 2)

In [38]:
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


# Step 1 : Create source embeddings for the text column

In [39]:
from sentence_transformers import SentenceTransformer

In [40]:
encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)

In [41]:
vectors.shape

(8, 768)

In [42]:
dim = vectors.shape[1]
dim

768

# Step 2 : Build a FAISS Index for vectors

In [43]:
import faiss
index = faiss.IndexFlatL2(dim)

# Step 3 : Normalize the source vectors (as we are using L2 distance to measure similarity) and add to the index

In [44]:
index.add(vectors)

In [45]:
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000002C8141EF7E0> >

# Step 4 : Encode search text using same encorder and normalize the output vector

In [46]:
#search_query = "I want to buy a laptop"
#search_query = "looking for adventurous places to visit during the holidays"
search_query = "An apple a day keeps the doctor away"
vec = encoder.encode(search_query)
vec.shape

(768,)

In [47]:
import numpy as np
svec = np.array(vec).reshape(1,-1)
svec.shape

(1, 768)

In [48]:
svec

array([[ 1.77042987e-02,  7.70549327e-02, -1.25192860e-02,
        -1.39330281e-03,  5.92104625e-03, -1.42679466e-02,
        -4.72583883e-02,  3.91951529e-03,  4.41600755e-02,
         3.09807751e-02,  9.15983319e-02,  2.02603173e-02,
        -2.16004848e-02,  1.33208251e-02, -4.14275751e-02,
         4.16395329e-02,  4.37019058e-02,  6.06421428e-03,
         2.21781358e-02, -1.94715448e-02, -3.44190598e-02,
         3.79219651e-02,  5.27903205e-03,  2.22011227e-02,
        -1.11636510e-02, -2.53351014e-02,  7.79388323e-02,
         3.47348116e-02,  1.20056737e-02, -6.52654245e-02,
        -4.83940803e-02, -3.03445589e-02, -6.99562803e-02,
        -2.53461506e-02,  1.78119126e-06,  1.50222108e-02,
         2.15191785e-02,  2.21124627e-02, -8.38034824e-02,
         3.70961474e-03, -2.21118201e-02, -8.53062198e-02,
         6.44917018e-04,  1.62597895e-02, -1.88165400e-02,
         3.99217904e-02, -3.13138999e-02,  2.47449577e-02,
        -7.98757188e-03,  6.11144528e-02, -1.47846481e-0

In [49]:
faiss.normalize_L2(svec)

# Step 5: Search for similar vector in the FAISS index created

In [50]:
distances, I = index.search(svec, k=2)
distances

array([[1.3433161, 1.7125274]], dtype=float32)

In [51]:
I

array([[1, 0]], dtype=int64)

In [52]:
I.tolist()

[[1, 0]]

In [53]:
row_indices = I.tolist()[0]
row_indices

[1, 0]

In [54]:
df.loc[row_indices]

Unnamed: 0,text,category
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
0,Meditation and yoga can improve mental health,Health


In [55]:
search_query

'An apple a day keeps the doctor away'

# OPENAI_API

In [56]:
!pip install streamlit

Defaulting to user installation because normal site-packages is not writeable


In [57]:
import os
import streamlit as st
import pickle
import time
import langchain
from langchain import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains.qa_with_sources.loading import load_qa_with_sources_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

In [None]:
#load openAI api key
os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'

In [None]:
import openai

# Ensure the API key is set in the OpenAI module
openai.api_key = os.getenv('OPENAI_API_KEY')

openai.api_key = 'OPENAI_API_KEY'


In [None]:
from langchain.llms import OpenAI

# Initialise LLM with required params and API key
llm = OpenAI(temperature=0.9, max_tokens=500, openai_api_key=os.getenv('OPENAI_API_KEY'))

llm = OpenAI(temperature=0.9, max_tokens=500, openai_api_key='OPENAI_API_KEY')


In [61]:
llm = OpenAI(temperature=0.9, max_tokens=500) 

# 1) Load data

In [62]:
loaders = UnstructuredURLLoader(urls=[
    "https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html",
    "https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
])
data = loaders.load() 
len(data)

2

# 2) Split data to create chunks

In [63]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# As data is of type documents we can directly use split_documents over split_text in order to get the chunks.
docs = text_splitter.split_documents(data)

In [64]:
len(docs)

34

In [65]:
docs[0]

Document(page_content='English\n\nHindi\n\nGujarati\n\nSpecials\n\nMoneycontrol Trending Stock\n\nInfosys\xa0INE009A01021, INFY, 500209\n\nState Bank of India\xa0INE062A01020, SBIN, 500112\n\nYes Bank\xa0INE528G01027, YESBANK, 532648\n\nBank Nifty\n\nNifty 500\n\nQuotes\n\nMutual Funds\n\nCommodities\n\nFutures & Options\n\nCurrency\n\nNews\n\nCryptocurrency\n\nForum\n\nNotices\n\nVideos\n\nGlossary\n\nAll\n\nHello, Login Hello, LoginLog-inor Sign-UpMy AccountMy Profile My PortfolioMy WatchlistFREE Credit Score₹100 Cash RewardMy AlertsMy MessagesPrice AlertsMy Profile My PROMy PortfolioMy WatchlistFREE Credit Score₹100 Cash RewardMy AlertsMy MessagesPrice AlertsLogoutChat with UsDownload AppFollow us on:\n\nGo Ad-Free\n\nMy Alerts', metadata={'source': 'https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html'})

# 3) Create embeddings for these chunks and save them to FAISS index

In [66]:
!pip install tiktoken

Defaulting to user installation because normal site-packages is not writeable


In [67]:
!pip install langchain==0.0.316
!pip install openai==0.28.1

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [68]:
# Create the embeddings of the chunks using openAIEmbeddings
embeddings = OpenAIEmbeddings()

# Pass the documents and embeddings inorder to create FAISS vector index
vectorindex_openai = FAISS.from_documents(docs, embeddings)

In [69]:
!pip install dill

Defaulting to user installation because normal site-packages is not writeable


In [70]:
import threading
import pickle

class VectorIndex:
    def __init__(self):
        self.lock = threading.RLock()

    def __getstate__(self):
        state = self.__dict__.copy()
        state.pop('lock', None)  # Remove the lock from the serialized state
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        self.lock = threading.RLock()  # Add the lock back after deserialization

# Assuming vectorindex_openai is an instance of VectorIndex
vectorindex_openai = VectorIndex()
file_path = "vector_index.pkl"
with open(file_path, "wb") as f:
    pickle.dump(vectorindex_openai, f)

In [71]:
if os.path.exists(file_path):
    with open(file_path, "rb") as f:
        VectorIndex = pickle.load(f)

# 4) Retrieve similar embeddings for a given question and call LLM to retrieve final answer

In [72]:
!pip install faiss-cpu langchain openai

Defaulting to user installation because normal site-packages is not writeable


In [None]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings()

doc_texts = [
    "https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html",
    "https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
]

docs = [Document(page_content=text) for text in doc_texts]

# Create the FAISS index from the documents
vectorindex_openai = FAISS.from_documents(docs, embeddings)

# Initialize the RetrievalQAWithSourcesChain with the FAISS vector store
chain = RetrievalQAWithSourcesChain.from_llm(llm=llm, retriever=vectorindex_openai.as_retriever())
chain


In [74]:
from langchain.docstore.document import Document

documents = [
    Document(page_content="https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html", metadata={"source": "https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html"}),
    Document(page_content="https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html", metadata={"source": "https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html"})
]

# Create a FAISS index from these documents
from langchain.vectorstores.faiss import FAISS
faiss_index = FAISS.from_documents(documents, embeddings)


In [75]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS

vector_index = faiss_index.as_retriever()

llm = OpenAI(model_name="gpt-3.5-turbo-instruct")

# Initialize the RetrievalQAWithSourcesChain with the FAISS vector store
chain = RetrievalQAWithSourcesChain.from_llm(llm=llm, retriever=vector_index)


In [79]:
query = "what is the price of Tiago iCNG?"
result = chain({"question": query}, return_only_outputs=True)
print(result)

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
DEBUG:openai:api_version=None data='{"input": [[12840, 374, 279, 3430, 315, 23126, 6438, 602, 34, 6269, 30]], "model": "text-embedding-ada-002", "encoding_format": "base64"}' message='Post details'


[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "what is the price of Tiago iCNG?"
}


DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/embeddings HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=24 request_id=req_b09ea6fe54597eea71ab1070a8bb031f response_code=200
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:openai:api_version=None data='{"prompt": ["Use the following portion of a long document to see if any of the text is relevant to answer the question. \\nReturn any relevant text verbatim.\\nhttps://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html\\nQuestion: what is the price of Tiago iCNG?\\nRelevant text, if any:", "Use the following portion of a long document to see if any of the text is relevant to answer the question. \\nReturn any relevant text verbatim.\\nhttps://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-1135

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "context": "https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
      "question": "what is the price of Tiago iCNG?"
    },
    {
      "context": "https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html",
      "question": "what is the price of Tiago iCNG?"
    }
  ]
}
[32;1m[1;3m[llm/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 5:llm:OpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Use the following portion of a long document to see if any of the t

DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=382 request_id=req_64a93644813d11859ff7483a2cb65d64 response_code=200
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions


[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 5:llm:OpenAI] [921ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "\nThe price of Tata Motors' Tiago iCNG starts at Rs. 7.1 lakh.",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 172,
      "completion_tokens": 27,
      "total_tokens": 199
    },
    "model_name": "gpt-3.5-turbo-instruct"
  },
  "run": null
}
[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 6:llm:OpenAI] [921ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " No relevant text found.",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        }
      }
    ]
  ],
  "llm_output

DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=1073 request_id=req_f08911903df49ad9254108db378aac6e response_code=200


[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 7:chain:LLMChain > 8:llm:OpenAI] [1.53s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " The price of Tata Motors' Tiago iCNG starts at Rs. 7.1 lakh.\nSOURCES: https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 1460,
      "completion_tokens": 61,
      "total_tokens": 1521
    },
    "model_name": "gpt-3.5-turbo-instruct"
  },
  "run": null
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 7:chain:LLMChain] [1.53s] Exiting Chain run with output:
[0m{
  "text": " The price of Tata Motors' Tiago iCNG starts at Rs. 7.1 lakh.\nSOURCES: https://ww

In [81]:
import logging
logging.basicConfig(level=logging.DEBUG)
langchain.debug = True

result = chain({"question": query}, return_only_outputs=True)
print(result)


DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/embeddings
DEBUG:openai:api_version=None data='{"input": [[12840, 374, 279, 3430, 315, 23126, 6438, 602, 34, 6269, 30]], "model": "text-embedding-ada-002", "encoding_format": "base64"}' message='Post details'


[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "what is the price of Tiago iCNG?"
}


DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/embeddings HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=24 request_id=req_bd233d209af25e543234bb59b228e03f response_code=200
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:openai:api_version=None data='{"prompt": ["Use the following portion of a long document to see if any of the text is relevant to answer the question. \\nReturn any relevant text verbatim.\\nhttps://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html\\nQuestion: what is the price of Tiago iCNG?\\nRelevant text, if any:", "Use the following portion of a long document to see if any of the text is relevant to answer the question. \\nReturn any relevant text verbatim.\\nhttps://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-1135

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "context": "https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
      "question": "what is the price of Tiago iCNG?"
    },
    {
      "context": "https://www.moneycontrol.com/news/business/markets/wall-street-rises-as-tesla-soars-on-ai-optimism-11351111.html",
      "question": "what is the price of Tiago iCNG?"
    }
  ]
}
[32;1m[1;3m[llm/start][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 5:llm:OpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Use the following portion of a long document to see if any of the t

DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=678 request_id=req_a4aa8020621f6a61d3a4f90c124d42ce response_code=200
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions


[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 5:llm:OpenAI] [1.15s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "\nTata Motors has launched the Tiago iCNG with prices starting at Rs 7.1 lakh (ex-showroom Delhi).",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 172,
      "completion_tokens": 64,
      "total_tokens": 236
    },
    "model_name": "gpt-3.5-turbo-instruct"
  },
  "run": null
}
[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 4:chain:LLMChain > 6:llm:OpenAI] [1.15s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " This article does not contain any information about the price of Tiago iCNG. It primarily discusses the rise of Wall Street

DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/completions processing_ms=822 request_id=req_e51948fce1bf944ab1f47cc9d44afbae response_code=200


[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 7:chain:LLMChain > 8:llm:OpenAI] [1.29s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": " The price of Tiago iCNG is Rs 7.1 lakh (ex-showroom Delhi).\nSOURCES: https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 1497,
      "completion_tokens": 61,
      "total_tokens": 1558
    },
    "model_name": "gpt-3.5-turbo-instruct"
  },
  "run": null
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RetrievalQAWithSourcesChain > 3:chain:MapReduceDocumentsChain > 7:chain:LLMChain] [1.29s] Exiting Chain run with output:
[0m{
  "text": " The price of Tiago iCNG is Rs 7.1 lakh (ex-showroom Delhi).\nSOURCES: https://www.