In [10]:
from langchain_community.document_loaders import TextLoader

In [11]:
#data ingestion

loader = TextLoader("speech.txt")
text_documents = loader.load()
text_documents

[Document(page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not 

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['OPEN_AI_KEY'] = os.getenv("OPENAI_API_KEY")

In [13]:
#web based loader
from langchain_community.document_loaders import WebBaseLoader
import bs4

#load chunk and index

loader = WebBaseLoader(web_path=("https://lilianweng.github.io/posts/2020-10-29-odqa/",), bs_kwargs=dict(parse_only = bs4.SoupStrainer(
    class_ = ("post-header", "post-content")
)))

webpagedoc = loader.load()

In [14]:
webpagedoc

[Document(page_content='\n\n      How to Build an Open-Domain Question Answering System?\n    \nDate: October 29, 2020  |  Estimated Reading Time: 33 min  |  Author: Lilian Weng\n\n\n\n[Updated on 2020-11-12: add an example on closed-book factual QA using OpenAI API (beta).\nA model that can answer any question with regard to factual knowledge can lead to many useful and practical applications, such as working as a chatbot or an AI assistant🤖. In this post, we will review several common approaches for building such an open-domain question answering system.\nDisclaimers given so many papers in the wild:\n\nAssume we have access to a powerful pretrained language model.\nWe do not cover how to use structured knowledge base (e.g. Freebase, WikiData) here.\nWe only focus on a single-turn QA instead of a multi-turn conversation style QA.\nWe mostly focus on QA models that contain neural networks, specially Transformer-based language models.\nI admit that I missed a lot of papers with archite

In [15]:
#PDF
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('paper.pdf')
docs = loader.load()

In [16]:
docs

[Document(page_content='PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation\nof Large Language Models\nWei Zou∗1, Runpeng Geng∗2, Binghui Wang3, Jinyuan Jia1\n1Pennsylvania State University,2Wuhan University,3Illinois Institute of Technology\n1{weizou, jinyuan}@psu.edu,2kevingeng@whu.edu.cn,3bwang70@iit.edu\nAbstract\nLarge language models (LLMs) have achieved remarkable success due to their exceptional generative capabilities. Despite their\nsuccess, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. Retrieval-Augmented\nGeneration (RAG) is a state-of-the-art technique to mitigate those limitations. In particular, given a question, RAG retrieves\nrelevant knowledge from a knowledge database to augment the input of the LLM. For instance, the retrieved knowledge could\nbe a set of top- ktexts that are most semantically similar to the given question when the knowledge database contains millions\nof texts collected from Wik

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
documents = text_splitter.split_documents(webpagedoc)
documents[:5]

[Document(page_content='How to Build an Open-Domain Question Answering System?\n    \nDate: October 29, 2020  |  Estimated Reading Time: 33 min  |  Author: Lilian Weng\n\n\n\n[Updated on 2020-11-12: add an example on closed-book factual QA using OpenAI API (beta).\nA model that can answer any question with regard to factual knowledge can lead to many useful and practical applications, such as working as a chatbot or an AI assistant🤖. In this post, we will review several common approaches for building such an open-domain question answering system.\nDisclaimers given so many papers in the wild:', metadata={'source': 'https://lilianweng.github.io/posts/2020-10-29-odqa/'}),
 Document(page_content='Assume we have access to a powerful pretrained language model.\nWe do not cover how to use structured knowledge base (e.g. Freebase, WikiData) here.\nWe only focus on a single-turn QA instead of a multi-turn conversation style QA.\nWe mostly focus on QA models that contain neural networks, specia

In [None]:
#vector embedding and storage
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(documents[:15], OpenAIEmbeddings(openai_api_key = os.environ['OPEN_AI_KEY']))

In [None]:
#vector database

query = "Whats the conclusion in 50 words?"
result = db.similarity_search(query)
result[0].page_content

'Lewis, et al., (2020) (code) found that 58-71% of test-time answers are also present somewhere in the training sets and 28-34% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. In their experiments, several models performed notably worse when duplicated or paraphrased questions were removed from the training set.\nOpen-book QA: Retriever-Reader#\nGiven a factoid question, if a language model has no context or is not big enough to memorize the context which exists in the training dataset, it is unlikely to guess the correct answer. In an open-book exam, students are allowed to refer to external resources like notes and books while answering test questions. Similarly, a ODQA system can be paired with a rich knowledge base to identify relevant documents as evidence of answers.\nWe can decompose the process of finding answers to given questions into two stages,'

In [None]:
#faiss db

from langchain_community.vectorstores import FAISS
faissdb = FAISS.from_documents(documents[:15], OpenAIEmbeddings())
query = "What is Open-Domain Question Answering?"
res = faissdb.similarity_search(query)
res[0].page_content

'What is Open-Domain Question Answering?#\nOpen-domain Question Answering (ODQA) is a type of language tasks, asking a model to produce answers to factoid questions in natural language. The true answer is objective, so it is simple to evaluate model performance.\nFor example,\nQuestion: What did Albert Einstein win the Nobel Prize for?\nAnswer: The law of the photoelectric effect.\nThe “open-domain” part refers to the lack of the relevant context for any arbitrarily asked factual question. In the above case, the model only takes as the input the question but no article about “why Einstein didn’t win a Nobel Prize for the theory of relativity” is provided, where the term “the law of the photoelectric effect” is likely mentioned. In the case when both the question and the context are provided, the task is known as Reading comprehension (RC).'