<a href="https://colab.research.google.com/github/ad71/ragbot/blob/master/Implementing_RAG_with_LangChain_and_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### RAG combines the powers of pretrained dense retrieval and sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate outputs

In [None]:
# %pip install -q langchain
%pip install -q torch
%pip install -q transformers
%pip install -q sentence-transformers
%pip install -q datasets
%pip install -q faiss-cpu

In [None]:
%pip install langchain==0.1.7



In [None]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

Use hugging face dataset — databricks-dolly-15k. This dataset is an open source dataset of instruction-following records generated by Databricks, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

In [None]:
dataset_name = 'databricks/databricks-dolly-15k'
page_content_column = 'context'

loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

data = loader.load()

data[:5]



[Document(page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."', metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}),
 Document(page_content='""', metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'}),
 Document(page_content='""', metadata={'instruction': 'Why can camels survive for long wit

In [None]:
# RecursiveCharacterTextSplitter is recommended for generic text, chunks of 1000 characters with a 150 character overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

In [None]:
docs[0]

Document(page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."', metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'})

In [None]:
# sentence embeddings
modelPath = 'sentence-transformers/all-MiniLM-l6-v2'

# create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)



In [None]:
text = 'This is a test document.'
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.03833853080868721, 0.12346463650465012, -0.028642989695072174]

In [None]:
db = FAISS.from_documents(docs, embeddings)

In [None]:
question = 'What is cheesemaking?'
searchDocs = db.similarity_search(question)

In [None]:
print(searchDocs[0].page_content)

"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be deliberately left to ferment from naturally airborne spores and bacteria; this approach generally leads to a less consistent product but one that is valuable in a niche market.\n\nCulturing\nCheese is made by bringing milk (possibly pasteurised) in the cheese vat to a temperature required to promote the growth of the bacteria that feed on lactose and thus ferment the lactose into lactic acid. These bacteria in the milk may be wild, as is the case with unpasteurised milk, added from a culture,


In [None]:
# intel/dynamic_tinybert which is a fine-tuned model for the purpose of question-answering
tokenizer = AutoTokenizer.from_pretrained('Intel/dynamic_tinybert')
model = AutoModelForQuestionAnswering.from_pretrained('Intel/dynamic_tinybert')



In [None]:
# create a question-answering pipeline using your pretrained model and tokenizer and then extend its functionality by creating a LangChain pipeline with additional model-specific arguments
model_name = 'Intel/dynamic_tinybert'

tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    'question-answering',
    model=model_name,
    tokenizer=tokenizer,
    return_tensors='pt'
)

# create an instance of the huggingface pipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={'temperature': 0.7, 'max_length': 512}
)

In [None]:
# a retriever is an interface that returns documents from the query
# it can't store documents, only return or retrieves them
retriever = db.as_retriever()

In [None]:
docs = retriever.get_relevant_documents(question)
print(docs[0].page_content)
# same result as before

"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be deliberately left to ferment from naturally airborne spores and bacteria; this approach generally leads to a less consistent product but one that is valuable in a niche market.\n\nCulturing\nCheese is made by bringing milk (possibly pasteurised) in the cheese vat to a temperature required to promote the growth of the bacteria that feed on lactose and thus ferment the lactose into lactic acid. These bacteria in the milk may be wild, as is the case with unpasteurised milk, added from a culture,


In [None]:
# RetrievalQA chain
# The RetrievalQA chain, which combines question-answering with a retrieval step.
# By default, we put all the data together in a single batch, where the chain type is 'stuff' when asking the language model.
# But if we have a lot of information and it doesn't all fit at once, we can use methods like MapReduce, Refine and MapRerank.

retriever = db.as_retriever(search_kwargs={'k': 4})

# create a question-answering instance (qa) using the RetrievalQA class.
# It's configured with a language model, a chain type "refine", the retriever we created, and an option to not return source documents
qa = RetrievalQA.from_chain_type(llm=llm, chain_type='refine', retriever=retriever, return_source_documents=False)

In [None]:
question = 'Who is Thomas Jefferson?'
result = qa.run({'query': question})
print(result['result'])

  warn_deprecated(


ValueError: Context information is below. 
------------
"Thomas Jefferson (April 13, 1743 \u2013 July 4, 1826) was an American statesman, diplomat, lawyer, architect, philosopher, and Founding Father who served as the third president of the United States from 1801 to 1809. Among the Committee of Five charged by the Second Continental Congress with authoring the Declaration of Independence, Jefferson was the Declaration's primary author. Following the American Revolutionary War and prior to becoming the nation's third president in 1801, Jefferson was the first United States secretary of state under George Washington and then the nation's second vice president under John Adams."
------------
Given the context information and not prior knowledge, answer the question: Who is Thomas Jefferson?
 argument needs to be of type (SquadExample, dict)

Answer the question(s) given as inputs by using the context(s).

create_sample
<
source
>
( question: Unioncontext: Union ) → One or a list of SquadExample

Parameters

question (str or List[str]) — The question(s) asked.
context (str or List[str]) — The context(s) in which we will look for the answer.
Returns

One or a list of SquadExample

The corresponding SquadExample grouping question and context.

QuestionAnsweringPipeline leverages the SquadExample internally. This helper method encapsulate all the logic for converting question(s) and context(s) to SquadExample.

We currently support extractive question answering.

In [None]:
qa_verbose = RetrievalQA.from_chain_type(llm=llm, chain_type='refine', retriever=retriever, return_source_documents=True)
print(qa.run({'query': question})['result'])

>When you will run your query, you will get the appropriate results, but with a ValueError, where it will ask you to input as a dictionary. While, in your final code, you have given your input as a dictionary.