# Improved Question-Answering Search Engines with Transformers and Large Language Models (LLMs)


![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# QA Search Engine using Transformers

## Retrival and Re-ranking

In Semantic Search we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search.

For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using Retrieve & Re-Rank.


## Retrieve & Re-Rank Pipeline

A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this notebook:

![](https://i.imgur.com/yIXJRSo.png)


Given a search query, we first use a retrieval system that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query.
For the retrieval, we can use either lexical search, e.g. with ElasticSearch, or we can use dense retrieval with a bi-encoder. Simple Lexical searches can be based on TF-IDF, BM25 etc.


However, the retrieval system might retrieve documents that are not that relevant for the search query.
Hence, in a second stage, we use a re-ranker based on a cross-encoder that scores the relevancy of all candidates for the given search query.

The output will be a ranked list of hits we can present to the user.


## Retrieval: Bi-Encoder

For the retrieval of the candidate set, we can either use lexical search (e.g. ElasticSearch), or we can use a bi-encoder (semantic search) which is implemented in this repository.

Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations.

In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.

Bi-Encoders produce for a given sentence or document an embedding.


## Re-Ranker: Cross-Encoder

The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.

A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

![](https://i.imgur.com/PFgkrcI.png)

The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.

Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.

First, you use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) combination.





## Retrieve & Re-Rank Search Engine over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over Simple Wikipedia.

You can input a query or a question. The script then uses semantic search to find relevant passages in Simple English Wikipedia

### Install Dependencies

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB

### Load Transformer Models, Wikipedia Data and Generate Embeddings

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve 32 potentially relevant passages that answer the input query.

Next, we use a more powerful CrossEncoder `(cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'))` that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance.

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine.

The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

In [2]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Downloading (…)5fedf/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)2cb455fedf/README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading (…)b455fedf/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)edf/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5fedf/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)fedf/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)2cb455fedf/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)455fedf/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

Passages: 169597


Batches:   0%|          | 0/5300 [00:00<?, ?it/s]

In [4]:
passages[0]

'Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".'

In [5]:
corpus_embeddings[0], corpus_embeddings[0].shape

(tensor([-1.0502e-01, -6.6984e-02,  6.5590e-03, -7.5239e-02, -2.9778e-02,
          3.2903e-02,  4.1868e-02,  9.6896e-02, -3.2055e-02, -3.0800e-02,
          1.3767e-02,  5.3367e-02, -4.6852e-02,  1.4192e-02,  6.8777e-02,
          3.0876e-02,  5.0982e-03,  4.0002e-02, -7.3914e-02, -6.9276e-02,
          1.2249e-02, -5.3207e-02,  3.7148e-02, -2.8915e-02, -5.0157e-04,
         -3.7280e-02,  6.0539e-02,  4.8168e-02, -8.6899e-03,  2.2352e-02,
          1.0071e-01, -2.1343e-02,  3.9886e-02, -5.0790e-03, -1.9516e-02,
         -8.7064e-02,  4.3888e-02,  3.3809e-02,  5.3262e-02,  3.9972e-02,
          4.4025e-02,  2.5665e-02, -2.2285e-03, -5.6762e-03, -8.8039e-03,
         -6.2691e-02,  2.6264e-02,  9.7448e-03, -9.1535e-03,  1.0132e-01,
          1.0190e-01,  3.6445e-02,  2.0244e-02,  1.4598e-03, -2.8108e-02,
          2.3936e-02, -3.3839e-02,  9.8303e-02, -3.6032e-02, -9.6148e-02,
         -1.4320e-02, -6.4947e-03,  1.1665e-02, -1.6429e-03, -9.1396e-02,
          1.0647e-01, -5.7154e-02, -2.

This function will search all wikipedia articles for passages that answer the query

In [6]:
def search(query):
    print("Input question:", query)

    ##### Bi-Encoder: Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Cross-Encoder: Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-2 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-2 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:2]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-2 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-2 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:2]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

In [7]:
search(query = "What is the capital of the United States?")

Input question: What is the capital of the United States?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.622	Cities in the United States:
	0.597	The United States Capitol is the building where the United States Congress meets. It is the center of the legislative branch of the U.S. federal government. It is in Washington, D.C., on top of Capitol Hill at the east end of the National Mall.

-------------------------

Top-2 Cross-Encoder Re-ranker hits
	8.906	Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. The President of the USA and many major national government offices are in the territory. This makes it the political center of the United States of America.
	3.755	A capital city (or capital town or just capital) is a city or town, specified by law or constitution, by the government of a country, or part of a country, such as a state, province or county. 

In [8]:
search(query = "What is the capital of Germany?")

Input question: What is the capital of Germany?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.647	Berlin is the capital city of Germany. It is also the biggest city in Germany. About 3,700,000 people live there.
	0.625	Bavaria () is a State ("Bundesland") of Germany. The territory of this state is the largest of the 16 German states. The state capital is Munich with 1.3 million people. About 12.5 million people live in Bavaria. Like many German states, Bavaria was once independent. Ludwig II of Bavaria was its last independent king.

-------------------------

Top-2 Cross-Encoder Re-ranker hits
	7.922	Berlin is the capital city of Germany. It is also the biggest city in Germany. About 3,700,000 people live there.
	4.175	Sangerhausen is the capital of the Mansfeld-Südharz rural district in Germany. It is between Magdeburg and Erfurt. The river Gonna flows through it.


In [9]:
search(query = "What is the capital of India?")

Input question: What is the capital of India?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.598	Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people.
	0.594	Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the Uni

In [10]:
search(query = "Coldest place on earth?")

Input question: Coldest place on earth?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.598	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.
	0.535	The North Pole is the point that is farthest north on Earth. It is the point on which axis of Earth turns. It is in the Arctic Ocean and it is cold there because the sun does not shine there for about half a year and never rises very high. The ocean around the pole is always very cold and it is covered by a thick sheet of ice.

-------------------------

Top-2 Cross-Encoder Re-ranker hits
	7.080	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, w

In [11]:
search(query = "What is natural language processing?")

Input question: What is natural language processing?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.773	Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.
	0.631	A natural language is the kind which we use in everyday conversation and writing. For example English, Hindi, Chinese. Natural languages are always very flexible, and people speak them in slightly different ways.

-------------------------

Top-2 Cross-Encoder Re-ranker hits
	10.696	Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically underst

In [12]:
search(query = "Can fish fly?")

Input question: Can fish fly?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.748	Flying fish are marine oceanic fishes of the family Exocoetidae. They are about 50 species, and they live worldwide in warm waters. They are noted for their ability to glide. They are all small, with a maximum length of about 45 cm (18 inches), and have winglike, rigid fins and an unevenly forked tail.
	0.633	Pilot fish ("Naucrates ductor") are fish that live in many places of the world. They live in warm water. They eat parasites on larger fish.

-------------------------

Top-2 Cross-Encoder Re-ranker hits
	6.971	Flying fish are marine oceanic fishes of the family Exocoetidae. They are about 50 species, and they live worldwide in warm waters. They are noted for their ability to glide. They are all small, with a maximum length of about 45 cm (18 inches), and have winglike, rigid fins and an unevenly forked tail.
	5.216	Fly fishing is a method of sport fishing. An artificial fly is cast int

In [13]:
search(query = "How do you train a machine learning model?")

Input question: How do you train a machine learning model?

-------------------------

Top-2 Bi-Encoder Retrieval hits
	0.593	In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.
	0.535	Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science.

-------------------------

Top-2 Cross-Encoder Re-ranker hits
	3.050	In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these resul

# QA Search Engine using Large Language Models - ChatGPT

Here we use an Open AI LLM to generate contextual embeddings for each wikipedia article.

Then we use ChatGPT (GPT3.5) to answer questions just as a human would by searching for the most similar article based on our input queries.

The new model, `text-embedding-ada-002`, replaces five separate models for text search, text similarity, and code search, and outperforms Open AI's previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

GPT-3.5 models can understand and generate natural language or code. The most capable and cost effective model in the GPT-3.5 family is `gpt-3.5-turbo` which has been optimized for chat using the Chat Completions API but works well for traditional completions tasks as well.

### Load Dependencies

In [16]:
!pip install sentence_transformers
!pip install langchain
!pip install openai
!pip install tiktoken
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


### Load Wikipedia Data

In [1]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

Passages: 169597


### Load Open AI LLMs

In [2]:
import os
os.environ['OPENAI_API_KEY'] = 'sk-xxxx'

In [20]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo',temperature=0.0)

### Generate LLM Embeddings and store them in FAISS index

Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

In [4]:
passages[-3:]

['Internet Movie Cars Database or IMCDb is a online database of cars, bikes, trucks, buses and other motor vehicle appearances in movies. It was founded in 2004.',
 "Kevin Dunn is an American actor who has played in many supporting roles. He is 64 years of age. He was born in Chicago Illinois. He has been in films since the 1980's.",
 'The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.']

In [5]:
# The vectorstore we'll be using
from langchain.vectorstores import FAISS

# The embedding engine that will convert our text to vectors
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [6]:
from langchain.docstore.document import Document

docs = [Document(page_content=doc) for doc in passages]

In [7]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunked_docs = splitter.split_documents(docs)

In [8]:
chunked_docs[-3:]

[Document(page_content='Internet Movie Cars Database or IMCDb is a online database of cars, bikes, trucks, buses and other motor vehicle appearances in movies. It was founded in 2004.', metadata={}),
 Document(page_content="Kevin Dunn is an American actor who has played in many supporting roles. He is 64 years of age. He was born in Chicago Illinois. He has been in films since the 1980's.", metadata={}),
 Document(page_content='The Seminole bat ("Lasiurus seminolus") is a type of bat in the family Vespertilionidae.', metadata={})]

In [9]:
# Get your embeddings engine ready
gpt_embedding_model = OpenAIEmbeddings()

# Embed your documents and combine with the raw text in a db.
docsearch = FAISS.from_documents(chunked_docs, gpt_embedding_model)

### Build a QA Retriever Engine

In [10]:
from langchain.chains import RetrievalQA

In [22]:
qa_engine = RetrievalQA.from_chain_type(llm=llm,
                                        chain_type="stuff",
                                        retriever=docsearch.as_retriever())

In [23]:
query = "What is the capital of the United States?"
qa_engine.run(query)

'The capital of the United States is Washington, D.C.'

In [24]:
query = "What is the capital of Germany?"
qa_engine.run(query)

'The capital of Germany is Berlin.'

In [26]:
query = "What is the capital of India?"
qa_engine.run(query)

'The capital of India is New Delhi.'

In [27]:
query = "Tell me some facts about the capital of India?"
qa_engine.run(query)

'The capital of India is New Delhi. It is a union territory and also a part of the megacity of Delhi. New Delhi has a rich history and is known for its numerous monuments. It is an expensive city to live in. The city covers an area of about 42.7 square kilometers and has a population of approximately 9.4 million people.'

In [29]:
query = "Can fish fly?"
qa_engine.run(query)

"No, fish cannot fly in the same way that birds or insects can. While there are some fish, like flying fish, that have the ability to glide above the water's surface for short distances, they do not have the ability to truly fly through the air."