# Question-Answering using Simple Wikipedia Index

This examples demonstrates the setup for Query / Question-Answer-Retrieval.

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use SentenceTransformer('msmarco-distilbert-base-v2') and retrieve
100 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')) that
scores the query and all retrieved passages for their relevancy. The cross-encoder is neccessary to filter out certain noise
that might be retrieved from the semantic search step.


In [None]:
!pip install -U sentence-transformers rank_bm25

In [1]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
model_name = 'msmarco-MiniLM-L-6-v3'
bi_encoder = SentenceTransformer(model_name)
top_k = 100  # Number of passages we want to retrieve with the bi-encoder

# The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder
wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        passages.extend(data['paragraphs'])

print("Passages:", len(passages))

# To speed things up, pre-computed embeddings are downloaded.
# The provided file encoded the passages with the model 'msmarco-distilbert-base-v2'
if model_name == 'msmarco-MiniLM-L-6-v3':
    embeddings_filepath = f'simplewiki-2020-11-01-{model_name}.pt'
    if not os.path.exists(embeddings_filepath):
        util.http_get(f'http://sbert.net/datasets/simplewiki-2020-11-01-{model_name}.pt', embeddings_filepath)

    corpus_embeddings = torch.load(embeddings_filepath)
    if torch.cuda.is_available():
        corpus_embeddings = corpus_embeddings.to('cuda')
else:  # Here, we compute the corpus_embeddings from scratch (which can take a while depending on the GPU)
    corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


Passages: 509663


In [4]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)




  0%|          | 0/509663 [00:00<?, ?it/s]

In [5]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))


In [6]:
search(query = "What is the capital of the United States?")

Input question: What is the capital of the United States?
Top-3 lexical search (BM25) hits
	16.264	Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states. The federal government (including the United States military) also uses capital punishment.
	15.124	In 1783, it was the capital of the United States for a few months.
	14.476	New York was the capital of the United States under the Articles of Confederation from 1785 to 1788. When the US Constitution was made, it stayed as the capital from 1789 until 1790. In 1789, the first President of the United States, George Washington, was inaugurated; the first United States Congress and the Supreme Court of the United States each met for the first time, and the United States Bill of Rights was written, all at Federal Hall on Wall Street. By 1790, New York grew bigger than Philadelphia, so it become the biggest city in t

In [7]:
search(query = "What is the best orchestra in the world?")

Input question: What is the best orchestra in the world?
Top-3 lexical search (BM25) hits
	18.385	In the December 2008 issue of Gramophone Magazine the Royal Concertgebouw Orchestra was ranked as the best symphony orchestra in the world.
	15.866	He was music director of the Boston Symphony Orchestra longer than any other conductor. Under Ozawa the orchestra remained one of the best in the world and often performed new musical compositions by living composers. They recorded more than 140 works together.
	15.610	The orchestra was the best orchestra in Yugoslavia. By the 1960s it had become one of the best orchestras in Europe. Then, during the 1990s there were wars in the areas which made up Yugoslavia. For a time the orchestra were not allowed to play abroad. It was a difficult time for them, and many players left.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.662	In the December 2008 issue of Gramophone Magazine the Royal Concertgebouw Orchestra was ranked as the best 

In [8]:
search(query = "Number countries Europe")

Input question: Number countries Europe
Top-3 lexical search (BM25) hits
	16.963	ECoHR' has a number of judges. The number of judges is seven normally but at the case of dealing a great issue, the number will be 21 and the judges are equally from member countries of the Council of Europe. At present, there are forty seven member countries of the Council of Europe. Each country may have one judge in the ECoHR. But, judges work independently for the ECoHR, and not for their country.
	14.560	Most countries in Europe, and a few countries in Asia, have made some or all synthetic cannabinoids illegal.
	14.165	Many of these countries were members of the Western European Union. Many, such as Norway, are also in Northern Europe or in Central Europe or Southern Europe.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.733	Europe signed a new treaty of union, which included 27 European countries in 2007.
	0.719	There are at least 43 countries in Europe (the European identities of 5 t

In [9]:
search(query = "When did the cold war end?")

Input question: When did the cold war end?
Top-3 lexical search (BM25) hits
	18.191	During the summit, Bush and Gorbachev would declare an end to the Cold War, although whether it was truly such is a matter of debate.
	16.865	During the near end of the Cold War, Thatcher became one of the closest friends of Ronald Reagan, the 40th President of the United States.
	16.571	This decade also saw the Soviet Union fight a war that seemed endless in Afghanistan, civil war in Ethiopia, and the fall of the Berlin Wall which started the end of the Cold War and of Communism in Eastern Europe.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.705	Not all historians agree on when the Cold War ended. Some think it ended when the Berlin Wall fell. Others think it ended when the Soviet Union collapsed in 1991.
	0.673	The war ended on Saturday 16th, 2008.
	0.636	The Cold War ended in 1991. Since Bond almost always fought Communists, many now thought that the Bond series of movies was finall

In [10]:
search(query = "How long do cats live?")

Input question: How long do cats live?
Top-3 lexical search (BM25) hits
	21.799	Reliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.
	18.878	The "Guinness World Record" for the oldest cat was for a cat named Creme Puff, who was 38 years old. Female cats seem to live longer than male cats. Neutered cats live longer than cats that have not been neutered. Mixed breed cats also appear to live longer than purebred cats. Researchers have also found that cats that weigh more have shorter lifespans.
	17.810	There are also farm cats, which are kept on farms to keep rodents away; and feral cats, which are domestic cats that live away from humans.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.883	Reliable informatio

In [11]:
search(query = "How many people live in Toronto?")

Input question: How many people live in Toronto?
Top-3 lexical search (BM25) hits
	17.340	The City of Toronto has a population of over 3 million people. Even more people live in the regions around it. All together, the Greater Toronto Area is home to over 6 million people. That makes it the biggest metropolitan area in Canada.
	16.631	Markham, Ontario is a city in Regional Municipality of York, in the Greater Toronto Area of Southern Ontario, Canada. There are twice as many people there as in 1990. 261,573 people live in Markham. It is the 4th largest town in the Greater Toronto Area, after Toronto, Mississauga, and Brampton.
	14.085	Ontario is very big, so sometimes people break it into two. The two parts are called Northern Ontario and Southern Ontario. Most of the people in Ontario live in the south, and that is where the big cities are. The big cities in Southern Ontario are Toronto and the rest of the Greater Toronto Area, Ottawa and the National Capital Region, Hamilton, London, 

In [12]:
search(query = "Oldest US president")

Input question: Oldest US president
Top-3 lexical search (BM25) hits
	14.599	Its first president was Kiro Gligorov, the oldest president in the world until his resignation in 1999.
	12.549	On June 14, 2007, Waldheim died of heart failure. At the time of his death he was the oldest living former Secretary-General of the United Nations and the oldest living former President of Austria.
	12.123	Currently, is the oldest living Mormon apostle.. In July 3, 2015, following the death of President Boyd K. Packer, Nelson became President of the Quorum.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.684	Ford died in his home in California on December 26, 2006 from cardiac arrest caused by cerebrovascular disease and coronary artery disease at the age of 93 years and 165 days. Until then, no other president had lived to be that old since Ronald Reagan in 2004. George H. W. Bush became the oldest living former president in November 2017. On March 22, 2019, Jimmy Carter gained the di

In [13]:
search(query = "Coldest place earth")

Input question: Coldest place earth
Top-3 lexical search (BM25) hits
	23.094	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.
	16.377	The slogan of the province is "city of the sea of mountains, coldest place in Siam, with beautiful flowers of three seasons."
	15.459	The extinctions may have been caused by an ice age that occurred at the end of the Ordovician period: the end of the Ordovician was one of the coldest times in the last 600 million years of earth history.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.593	In physics, absolute zero (0 K) is the coldest temperature. At that point, subatomic particles stop moving (entropy is at its minimum). Certain things can reach temperatures below absolute zero, known as negative tem

In [14]:
search(query = "Elon Musk year birth")

Input question: Elon Musk year birth
Top-3 lexical search (BM25) hits
	22.568	Tesla, Inc. is a company based in Palo Alto, California which makes electric cars. It was started in 2003 by Martin Eberhard, Dylan Stott, and Elon Musk (who also co-founded PayPal and SpaceX and is the CEO of SpaceX). Eberhard no longer works there. Today, Elon Musk is the Chief Executive Officer (CEO). It started selling its first car, the Roadster in 2008.
	20.492	Elon Musk complained via Twitter about Los Angeles traffic and the same day, December 17, 2016, founded the company. It built a short test tunnel in Los Angeles.
	20.448	At the end of 2016, Musk founded The Boring Company which focuses on tunnelling and infrastructure. He mentioned Los Angeles traffic as the reason for starting this company. In March 2017 Elon Musk announced he has started another company which aims to merge human brains and computers, it is called Neuralink.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.593	Elon

In [15]:
search(query = "Paris eiffel tower")

Input question: Paris eiffel tower
Top-3 lexical search (BM25) hits
	31.008	For example, The Eiffel Tower in Paris, France is tall. That is, the distance from the top to the bottom of the Eiffel Tower is 300 metres. The property of the Eiffel Tower being measured is a distance. The number measured is 300. 300 of what? The unit of measurement is the metre.
	30.044	Afterwards, George and Nico return to Paris and go on their first date on the Eiffel Tower.
	26.973	Some of the most famous attractions in Paris, are the Eiffel Tower and the Arc de Triomphe. Another one is Mont Saint Michel, in Normandy.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.855	The Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: "EYE-full" English; "eh-FEHL" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle (World Fair). The Tower was the Exposition's main attraction.
	0.663	The first digging for the foundations began on January 28, 1887 and

In [16]:
search(query = "Which US president was killed?")

Input question: Which US president was killed?
Top-3 lexical search (BM25) hits
	11.966	He came into office when the previous president, Cyprien Ntaryamira, was killed in a plane crash. It was an assassination in which the Rwandan president Juvénal Habyarimana was also killed. Ntibantunganya left office when he was deposed by Pierre Buyoya in a military coup of 1996.
	11.697	Burr killed Alexander Hamilton in a duel in 1804, when Burr was still Vice President.
	11.482	After President James A. Garfield died, vice-president Chester Arthur replaced him. The man who killed him expected the new President to pardon him. This did not happen.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.647	Abraham Lincoln, James A. Garfield, William McKinley, and John F. Kennedy were assassinated while in office. William Henry Harrison, Zachary Taylor, Warren G. Harding, and Franklin Roosevelt died from illness while president. John Tyler was the first Vice President of the United States to b

In [17]:
search(query="When is Chinese New Year")

Input question: When is Chinese New Year
Top-3 lexical search (BM25) hits
	18.606	Today in China the Gregorian calendar is used for most activities. At the same time, the Chinese calendar is still used for traditional Chinese holidays like Chinese New Year or Lunar New Year.
	18.151	Before that, the holiday was usually just called the "NewYear". Because the traditional Chinese calendar is mostly based on the changes in the moon, the Chinese New Year is also known in English as the "Lunar New Year" or "Chinese Lunar New Year". This name comes from "Luna", an old Latin name for the moon. The Indonesian name for the holiday is Imlek, which comes from the Hokkien word for the old Chinese calendar and is therefore also like saying "Lunar New Year".
	18.011	Spring Festival is the Chinese New Year.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.739	Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new mo