# Question answering based on corpus
> "We create a corpus of summaries from wikipedia for certain topics. Then we do question answering on the corpus."

- toc: true
- branch: master
- badges: true
- comments: true
- author: Ashish Kashav
- categories: [deep learning, NLP,jupyter]





You can input a query or a question. The script then uses semantic search
to find relevant passages.

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


# installations and imports

In [1]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.7 MB/s 
[?25hCollecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 60.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 63.0 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 15.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K  

We use the Bi-Encoder to encode all passages, so that we can use it with sematic search. The bi-encoder will retrieve documents. We use a cross-encoder, to re-rank the results list to improve the quality


In [44]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 128     
top_k = 3                         

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

We search for terms in wikipedia using wiki api and then store each sentence in a list

In [15]:

!pip3 install wikipedia-api


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.5.4-py3-none-any.whl size=13477 sha256=e9ff293b1a50e1c0c1805f6ed27456515c407462f3cb6a10aea8766b22519539
  Stored in directory: /root/.cache/pip/wheels/d3/24/56/58ba93cf78be162451144e7a9889603f437976ef1ae7013d04
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.5.4


In [58]:
import wikipediaapi
passages = []

wiki_wiki = wikipediaapi.Wikipedia('en')
for p in ['Krishna','Shiva','Vishnu']:

  page_py = wiki_wiki.page(p)

  passages.extend(page_py.summary.split("."))

# Embedding conversion

In [59]:
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encode the query using the bi-encoder and find potentially relevant passages then score all retrieved passages with the cross_encoder and then sort the results.


# Searching

In [60]:

def search(query):
    print("Input question:", query)


    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  

    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))


Now we ask question to find relevant answer sentence from the corpus.

In [61]:
search(query = "Who worships Krishna?")

Input question: Who worships Krishna?

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.710	 In some sub-traditions, Krishna is worshipped as Svayam Bhagavan (the Supreme God), and it is sometimes known as Krishnaism
	0.674	Krishna (, pronounced [ˈkr̩ʂɳɐ] (listen); Sanskrit: कृष्ण, IAST: Kṛṣṇa)  is a major deity in Hinduism
	0.610	 Since the 1960s, the worship of Krishna has also spread to the Western world and to Africa, largely due to the work of the International Society for Krishna Consciousness (ISKCON)

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	8.215	 In some sub-traditions, Krishna is worshipped as Svayam Bhagavan (the Supreme God), and it is sometimes known as Krishnaism
	6.475	 Since the 1960s, the worship of Krishna has also spread to the Western world and to Africa, largely due to the work of the International Society for Krishna Consciousness (ISKCON)
	5.061	Krishna (, pronounced [ˈkr̩ʂɳɐ] (listen); Sanskrit: कृष्ण, IAST: Kṛṣṇa)  is a major

In [62]:
search(query = "Who is Shiva?")

Input question: Who is Shiva?

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.710	 Shiva is also known as Adiyogi Shiva, regarded as the patron god of yoga, meditation and the arts
	0.707	 Shiva is a pan-Hindu deity, revered widely by Hindus in India, Nepal, Sri Lanka and Indonesia (especially in Java and Bali)
	0.706	Shiva  (; Sanskrit: शिव, romanized: Śiva, lit

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	9.244	 Shiva is a pan-Hindu deity, revered widely by Hindus in India, Nepal, Sri Lanka and Indonesia (especially in Java and Bali)
	8.796	 Shiva is also known as Adiyogi Shiva, regarded as the patron god of yoga, meditation and the arts
	5.118	Shiva  (; Sanskrit: शिव, romanized: Śiva, lit


In [63]:
search(query = "Who is the preserver?")

Input question: Who is the preserver?

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.329	Vishnu is known as "The Preserver" within the Trimurti, the triple deity of supreme divinity that includes Brahma and Shiva
	0.292	  In Vaishnavism, Vishnu is the supreme being who creates, protects, and transforms the universe
	0.253	 He is the god of protection, compassion, tenderness, and love; and is one of the most popular and widely revered among Indian divinities

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	6.345	Vishnu is known as "The Preserver" within the Trimurti, the triple deity of supreme divinity that includes Brahma and Shiva
	-7.569	 He is the god of protection, compassion, tenderness, and love; and is one of the most popular and widely revered among Indian divinities
	-9.610	  In Vaishnavism, Vishnu is the supreme being who creates, protects, and transforms the universe
