<a href="https://colab.research.google.com/github/andrePankraz/qa_service/blob/main/notebooks/Website_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question Answering (QA) from a given Website
The following Notebook can answer questions out of a given website.
The examples are focussed onto German language, but the overall method also applies to other languages.

# Set-up environment

We need following packages:

*   [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) (bs) for HTML parsing
*   [sentence_cleaner_splitter](https://github.com/facebookresearch/LASER/tree/main/utils) from project [LASER](https://github.com/facebookresearch/LASER) for sentence splitting
*   [sentence-transformer](https://www.sbert.net/) for sentence embedding
    * Also installs [transformers](https://pypi.org/project/transformers/) for question answering etc.

In [1]:
!pip install --quiet bs4 sentence_cleaner_splitter@git+https://github.com/facebookresearch/LASER.git#subdirectory=utils sentence-transformers

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 KB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 KB[0m [31m669.0 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 KB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.8/883.8 KB[0m [31m64.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[

# Check GPU
The following experiments can be run without a GPU, but it will take much longer, especially the QA-model!

See Colab Menu: Runtime / Change type.

Check if GPU available:

In [2]:
!nvidia-smi

Mon Mar 13 07:39:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Prepare question & source website
First we need a question and a website URL, that contains the response.

In [3]:
question = 'Wer ist der Bürgermeister von Dresden?'
title = 'Dresden'
website_url = 'https://de.wikipedia.org/wiki/' + title

Fetch website:

In [4]:
import requests

res = requests.get(website_url)
page_html = res.content

# html_page

Extract text from website:

In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page_html, features="html.parser")
page_text = soup.getText()

print(page_text)





Dresden – Wikipedia
document.documentElement.className="client-js";(function(){var cookie=document.cookie.match(/(?:^|; )dewikimwclientprefs=([^;]+)/);if(cookie){var featureName=cookie[1];document.documentElement.className=document.documentElement.className.replace(featureName+'-enabled',featureName+'-disabled');}}());RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":[",\t.",".\t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","Januar","Februar","März","April","Mai","Juni","Juli","August","September","Oktober","November","Dezember"],"wgRequestId":"60c3d434-5cec-4830-aa4f-da0e31b1e8cf","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Dresden","wgTitle":"Dresden","wgCurRevisionId":231749772,"wgRevisionId":231749772,"wgArticleId":1084,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia:Überarbeiten",
"Wikip

Alternative extraction mechanism:

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page_html, features="html.parser")
page_lines = []
line = ''
for e in soup.descendants:
    if isinstance(e, str):
        line += e.replace(u'\xa0', u' ')
    elif e.name in ['br', 'p', 'h1', 'h2', 'h3', 'h4', 'tr', 'th']:
        if len(line) > 50:
          page_lines.append(line)
        line = ''
    elif e.name == 'li':
        if len(line) > 50:
          page_lines.append('- ' + line)
        line = ''
page_text = '\n'.join(page_lines)
print(page_text)

html



Dresden – Wikipedia
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":[",\t.",".\t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","Januar","Februar","März","April","Mai","Juni","Juli","August","September","Oktober","November","Dezember"],"wgRequestId":"ffe9249e-c5fd-4d25-9913-01ed829a3e93","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Dresden","wgTitle":"Dresden","wgCurRevisionId":229911472,"wgRevisionId":229911472,"wgArticleId":1084,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia:Defekte Weblinks/Ungeprüfte Archivlinks 2022-10","Wikipedia:Überarbeiten","Wikipedia:Veraltet nach Jahr 2023","Dresden","Gemeinde in Sachsen","Wikipedia:Gesprochener Artikel","Wikipedia:Exzellent","Kreisfreie Stadt in Sachsen","Deutsche Landeshauptstadt","Ort mit Binnen

# Set-up an AI-model for question answering (QA)


*   [German AI models for QA on Hugging Faces](https://huggingface.co/models?language=de&pipeline_tag=question-answering&sort=downloads)
*   Here we use [deepset/gelectra-large-germanquad](https://huggingface.co/deepset/gelectra-large-germanquad)

In [6]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

qa_model_name = "deepset/gelectra-large-germanquad"
qa_pipeline = pipeline('question-answering', model=qa_model_name, tokenizer=qa_model_name, device=device)

qa_pipeline.device

Downloading (…)lve/main/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/359 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/240k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

device(type='cuda', index=0)

# Test QA-model on small example

In [7]:
qa_pipeline({
    'question': 'Wie heißt die Hauptstadt von Deutschland?',
    'context': 'Die Hauptstadt von Deutschland ist Berlin. In London, der Hauptstadt von GB, wohnen auch viele Deutsche. Da beißt die Maus keinen Faden ab.'
}, top_k=5)

[{'score': 0.9851893186569214, 'start': 35, 'end': 41, 'answer': 'Berlin'},
 {'score': 0.006777369882911444,
  'start': 31,
  'end': 41,
  'answer': 'ist Berlin'},
 {'score': 0.0039743697270751,
  'start': 0,
  'end': 41,
  'answer': 'Die Hauptstadt von Deutschland ist Berlin'},
 {'score': 0.0016199437668547034, 'start': 35, 'end': 42, 'answer': 'Berlin.'},
 {'score': 0.001305743819102645,
  'start': 35,
  'end': 52,
  'answer': 'Berlin. In London'}]

# Use QA-model on entire website
This can be quite slow (>5 mins with CPU, 1 min with GPU), if the page contains a lot of text.

The anweser quality is pretty good, but this method doesn't scale to more text (bigger pages or even many pages).

In [8]:
qa_pipeline({
    'question': question,
    'context': page_text # or '\n'.join(lines)
}, top_k=5)

[{'score': 0.9691397547721863,
  'start': 75842,
  'end': 75860,
  'answer': 'Wolfgang Berghofer'},
 {'score': 0.9690524935722351,
  'start': 75842,
  'end': 75860,
  'answer': 'Wolfgang Berghofer'},
 {'score': 0.9626346826553345,
  'start': 5282,
  'end': 5294,
  'answer': 'Dirk Hilbert'},
 {'score': 0.9378310441970825,
  'start': 46363,
  'end': 46381,
  'answer': 'Wolfgang Berghofer'},
 {'score': 0.8856545686721802,
  'start': 5282,
  'end': 5294,
  'answer': 'Dirk Hilbert'}]

# Set-up an AI-model for sentence embedding
Instead of applying the QA-model on the entire website, we split the website into sentences and try to find relevant sentences for the question. Than we can apply the expensive QA-model just to these sentences.

For finding relevant sentences we use an AI-model, that can evaluate sentence similarity. We create a sentence embedding for each sentence. We also create a sentence embedding for the question and compare these embeddings ([Bi-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html)).

The sentence embeddings for the website(s) can later be stored in a database with a similarity index (FAISS, Elasticsearch etc.).
*   [German AI models for sentence similarity on Hugging Faces](https://huggingface.co/models?language=de&pipeline_tag=sentence-similarity&sort=downloads)

In [9]:
from sentence_transformers import SentenceTransformer, util

s_model = SentenceTransformer('Sahajtomar/German-semantic', device=device)

s_model.device, s_model

Downloading (…)1d5b5/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)nsformer/config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading (…)ransformer/vocab.txt:   0%|          | 0.00/240k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)e059c1d5b5/README.md:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading (…)59c1d5b5/config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)9c1d5b5/modules.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

(device(type='cpu'), SentenceTransformer(
   (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
   (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
 ))

# Test Embedding-model on small example
There are to methods for comparing embeddings: [Dot-Product or Cosine-Similarity](https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity).

For retrieval of the most similar texts with embeddings Cosine-Similarity seems to work better...

In [10]:
query_embedding = s_model.encode('Wie heißt die Hauptstadt von Deutschland?')
passage_embedding = s_model.encode(['Die Hauptstadt von Deutschland ist Berlin.',
                                    'In London, der Hauptstadt von GB, wohnen auch viele Deutsche.',
                                    'Da beißt die Maus keinen Faden ab.'])

print(f"Type: {type(query_embedding)}, Shape: {query_embedding.shape}")
print("Dot-Similarity:", util.dot_score(query_embedding, passage_embedding))
print("Cos-Similarity:", util.cos_sim(query_embedding, passage_embedding))

Type: <class 'numpy.ndarray'>, Shape: (1024,)
Dot-Similarity: tensor([[318.2937, 186.7843,   0.6361]])
Cos-Similarity: tensor([[0.6410, 0.3833, 0.0013]])


# Create sentence embeddings for website
*   Split website text into sentences
*   We use an AI-model for this - what else (from project [LASER](https://github.com/facebookresearch/LASER), also used for [NLLB](https://ai.facebook.com/research/no-language-left-behind/))
*   Ignore very short lines
*   Create sentence embeddings with Embedding-model

In [11]:
from sentence_cleaner_splitter.cleaner_splitter import SentenceSplitClean

sentence_splitter = SentenceSplitClean('de', 'default')
split_text = ''
sentences = []
for _, _, line in sentence_splitter(page_text):
  if len(line) > 50:
    sentences.append(f"{title}: {line.strip()}" if title not in line else line.strip())
embeddings = s_model.encode(sentences)

print(len(embeddings))

| 2023-03-13 07:44:13,032 | [1;32mINFO[0m | khmer-nltk | Loaded model from /usr/local/lib/python3.9/dist-packages/khmernltk/word_tokenize/sklearn_crf_ner_10000.sav |
INFO:khmer-nltk:Loaded model from /usr/local/lib/python3.9/dist-packages/khmernltk/word_tokenize/sklearn_crf_ner_10000.sav
| 2023-03-13 07:44:13,062 | [1;32mINFO[0m | khmer-nltk | Loaded model from /usr/local/lib/python3.9/dist-packages/khmernltk/pos_tag/sklearn_crf_pos_alt_0.9846.sav |
INFO:khmer-nltk:Loaded model from /usr/local/lib/python3.9/dist-packages/khmernltk/pos_tag/sklearn_crf_pos_alt_0.9846.sav


1392


# Find top similar sentences
1.   Create sentence embedding for question
2.   Find top x similar indices for matching website sentence embeddings
3.   Create context with found sentences
4.   Including some surrounding sentences for each found sentences

In [12]:
query_embedding = s_model.encode(question)

# print("Similarity:", util.dot_score(query_embedding, embeddings))
sim_scores = util.dot_score(query_embedding, embeddings)
top_results = torch.topk(sim_scores[0], k=20)
context = ''
for score, idx in zip(top_results[0], top_results[1]):
  print(f"(Score: {score:.4f})  {sentences[idx]}")
  if idx - 2 > 0:
    context += (sentences[idx - 2] + ' ')
  if idx - 1 > 0:
    context += (sentences[idx - 1] + ' ')
  context += (sentences[idx] + ' ')
  if idx + 1 < len(sentences):
    context += (sentences[idx + 1] + ' ')
  if idx + 2 < len(sentences):
    context += (sentences[idx + 2] + ' ')

context

(Score: 361.4486)  Siehe auch: Liste der Oberbürgermeister von Dresden
(Score: 350.8103)  Dresden: Sie führen den Titel "Bürgermeister," wobei der "Erste Bürgermeister" den Oberbürgermeister ständig vertritt.
(Score: 331.3880)  Dresden: Er leitet die Stadtverwaltung, verantwortet laufende Tagesgeschäfte und repräsentiert die Stadt.
(Score: 324.5040)  Dresden: Nach Einführung der Süddeutschen Ratsverfassung in Sachsen ist seit 1994 der nunmehr direkt vom Volk gewählte Oberbürgermeister zugleich Vorsitzender des Stadtrates.
(Score: 323.9839)  → Hauptartikel: Ergebnisse der Kommunalwahlen in Dresden
(Score: 323.0088)  Dresden: Als Sitz der Sächsischen Staatsregierung und des Sächsischen Landtags sowie zahlreicher Landesbehörden ist die Großstadt das politische Zentrum Sachsens.
(Score: 318.4113)  Dresden: Vorsitzender des Stadtbezirksbeirats ist der Oberbürgermeister oder eine von ihm beauftragte Person.
(Score: 313.9804)  Dresden: ↑ Quelle: Statistisches Landesamt des Freistaates Sachsen

'Dresden: Die beiden Wahlgänge der Oberbürgermeisterwahl 2022 ergaben folgende Ergebnisse:[64][65] Hilbert (Bürgerfür Dresden) Jähnigen (Grüne) Krah (AfD) Pöhnisch (PARTEI) Fuchs (Einzel bewerber) Vorlage:Wahldiagramm/Wartung/TITEL zu lang Siehe auch: Liste der Oberbürgermeister von Dresden → Hauptartikel: Ergebnisse der Kommunalwahlen in Dresden Dresden: Bei der letzten Kommunalwahl am 26. Mai 2019 wurde folgender Stadtrat gewählt:[66] Dresden: Entsprechend den Regelungen der Sächsischen Gemeindeordnung (SächsGemO) wird er für eine Amtszeit von sieben Jahren direkt von den Bürgern gewählt. Dresden: Ihm zur Seite gestellt sind sieben Beigeordnete, die für einzelne Geschäftskreise zuständig sind und diese eigenverantwortlich leiten. Dresden: Sie führen den Titel "Bürgermeister," wobei der "Erste Bürgermeister" den Oberbürgermeister ständig vertritt. Dresden: Dies kam Ende 2014 bis Mitte 2015 voll zum Tragen, da die Oberbürgermeisterin Helma Orosz aus gesundheitlichen Gründen vorzeitig i

# Use QA-model on created context
Now we use the QA-model on the much smaller context, extracted from the original website(s).

The hope is, that the sentence similarity mechanism keeps the relevant information for the expensive QA-model in this context. Not relevant sentences are excluded by the much cheeper similarity method with a database index.

In [13]:
qa_pipeline({
    'question': question,
    'context': context
}, top_k=5)

[{'score': 0.8778463006019592,
  'start': 2683,
  'end': 2695,
  'answer': 'Dirk Hilbert'},
 {'score': 0.7034727931022644,
  'start': 2683,
  'end': 2695,
  'answer': 'Dirk Hilbert'},
 {'score': 0.17102676630020142,
  'start': 1092,
  'end': 1104,
  'answer': 'Dirk Hilbert'},
 {'score': 0.12940539419651031,
  'start': 1072,
  'end': 1104,
  'answer': 'Erste Bürgermeister Dirk Hilbert'},
 {'score': 0.11945375055074692,
  'start': 2683,
  'end': 2707,
  'answer': 'Dirk Hilbert von der FDP'}]