### Building Question answering system

For  this work , we will build our question answering system , we will leverage the [deepset framework]() to build the components of our system.

Here are the followign components of our system:
-  The index store 
- The document store 
- The search pipeline with a Retrieval and a Reader model.

We will be leveraging the tutorials provided by deepstack to build our system.

For this work we will leverage [this  tutorial:](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) 

### Building the dense passage retrieval

To build the dataset for the dense passage retrieval , we will be using the approach suggested by [this tutorial](https://huggingface.co/etalab-ia/dpr-question_encoder-fr_qa-camembert) which use the cambert model .

For each question , we have a single positive context , the paragraph where the answer to the question is located and n hard negavives contexts that are the the top - k canditates taht does not contain the answer, to the question. to retrieve the negative context they will use the bm25 retrieval model.

Before training the retrieval model we will have to build the document store and save the document as units of retrieval to the store.

### Building the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
from pathlib import Path
DATA_PATH = Path.cwd().joinpath("data")
assert DATA_PATH.exists(), "the data path does not exist"
TEXT_DATA_FOLDER = DATA_PATH.joinpath("corpus", "drc-news-txt")
assert TEXT_DATA_FOLDER.exists(), "the text data folder does not exist"

In [3]:
data_file_path = DATA_PATH.joinpath("corpus", "raw", 'drc-news-raws.csv')

In [4]:
data = pd.read_csv(data_file_path, names=["content", "posted_at"])

In [5]:
data.shape

(140638, 2)

In [6]:
data = data.fillna(value="")
data.head()

Unnamed: 0,content,posted_at
0,Les membres de la Commission tarifaire viennen...,2022-09-05 00:00:00
1,Les membres de la Commission tarifaire so...,2022-04-05 00:00:00
2,Vodacom Congo vient de signer un partenariat a...,2022-04-23 00:00:00
3,"Le sélectionneur des Léopards de la RDC, Hectó...",2022-03-05 00:00:00
4,Le protocole d’accord était déjà signé entre l...,2022-11-05 00:00:00


In [7]:
from haystack.nodes import TextConverter

  from .autonotebook import tqdm as notebook_tqdm
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.


In [9]:
from haystack.schema import Document
from secrets import token_hex

# @Todo: this is not working now , it was supposed to save the document to dataframe
def get_document_from_text(row):
    """numpy row with the text and the date of the post

    Args:
        row (_type_): _description_

    Returns:
        _type_: _description_
    """
    text = row[0].replace(u'\xa0', u' ')
    for paragraph in text.split("   "):
        if not paragraph.strip():  # skip empty paragraphs
            continue
        return Document(content=paragraph, meta={"posted_at":row[1] if row[1] else "" })

In [10]:
for index, row in enumerate(data.sample(1000).loc[:, ["content"]].values):
   with open(TEXT_DATA_FOLDER.joinpath(f"{index}.txt"), "w") as f:
       f.write(row[0])

In [11]:
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs

In [12]:
all_docs = convert_files_to_docs(dir_path=TEXT_DATA_FOLDER)

INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/289.txt
INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/504.txt
INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/262.txt
INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/276.txt
INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/510.txt
INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/538.txt
INFO - haystack.utils.preprocessing -  Converting /Users/es.py/Projects/Personal/multilingual-drc-news-chatbot/data/corpus/drc-news-txt/

In [13]:
from haystack.errors import HaystackError
from haystack.schema import Document
from typing import List, Optional, Generator, Set, Union
from copy import deepcopy
from haystack.nodes import PreProcessor

class CustomPreProcessor(PreProcessor):
    def __init__(self, custom_preprocessor=None, **kwargs):
        super().__init__(**kwargs)
        self.custom_preprocessor = custom_preprocessor
    def clean(
        self,
        document: Union[dict, Document],
        clean_whitespace: bool,
        clean_header_footer: bool,
        clean_empty_lines: bool,
        remove_substrings: List[str],
        id_hash_keys: Optional[List[str]] = None,
    ) -> Document:
        """
        
        Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers
        and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.__init__().
        """
        if id_hash_keys is None:
            id_hash_keys = self.id_hash_keys

        if isinstance(document, dict):
            document = Document.from_dict(document, id_hash_keys=id_hash_keys)

        # Mainly needed for type checking
        if not isinstance(document, Document):
            print(10 * "*")
            print("this is the istance of the document:", type(document))
            raise HaystackError("Document must not be of type 'dict' but of type 'Document'.")
        text = document.content
        text = self.custom_preprocessor(text)
        if clean_header_footer:
            text = self._find_and_remove_header_footer(
                text, n_chars=300, n_first_pages_to_ignore=1, n_last_pages_to_ignore=1
            )

        if clean_whitespace:
            lines = text.splitlines()

            cleaned_lines = []
            for line in lines:
                line = line.strip()
                cleaned_lines.append(line)
            text = "\n".join(cleaned_lines)

        if clean_empty_lines:
            text = re.sub(r"\n\n+", "\n\n", text)

        for substring in remove_substrings:
            text = text.replace(substring, "")

        if text != document.content:
            document = deepcopy(document)
            document.content = text

        return document
    
    

In [14]:
import re
from gensim.utils import deaccent
from unicodedata import normalize as unicode_normalize

In [15]:
def replace_point(document):
    """replace the point with the wwt.www with space point before tokenizing the document .
    TOdos : this may have a a downside when the point is in the middle of a words
    Args:
        document (_type_): _description_
    """
    result = re.sub(r"(\S)\.(\S)", r"\1 . \2", document)
    return result

def remove_accents(document):
    input_without_accent = deaccent(document)
    return input_without_accent

def pre_clean_document(document):
    """pre clean the document by removing the accents and replacing the point with the wwt.www with space point before tokenizing the document .
    TOdos : this may have a a downside when the point is in the middle of a words
    and any other side of cleaning that we want to do .
    Args:
        document (_type_): _description_
    """
    result = remove_accents(document)
    result = replace_point(result)
    result = re.sub(r"This post has already been read \d+ times!", "", result) # remove unwanted text
    result = unicode_normalize("NFKD", result)
    return result

In [17]:
all_docs[222]

<Document: {'content': 'This post has already been read 649 times!Le 25 février 2022, le conseiller d’État et ministre des Affaires étrangères Wang Yi a eu des conversations téléphoniques respectivement avec la ministre britannique des Affaires étrangères Liz Truss, le haut représentant de l’Union européenne pour les affaires étrangères et la politique de sécurité Josep Borrell, et le conseiller diplomatique du président français Emmanuel Bonne. Ils ont eu des échanges de vues approfondis principalement sur la situation en Ukraine.Wang Yi a exposé la position fondamentale de la Chine sur la question de l’Ukraine, qui se résume aux cinq points suivants :«Premièrement, la Chine est fermement d’avis que la souveraineté et l’intégrité territoriale de tous les pays doivent être respectées et protégées et que les buts et principes de la Charte des Nations unies doivent être respectés avec sérieux. Cette position de la Chine est cohérente et claire, et s’applique également à la question de l’

In [18]:
preprocessor = CustomPreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
    language="fr",
    custom_preprocessor=pre_clean_document,
)


docs = preprocessor.process(all_docs)

print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(docs)}")





[A[A

[A[A

[A[A

[A[A

HaystackError: Document must not be of type 'dict' but of type 'Document'.



[A[A

In [19]:
docs[0]

NameError: name 'docs' is not defined

In [32]:
from haystack.document_stores import ElasticsearchDocumentStore



In [33]:
document_store = ElasticsearchDocumentStore(index="drc-news", recreate_index=True, analyzer="french")

INFO - haystack.document_stores.elasticsearch -  Index 'drc-news' deleted.
INFO - haystack.document_stores.elasticsearch -  Index 'label' deleted.


In [34]:
document_store.write_documents(docs)

### Retrieval

With the document store in place , the document store has all the document in it , let build a retriever model that use BM25 to retrieve the document.

In [37]:
custom_query_template = """
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": ${query}
        }
      },
      "negative": {
        "match": {
          "content": ${name_to_not_match}
        }
      },
      "negative_boost": 0.5
    }
  }
}
"""

In [21]:
print(custom_query_template)


{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": ${query}
        }
      },
      "negative": {
        "match": {
          "content": ${name_to_not_match}
        }
      },
      "negative_boost": 0.5
    }
  }
}



In [35]:
from haystack.nodes import BM25Retriever

In [38]:
bm25_retriever = BM25Retriever(document_store=document_store, all_terms_must_match=True, custom_query=custom_query_template)

In [42]:
question = "Qui est le president de la Republique democratique du congo?"
answer = "Felix Tshisekedi"

In [40]:
def get_hard_negative_context(
    retriever: BM25Retriever, question: str, answer: str, n_ctxs: int = 10
):
    """
    given the question and the answer query the Elastic search document store and return the hard negative context to the question
    """

    documents = bm25_retriever.retrieve(query=question, top_k=10, filters={"name_to_not_match": answer})
    return documents

In [43]:
get_hard_negative_context(bm25_retriever, question, answer)



[<Document: {'content': "Le coordonateur provincial de la jeunesse Kabiliste section du Lualaba appelle les acteurs politiques a lutter contre le tribalisme et a privilegier l'unite de la Republique Democratique du Congo . Bouphon Fangana a lance cet appel, le samedi 16 janvier 2021, a l'occasion de la commemoration de 20 ans de la disparition de l'ancien president de la Republique democratique du Congo, Mzee Laurent Desire Kabila . Les jeunes Kabilistes entendent voir les acteurs politiques regarder tous dans une meme direction pour pereniser les valeurs pronees par cet ancien president de la republique qu'il qualifie de « soldat du peuple ».", 'content_type': 'text', 'score': 0.7885887221778072, 'meta': {'posted_at': '', '_split_id': 0}, 'embedding': None, 'id': 'c36f00bf0e109a8193998ab6b565e4f'}>,
 <Document: {'content': 'Selon les observateurs, le president de la Republique Democratique du Congo, Joseph Kabila Kabange, et le 1er ministre Augustin Matata Ponyo ont interet a preter u

Next is to build the dense passage retrieval dataset , for each sentence we will find the name entities and mask them and query the database to find hard negative.

### Building the Dense Passage Retrieval Dataset

Adding the documents to the retriever store , the next step will be to build the dense passage retrieval dataset.

We will consider each paragraph as the answer, and we will generate differents question in the paragraph by masking the name entities which yield to a better score.

Once we have a question and the paragraph answer , we will retrieve the negative context with the code we wrote above.

#### NER on the Text

In [28]:
from transformers import AutoTokenizer, AutoModelForTokenClassification


# this model is good but it is not classifiying roles exactly., we need to improve that. confusing ministre and ministere
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")

In [30]:
from transformers import pipeline

In [31]:
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")

In [153]:
 def replace_between(text, begin, end, word_to_replace,  alternative='<MASK>'):
    to_replace = text[begin:end]
    # assert to_replace.strip() == word_to_replace.strip()
    return f"{text[:begin]} {alternative} {text[end+1:]}"

def filter_entities(entities):
    """filter the entities and keep only name , org, loc, date and the entity with a score of more than 85%

    Args:
        entities (_type_): _description_
    """
    return [entity for entity in entities if entity.get("entity_group") in ["PER", "ORG", "LOC", "DATE"] and entity.get("score") >= 0.85]

In [154]:
def build_question_answers_from_sentences(sentence, nlp):
    """given a sentence build the question and the answers from the sentence.
    Args:
        sentence (_type_): the sentence we are trying to get the NLP from, 
        nlp: the nlp pipeline that will do the NER.
    Returns:
        _type_: _description_
    """
    entities = nlp(sentence)
    filtered_entities = filter_entities(entities)
    for entity_dict in filtered_entities:
        start_index = entity_dict.get("start")
        end_index = entity_dict.get("end")
        token = entity_dict.get("word")
        sentence_with_mask = replace_between(sentence, start_index, end_index, token, alternative=' <MASK> ')
        entity_dict["sentence_with_mask"] = sentence_with_mask
        yield entity_dict

In [47]:
sample_document = docs[6]

In [None]:
def get_hard_negative_context(
    retriever: BM25Retriever, question: str, answer: str, n_ctxs: int = 10
):
    """
    given the question and the answer query the Elastic search document store and return the hard negative context to the question
    """

    documents = bm25_retriever.retrieve(query=question, top_k=10, filters={"name_to_not_match": answer})
    return documents

In [50]:
def get_hard_negative_context(
    retriever: BM25Retriever, question: str, answer: str, n_ctxs: int = 15
):
    list_hard_neg_ctxs = []
    retrieved_docs = get_hard_negative_context(retriever, question, answer, n_ctxs)
    for index, retrieved_doc in enumerate(retrieved_docs):
        retrieved_doc_text = retrieved_doc.text
        if answer.lower() in retrieved_doc_text.lower():
            continue
        list_hard_neg_ctxs.append(
            {"title": f"document_{index}", "text": retrieved_doc_text}
        )
    return list_hard_neg_ctxs

<Document: {'content': 'Ne en 1945, Kitenge Yesu est decede le 31 mai dernier a l’age de 76 ans et son enterrement se deroulera dans un cadre strictement prive, Selon plusieurs sources a la presidence de la Republique.', 'content_type': 'text', 'score': None, 'meta': {'name': '510.txt', '_split_id': 2}, 'embedding': None, 'id': '65fba14bdc58dee371d1cd95435bf635'}>

In [98]:
context =  build_question_answers_from_sentences(sample_document.content, nlp)

In [99]:
list(context)[0]

{'entity_group': 'LOC',
 'score': 0.9889691,
 'word': 'Rwanda',
 'start': 118,
 'end': 125,
 'sentence_with_mask': 'Cette carte indique egalement les parties cedees par chacune des puissances en presence. Ce qui a permis d’une part au  <MASK>  d’avoir acces sur les eaux du Lac Kivu, et d’autre part a l’Ouganda d’avoir acces sur les eaux du lac Edouard. L’ordonnance n°21/12 du 12 janvier 1953 modifiant l’ordonnance n021/258 du 14 aout 1949 fixant l’organisation territoriale du Rwanda -Urundi, fixe a neuf (9) le nombre de territoires pour le Rwanda et leurs delimitations. Parmi ces neuf territoires, trois (3) ont des frontieres communes avec la Republique Democratique du Congo.'}

In [113]:
from tqdm import tqdm

In [155]:
# https://github.com/philipperemy/Stanford-OpenIE-Python
# checkout this to build better stuff.
def create_dpr_training_dataset(
    docs, retriever: BM25Retriever, num_hard_negative_ctxs: int = 30
):
    n_non_added_questions = 0
    n_questions = 0
    for  doc in tqdm(docs):
        entities_details = build_question_answers_from_sentences(doc.content, nlp)
        for entity_detail in entities_details:
            context = doc.content
            question = entity_detail.get("sentence_with_mask").replace( "<MASK>", "")
            answer = entity_detail.get("word")
            hard_negative_contexts = get_hard_negative_context(
                retriever=retriever,
                question=question,
                answer=answer,
                n_ctxs=num_hard_negative_ctxs,
            )
            positive_context = [ {"text": context}]
            if not hard_negative_contexts or not positive_context:
                print(
                    f"No retrieved candidates for article , with question "
                )
                n_non_added_questions += 1
                continue
            dict_DPR = {
                "question": question,
                "answers": answer,
                "positive_ctxs": positive_context,
                "negative_ctxs": [],
                "hard_negative_ctxs": hard_negative_contexts,
            }
            n_questions += 1
            yield dict_DPR

In [105]:
qa_dpr_path = DATA_PATH.joinpath("raw", "french-qa", "DPR-news-with-mast.json")

In [106]:
assert qa_dpr_path.parent.exists()

In [111]:
import json

In [118]:
len(docs)

4719

In [156]:
dpr_results = create_dpr_training_dataset(docs, bm25_retriever, num_hard_negative_ctxs=30)

In [157]:

def write_retrieves_to_json(dpr_results, path):
    with open(path, "w") as json_file:
        json.dump(list(dpr_results), json_file, indent=4)

In [158]:
write_retrieves_to_json(dpr_results, qa_dpr_path)

  0%|          | 10/4719 [00:07<57:42,  1.36it/s]  


JSONDecodeError: Expecting ',' delimiter: line 7 column 321 (char 393)

In [1]:
all_docs[0]

NameError: name 'all_docs' is not defined