### Building Question answering system

For  this work , we will build our question answering system , we will leverage the [deepset framework]() to build the components of our system.

Here are the followign components of our system:
-  The index store 
- The document store 
- The search pipeline with a Retrieval and a Reader model.

We will be leveraging the tutorials provided by deepstack to build our system.

For this work we will leverage [this  tutorial:](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) 

### Building the dense passage retrieval

To build the dataset for the dense passage retrieval , we will be using the approach suggested by [this tutorial](https://huggingface.co/etalab-ia/dpr-question_encoder-fr_qa-camembert) which use the cambert model .

For each question , we have a single positive context , the paragraph where the answer to the question is located and n hard negavives contexts that are the the top - k canditates taht does not contain the answer, to the question. to retrieve the negative context they will use the bm25 retrieval model.

Before training the retrieval model we will have to build the document store and save the document as units of retrieval to the store.

### Building the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
from pathlib import Path
DATA_PATH = Path.cwd().joinpath("data")
assert DATA_PATH.exists(), "the data path does not exist"
TEXT_DATA_FOLDER = DATA_PATH.joinpath("corpus", "drc-news-txt")
assert TEXT_DATA_FOLDER.exists(), "the text data folder does not exist"

In [3]:
data_file_path = DATA_PATH.joinpath("corpus", "raw", 'drc-news-raws.csv')

In [4]:
data = pd.read_csv(data_file_path, names=["content", "posted_at"])

In [5]:
data.shape

(140638, 2)

In [5]:
data = data.fillna(value="")
data.head()

Unnamed: 0,content,posted_at
0,Les membres de la Commission tarifaire viennen...,2022-09-05 00:00:00
1,Les membres de la Commission tarifaire so...,2022-04-05 00:00:00
2,Vodacom Congo vient de signer un partenariat a...,2022-04-23 00:00:00
3,"Le sélectionneur des Léopards de la RDC, Hectó...",2022-03-05 00:00:00
4,Le protocole d’accord était déjà signé entre l...,2022-11-05 00:00:00


In [6]:
from haystack.nodes import TextConverter

  from .autonotebook import tqdm as notebook_tqdm
INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.


In [7]:
from haystack.schema import Document
from secrets import token_hex

# @Todo: this is not working now , it was supposed to save the document to dataframe
def get_document_from_text(row):
    """numpy row with the text and the date of the post

    Args:
        row (_type_): _description_

    Returns:
        _type_: _description_
    """
    text = row[0].replace(u'\xa0', u' ')
    for paragraph in text.split("   "):
        if not paragraph.strip():  # skip empty paragraphs
            continue
        return Document(content=paragraph, meta={"posted_at":row[1] if row[1] else "" })

In [8]:
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs

In [9]:
all_docs = data.sample(1000).apply(get_document_from_text, axis="columns")

In [14]:
all_docs.shape

(1000,)

In [10]:
all_docs = all_docs.dropna().to_list()

In [11]:
from haystack.errors import HaystackError
from haystack.schema import Document
from typing import List, Optional, Generator, Set, Union
from copy import deepcopy
from haystack.nodes import PreProcessor

class CustomPreProcessor(PreProcessor):
    def __init__(self, custom_preprocessor=None, **kwargs):
        super().__init__(**kwargs)
        self.custom_preprocessor = custom_preprocessor
    def clean(
        self,
        document: Union[dict, Document],
        clean_whitespace: bool,
        clean_header_footer: bool,
        clean_empty_lines: bool,
        remove_substrings: List[str],
        id_hash_keys: Optional[List[str]] = None,
    ) -> Document:
        """
        
        Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers
        and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.__init__().
        """
        if id_hash_keys is None:
            id_hash_keys = self.id_hash_keys

        if isinstance(document, dict):
            document = Document.from_dict(document, id_hash_keys=id_hash_keys)

        # Mainly needed for type checking
        if not isinstance(document, Document):
            raise HaystackError("Document must not be of type 'dict' but of type 'Document'.")
        text = document.content
        text = self.custom_preprocessor(text)
        if clean_header_footer:
            text = self._find_and_remove_header_footer(
                text, n_chars=300, n_first_pages_to_ignore=1, n_last_pages_to_ignore=1
            )

        if clean_whitespace:
            lines = text.splitlines()

            cleaned_lines = []
            for line in lines:
                line = line.strip()
                cleaned_lines.append(line)
            text = "\n".join(cleaned_lines)

        if clean_empty_lines:
            text = re.sub(r"\n\n+", "\n\n", text)

        for substring in remove_substrings:
            text = text.replace(substring, "")

        if text != document.content:
            document = deepcopy(document)
            document.content = text

        return document
    
    

In [12]:
import re
from gensim.utils import deaccent
from unicodedata import normalize as unicode_normalize

In [13]:
def replace_point(document):
    """replace the point with the wwt.www with space point before tokenizing the document .
    TOdos : this may have a a downside when the point is in the middle of a words
    Args:
        document (_type_): _description_
    """
    result = re.sub(r"(\S)\.(\S)", r"\1 . \2", document)
    return result

def replace_website_name(document):
    """sometimes the doucment has the name politico.cd or 7sur7.cd or actualite.cd, we would like to replace them by the 
    actual name of the website. before proper cleaning

    Args:
        document (_type_): _description_
    """
    # @TODO : not sure if this will work but , way better replace by the first line of match.
    
    result = re.sub(r"7SUR7.CD|politico.cd|actualite.cd|mediacongo.net", r"SITE_WEB", document, flags=re.IGNORECASE)
    return result

def remove_accents(document):
    input_without_accent = deaccent(document)
    return input_without_accent

def pre_clean_document(document):
    """pre clean the document by removing the accents and replacing the point with the wwt.www with space point before tokenizing the document .
    TOdos : this may have a a downside when the point is in the middle of a words
    and any other side of cleaning that we want to do .
    Args:
        document (_type_): _description_
    """
    result = remove_accents(document)
    result =  replace_website_name(result)
    result = replace_point(result)
    result = re.sub(r"This post has already been read \d+ times!", "", result) # remove unwanted text
    result = unicode_normalize("NFKD", result)
    return result

In [14]:
text_doc = """
Une motion de defiance a ete deposee au cabinet de la presidente de l'Assemblee provinciale du Maniema contre le vice-gouverneur Jean-Pierre Amadi, le mardi 30 mars dernier.16 parmi les 17 deputes provinciaux presents a Kindu, chef-lieu de la province du Maniema, ont appose leurs signatures sur ladite motion depuis le 27 mars 2021.Selon ce document consulte par 7SUR7.CD, Jean-Pierre Amadi Lubenga est reproche de plusieurs griefs dont le « refus d'obtemperer aux instructions de la hierarchie » pendant qu'il etait gouverneur de province a l'interim et le detournement des deniers publics.
comme signale a politico.cd sur notre site POLITICO.CD et puis ensuite sur actualite.cd et sur notre site mediacongo.net
"""

In [15]:
replace_website_name(text_doc)

"\nUne motion de defiance a ete deposee au cabinet de la presidente de l'Assemblee provinciale du Maniema contre le vice-gouverneur Jean-Pierre Amadi, le mardi 30 mars dernier.16 parmi les 17 deputes provinciaux presents a Kindu, chef-lieu de la province du Maniema, ont appose leurs signatures sur ladite motion depuis le 27 mars 2021.Selon ce document consulte par SITE_WEB, Jean-Pierre Amadi Lubenga est reproche de plusieurs griefs dont le « refus d'obtemperer aux instructions de la hierarchie » pendant qu'il etait gouverneur de province a l'interim et le detournement des deniers publics.\ncomme signale a SITE_WEB sur notre site SITE_WEB et puis ensuite sur SITE_WEB et sur notre site SITE_WEB\n"

In [16]:
preprocessor = CustomPreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    language="fr",
    custom_preprocessor=pre_clean_document,
)


docs = preprocessor.process(all_docs)

print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(docs)}")

100%|██████████| 998/998 [00:01<00:00, 729.68docs/s]

n_files_input: 998
n_docs_output: 2160





In [17]:
docs[0]

<Document: {'content': 'A l\'occasion de la ceremonie de remise et reprise avec le ministre sortant Emery Okundi, le nouveau ministre des Postes, Telecommunications et nouvelles technologies de l\'information et de la communication, a promis de defendre l\'interet de la Republique tout en rappelant que la Republique passe avant les interets individuels . Augustin Kibassa Maliba a dans la foulee rassure qu\'il ne cedera pas aux chantages. "Les chantages ne passeront pas, parce que je sais que la Republique passe avant nos individus ( . ..). Je defendrais avec toutes mes forces l\'interet de la Republique", a-t-il martele . Je suis conscient du fait que poursuit-il, sur le plan de la legislation, nous devons fournir enormement ( . ..) et au-dela de ca je sais que la legislation reflechit sur une legislation qui va permettre au pays de s\'adapter au standard international" . S\'agissant de la fibre optique, Augustin Kibassa a souligne sa determination a voir le pays etre relie. Mais d\'ap

In [18]:
from haystack.document_stores import ElasticsearchDocumentStore



In [19]:
document_store = ElasticsearchDocumentStore(index="drc-news", recreate_index=True, analyzer="french")

INFO - haystack.document_stores.elasticsearch -  Index 'drc-news' deleted.
INFO - haystack.document_stores.elasticsearch -  Index 'label' deleted.


In [20]:
document_store.write_documents(docs)

### Retrieval

With the document store in place , the document store has all the document in it , let build a retriever model that use BM25 to retrieve the document.

In [21]:
custom_query_template = """
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": ${query}
        }
      },
      "negative": {
        "match": {
          "content": ${name_to_not_match}
        }
      },
      "negative_boost": 0.5
    }
  }
}
"""

In [22]:
from haystack.nodes import BM25Retriever

In [23]:
bm25_retriever = BM25Retriever(document_store=document_store, all_terms_must_match=True, custom_query=custom_query_template)
bm25_retriever_positive = BM25Retriever(document_store=document_store, all_terms_must_match=True)

In [24]:
question = "le president de la Republique democratique du congo?"
answer = "Felix Tshisekedi"

In [25]:
def get_hard_negative_context(
    retriever: BM25Retriever, question: str, answer: str, n_ctxs: int = 10
):
    """
    given the question and the answer query the Elastic search document store and return the hard negative context to the question
    """

    documents = bm25_retriever.retrieve(query=question, top_k=10, filters={"name_to_not_match": answer})
    return documents

In [26]:
get_hard_negative_context(bm25_retriever, question, answer)

[<Document: {'content': 'A Son Excellence Monsieur le President  de la Republique Democratique du Congo, Joseph KABILA KABANGE\nLa societe Tenke Fungurume Mining (TFM), sa direction et  tous ses employes presentent, a l’occasion du cinquante-troisieme anniversaire  de l’accession de la Republique Democratique du Congo a la souverainete  nationale et internationale, toutes leurs felicitations ainsi que leurs vœux de  bien-etre et de prosperite au pays et a toute la population congolaise. Tenke Fungurume est fiere de  reiterer son soutien indefectible a la revolution de la modernite, pronee par  le Chef de l’Etat, et sa contribution au developpement economique au travers  des projets dans les domaines des infrastructures, de l’eau, de l’emploi, de l’education,  de la sante et de la protection de l’environnement en RDC. Cette Journee de l’Independence commemore pour la  Republique Democratique du Congo une ere de paix, de prosperite et de reussite  socio-economique. Bonne et heureuse fete

Next is to build the dense passage retrieval dataset , for each sentence we will find the name entities and mask them and query the database to find hard negative.

### Building the Reader Dataset

Adding the documents to the retriever store , the next step will be to build the dense passage retrieval dataset.

We will consider each paragraph as the answer, and we will generate differents question in the paragraph by masking the name entities which yield to a better score.

Once we have a question and the paragraph answer , we will retrieve the negative context with the code we wrote above.

#### NER on the Text

In [27]:
from transformers import AutoTokenizer, AutoModelForTokenClassification


# this model is good but it is not classifiying roles exactly., we need to improve that. confusing ministre and ministere
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")

In [28]:
from transformers import pipeline

In [29]:
transformer_ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")

In [30]:
from tqdm import tqdm
import json

In [31]:
import spacy

spacy_pipeline = spacy.load("fr_dep_news_trf")



In [32]:

random_id = np.random.randint(0, len(docs))
sample_document = docs[random_id]

In [33]:
sample_document

<Document: {'content': 'A cet effet, il visera a encourager les efforts continentaux et mondiaux pour renforcer la cooperation et l’integration regionales en tant que catalyseur pour la realisation des projets d’hydroelectricite renouvelable et des investissements, de l’innovation et de la mise en œuvre de l’efficacite energetique sur le continent . En effet, une cooperation regionale ciblee peut relever certains des defis et des obstacles a l’exploitation durable des marches de l’energie et des technologies climatiques, en creant les economies d’echelle necessaires et en permettant des progres plus equilibres avec des effets d’entrainement entre les pays. Ayant reconnu l’importance des economies d’echelle dans la production ancree sur des marches bien etablis, les gouvernements africains ont charge l’AUDA-NEPAD, la CUA, la Banque Africaine de Developpement (BAD), la Commission Economique pour l’Afrique (CEA) et les partenaires au developpement d’elaborer conjointement un schema direct

In [45]:
from src.data.corpus_builder_utils import DocumentContext, AllCorpusBuilder, Sentence

In [39]:
sample_document_context = DocumentContext(sample_document.content, 
                                          spacy_pipeline=spacy_pipeline,
                                          ner_pipeline=transformer_ner_pipeline,
                                          bm25_retriever_positive=bm25_retriever_positive,)



In [40]:
sample_document_context.generate_sentences()

In [41]:
BASE_QA_PATH = DATA_PATH.joinpath("processed", "DRC-News-UQA")
assert BASE_QA_PATH.exists()

In [46]:
corpus_builder = AllCorpusBuilder(
    ner_pipeline=spacy_pipeline,
    transformer_pipeline=transformer_ner_pipeline,
    retriever=bm25_retriever_positive,
    all_docs=all_docs[0:5],
    base_folder=BASE_QA_PATH,
)

In [47]:
corpus_builder.build_corpus()

generating dataset: 100%|██████████| 5/5 [00:17<00:00,  3.53s/it]


We all the utilities function we can go 2 ways from this: 
- train a dense passage retriever
- train a reader , leveraging our BM25 model and using the paper from the [span selection pretraining.](https://github.com/IBM/span-selection-pretraining/blob/master/sspt/sspt_gen_async.py)

#### Implementation of the Training

Retriever details 

For the retriever we will use the BM25 to retrieve for each question the positive context and the negative context.

The postive context are the paragraphs where the asnwer is located and the hard negative context are the context that does not contain the answer.

We will use only positive answer to train our model, for the dense passage retrieval fine tunning we will use the positive and negative context.