#### Train Embedding on Congo News Dataset

In this notebook I will finetune the camebert model on the congo news dataset. The end goal will be to have a model which is fined tune on the congo news dataset and can be used for various NLP tasks. But with a focus on topic modeling.

The first step will consist of reading the data from the database where it saved and preprocess it in paragraphs.


In [3]:
from os import getenv
from dotenv import load_dotenv, find_dotenv
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from urllib.parse import quote


In [4]:
load_dotenv()
database_url = database_url = 'postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}'\
        .format(user=getenv('POSTGRES_USER'),
                password=quote(getenv('POSTGRES_PASSWORD')),
                host=getenv('POSTGRES_HOST'),
                database=getenv('POSTGRES_DB'),
                port=getenv('POSTGRES_PORT'))

In [5]:
engine = create_engine(database_url)
Session = sessionmaker(bind=engine)
session = Session()

In [8]:
query = "select title, content, summary, posted_at, website_origin from article"

After collecting the data from our database, let process it. We will leverage the processing function define in Haystack.

In [6]:
import pandas as pd

In [9]:
with engine.connect() as connection:
    data = pd.read_sql_query(sql=query, con=connection, parse_dates=['posted_at'])

In [10]:
data.head()

Unnamed: 0,title,content,summary,posted_at,website_origin
0,RDC : Les Jeux de la Francophonie ont stimulé ...,« En plus des résultats plus que satisfaisants...,,NaT,https://www.7sur7.cd
1,Ituri : 3 civils tués par des miliciens CODECO...,Trois (3) civils ont été tués et 2 autres griè...,,NaT,https://www.7sur7.cd
2,RDC : Le succès des Jeux de la Francophonie pr...,Cette Organisation de la société civile félici...,,NaT,https://www.7sur7.cd
3,Beni : 2 ADF de nationalité ougandaise tués pa...,Selon le porte-parole du secteur opérationnel ...,,NaT,https://www.7sur7.cd
4,Kinshasa : 5 morts dans l’incendie provoqué pa...,C'est ce qu'a affirmé le vice-gouverneur de la...,,NaT,https://www.7sur7.cd


In [11]:
data.content = data.content.str.replace(u'\xa0', u' ').str.split("  ")

In [12]:
data.shape

(66570, 5)

In [13]:
data = data.explode("content")

In [14]:
data.shape

(182985, 5)

In [15]:
import re
from unicodedata import normalize as unicode_normalize, category as unicode_category


In [16]:
def deaccent(text):
    """
    Remove accentuation from the given string. Input text is either a unicode string or utf8 encoded bytestring.

    Return input string with accents removed, as unicode.

    >>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
    u'Sef chomutovskych komunistu dostal postou bily prasek'

    """
    if not isinstance(text, str):
        # assume utf8 for byte strings, use default (strict) error handling
        text = text.decode('utf8')
    norm = unicode_normalize("NFD", text)
    result = str('').join(ch for ch in norm if unicode_category(ch) != 'Mn')
    return unicode_normalize("NFC", result)


In [60]:
def replace_point(document):
    """replace the point with the wwt.www with space point before tokenizing the document .
    TOdos : this may have a a downside when the point is in the middle of a words
    Args:
        document (_type_): _description_
    """
    result = re.sub(r"(\S)\.(\S)", r"\1 . \2", document)
    return result

def replace_website_name(document):
    """sometimes the doucment has the name politico.cd or 7sur7.cd or actualite.cd, we would like to replace them by the 
    actual name of the website. before proper cleaning

    Args:
        document (_type_): _description_
    """
    # @TODO : not sure if this will work but , way better replace by the first line of match.
    
    result = re.sub(r"7SUR7.CD|politico.cd|actualite.cd|mediacongo.net", r"SITE_WEB", document, flags=re.IGNORECASE)
    return result

def remove_accents(document):
    input_without_accent = deaccent(document)
    return input_without_accent

def pre_clean_document(document):
    """pre clean the document by removing the accents and replacing the point with the wwt.www with space point before tokenizing the document .
    TOdos : this may have a a downside when the point is in the middle of a words
    and any other side of cleaning that we want to do .
    Args:
        document (_type_): _description_
    """
    result = remove_accents(document)
    result =  replace_website_name(result)
    result = replace_point(result)
    result = re.sub(r"This post has already been read \d+ times!", "", result) # remove unwanted text
    result = unicode_normalize("NFKD", result)
    return result

In [61]:
import numpy as np

In [2]:
data.shape()

NameError: name 'data' is not defined

In [21]:
# this can be improved.
data.content = data.content.\
    str.replace(r"(\S)\.(\S)", r"\1 . \2", regex=True)\
    .str.replace(r"7SUR7.CD|politico.cd|actualite.cd|mediacongo.net", r"SITE_WEB", flags=re.IGNORECASE, regex=True)\
    .str.replace(r"This post has already been read \d+ times!", "", regex=True)

In [22]:
from datasets import Dataset

In [23]:
dataset = Dataset.from_pandas(data)

In [24]:
dataset.shape

(182985, 6)