# 4. BERTopic Modeling

Below is the implementation of BERTopic used to run topic modeling the WikiNews articles.

In [None]:
%pip install bertopic
from bertopic import BERTopic

import pandas as pd
import numpy as np
import json
import os

## Step 1. Training BERT model

We initially implemented topic modeling by prefitting the BERTopic model to a sample (in this case one percent) of a [pre-existing dataset of all Wikipedia articles](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701), so that it can be trained specifically for wikis and therefore potentially exhibit better performance than a general BERT model. However, we were only able to train on a small sample of Wikipedia articles due to compute and memory limitations, which resulted in a model with less performance than the pretrained one (as it would represent most articles as "unassigned")

In [None]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

dir = '/kaggle/input/wikipedia-20230701'

file_paths = sorted(os.listdir(dir))
file_paths.remove('wiki_2023_index.parquet')
file_paths = [os.path.join(dir, path) for path in file_paths]

fraction = 0.01
big_data = dd.read_parquet(file_paths[0]).sample(frac=fraction, random_state=42)

for file in file_paths[1:]:
    curr = dd.read_parquet(file).sample(frac=fraction, random_state=42)
    big_data = dd.concat([big_data, curr], ignore_index=True)
    del curr

In [None]:
# Attempted to topic model by first premodelling topics from subset of wikipedia
# however was not able to sample a particularly large section of wikipedia to to memory constraints
# ended up with more topic outliers
big_data_pd = big_data.compute()
fit_docs = big_data_pd['text'].tolist()
model = BERTopic()
_ = model.fit_transform(fit_docs)

The resulting model had around 11,000 articles marked with topic -1 (unassigned), which was worse than the pretrained model, which only labeled 6,900 articles as unassigned. Since we only had around 20,000 articles, minimizing this number is important to getting good results.

## Step 2. Initializing Pretrained BERT Model and running directly on news articles

In [None]:
file_path = '/kaggle/input/wikinews-data-converter-2-final-stage-3/enwikinews-processed.parquet'

w_data = pd.read_parquet(file_path)
w_data.drop(columns=['page_namespace'], inplace=True)
w_data.dropna(inplace=True)

docs = w_data['page_text_extract_result'].tolist()

In [None]:
model2 = BERTopic()
topics, probs = model2.fit_transform(docs)

Topics represent the assigned topic number for a given document, while probs represents documents probabilities of having that topic.

In [None]:
pd.value_counts(topics) # can see number of articles assigned topic -1

In [None]:
# add back to dataframe

w_data['assigned_topic_num'] = topics
w_data['topic_probability'] = probs

## Step 3. Parsing Dates and generating output files

Below I cleaned the date-related data in the dataframe for writing to a JSON for the website. I also outputted the topic models as a parquet for data analysis.
The output JSON/Dataframe contains each document as a row/entry, with the documents assigned topic, topic probability, date, and article title.

In [None]:
topical = w_data[w_data['assigned_topic_num'] != -1]
topical.drop(columns=['revision_id', 'page_id', 'page_text'], axis=1, inplace=True)

import re

def clean_dates(text):
    pattern = r'\b([jJ]anuary|[fF]ebruary|[mM]arch|[aA]pril|[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[nN]ovember|[dD]ecember) (\d{1,2}), (\d{4})\b'
    match = re.findall(pattern, text[0].lower())
    if match:
        return {
            "Year": match[0][0],
            "Month": match[0][1],
            "Day": match[0][2],
            "Hour": 0,
            "Minute": 0,
            "Second": 0
            }
    else:
        return np.NaN
topical['article_date'] = topical['page_dates'].apply(clean_dates)
topical.dropna(inplace=True)

from dateutil.parser import parse

def parsedatetodict(timestamp):
    try:
        dt = parse(timestamp)
        return {
            "Year": dt.year,
            "Month": dt.month,
            "Day": dt.day,
            "Hour": dt.hour,
            "Minute": dt.minute,
            "Second": dt.second
            }
    except Exception as e:
        return np.NaN


topical['page_dates_parsed'] = topical['page_dates_parsed'].apply(lambda x: parsedatetodict(x[0]))
topical['last_update_timestamp'] = topical['last_update_timestamp'].apply(lambda x: parsedatetodict(x[0]))
topical.dropna(inplace=True)

topical.drop(columns=['page_dates', 'page_text_extract_result'], inplace=True)

topical.reset_index(drop=True, inplace=True)

topical.to_json('topical_output.json', orient='records')
topical.to_parquet('topical_output.parquet', index=False)

Here I produced the dict to assign the topic numbers to actual words, opting to just choose the word with highest probability for assignment. This does mean that some topics are repeated, however we can just collate those articles in the website.

In [None]:
topics = model2.get_topics()

# select word with biggest confidence in set of words assigned to topic
for key in topics.keys():
    max_pair = max(topics[key], key=lambda d: d[1])
    topics[key] = max_pair
    
import pickle
with open("topics.json", "w") as file:
    file.write(json.dumps(topics))
with open("topics.pkl", "wb") as file:
    pickle.dump(topics, file)