# 4. BERTopic Modeling

Below is the implementation of BERTopic used to run topic modeling the WikiNews articles.

In [1]:
%pip install bertopic
from bertopic import BERTopic

import pandas as pd
import numpy as np
import json
import os

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m48.1 MB/s[0m eta [36m0:

## Step 1. Training BERT model

We initially implemented topic modeling by prefitting the BERTopic model to a sample (in this case one percent) of a [pre-existing dataset of all Wikipedia articles](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701), so that it can be trained specifically for wikis and therefore potentially exhibit better performance than a general BERT model. However, we were only able to train on a small sample of Wikipedia articles due to compute and memory limitations, which resulted in a model with less performance than the pretrained one (as it would represent most articles as "unassigned")

In [2]:
# import dask.dataframe as dd
# from dask.diagnostics import ProgressBar

# dir = '/kaggle/input/wikipedia-20230701'

# file_paths = sorted(os.listdir(dir))
# file_paths.remove('wiki_2023_index.parquet')
# file_paths = [os.path.join(dir, path) for path in file_paths]

# fraction = 0.01
# big_data = dd.read_parquet(file_paths[0]).sample(frac=fraction, random_state=42)

# for file in file_paths[1:]:
#     curr = dd.read_parquet(file).sample(frac=fraction, random_state=42)
#     big_data = dd.concat([big_data, curr], ignore_index=True)
#     del curr

In [3]:
# Attempted to topic model by first premodelling topics from subset of wikipedia
# however was not able to sample a particularly large section of wikipedia to to memory constraints
# ended up with more topic outliers
# fit_docs = big_data['text'].compute().tolist()
# model = BERTopic()
# _ = model.fit(fit_docs)

The resulting model had around 11,000 articles marked with topic -1 (unassigned), which was worse than the pretrained model, which only labeled 6,900 articles as unassigned. Since we only had around 20,000 articles, minimizing this number is important to getting good results.

## Step 2. Initializing Pretrained BERT Model and running directly on news articles

In [4]:
file_path = '/kaggle/input/wikinews-data-converter-2-final-stage-3/enwikinews-processed.parquet'

w_data = pd.read_parquet(file_path)
w_data.drop(columns=['page_namespace'], inplace=True)
w_data.dropna(inplace=True)

docs = w_data['page_text_extract_result'].tolist()

In [5]:
model2 = BERTopic()
topics, probs = model2.fit_transform(docs)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

Topics represent the assigned topic number for a given document, while probs represents documents probabilities of having that topic.

In [6]:
pd.value_counts(topics) # can see number of articles assigned topic -1

-1      7190
 0       404
 1       347
 2       202
 3       197
        ... 
 374      10
 375      10
 376      10
 377      10
 378      10
Name: count, Length: 380, dtype: int64

In [7]:
# add back to dataframe

w_data['assigned_topic_num'] = topics
w_data['topic_probability'] = probs

In [8]:
topical = w_data[w_data['assigned_topic_num'] != -1]
topical.drop(columns=['revision_id', 'page_id', 'page_text'], axis=1, inplace=True)

## Step 3. Parsing Dates and generating output files

Below I cleaned the date-related data in the dataframe for writing to a JSON for the website. I also outputted the topic models as a parquet for data analysis.
The output JSON/Dataframe contains each document as a row/entry, with the documents assigned topic, topic probability, date, and article title.

In [9]:
import re
from dateutil.parser import isoparse

def parsedatetodict(timestamps):
    if "ambiguous" in timestamps:
        return np.NaN
    if len(timestamps) != 1:
        return np.NaN
    
    dates = []
    for timestamp in timestamps:
        dates.append(isoparse(timestamp))

    dates.sort()
    dt = dates[0]

    return {
        "Year": dt.year,
        "Month": dt.month,
        "Day": dt.day,
        "Hour": dt.hour,
        "Minute": dt.minute,
        "Second": dt.second
    }

topical['page_dates_parsed_obj'] = topical['page_dates_parsed'].apply(lambda x: parsedatetodict(x))
topical['last_update_timestamp_obj'] = topical['last_update_timestamp'].apply(lambda x: parsedatetodict([x]))
topical.dropna(inplace=True)

topical.reset_index(drop=True, inplace=True)

topical.to_json('topical_output.json', orient='records')

for x in topical['page_dates_parsed_obj']:
    try:
        assert x["Year"] > 1000
        assert x["Month"] < 13
        assert x["Day"] <= 31
        assert x["Hour"] == 0
        assert x["Minute"] == 0
        assert x["Second"] == 0
        assert all([(v >= 0) for v in x.values()])
    except AssertionError:
        print(x)
        assert False

for x in topical['last_update_timestamp_obj']:
    try:
        assert x["Year"] > 1000
        assert x["Month"] < 13
        assert x["Day"] <= 31
        assert x["Hour"] < 25
        assert x["Minute"] < 61
        assert x["Second"] < 61
        assert all([(v >= 0) for v in x.values()])
    except AssertionError:
        print(x)
        assert False

topical['page_dates_parsed_obj'] = topical['page_dates_parsed_obj'].apply(json.dumps)
topical['last_update_timestamp_obj'] = topical['last_update_timestamp_obj'].apply(json.dumps)
topical.to_parquet('topical_output.parquet', 
    engine='pyarrow', 
    compression='zstd', 
    compression_level=23
)

Here I produced a dictionary mapping topic numbers generated by the model to actual words. This was performed by choosing the word that was most commonly associated with each topic number as the label for that topic.

In [10]:
topics = model2.get_topics()

# select word with biggest confidence in set of words assigned to topic
for key in topics.keys():
    max_pair = max(topics[key], key=lambda d: d[1])
    topics[key] = max_pair
    
import pickle
with open("topics.json", "w") as file:
    file.write(json.dumps(topics))
with open("topics.pkl", "wb") as file:
    pickle.dump(topics, file)