# 4. BERTopic Modeling

Below is the implementation of BERTopic used to run topic modeling the WikiNews articles.

In [1]:
%pip install bertopic dateparser datefinder
from bertopic import BERTopic

import pandas as pd
import numpy as np
import json
import os

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting dateparser
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Collecting datefinder
  Downloading datefinder-0.7.3-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting tzlocal (from dateparser)
  Downloading tzlocal-5.2-py3-none-any.whl.metadata (7.8 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 1. Training BERT model

We initially implemented topic modeling by prefitting the BERTopic model to a sample (in this case one percent) of a [pre-existing dataset of all Wikipedia articles](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701), so that it can be trained specifically for wikis and therefore potentially exhibit better performance than a general BERT model. However, we were only able to train on a small sample of Wikipedia articles due to compute and memory limitations, which resulted in a model with less performance than the pretrained one (as it would represent most articles as "unassigned")

In [2]:
# import dask.dataframe as dd
# from dask.diagnostics import ProgressBar

# dir = '/kaggle/input/wikipedia-20230701'

# file_paths = sorted(os.listdir(dir))
# file_paths.remove('wiki_2023_index.parquet')
# file_paths = [os.path.join(dir, path) for path in file_paths]

# fraction = 0.01
# big_data = dd.read_parquet(file_paths[0]).sample(frac=fraction, random_state=42)

# for file in file_paths[1:]:
#     curr = dd.read_parquet(file).sample(frac=fraction, random_state=42)
#     big_data = dd.concat([big_data, curr], ignore_index=True)
#     del curr

In [3]:
# Attempted to topic model by first premodelling topics from subset of wikipedia
# however was not able to sample a particularly large section of wikipedia to to memory constraints
# ended up with more topic outliers
# fit_docs = big_data['text'].compute().tolist()
# model = BERTopic()
# _ = model.fit(fit_docs)

The resulting model had around 11,000 articles marked with topic -1 (unassigned), which was worse than the pretrained model, which only labeled 6,900 articles as unassigned. Since we only had around 20,000 articles, minimizing this number is important to getting good results.

## Step 2. Initializing Pretrained BERT Model and running directly on news articles

In [4]:
file_path = '/kaggle/input/wikinews-data-converter-2-final-stage-3/enwikinews-processed.parquet'

w_data = pd.read_parquet(file_path)
w_data.drop(columns=['page_namespace'], inplace=True)
w_data.dropna(inplace=True)

docs = w_data['page_text_extract_result'].tolist()

In [5]:
model2 = BERTopic()
topics, probs = model2.fit_transform(docs)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid 

Topics represent the assigned topic number for a given document, while probs represents documents probabilities of having that topic.

In [6]:
pd.value_counts(topics) # can see number of articles assigned topic -1

-1      7140
 0       227
 1       227
 2       176
 3       172
        ... 
 374      10
 375      10
 376      10
 377      10
 378      10
Name: count, Length: 380, dtype: int64

In [7]:
# add back to dataframe

w_data['assigned_topic_num'] = topics
w_data['topic_probability'] = probs

In [8]:
w_data

Unnamed: 0,revision_id,page_id,page_title,page_text,last_update_timestamp,page_dates,page_text_extract_result,page_dates_parsed,assigned_topic_num,topic_probability
0,4516743,736,President of China lunches with Brazilian Pres...,"{{date|November 13, 2004}}\n{{Brazil}}\n\n{{w|...",2019-09-28T09:51:53Z,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nHu Jintao, the...",[2024-11-13T00:00:00],80,1.000000
1,4516759,741,Palestinians to elect new president on January 9,[[File:Mahmoud abbas.jpg|frame|left|Mahmoud Ab...,2019-09-28T10:45:51Z,"[{{byline|date=november 14, 2004|location=[[w:...","Sunday, November 14, 2004 \nRAMALLAH — Acting ...",[2024-11-14T00:00:00],230,0.915818
2,2280888,743,Brazilian delegation returns from Arafat funeral,"{{date|November 13, 2004}}\n{{Palestine}}The d...",2014-01-02T19:36:05Z,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nThe delegation...",[2024-11-13T00:00:00],80,0.668489
3,4516758,764,Hearing begins over David Hookes death,"{{Crime and law}}{{byline|date=November 15, 20...",2019-09-28T10:39:36Z,"[{{byline|date=november 15, 2004|location=[[me...","Monday, November 15, 2004 \nMELBOURNE, Victori...",[2024-11-15T00:00:00],1,0.875391
4,1973838,779,Iran close to decision on nuclear program,"{{date|November 13, 2004}}\n{{Iran nuclear pro...",2013-08-21T16:07:41Z,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nIranian repres...",[2024-11-13T00:00:00],9,0.782153
...,...,...,...,...,...,...,...,...,...,...
21651,4804553,3003615,"Trump wins 2024 U.S. Presidential Election, se...","{{date|November 12, 2024}} <!--leave this as i...",2024-11-12T18:11:08Z,"[{{date|november 12, 2024}}]","Tuesday, November 12, 2024 \n\nDonald Trump w...",[2024-11-12T00:00:00],81,0.988270
21653,4805088,3003768,"Prison riot in Ecuador, at least 17 killed",{{tasks|src|npov|mos|re-review}}\n{{date|Novem...,2024-11-17T16:35:30Z,"[{{date|november 13, 2024}}]","Wednesday, November 13, 2024 \n\nAt least sev...",[2024-11-13T00:00:00],79,0.890278
21654,4805240,3003827,2024 ARPS Conference,"{{develop}}\n{{date|November 12, 2024}}\n\nThe...",2024-11-19T12:19:00Z,"[{{date|november 12, 2024}}]","Tuesday, November 12, 2024 \nThe Australasian ...",[2024-11-12T00:00:00],217,0.753780
21655,4805272,3003842,Japan's oldest Princess Yuriko the Princess Mi...,"{{tasks|src|re-review}}{{date|November 15, 202...",2024-11-19T18:23:32Z,"[{{date|november 15, 2024}}]","Wednesday, November 20, 2024 \n\nJapan's Yurik...",[2024-11-15T00:00:00],-1,0.000000


In [9]:
topical = w_data[w_data['assigned_topic_num'] != -1]
topical.drop(columns=['revision_id', 'page_id', 'page_text'], axis=1, inplace=True)

## Step 3. Date fixes and generating output files

Below I cleaned the date-related data in the dataframe for writing to a JSON for the website. I also outputted the topic models as a parquet for data analysis. The original date parsing was done with a slightly-too-permissive library which lead to incorrectly parsed dates, so these dates should be slightly more accurate.

The output JSON/Dataframe contains each document as a row/entry, with the documents assigned topic, topic probability, date, and article title.

In [10]:
import re
from dateutil.parser import parse, isoparse
import datefinder

def parse_date(page_dates):
    page_dates_parsed = []
    for date_str in page_dates:
        date_str = re.sub(" +", " ", date_str.replace(",", " "))
        
        dates_parsed = list(datefinder.find_dates(date_str))
        if len(dates_parsed) == 1:
            page_dates_parsed.extend(dates_parsed)
        else:
            assert False
    page_date = min(page_dates_parsed)

    # Sanity check on page dates to ensure year is correct
    if page_date.year == 2024:
        try:
            for i in range(2004, 2023):
                assert str(i) not in str(page_dates)
        except:
            print(page_dates)
            assert False
    
    return {
        "Year": page_date.year,
        "Month": page_date.month,
        "Day": page_date.day,
        "Hour": page_date.hour,
        "Minute": page_date.minute,
        "Second": page_date.second
    }

def page_update_date_isoparse(isodate):
    isodate = isoparse(isodate)
    
    return {
        "Year": isodate.year,
        "Month": isodate.month,
        "Day": isodate.day,
        "Hour": isodate.hour,
        "Minute": isodate.minute,
        "Second": isodate.second
    }
    

topical['page_dates_parsed_obj'] = topical['page_dates'].apply(parse_date)
topical['last_update_timestamp_obj'] = topical['last_update_timestamp'].apply(page_update_date_isoparse)
del topical['page_dates_parsed']
del topical['last_update_timestamp']
topical.dropna(inplace=True)
topical.reset_index(drop=True, inplace=True)

topical.to_json('topical_output.json', orient='records')

for x in topical['page_dates_parsed_obj']:
    try:
        assert x["Year"] > 1000
        assert x["Month"] < 13
        assert x["Day"] <= 31
        assert x["Hour"] == 0
        assert x["Minute"] == 0
        assert x["Second"] == 0
        assert all([(v >= 0) for v in x.values()])
    except AssertionError:
        print(x)
        assert False

for x in topical['last_update_timestamp_obj']:
    try:
        assert x["Year"] > 1000
        assert x["Month"] < 13
        assert x["Day"] <= 31
        assert x["Hour"] < 25
        assert x["Minute"] < 61
        assert x["Second"] < 61
        assert all([(v >= 0) for v in x.values()])
    except AssertionError:
        print(x)
        assert False

topical['page_dates_parsed_obj'] = topical['page_dates_parsed_obj'].apply(json.dumps)
topical['last_update_timestamp_obj'] = topical['last_update_timestamp_obj'].apply(json.dumps)
topical.to_parquet('topical_output.parquet', 
    engine='pyarrow', 
    compression='zstd', 
    compression_level=23
)

In [11]:
topical

Unnamed: 0,page_title,page_dates,page_text_extract_result,assigned_topic_num,topic_probability,page_dates_parsed_obj,last_update_timestamp_obj
0,President of China lunches with Brazilian Pres...,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nHu Jintao, the...",80,1.000000,"{""Year"": 2004, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2019, ""Month"": 9, ""Day"": 28, ""Hour"": ..."
1,Palestinians to elect new president on January 9,"[{{byline|date=november 14, 2004|location=[[w:...","Sunday, November 14, 2004 \nRAMALLAH — Acting ...",230,0.915818,"{""Year"": 2004, ""Month"": 11, ""Day"": 14, ""Hour"":...","{""Year"": 2019, ""Month"": 9, ""Day"": 28, ""Hour"": ..."
2,Brazilian delegation returns from Arafat funeral,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nThe delegation...",80,0.668489,"{""Year"": 2004, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2014, ""Month"": 1, ""Day"": 2, ""Hour"": 1..."
3,Hearing begins over David Hookes death,"[{{byline|date=november 15, 2004|location=[[me...","Monday, November 15, 2004 \nMELBOURNE, Victori...",1,0.875391,"{""Year"": 2004, ""Month"": 11, ""Day"": 15, ""Hour"":...","{""Year"": 2019, ""Month"": 9, ""Day"": 28, ""Hour"": ..."
4,Iran close to decision on nuclear program,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nIranian repres...",9,0.782153,"{""Year"": 2004, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2013, ""Month"": 8, ""Day"": 21, ""Hour"": ..."
...,...,...,...,...,...,...,...
14502,Smithsonian National Zoo euthanizes elderly As...,"[{{date|november 7, 2024}}]","Thursday, November 7, 2024 \n\nKamala, an Asia...",258,1.000000,"{""Year"": 2024, ""Month"": 11, ""Day"": 7, ""Hour"": ...","{""Year"": 2024, ""Month"": 11, ""Day"": 17, ""Hour"":..."
14503,Trump declares victory,"[{{date|november 6, 2024}}]","Wednesday, November 6, 2024 \n\nDonald Trump i...",81,0.576288,"{""Year"": 2024, ""Month"": 11, ""Day"": 6, ""Hour"": ...","{""Year"": 2024, ""Month"": 11, ""Day"": 16, ""Hour"":..."
14504,"Trump wins 2024 U.S. Presidential Election, se...","[{{date|november 12, 2024}}]","Tuesday, November 12, 2024 \n\nDonald Trump w...",81,0.988270,"{""Year"": 2024, ""Month"": 11, ""Day"": 12, ""Hour"":...","{""Year"": 2024, ""Month"": 11, ""Day"": 12, ""Hour"":..."
14505,"Prison riot in Ecuador, at least 17 killed","[{{date|november 13, 2024}}]","Wednesday, November 13, 2024 \n\nAt least sev...",79,0.890278,"{""Year"": 2024, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2024, ""Month"": 11, ""Day"": 17, ""Hour"":..."


In [12]:
pd.read_parquet('topical_output.parquet')

Unnamed: 0,page_title,page_dates,page_text_extract_result,assigned_topic_num,topic_probability,page_dates_parsed_obj,last_update_timestamp_obj
0,President of China lunches with Brazilian Pres...,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nHu Jintao, the...",80,1.000000,"{""Year"": 2004, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2019, ""Month"": 9, ""Day"": 28, ""Hour"": ..."
1,Palestinians to elect new president on January 9,"[{{byline|date=november 14, 2004|location=[[w:...","Sunday, November 14, 2004 \nRAMALLAH — Acting ...",230,0.915818,"{""Year"": 2004, ""Month"": 11, ""Day"": 14, ""Hour"":...","{""Year"": 2019, ""Month"": 9, ""Day"": 28, ""Hour"": ..."
2,Brazilian delegation returns from Arafat funeral,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nThe delegation...",80,0.668489,"{""Year"": 2004, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2014, ""Month"": 1, ""Day"": 2, ""Hour"": 1..."
3,Hearing begins over David Hookes death,"[{{byline|date=november 15, 2004|location=[[me...","Monday, November 15, 2004 \nMELBOURNE, Victori...",1,0.875391,"{""Year"": 2004, ""Month"": 11, ""Day"": 15, ""Hour"":...","{""Year"": 2019, ""Month"": 9, ""Day"": 28, ""Hour"": ..."
4,Iran close to decision on nuclear program,"[{{date|november 13, 2004}}]","Saturday, November 13, 2004 \n\nIranian repres...",9,0.782153,"{""Year"": 2004, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2013, ""Month"": 8, ""Day"": 21, ""Hour"": ..."
...,...,...,...,...,...,...,...
14502,Smithsonian National Zoo euthanizes elderly As...,"[{{date|november 7, 2024}}]","Thursday, November 7, 2024 \n\nKamala, an Asia...",258,1.000000,"{""Year"": 2024, ""Month"": 11, ""Day"": 7, ""Hour"": ...","{""Year"": 2024, ""Month"": 11, ""Day"": 17, ""Hour"":..."
14503,Trump declares victory,"[{{date|november 6, 2024}}]","Wednesday, November 6, 2024 \n\nDonald Trump i...",81,0.576288,"{""Year"": 2024, ""Month"": 11, ""Day"": 6, ""Hour"": ...","{""Year"": 2024, ""Month"": 11, ""Day"": 16, ""Hour"":..."
14504,"Trump wins 2024 U.S. Presidential Election, se...","[{{date|november 12, 2024}}]","Tuesday, November 12, 2024 \n\nDonald Trump w...",81,0.988270,"{""Year"": 2024, ""Month"": 11, ""Day"": 12, ""Hour"":...","{""Year"": 2024, ""Month"": 11, ""Day"": 12, ""Hour"":..."
14505,"Prison riot in Ecuador, at least 17 killed","[{{date|november 13, 2024}}]","Wednesday, November 13, 2024 \n\nAt least sev...",79,0.890278,"{""Year"": 2024, ""Month"": 11, ""Day"": 13, ""Hour"":...","{""Year"": 2024, ""Month"": 11, ""Day"": 17, ""Hour"":..."


Here I produced a dictionary mapping topic numbers generated by the model to actual words. This was performed by choosing the word that was most commonly associated with each topic number as the label for that topic.

In [13]:
topics = model2.get_topics()

# select word with biggest confidence in set of words assigned to topic
for key in topics.keys():
    max_pair = max(topics[key], key=lambda d: d[1])
    topics[key] = max_pair
    
import pickle
with open("topics.json", "w") as file:
    file.write(json.dumps(topics))
with open("topics.pkl", "wb") as file:
    pickle.dump(topics, file)