# DATA PREPARATION FOR BERTOPIC
### Purpose: Prepare minimally processed text for BERTopic

The initial baseline run involved minimal data cleaning and revealed that several topics were dominated by structural and boilerplate terms (e.g. “html”, “page”, “information”). This indicated the need for additional document-level cleaning and a re-run of the baseline model.

## Imports and Data

In [1]:
# imports
import pandas as pd
import numpy as np
import re
from pathlib import Path
import spacy
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load spaCy model
nlp = spacy.load('en_core_web_sm')

In [3]:
# Load your scraped corpus (after structural cleaning only)
df = pd.read_csv('/workspaces/AM1_topic_modelling_BERTopic/data/full_retro.csv')

In [4]:
print(f"Loaded {len(df)} documents")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nOrganisations: {df['source'].value_counts()}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")

Loaded 3972 documents

Columns: ['url', 'title', 'date', 'text', 'source', 'type']

Organisations: source
schoolsweek    2742
gov             702
fft             202
epi             115
nuffield        107
fed             104
Name: count, dtype: int64

Date range: 2023-01-03 to 2026-01-08


In [5]:
df.head()

Unnamed: 0,url,title,date,text,source,type
0,https://epi.org.uk/publications-and-research/w...,Early education and care and the private secto...,2025-12-11,There have been significant shifts in the shap...,epi,think_tank
1,https://epi.org.uk/publications-and-research/e...,Edtech decision-making and inclusive practice:...,2025-12-04,Education technology (edtech) is rising on the...,epi,think_tank
2,https://epi.org.uk/publications-and-research/w...,What you learn and what you earn: educational ...,2025-11-20,Executive summary\nDrawing on the Longitudinal...,epi,think_tank
3,https://epi.org.uk/publications-and-research/a...,A decade of degree apprenticeships,2025-11-12,"Ten years on from their launch, degree apprent...",epi,think_tank
4,https://epi.org.uk/publications-and-research/y...,Youth degree apprenticeships: An alternative t...,2025-11-12,“Youth degree appr enticeships: An alternative...,epi,think_tank


## Minimal Cleaning

In [6]:
# add stopwords after initial run of baseline  
BOILERPLATE_TERMS = [
    "html", "page", "pages", "items",
    "latest", "further", "information",
    "related", "read", "more"
]

def remove_boilerplate_terms(text):
    pattern = r"\b(" + "|".join(BOILERPLATE_TERMS) + r")\b"
    text = re.sub(pattern, " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\s+", " ", text).strip()
    return text


In [7]:
def minimal_clean_for_bertopic(text):
    """
    Light cleaning that preserves semantic context.
    BERTopic works better with natural language, not heavily preprocessed text.
    """
    if pd.isna(text):
        return ""
    
    # Convert to string
    text = str(text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove URLs (optional - decide based on your needs)
    text = re.sub(r'http\S+|www.\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove multiple punctuation
    text = re.sub(r'[.!?]{2,}', '.', text)
    
    # Remove special characters but keep sentence structure
    text = re.sub(r'[^\w\s.!?,;:\'-]', ' ', text)
    
    # Clean up spacing
    text = re.sub(r'\s+', ' ', text).strip()

    text = remove_boilerplate_terms(text)
    
    return text

In [8]:
# apply minimal cleaning
df['text_bertopic'] = df['text'].apply(minimal_clean_for_bertopic)

## Remove short documents (less than 20 words)

In [9]:
# remove very short documents
min_words = 20
df['word_count'] = df['text_bertopic'].apply(lambda x: len(str(x).split()))

print(f"Documents before filtering: {len(df)}")
df = df[df['word_count'] >= min_words].copy()
print(f"Documents after filtering (>={min_words} words): {len(df)}")

Documents before filtering: 3972
Documents after filtering (>=20 words): 3970


# data quality checks 

In [10]:
# Check for empty documents
empty_docs = df['text_bertopic'].isna() | (df['text_bertopic'] == '')
print(f"Empty documents: {empty_docs.sum()}")

if empty_docs.sum() > 0:
    print("Removing empty documents...")
    df = df[~empty_docs].copy()

Empty documents: 0


In [11]:
# Check text length distribution
print("\nText length statistics (character count):")
df['char_count'] = df['text_bertopic'].str.len()
print(df['char_count'].describe())

print("\nWord count statistics:")
df['final_word_count'] = df['text_bertopic'].apply(lambda x: len(str(x).split()))
print(df['final_word_count'].describe())


Text length statistics (character count):
count     3970.000000
mean      4144.261461
std       2862.053007
min        134.000000
25%       2457.000000
50%       3764.000000
75%       4964.000000
max      33457.000000
Name: char_count, dtype: float64

Word count statistics:
count    3970.000000
mean      682.335013
std       482.220701
min        20.000000
25%       401.000000
50%       615.000000
75%       818.000000
max      5574.000000
Name: final_word_count, dtype: float64


In [12]:
print("\nRandom sample of cleaned documents:")
for idx in df.sample(3).index:
    print(f"\nOrganisation: {df.loc[idx, 'source']}")
    print(f"Date: {df.loc[idx, 'date']}")
    print(f"Text (first 200 chars): {df.loc[idx, 'text_bertopic'][:200]}...")
    print("-" * 80)


Random sample of cleaned documents:

Organisation: schoolsweek
Date: 2025-03-26
Text (first 200 chars): The Treasury must restore 3.6 billion in lost capital funding, leaders have said, as teachers reported widespread problems with their school buildings. The chancellor Rachel Reeves will deliver her sp...
--------------------------------------------------------------------------------

Organisation: schoolsweek
Date: 2024-02-06
Text (first 200 chars): Ofsted has rejected calls to automatically exempt schools with RAAC from inspection, but urged leaders to use its deferral policy if they get the call. In the autumn term, the watchdogremoved all scho...
--------------------------------------------------------------------------------

Organisation: schoolsweek
Date: 2024-07-18
Text (first 200 chars): There will be 260,000 pupils in schools than previously predicted by 2028, the Department for Education has estimated, revising up its projections in response to new data. Last year s pupil

## Prepare Final Dataset

In [13]:
# Create final clean dataset
df_final = df[[
    'text_bertopic',  # Main text for BERTopic
    'date',
    'source',
    'type'
]].copy()

df_final.rename(columns={'text_bertopic': 'text'}, inplace=True)


In [14]:
# Convert date to datetime
df_final['date'] = pd.to_datetime(df_final['date'])

# Sort by date
df_final = df_final.sort_values('date').reset_index(drop=True)

In [15]:
# Add document ID
df_final['doc_id'] = range(len(df_final))

In [16]:
# save cleaned data
output_path = '/workspaces/AM1_topic_modelling_BERTopic/data/bertopic_model_input.csv'
Path('data').mkdir(exist_ok=True)

df_final.to_csv(output_path, index=False)
print(f"\nSaved cleaned corpus to: {output_path}")


Saved cleaned corpus to: /workspaces/AM1_topic_modelling_BERTopic/data/bertopic_model_input.csv


In [17]:
metadata_cols = ['doc_id', 'date', 'source', 'type']
df_final[metadata_cols].to_csv('/workspaces/AM1_topic_modelling_BERTopic/data/bertopic_metadata.csv', index=False)
print(f"Saved metadata to: data/bertopic_metadata.csv")

Saved metadata to: data/bertopic_metadata.csv


In [18]:
print(f"\nTotal documents: {len(df_final)}")
print(f"Date range: {df_final['date'].min()} to {df_final['date'].max()}")
print(f"\nDocuments by organisation:")
print(df_final['source'].value_counts())
print(f"\nDocuments by organisation type:")
print(df_final['type'].value_counts())
print(f"\nAverage document length: {df_final['text'].str.len().mean():.0f} characters")
print(f"Median document length: {df_final['text'].str.len().median():.0f} characters")


Total documents: 3970
Date range: 2023-01-03 00:00:00 to 2026-01-08 00:00:00

Documents by organisation:
source
schoolsweek    2742
gov             702
fft             202
epi             115
nuffield        107
fed             102
Name: count, dtype: int64

Documents by organisation type:
type
ed_journalism    2742
gov_inst          702
think_tank        222
ed_res_org        202
prof_body         102
Name: count, dtype: int64

Average document length: 4144 characters
Median document length: 3764 characters


## Comparison between NMF And BERTopic cleaning 

In [19]:
print("\n" + "="*80)
print("BERTOPIC vs NMF PREPROCESSING COMPARISON")
print("="*80)

comparison_text = """
KEY DIFFERENCES BETWEEN BERTOPIC AND NMF PREPROCESSING:

NMF Preprocessing (Heavy):
- Lowercase all text
- Remove all punctuation
- Tokenisation and lemmatisation
- Aggressive stopword removal
- POS filtering (only nouns, proper nouns, adjectives)
- Domain-specific stopword lists
→ Result: "teacher recruitment school pupil england"

BERTopic Preprocessing (Minimal):
- Preserve sentence structure
- Keep punctuation and capitalisation
- Keep person entities (for baseline model- revisit this at a later stage depending on model output)
- No lemmatisation or stemming
- No explicit stopword removal
→ Result: "The teacher recruitment crisis in schools across England 
   has affected pupil outcomes."

WHY THE DIFFERENCE?
- NMF uses bag-of-words (TF-IDF) → needs clean tokens
- BERTopic uses sentence embeddings → needs semantic context
- Transformers understand grammar, syntax, and context
- Over-cleaning removes information that embeddings use

TRADE-OFFS:
- BERTopic: Better semantic understanding, more contextual topics
- NMF: More interpretable individual words, faster processing
"""

print(comparison_text)


BERTOPIC vs NMF PREPROCESSING COMPARISON

KEY DIFFERENCES BETWEEN BERTOPIC AND NMF PREPROCESSING:

NMF Preprocessing (Heavy):
- Lowercase all text
- Remove all punctuation
- Tokenisation and lemmatisation
- Aggressive stopword removal
- POS filtering (only nouns, proper nouns, adjectives)
- Domain-specific stopword lists
→ Result: "teacher recruitment school pupil england"

BERTopic Preprocessing (Minimal):
- Preserve sentence structure
- Keep punctuation and capitalisation
- Keep person entities (for baseline model- revisit this at a later stage depending on model output)
- No lemmatisation or stemming
- No explicit stopword removal
→ Result: "The teacher recruitment crisis in schools across England 
   has affected pupil outcomes."

WHY THE DIFFERENCE?
- NMF uses bag-of-words (TF-IDF) → needs clean tokens
- BERTopic uses sentence embeddings → needs semantic context
- Transformers understand grammar, syntax, and context
- Over-cleaning removes information that embeddings use

TRADE-O