# Lecture 3 - Topic Modeling with BERTopic
* Text Retrieval and Mining, BSc BAN, 2023-2024
* Author: [Julien Rossi](mailto:j.rossi@uva.nl)



# Pre-Requisites

* For this demo you need to have a API Key for OpenAI
* It should be stored as a Notebook Secret under the name "OPENAI_KEY"
* If no key is given, the demo will not use ChatGPT
* ChatGPT is used to create a label to the topics

# Resources

* Official BERTopic [webpage](https://maartengr.github.io/BERTopic/index.html)

# Application

We will use a dataset of BBC articles, see the [BBC Page](http://mlg.ucd.ie/datasets/bbc.html)


In [None]:
!pip install bertopic openai typing-extensions=="4.5.0"

## Prepare Corpus

In [None]:
import requests

r = requests.get('http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip')

assert r.status_code == 200

with open('bbc-fulltext.zip', 'wb') as out:
    out.write(r.content)


In [None]:
from zipfile import ZipFile

texts = []
with ZipFile('bbc-fulltext.zip') as zf:
    txtfiles = filter(lambda x: x.endswith('.txt'), zf.namelist())
    for txtf in txtfiles:
        with zf.open(txtf, 'r') as txt:
            texts.append(txt.read().decode('utf-8', 'ignore').replace('\n', ' '))

print(f'Collected {len(texts)} articles.')

In [None]:
texts[0]

# Topic Modeling

* We will prepare all the elements of the topic modeling pipeline
* **EMBEDDING MODEL** transforms a document into a vector (more about this in Week 3). These vectors are called "embeddings"
* **DIMENSIONALITY REDUCTION** model, to reduce the number of dimensions of the embeddings
* **CLUSTERING ALGORITHM** that creates "semantic" cluster of documents on similar topics
* **VECTORIZER** which will be used to extract the vocabulary of the corpus, and of the clusters
* **REPRESENTATION MODEL** which will create labels from the top words in each topic

In [None]:
# Using Sentence Transformers to create document embeddings

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(texts, show_progress_bar=True)



In [None]:
# Dimensionality Reduction: UMAP
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine', random_state=42)

# Clustering: HDBSCAN
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Vectorizer: CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))

# Representation Model
# ChatGPT if OPENAI API KEY is given
# Otherwise Text-2-Text Generation Pipeline from Huggingface Transformers
representation_model = None

from google.colab import userdata
try:
    openai_key = userdata.get('OPENAI_KEY')
except Exception:
    openai_key = None

if openai_key is not None:
    # Use ChatGpT
    import openai
    from bertopic.representation import OpenAI

    # GPT-3.5
    prompt = """
    I have a topic that contains the following documents:
    [DOCUMENTS]
    The topic is described by the following keywords: [KEYWORDS]

    Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
    topic: <topic label>
    """
    client = openai.OpenAI(api_key=openai_key)
    representation_model = {"ChatGPT": OpenAI(client, model="gpt-3.5-turbo", exponential_backoff=True, chat=True, prompt=prompt, doc_length=100, tokenizer="whitespace")}
else:
    # Use a Text-2-Text Model from Transformers
    from transformers import pipeline
    from bertopic.representation import TextGeneration

    prompt = "I have a topic described by the following keywords: [KEYWORDS]. [DOCUMENTS] Describe this topic in less than 4 words, topic:"

    # Create your representation model
    generator = pipeline('text2text-generation', model='google/flan-t5-base')
    representation_model = {"Flan T5": TextGeneration(generator, prompt=prompt, nr_docs=1, doc_length=0, tokenizer="whitespace")}


In [None]:
from bertopic import BERTopic

topic_model = BERTopic(
    # Pipeline models
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,

    # Hyperparameters
    top_n_words=10,

    # Running parameters
    verbose=True,
    calculate_probabilities=True,
)



In [None]:
topics, probs = topic_model.fit_transform(documents=texts, embeddings=embeddings)

In [None]:
topic_model.get_topic_info()

In [None]:
# Topics in Document at index 0

topic_model.visualize_distribution(probs[0])


In [None]:
# Topics Similarity

topic_model.visualize_heatmap()
