<a href="https://colab.research.google.com/github/byeungchun/cbspeech_topicmodelling/blob/main/topicmodelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code example: Large language models: a primer for economists

In this notebook, we analyze the factors influencing stock market sentiment using Central Bank speeches from the Bank for International Settlements website, spanning 2014 to 2022.

The data processing pipeline includes:
- <b>Segmentation</b>: Dividing speeches into smaller, manageable chunks.

- <b>Tokenization</b>: Converting text into tokens for each chunk.

- <b>Embedding Transformation</b>: Mapping each chunk into numerical vectors (embeddings) to capture semantic meaning.

- <b>Clustering and Topic Extraction</b>: Grouping embeddings to identify thematic clusters, with topics based on frequently co-occurring terms.

## Setup

In [3]:
# pip install bertopic
!pip install bertopic accelerate --quiet


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import re
import json
import requests
import pandas as pd
import numpy as np
import torch

from tqdm import tqdm
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

from transformers import AutoTokenizer, AutoModel, BitsAndBytesConfig
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

### Topic Modeling Parameters

- **speech_jsonfile**: URL to the JSON file containing central bank speech data. This file is hosted on GitHub and will be used as input for topic modeling.
  
- **word_chunk_size**: Defines the size of word chunks to be used in the topic model. Smaller values may capture more fine-grained themes, while larger values may generalize more.

- **bertopic_nr_topics**: Specifies the number of topics to generate in BERTopic modeling. This parameter controls the level of granularity in the topic clusters.

- **embeddingmodel**: Specifies the Hugging Face embedding model to be used for text representations. `'microsoft/Phi-3-mini-4k-instruct'` is a pre-trained model suitable for NLP tasks, but feel free to replace this with any other 'text generation' model available on Hugging Face.


In [4]:
speech_jsonfile = r'https://github.com/byeungchun/cbspeech_topicmodelling/blob/main/wikipedia_10pages.json?raw=true'
word_chunk_size = 100
bertopic_nr_topics = 7
embeddingmodel = 'microsoft/Phi-3-mini-4k-instruct'


## Data Loading
Central Bank Speech Dataset
- **Source:** [Bank for International Settlements (BIS) Central Bank Speeches](https://www.bis.org/cbspeeches/index.htm?m=60)
- **Number of Speeches:** 125 speeches, published as of October 2019

In [6]:
# Fetch the data from the URL
response = requests.get(speech_jsonfile)
response.raise_for_status()  # Raise an exception for bad status codes

# Load the JSON data
data = json.loads(response.text)
df = pd.DataFrame(data)

In [7]:
df.head()

Unnamed: 0,title,text,url
0,Bank for International Settlements,The Bank for International Settlements (BIS) i...,https://en.wikipedia.org/wiki/Bank_for_Interna...
1,United States,"The United States of America (USA), commonly k...",https://en.wikipedia.org/wiki/United_States
2,Machine Learning,Machine learning (ML) is a field of study in a...,https://en.wikipedia.org/wiki/Machine_learning
3,Data Science,Data science is an interdisciplinary academic ...,https://en.wikipedia.org/wiki/Data_science
4,Artificial Intelligence,"Artificial intelligence (AI), in its broadest ...",https://en.wikipedia.org/wiki/Artificial_intel...


## Data Processing
- **Chunking:** Split each Central Bank speech into chunks of 100 words each.
- **Embedding Generation:** Create an embedding array for each chunk.

### Chunking

In [8]:
# Preprocess document content and generate text chunks for analysis
# - docs: List to store unique text chunks extracted from the documents
# - df_docs_mapper: List to map the original text ID to generated document chunks
# - docs_cnt: Counter to assign unique IDs to document chunks

# Initialize variables for document processing
docs = []
df_docs_mapper = []
docs_cnt = 0

# Iterate through each record in the DataFrame and generate text chunks
for i, rec in tqdm(df.iterrows(), total=len(df)):
    text = rec['text']
    text2 = re.sub(r'[\.\n]', ' ', text)  # Replace periods and newlines with space
    text2 = re.sub(r'\s+', ' ', text2)  # Replace multiple spaces with a single space
    text2 = re.sub(r'[^a-zA-Z0-9 ]', '', text2).strip()  # Remove non-alphanumeric characters
    words = text2.split()

    # Create word chunks from the cleaned text
    word_idx = 0
    while word_idx < len(words):
        word_piece = words[word_idx: word_idx + word_chunk_size]
        is_piece_chunksize = len(word_piece) == word_chunk_size
        chunk_text = ' '.join(word_piece)

        # Add unique chunks to docs and update the document mapper
        if is_piece_chunksize and chunk_text not in docs:
            docs.append(chunk_text)
            df_docs_mapper.append((i, docs_cnt))
            docs_cnt += 1
        word_idx = word_idx + word_chunk_size

# Convert document mapper list to DataFrame and merge with document chunks
df_docs_mapper = pd.DataFrame(df_docs_mapper, columns=['text_id', 'doc_id'])
df_docs = df_docs_mapper.merge(pd.DataFrame(docs), left_index=True, right_index=True)

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:00<00:00, 280.89it/s]


In [9]:
print(f'Total number of pages: {len(df_docs_mapper.text_id.drop_duplicates())}, total number of chunks (each chunk contains {word_chunk_size} words): {len(docs)}')


Total number of pages: 10, total number of chunks (each chunk contains 100 words): 768


In [10]:
chunk_stat = df_docs_mapper.groupby('text_id').count().describe()
print(f'For each page, the average number of chunks is: {chunk_stat.iloc[1,0]:.2f}, with a minimum of {chunk_stat.iloc[3,0]} chunks and a maximum of {chunk_stat.iloc[7,0]} chunks.')


For each page, the average number of chunks is: 76.80, with a minimum of 10.0 chunks and a maximum of 132.0 chunks.


In [14]:
# Merge document chunks with mapper and publication dates
# - df_docs: DataFrame that contains document chunks, their corresponding IDs, and publication dates

# Merge document chunks with document mapper DataFrame
df_docs = df_docs_mapper.merge(pd.DataFrame(docs), how='inner', left_on='doc_id', right_index=True)
df_docs.columns = ['text_id', 'doc_id', 'chunk']

### Embedding

#### Embedding Generation Process

- **Purpose**: Generate sentence embeddings using a pre-trained language model.
- **get_sentence_embedding**: A function that generates the embedding for a given sentence using a tokenizer and model. 
- **embeddings**: A list to store the embeddings for all documents in the dataset.
- **CPU vs. GPU**:
  - If the code is executed on a **CPU**, it loads precomputed embeddings from a JSON file hosted on GitHub because CPU performance is slower for embedding generation, making real-time processing inefficient.
  - If on **GPU**, it calculates embeddings in real-time using the specified model and tokenizer.

- **get_sentence_embedding Function**:
  - Takes a sentence as input, tokenizes it, and passes it through the model.
  - Extracts the last hidden state and computes the mean to produce the sentence embedding.

- **Final Step**: Converts the embeddings list into a numpy array for further processing.


In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if device == 'cpu':
  import io
  npzfileurl = 'https://github.com/byeungchun/cbspeech_topicmodelling/blob/main/embeddings_msphi3.npz?raw=true'
  response = requests.get(npzfileurl)
  response.raise_for_status()  # Check for errors
  with np.load(io.BytesIO(response.content)) as data:
    embeddings = data['array']
else:
  llm_engine = AutoModel.from_pretrained(
      embeddingmodel,
      device_map= 'auto',
      torch_dtype=torch.float16,
  ).to(device)
  tokenizer = AutoTokenizer.from_pretrained(embeddingmodel)

  def get_sentence_embedding(sentence):
      inputs = tokenizer(sentence, return_tensors='pt').to(device)
      with torch.no_grad():
          outputs = llm_engine(**inputs)
      last_hidden_state = outputs.last_hidden_state
      sentence_embedding = torch.mean(last_hidden_state, dim=1).squeeze().tolist()
      return sentence_embedding

  embeddings = [get_sentence_embedding(doc) for doc in tqdm(docs)]
  embeddings = np.array(embeddings)

## Topic Modeling

BERTopic model to identify key themes in the dataset. Key parameters include:

- **Top Words (`num_top_words`):** Number of key terms per topic.
- **KeyBERT Words (`num_keybert_words`):** Words representing each topic using KeyBERT.
- **Topic Limit (`topic_limit`):** Maximum number of topics.



In [17]:
num_top_words = 30
num_keybert_words = 10
topic_limit = 20

representation_model = {"KeyBERT": KeyBERTInspired(top_n_words=num_keybert_words, nr_repr_docs=20)}
vectorizer_model = CountVectorizer(stop_words="english")

bertopic_model = BERTopic(
    embedding_model='sentence-transformers/all-mpnet-base-v2',
    vectorizer_model=vectorizer_model,
    top_n_words=num_top_words,
    nr_topics=topic_limit,
    representation_model=representation_model,
    verbose=True
)

# Fit and transform the model on the documents and embeddings
topics, initial_probabilities = bertopic_model.fit_transform(docs, embeddings)

# Create a DataFrame with the topics and their corresponding probabilities
df_results = pd.DataFrame({'Topic': topics, 'Probability': initial_probabilities})

# Get detailed topic information from the BERTopic model and merge with the results DataFrame
df_topic_info = bertopic_model.get_topic_info()
df_results = df_results.merge(df_topic_info, how='left', on='Topic')

# Merge document mapper with results DataFrame and filter out documents without topics
embedding_list = [list(val) for val in embeddings]
df_results = df_docs_mapper.merge(df_results, how='left', left_on='doc_id', right_index=True)
df_bertopic_results = df_results[df_results.Topic != -1.0].dropna(how='any').reset_index(drop=True)

2024-11-05 16:45:17,436 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-05 16:45:29,837 - BERTopic - Dimensionality - Completed ✓
2024-11-05 16:45:29,839 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-05 16:45:29,893 - BERTopic - Cluster - Completed ✓
2024-11-05 16:45:29,895 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-05 16:46:04,284 - BERTopic - Representation - Completed ✓
2024-11-05 16:46:04,285 - BERTopic - Topic reduction - Reducing number of topics
2024-11-05 16:46:21,497 - BERTopic - Topic reduction - Reduced number of topics from 34 to 20


In [18]:
df_results['KeyBERT'] = df_results['KeyBERT'].apply(tuple)
topic_keybert_counts = df_results.groupby(['Topic', 'KeyBERT']).size().reset_index(name='counts')
topic_keybert_counts

Unnamed: 0,Topic,KeyBERT,counts
0,-1,"(economy, inflation, economies, economic, bank...",285
1,0,"(finance, climate, climaterelated, banks, econ...",334
2,1,"(fintech, banking, financial, finance, banks, ...",148
3,2,"(macroeconomic, stabilisation, banking, banks,...",137
4,3,"(eurozone, euro, eu, economy, ecb, economies, ...",107
5,4,"(inflation, economy, policy, forecast, uncerta...",105
6,5,"(banks, banking, finance, economics, nationalb...",59
7,6,"(banks, economy, lending, inflation, economic,...",48
8,7,"(competencies, organisations, policymakers, ec...",48
9,8,"(banking, banks, hkma, bank, hong, hkmas, curr...",39


### (Option) LDA Topic Modeling

This optional step applies LDA to identify topics across document chunks. While human labeling was used for topic identification in the main analysis, LDA provides a complementary, automated approach to confirm and explore additional themes. Key parameters and components include:

- **LDA Topic Results (`lda_topic_results`):** Stores LDA topic modeling outcomes for each document set.
- **Topics to Extract (`num_topics_to_extract`):** Specifies the number of topics to identify in each LDA model.

In [25]:
lda_topic_results = []
num_topics_to_extract = 5

# Prepare the documents and dictionary for LDA modeling
document_texts = df_bertopic_results.KeyBERT.to_list()
doc_dictionary = corpora.Dictionary(document_texts)
bow_corpus = [doc_dictionary.doc2bow(doc) for doc in document_texts]

# Train the LDA model with the specified number of topics
lda_model = LdaModel(corpus=bow_corpus, id2word=doc_dictionary, num_topics=num_topics_to_extract, passes=10)

# Extract topics for each document and store the results
document_topic_info = []
for idx, doc_bow in enumerate(bow_corpus):
    doc_topics = lda_model.get_document_topics(doc_bow, minimum_probability=0.0)
    doc_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)
    dominant_topic = doc_topics[0][0]
    document_topic_info.append({
        'lda_topic': dominant_topic,
        'lda_probability': doc_topics[0][1],
        'lda_words': ' '.join([x[0] for x in sorted(lda_model.show_topic(dominant_topic), reverse=False)])
    })
lda_topic_results.append(document_topic_info)

# Print summary of LDA modeling results
print(f'The LDA model was trained with {num_topics_to_extract} topics.')


The LDA model was trained with 5 topics.


In [26]:
df_bertopic_results.head()

Unnamed: 0,text_id,doc_id,Topic,Probability,Count,Name,Representation,KeyBERT,Representative_Docs
0,0,1,10,0.914676,24,10_inflation_governing_council_monetary,"[inflation, governing, council, monetary, area...","[inflation, euro, deflationary, economy, ecb, ...",[of inflation to the Governing Councils medium...
1,0,2,2,0.937807,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...
2,0,3,2,0.475947,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...
3,0,4,2,0.819463,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...
4,0,5,2,0.827686,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...


In [27]:
lda_topic_results = pd.DataFrame(document_topic_info)

# Print summary of LDA modeling results
print(f'The LDA model was trained with {num_topics_to_extract} topics. The number of documents processed is {len(lda_topic_results)}.')

# Merge BERTopic results and LDA topic results into a single DataFrame
merged_results = df_bertopic_results.merge(lda_topic_results, how='left', left_index=True, right_index=True)

The LDA model was trained with 5 topics. The number of documents processed is 1197.


In [28]:
merged_results.head()

Unnamed: 0,text_id,doc_id,Topic,Probability,Count,Name,Representation,KeyBERT,Representative_Docs,lda_topic,lda_probability,lda_words
0,0,1,10,0.914676,24,10_inflation_governing_council_monetary,"[inflation, governing, council, monetary, area...","[inflation, euro, deflationary, economy, ecb, ...",[of inflation to the Governing Councils medium...,1,0.926543,banks ecb economy euro fiscal inflation liquid...
1,0,2,2,0.937807,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...,1,0.926663,banks ecb economy euro fiscal inflation liquid...
2,0,3,2,0.475947,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...,1,0.926659,banks ecb economy euro fiscal inflation liquid...
3,0,4,2,0.819463,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...,1,0.92666,banks ecb economy euro fiscal inflation liquid...
4,0,5,2,0.827686,137,2_financial_macroprudential_stability_area,"[financial, macroprudential, stability, area, ...","[macroeconomic, stabilisation, banking, banks,...",[it nThis is why since 2014 the ECB has gradua...,1,0.926658,banks ecb economy euro fiscal inflation liquid...
