<a href="https://colab.research.google.com/github/byeungchun/cbspeech_topicmodelling/blob/main/topicmodelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code example: Large language models: a primer for economists

In this notebook, we analyze the factors influencing stock market sentiment using Central Bank speeches from the Bank for International Settlements website, spanning 2014 to 2022.

The data processing pipeline includes:
- <b>Segmentation</b>: Dividing speeches into smaller, manageable chunks.

- <b>Tokenization</b>: Converting text into tokens for each chunk.

- <b>Embedding Transformation</b>: Mapping each chunk into numerical vectors (embeddings) to capture semantic meaning.

- <b>Clustering and Topic Extraction</b>: Grouping embeddings to identify thematic clusters, with topics based on frequently co-occurring terms.

## Setup

In [None]:
# pip install bertopic
!pip install bertopic --quiet

In [None]:
import re
import json
import requests
import pandas as pd
import numpy as np
import torch

from tqdm import tqdm
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

from transformers import AutoTokenizer, AutoModel, BitsAndBytesConfig
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

### Topic Modeling Parameters

- **speech_jsonfile**: URL to the JSON file containing central bank speech data. This file is hosted on GitHub and will be used as input for topic modeling.
  
- **word_chunk_size**: Defines the size of word chunks to be used in the topic model. Smaller values may capture more fine-grained themes, while larger values may generalize more.

- **bertopic_nr_topics**: Specifies the number of topics to generate in BERTopic modeling. This parameter controls the level of granularity in the topic clusters.

- **embeddingmodel**: Specifies the Hugging Face embedding model to be used for text representations. `'microsoft/Phi-3-mini-4k-instruct'` is a pre-trained model suitable for NLP tasks, but feel free to replace this with any other 'text generation' model available on Hugging Face.


In [None]:
speech_jsonfile = r'https://github.com/byeungchun/cbspeech_topicmodelling/blob/main/centralbank_speech_201910.json?raw=true'
word_chunk_size = 100
bertopic_nr_topics = 7
embeddingmodel = 'microsoft/Phi-3-mini-4k-instruct'


## Data Loading
Central Bank Speech Dataset
- **Source:** [Bank for International Settlements (BIS) Central Bank Speeches](https://www.bis.org/cbspeeches/index.htm?m=60)
- **Number of Speeches:** 125 speeches, published as of October 2019

In [None]:
# Fetch the data from the URL
response = requests.get(speech_jsonfile)
response.raise_for_status()  # Raise an exception for bad status codes

# Load the JSON data
data = json.loads(response.text)
df = pd.DataFrame(data)

In [None]:
df.head()

## Data Processing
- **Chunking:** Split each Central Bank speech into chunks of 100 words each.
- **Embedding Generation:** Create an embedding array for each chunk.

### Chunking

In [None]:
# Preprocess document content and generate text chunks for analysis
# - docs: List to store unique text chunks extracted from the documents
# - df_docs_mapper: List to map the original text ID to generated document chunks
# - docs_cnt: Counter to assign unique IDs to document chunks

# Initialize variables for document processing
docs = []
df_docs_mapper = []
docs_cnt = 0

# Iterate through each record in the DataFrame and generate text chunks
for i, rec in tqdm(df.iterrows(), total=len(df)):
    text = rec['content']
    text2 = re.sub(r'[\.\n]', ' ', text)  # Replace periods and newlines with space
    text2 = re.sub(r'\s+', ' ', text2)  # Replace multiple spaces with a single space
    text2 = re.sub(r'[^a-zA-Z0-9 ]', '', text2).strip()  # Remove non-alphanumeric characters
    words = text2.split()

    # Create word chunks from the cleaned text
    word_idx = 0
    while word_idx < len(words):
        word_piece = words[word_idx: word_idx + word_chunk_size]
        is_piece_chunksize = len(word_piece) == word_chunk_size
        chunk_text = ' '.join(word_piece)

        # Add unique chunks to docs and update the document mapper
        if is_piece_chunksize and chunk_text not in docs:
            docs.append(chunk_text)
            df_docs_mapper.append((i, docs_cnt))
            docs_cnt += 1
        word_idx = word_idx + word_chunk_size

# Convert document mapper list to DataFrame and merge with document chunks
df_docs_mapper = pd.DataFrame(df_docs_mapper, columns=['text_id', 'doc_id'])
df_docs = df_docs_mapper.merge(pd.DataFrame(docs), left_index=True, right_index=True)

In [None]:
print(f'Total number of speeches: {len(df_docs_mapper.text_id.drop_duplicates())}, total number of chunks (each chunk contains {word_chunk_size} words): {len(docs)}')


In [None]:
chunk_stat = df_docs_mapper.groupby('text_id').count().describe()
print(f'For each speech, the average number of chunks is: {chunk_stat.iloc[1,0]:.2f}, with a minimum of {chunk_stat.iloc[3,0]} chunks and a maximum of {chunk_stat.iloc[7,0]} chunks.')


In [None]:
# Merge document chunks with mapper and publication dates
# - df_docs: DataFrame that contains document chunks, their corresponding IDs, and publication dates

# Merge document chunks with document mapper DataFrame
df_docs = df_docs_mapper.merge(pd.DataFrame(docs), how='inner', left_on='doc_id', right_index=True)
df_docs.columns = ['text_id', 'doc_id', 'chunk']

# Merge publication dates with document chunks
df_docs = df_docs.merge(df[['pub_date']], how='left', left_on='text_id', right_index=True)

### Embedding

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

llm_engine = AutoModel.from_pretrained(
    embeddingmodel,
    device_map= 'cuda',
    torch_dtype=torch.float16,
).to(device)

tokenizer = AutoTokenizer.from_pretrained(embeddingmodel)

In [None]:
# Generate sentence embeddings using a pre-trained language model
# - get_sentence_embedding: Function to get the embedding for a given sentence using a tokenizer and model
# - embeddings: List to store the embeddings for all documents in the dataset

# Function to get sentence embedding from the model
def get_sentence_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = llm_engine(**inputs)
    last_hidden_state = outputs.last_hidden_state
    sentence_embedding = torch.mean(last_hidden_state, dim=1).squeeze().tolist()
    return sentence_embedding

# Create embeddings for all documents in the dataset
embeddings = [get_sentence_embedding(doc) for doc in tqdm(docs)]

# Convert embeddings to a numpy array for further processing
embeddings = np.array(embeddings)


## Topic Modeling

BERTopic model to identify key themes in the dataset. Key parameters include:

- **Top Words (`num_top_words`):** Number of key terms per topic.
- **KeyBERT Words (`num_keybert_words`):** Words representing each topic using KeyBERT.
- **Topic Limit (`topic_limit`):** Maximum number of topics.



In [None]:
num_top_words = 30
num_keybert_words = 10
topic_limit = 20

representation_model = {"KeyBERT": KeyBERTInspired(top_n_words=num_keybert_words, nr_repr_docs=20)}
vectorizer_model = CountVectorizer(stop_words="english")

bertopic_model = BERTopic(
    embedding_model='sentence-transformers/all-mpnet-base-v2',
    vectorizer_model=vectorizer_model,
    top_n_words=num_top_words,
    nr_topics=topic_limit,
    representation_model=representation_model,
    verbose=True
)

# Fit and transform the model on the documents and embeddings
topics, initial_probabilities = bertopic_model.fit_transform(docs, embeddings)

# Create a DataFrame with the topics and their corresponding probabilities
df_results = pd.DataFrame({'Topic': topics, 'Probability': initial_probabilities})

# Get detailed topic information from the BERTopic model and merge with the results DataFrame
df_topic_info = bertopic_model.get_topic_info()
df_results = df_results.merge(df_topic_info, how='left', on='Topic')

# Merge document mapper with results DataFrame and filter out documents without topics
embedding_list = [list(val) for val in embeddings]
df_results = df_docs_mapper.merge(df_results, how='left', left_on='doc_id', right_index=True)
df_bertopic_results = df_results[df_results.Topic != -1.0].dropna(how='any').reset_index(drop=True)


In [None]:
df_results['KeyBERT'] = df_results['KeyBERT'].apply(tuple)
topic_keybert_counts = df_results.groupby(['Topic', 'KeyBERT']).size().reset_index(name='counts')
topic_keybert_counts

### (Option) LDA Topic Modeling

This optional step applies LDA to identify topics across document chunks. While human labeling was used for topic identification in the main analysis, LDA provides a complementary, automated approach to confirm and explore additional themes. Key parameters and components include:

- **LDA Topic Results (`lda_topic_results`):** Stores LDA topic modeling outcomes for each document set.
- **Topics to Extract (`num_topics_to_extract`):** Specifies the number of topics to identify in each LDA model.

In [None]:
lda_topic_results = []
num_topics_to_extract = 5

# Prepare the documents and dictionary for LDA modeling
document_texts = df_bertopic_results.KeyBERT.to_list()
doc_dictionary = corpora.Dictionary(document_texts)
bow_corpus = [doc_dictionary.doc2bow(doc) for doc in document_texts]

# Train the LDA model with the specified number of topics
lda_model = LdaModel(corpus=bow_corpus, id2word=doc_dictionary, num_topics=num_topics_to_extract, passes=10)

# Extract topics for each document and store the results
document_topic_info = []
for idx, doc_bow in enumerate(bow_corpus):
    doc_topics = lda_model.get_document_topics(doc_bow, minimum_probability=0.0)
    doc_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)
    dominant_topic = doc_topics[0][0]
    document_topic_info.append({
        'num_topics': num_topics_to_extract,
        'lda_topic': dominant_topic,
        'lda_probability': doc_topics[0][1],
        'lda_words': ' '.join([x[0] for x in sorted(lda_model.show_topic(dominant_topic), reverse=False)])
    })
lda_topic_results.append(document_topic_info)

# Print summary of LDA modeling results
print(f'The LDA model was trained with {num_topics_to_extract} topics. The number of documents processed is {len(lda_topic_results)}.')


In [None]:
df_bertopic_results.head()

In [None]:
lda_topic_results.head()

In [None]:
lda_topic_results = pd.DataFrame(document_topic_info)

# Print summary of LDA modeling results
print(f'The LDA model was trained with {num_topics_to_extract} topics. The number of documents processed is {len(lda_topic_results)}.')

# Merge BERTopic results and LDA topic results into a single DataFrame
merged_results = df_bertopic_results.merge(lda_topic_results, how='left', left_index=True, right_index=True)

In [None]:
merged_results.head()