# Contents

2. [Topic Extraction](#2.-Topic-Extraction)  
    2.1. [Merge years](#2.1.-Merge-years)   
    2.2. [Create embeddings](#2.2.-Create-embeddings)  
    2.3. [Extract topics](#2.3.-Extract-topics)  
    2.4. [Predict topics](#2.4.-Predict-topics)

# **2. Topic Extraction**

The goal of Topic Extraction (or Topic Modeling) is, as the name says, to automatically identify the most meaningul topics that describe the content of a document, toghether with the most representative keywords for each topic. This is a crucial task in our pipeline, because we want to know what politicians talk about, i.e. what are the possible axis that construct an opinion.

After a comparison with LDA (cf. `Milestone2/TopicExtraction_exploration`), we decided to use for this scope **BERTopic**, an algorithm for topic extraction that builds on top of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), a pretrained language representation model based on transformers developed by Google and quickly becomed extremely popular for in the NLP community [[1]](https://arxiv.org/abs/1810.04805).

BERTopic uses BERT to embed documents and then applies sequentially [UMAP](https://umap-learn.readthedocs.io/en/latest/), to reduce the dimensionality, and [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/), to cluster semantically similar documents (i.e. create topics). Finally, topic representations are constructed with [c-TF-IDF](https://github.com/MaartenGr/cTFIDF).

The algorithm pipeline is showed schematically below [[2]](https://maartengr.github.io/BERTopic/tutorial/algorithm/algorithm.html).

<img src="https://maartengr.github.io/BERTopic/img/algorithm.png" width="700" height="600"/>

© Maarten Grootendorst 2021

In [None]:
# Mount Google Drive
from google.colab import drive
drive._mount('/content/drive')

Mounted at /content/drive


In [None]:
# Needed to bugfix (ref: https://github.com/scikit-learn-contrib/hdbscan/pull/495)

!pip install --upgrade tbb
!pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan
!pip install bertopic

Collecting git+https://github.com/scikit-learn-contrib/hdbscan
  Cloning https://github.com/scikit-learn-contrib/hdbscan to /tmp/pip-req-build-z1vwo3xq
  Running command git clone -q https://github.com/scikit-learn-contrib/hdbscan /tmp/pip-req-build-z1vwo3xq
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow.parquet as pq
import pyarrow as pa
import gc
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
os.environ['NUMBA_THREADING_LAYER'] = 'omp'

In [None]:
preprocess_folder = '/content/drive/MyDrive/ADA/Processed/'
topics_folder = '/content/drive/MyDrive/ADA/Topics/'

In [None]:
import datetime
import pytz
def printts(*objects):
    print(datetime.datetime.now(pytz.timezone('Europe/Zurich')).strftime("%d %b %Y %H:%M:%S"), ":", *objects)

## 2.1. Merge years

After preprocessing separately each year, we now need to merge them into a single DataFrame, from which we will learn and predict the topics.
For convenience, while merging the quotes are shuffled and assigned an unique id, corresponding to the index after shuffling.

In [None]:
def merge_df():
  '''
  Loads the preprocessed DataFrame for each year from 2015 to 2020 and merge them
  in a unique DataFrame.
  '''

  # Create list of preprocessed DataFrames per year
  df_years = []
  for filename in sorted(os.listdir(preprocess_folder), reverse=True):
    processpath = os.path.join(preprocess_folder, filename)
    printts(f'Reading {filename}...')
    df_year = pd.read_parquet(processpath)
    df_years.append(df_year)

  # Concatenate the processed years into one single dataframe
  printts(f'Combining years...')
  df = pd.concat(df_years)
  del df_year
  del df_years

  # Shuffle dataframe
  df = df.sample(frac=1, random_state=42)

  # Set index
  index = np.array(list(map(lambda x: 'q' + x, np.arange(len(df)).astype(str))))
  df = df.set_index(index)
  # df = df.reset_index(drop=True)

  printts('Merging done')
  return df

## 2.2. Create embeddings

The first step in BERTopic is is to create embeddings for each quote.
Since the process benefit widely from GPU usage and is thus computationally expensive, we will do it separately from the other steps of the algorithm, and save in a Drive folder the results for chunks of our merged DataFrame.
For the resources we dispose of, we can split it in 3 chunks of approximately 3M rows.

In [None]:
CHUNK_NB = 3
CHUNK_SIZE = 9458185 // CHUNK_NB

In [None]:
# Pretrained model using SentenceTransformers that maps sentences and paragraphs
# to a 384-dimensional dense vector space of embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
for i in range(CHUNK_NB):
  df = merge_df()
  quotes = df.quotation.values[i*CHUNK_SIZE:(i+1)*CHUNK_SIZE]

  # Delete DataFrame to free memory
  del df
  gc.collect()

  # Create and save embeddings
  printts(f'Creating embeddings for chunk number {i}...')
  embeddings = sentence_model.encode(quotes, show_progress_bar=True)
  printts(f'Saving embeddings...')
  np.save(os.path.join(topics_folder, f'quotes{i}.npy'), quotes)
  np.save(os.path.join(topics_folder, f'embeddings{i}.npy'), embeddings)

  # Delete embeddings to free memory
  del quotes
  del embeddings
  gc.collect()
  print('-----------------------------------------------------------')

## 2.3. Extract topics

Next, we want to apply the model to our dataset, learning relevant topics from the quotes. Unfortunately, UMAP is very expensive in terms of memory usage, so we can only afford to fit the model on a subset of 600k quotes, corresponding to almost the 7% of the filtered DataFrame. After that, we will hierarchically reduce the number of topics to 1000, and create a representation for each of them using only words that have a minimum document frequency of 10.

In [None]:
MIN_DF = 10
NUM_QUOTES = int(6e5)
NUM_TOPICS = 1000

In [None]:
# Creating BERTopic model
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=MIN_DF)
topic_model = BERTopic(verbose=True, 
                       calculate_probabilities=False, 
                       low_memory=True, 
                       vectorizer_model=vectorizer_model)

In [None]:
# Loading the first chunk of quotes, with their embeddings
quotes = np.load(os.path.join(topics_folder, f'quotes0.npy'), allow_pickle=True)
embeddings = np.load(os.path.join(topics_folder, f'embeddings0.npy'), allow_pickle=True)

In [None]:
# Keep only the first NUM_QUOTES quotes
reduced_quotes = quotes[0:NUM_QUOTES]
reduced_embeddings = embeddings[0:NUM_QUOTES]

In [None]:
# Delete quotes and embeddings to free RAM
del quotes
del embeddings
gc.collect() 

151

In [None]:
reduced_quotes.shape

(600000,)

In [None]:
# fit BERTopic model, i.e. extract topics
reduced_topics, _ = topic_model.fit_transform(reduced_quotes, reduced_embeddings)

2021-12-08 13:51:49,253 - BERTopic - Reduced dimensionality with UMAP
2021-12-08 13:54:26,857 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [None]:
new_topics, _ = topic_model.reduce_topics(reduced_quotes, reduced_topics, nr_topics=NUM_TOPICS)

2021-12-08 14:26:45,047 - BERTopic - Reduced number of topics from 3476 to 1001


In [None]:
topics_name = f'topics_{NUM_QUOTES:.1e}_{MIN_DF}df_{NUM_TOPICS}'
topics_name

'topics_6.0e+05_10df_1000'

In [None]:
# Save BERTopic model to Drive
topic_model.save(os.path.join(topics_folder, topics_name))

## 2.4. Predict topics

Once learned the topics, we need to predict one topic for each quotation in the preprocessed DataFrame. Again, for conveniency we will execute this operation in chunks, using the embeddings already created.

In [None]:
model_name = 'topics_6.0e+05_10df_1000'

In [None]:
# Load BERTopic model
topic_model = BERTopic.load(os.path.join(topics_folder, model_name), 
                            embedding_model="all-MiniLM-L6-v2")

In [None]:
ichunks_nb = 10
ichunks_size = CHUNK_SIZE // ichunks_nb

In [None]:
for i in range(2, CHUNK_NB):
  printts(f'Loading embeddings for chunk number {i}...')
  quotes = np.load(os.path.join(topics_folder, f'quotes{i}.npy'), allow_pickle=True)
  embeddings = np.load(os.path.join(topics_folder, f'embeddings{i}.npy'), allow_pickle=True)

  for j in range(ichunks_nb):
    if(j != ichunks_nb - 1):
      quotes_chunk = quotes[j*ichunks_size: (j+1)*ichunks_size]
      embeddings_chunk = embeddings[j*ichunks_size: (j+1)*ichunks_size]
    else:
      quotes_chunk = quotes[j*ichunks_size:]
      embeddings_chunk = embeddings[j*ichunks_size:]

    # Predict topics
    printts(f'Predicting topics for chunk number {i}, internal chunk number {j}...')
    topics, _ = topic_model.transform(quotes_chunk, embeddings_chunk)

    # Append topics to parquet
    printts('Writing topics to parquet...')
    if(j != ichunks_nb - 1):
      indrang = np.arange(i*CHUNK_SIZE + j*ichunks_size, i*CHUNK_SIZE + (j+1)*ichunks_size, dtype=int)
    else:
      indrang = np.arange(i*CHUNK_SIZE + j*ichunks_size, (i+1)*CHUNK_SIZE, dtype=int)

    index = np.array(list(map(lambda x: 'q' + x, indrang.astype(str))))
    table = pa.Table.from_pandas(pd.DataFrame(topics, index=index, columns=['topic']))
    pq.write_to_dataset(table, root_path=os.path.join(topics_folder, 'topics.parquet'))

    del topics
    gc.collect()
    print('-----------------------------')
  
  del quotes
  del embeddings
  gc.collect()
  print('*-----------------------------------------------------------*')