<a href="https://colab.research.google.com/github/daniel-hain/bibliometrics_EIST_2021/blob/master/python/BERTopic_ST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic EIST 

xxx
xxx


## Setup

### Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

### Installing BERTopic

In [7]:
%%capture
!pip install bertopic

###  Restart the Notebook
Runtime → Restart Runtime

## Data

In [1]:
import pandas as pd

In [2]:
docs = pd.read_csv("https://raw.githubusercontent.com/daniel-hain/bibliometrics_EIST_2021/master/data/data_text.csv")

In [3]:
docs.head()

Unnamed: 0,XX,text
0,GEELS_FW_2011_THE_MULTI_LEVEL_PERSPECTI,the multi-level perspective (mlp) has emerged ...
1,FRENKEN_K_2017_PUTTING_THE_SHARING_ECONO,we develop a conceptual framework that allows ...
2,KHLER_J_2019_AN_AGENDA_FOR_SUSTAINABIL,research on sustainability transitions has exp...
3,HANSEN_T_2015_THE_GEOGRAPHY_OF_SUSTAINA,this review covers the recent literature on th...
4,MEADOWCROFT_J_2011_ENGAGING_WITH_THE_POLITIC,although recent scholarship has contributed to...


# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

* We start by instantiating BERTopic. 
*We set language to `english` since our documents are in the English language. 
* If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 
* We will also calculate the topic probabilities. 
* However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


In [4]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

from hdbscan import HDBSCAN
from umap import UMAP

from sklearn.feature_extraction.text import CountVectorizer

In [25]:
# custom vectorizer to get rid of stopwords
vectorizer_model = CountVectorizer(stop_words="english")

# lower n_neighbors=3 value thatn standard 5 and lower n_components=3
umap_model = UMAP(n_neighbors=5, 
                  n_components=5, 
                  min_dist=0.0, 
                  metric='cosine', 
                  random_state=1337)

# resuce min_cluster_size and min_samples
hdbscan_model = HDBSCAN(min_cluster_size=5, 
                        metric='euclidean', 
                        cluster_selection_method='eom', 
                        prediction_data=True, 
                        min_samples=2)

# specify all custom models and n_grams
topic_model = BERTopic(language="english", 
                       calculate_probabilities=True,
                       verbose=True, 
                       embedding_model="allenai-specter", 
                       n_gram_range=(1, 3), 
                       hdbscan_model=hdbscan_model, 
                       umap_model=umap_model,
                       vectorizer_model=vectorizer_model)

## Old simple version
# sentence_model = SentenceTransformer("allenai-specter", device="cpu")
# topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

In [26]:
#topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs.loc[:,'text'])

Batches:   0%|          | 0/15 [00:00<?, ?it/s]

2023-02-13 11:29:07,593 - BERTopic - Transformed documents to Embeddings
2023-02-13 11:29:09,158 - BERTopic - Reduced dimensionality
2023-02-13 11:29:09,244 - BERTopic - Clustered reduced embeddings


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [27]:
freq = topic_model.get_topic_info(); freq.head(50)

Unnamed: 0,Topic,Count,Name
0,-1,40,-1_modeling_environmental_transitions_sustaina...
1,0,50,0_electric_car_vehicles_mobility
2,1,30,1_energy_renewable_pv_electricity
3,2,24,2_sustainability_transitions_research_justice
4,3,20,3_transition_societal_democracy_development
5,4,17,4_growth_economic_economy_efficiency
6,5,17,5_intermediaries_intermediation_intermediary_s...
7,6,16,6_tis_technological_innovation_context
8,7,16,7_urban_city_planning_infrastructure
9,8,15,8_niche_construction_topics_geographical


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [28]:
topic_model.get_topic(1)  # Select the most frequent topic

[('energy', 0.04515333917485621),
 ('renewable', 0.03584156611817029),
 ('pv', 0.028369537963538076),
 ('electricity', 0.022196905802558228),
 ('risk', 0.01942606257644722),
 ('political', 0.018896117535697605),
 ('storage', 0.016350942705969124),
 ('deployment', 0.015266956735359624),
 ('policy', 0.015216089075229925),
 ('carbon', 0.015116894028558083)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [29]:
topic_model.topics_[:10]

[2, 14, 2, 18, 34, -1, 8, 14, 18, 21]

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [30]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [31]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [32]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [34]:
topic_model.visualize_barchart(top_n_topics=50, n_words = 10)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [36]:
topic_model.visualize_heatmap(n_clusters=5, width=1000, height=1000)