# Using BERTopic to perform topic modeling

In this lab, we are going to extract topics from a French online citizen consultation called *République Numérique*. This consultation, held in 2015, aimed at enriching, criticizing and extending the *République Numérique* law bill in 2015 before it got adopted by the French parliament in 2016.

## Importing the dependencies

First, we are going to import all the dependencies that we will need for this lab. If you cannot run the following code cell, do not forget to [create an environment](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/), to install the dependencies inside of it (using the command `pip install -r requirements.txt`) and to use it as your Jupyter kernel.

In [None]:
import os
import re
import requests
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import torch
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Downloading the dataset

In [None]:
if not os.path.exists(f"./projet-de-loi-numerique-consultation-anonyme.csv"):
    with open("./projet-de-loi-numerique-consultation-anonyme.csv", "wb") as f:
        dataset_URL = "https://www.data.gouv.fr/fr/datasets/r/891bca8a-d9c1-4250-bfb2-3d13bf595813"
        r = requests.get(dataset_URL, allow_redirects=True)
        f.write(r.content)

    print("Downloaded successfully!")
else:
    print("Dataset already downloaded!")

## Loading the dataset using `pandas`

In [None]:
consultation = pd.read_csv("./projet-de-loi-numerique-consultation-anonyme.csv",
                               parse_dates=["Création", "Modification"], index_col=0,
                               dtype={"Identifiant": str, "Titre": str, "Lié.à..": str, "Contenu": str, "Lien": str})

## Cleaning the dataset

Now that our dataset is loaded as a `pandas Dataframe`, we are going to clean it by filling some empty cells, removing a formatting issue in the content of our proposals and creating a column aggregating all the content we want to use.

In [None]:
consultation["Lié.à.."] = consultation["Lié.à.."].fillna("Unknown")
consultation["Type.de.profil"] = consultation["Type.de.profil"].fillna("Unknown")

In [None]:
proposals = consultation.loc[consultation["Type.de.contenu"] == "Proposition"].copy()
proposals["Contenu"] = proposals["Contenu"].apply(lambda proposal: re.sub("Éléments de contexte\r?\nExplication de l'article :\r?\n", "", re.sub("(\r?\n)+", "\n", proposal)))

In [None]:
proposals["full_contribution"] = proposals[["Titre", "Contenu"]].agg(". \n\n".join, axis=1)

## Producing proposals embeddings

Here, we will transform our proposals into embedding vectors using the tokenizer of a language model. The embedding of a text represents its position in a n-dimensions vector space, with a simple premise: texts with similar meanings should have similar vectors (even if they do not share common words). 

In this example, we will use a model called [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2). This model is a "sentence transformer", it is specifically trained to embed sentences or paragraphs into a vector space. Moreover, it is a multilingual model, meaning that the same sentence in different languages should produce almost the same embedding.

But first, as we are going to perform a computing-intensive task, we must identify the most efficient device available to perform it. We do so, using PyTorch which is the back-end that we will use in this lab. We prioritize NVIDIA GPUs with CUDA installed, then Apple Silicon GPUs, and finally CPUs if none of the above is found.

If you need help installing the relevant version of PyTorch: https://pytorch.org/get-started/locally/

If you have a NVIDIA GPU but you don't know whether you have CUDA installed or not, type the following command:

```bash
nvcc --version
```

If you have it installed, you should see the CUDA version installed on your computer. Otherwise, you should install a PyTorch-compatible version (as listed [here](https://pytorch.org/get-started/locally/), row "Stable CUDA").

In [None]:
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device('cpu')
    print ("GPU not found.")

print(device)

Now let's download and prepare the language model that we will use!

In [None]:
sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")

We will now encode our proposals into 768-dimensions vector space.

In [None]:
proposals.full_contribution

In [None]:
proposals_embeddings = sentence_model.encode(proposals.full_contribution.tolist(), show_progress_bar=True, device=device)

Let's see what the embeddings of a proposal look like.

In [None]:
proposals_embeddings[0]

## Loading vectorizers for statistical representation

Now that we have our embeddings, we prepare two additional vectorizers whose job is to produce a statistical representation of the terms in the documents. We use both a simple counter with a list of stopwords as filter, and a more complex one based on a formula called TF-IDF. For each term, or n-gram, in a given document, the TF-IDF score represents the frequency of our term in the document inversely weighted by its frequency in the whole corpus.

The objective of this formula is to give more weight to the terms appearing in only a subpart of our corpus rather than those which are the most common but also the least distinctive of specific topics. For example, in this dataset, the words "République" or "Numérique" would have a high term frequency but would not be distinctive at all of a category of proposal.

*N.B.*: Here, we use a slightly modified version of the TF-IDF formula implemented by the BERTopic library to suit the needs of topic modeling tasks. However, the principles remain similar.

In [None]:
french_stopwords = requests.get("https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt").text.splitlines()
vectorizer_model = CountVectorizer(max_df=0.80, min_df=0.20, stop_words=french_stopwords, ngram_range = (1, 2))
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

## Loading the model

Here, we instantiate a BERTopic model based on the model we used to produce our embeddings and our two vectorizers. We impose a minimum of 10 documents per topic, but this value can be modified depending on two main factors:
- The size of our dataset: Here, with only a few hundreds of texts, we cannot increase this value too much. But for datasets of hundreds of thousands of texts, it may not be relevant to capture a topic specific to only 10 texts.
- Whether we want to identify broad topics covering a large quantity of documents or more fine-grained ones specific to a small subset of the corpus.

*N.B.*: we use the `low_memory=True` parameter here as there is a known bug in BERTopic specific to Apple Silicon chips which can lead to potential crashes of the computer, and using this parameter reduces this risk. Please remove it **only if you know what you are doing**. 

In [None]:
model = BERTopic(verbose=True, min_topic_size=5, ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model, embedding_model="paraphrase-multilingual-mpnet-base-v2", low_memory=True)

## Producing the topic model

In [None]:
topics, probs = model.fit_transform(proposals.full_contribution, proposals_embeddings)

In [None]:
freq = model.get_topic_info()
print(f"Number of topics: {len(freq)}")
freq 

## Visualizing the topic model

In [None]:
model.visualize_barchart(top_n_topics=8)

In [None]:
df_docs= model.get_document_info(proposals.full_contribution)
df_docs

## Identifying topics similar to a concept

In [None]:
concept_to_test = "vote en ligne"

In [None]:
similar_topics, similarity = model.find_topics(concept_to_test, top_n = 3)

for t, s in zip(similar_topics, similarity):
    print(f"For topic {t}:")
    print(f"\tSimilarity with the concept '{concept_to_test}': {s}")
    defining_concepts = [concept[0] for concept in model.get_topic(t)]
    print(f"\tMost relevant topics: {', '.join(defining_concepts)}")

## Visualize Topic Hierarchy

As you can see, some topics are very close. One thing that could come to mind is how can I reduce the number of topics? The good new is that those topics can be hierarchically organized in order to select the appropriate number of topics.

In [None]:
model.visualize_hierarchy(top_n_topics=30)

## Visualize Documents

Using the previous method, we can visualize the topics and get insight into their relationships. However, you might want a more fine-grained approach where we can visualize the documents inside the topics to see if they were assigned correctly or whether they make sense. To do so, we can use the `topic_model.visualize_documents()` function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization purposes.

*N.B.*: This process can be very expensive, especially if you want to be able to interact with individual documents (`hide_document_hover=False`)

In [None]:
model.visualize_documents(proposals.full_contribution.to_list(), hide_document_hover=False)