# Using LDA to perform topic modeling

In this lab, we are going to extract topics from a French online citizen consultation called *République Numérique*. This consultation, held in 2015, aimed at enriching, criticizing and extending the *République Numérique* law bill in 2015 before it got adopted by the French parliament in 2016.

All the works presented here are based on the following paper:

> William Aboucaya, Sonia Guehis, Rafael Angarita. Building Online Public Consultation Knowledge Graphs. Text2KG 2023: International Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2023, May 2023, Hersonissos, Greece.

## Importing the dependencies

First, we are going to import all the dependencies that we will need for this lab. If you cannot run the following code cell, do not forget to [create an environment](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/), to install the dependencies inside of it (using the command `pip install -r requirements.txt`) and to use it as your Jupyter kernel.

In [None]:
import os
import re
import math
import requests
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

## Downloading the dataset

In [None]:
if not os.path.exists(f"./projet-de-loi-numerique-consultation-anonyme.csv"):
    with open("./projet-de-loi-numerique-consultation-anonyme.csv", "wb") as f:
        dataset_URL = "https://www.data.gouv.fr/fr/datasets/r/891bca8a-d9c1-4250-bfb2-3d13bf595813"
        r = requests.get(dataset_URL, allow_redirects=True)
        f.write(r.content)

    print("Downloaded successfully!")
else:
    print("Dataset already downloaded!")

## Loading the dataset using `pandas`

In [None]:
consultation = pd.read_csv("./projet-de-loi-numerique-consultation-anonyme.csv",
                               parse_dates=["Création", "Modification"], index_col=0,
                               dtype={"Identifiant": str, "Titre": str, "Lié.à..": str, "Contenu": str, "Lien": str})

## Cleaning the dataset

Now that our dataset is loaded as a `pandas Dataframe`, we are going to clean it by filling some empty cells, removing a formatting issue in the content of our proposals and creating a column aggregating all the content we want to use.

In [None]:
consultation["Lié.à.."] = consultation["Lié.à.."].fillna("Unknown")
consultation["Type.de.profil"] = consultation["Type.de.profil"].fillna("Unknown")

In [None]:
proposals = consultation.loc[consultation["Type.de.contenu"] == "Proposition"].reset_index().copy()
proposals["Contenu"] = proposals["Contenu"].apply(lambda proposal: re.sub("Éléments de contexte\r?\nExplication de l'article :\r?\n", "", re.sub("(\r?\n)+", "\n", proposal)))

In [None]:
proposals["full_contribution"] = proposals[["Titre", "Contenu"]].agg(". \n\n".join, axis=1)

## Vectorizing the proposals

Here, we will transform our proposals into numeric vectors. For each term, or n-gram, in a given document, the associated score represents the raw frequency of our term in the document. 

The objective is to give more weight to the most common terms of our corpus. However, some of the most common words in any text cannot be representative of any topic (e.g., "the", "a", etc.). These stopwords are filtered out using an existing list available online.

In [None]:
n_samples = len(proposals.index)

french_stopwords = requests.get("https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt").text.splitlines()
tf_vectorizer = CountVectorizer(min_df=2, max_features=1000, stop_words=french_stopwords)
tf = tf_vectorizer.fit_transform(proposals["full_contribution"])
tf_feature_names = tf_vectorizer.get_feature_names_out()

Let's take a look at a document vector! Or more specifically, let's look at the 50 first features of a document vector.

In [None]:
tf[0,:50].todense()

These values correspond to the number of occurrences of the words of our corpus in the document.

In [None]:
print(tf_feature_names[0:50])

## Performing LDA

Here, we will perform LDA and plot the results for 5, 7 and 10 expected topics.

In [None]:
def plot_top_words(model, feature_names, n_top_words):
    row_size = 5

    fig, axes = plt.subplots(math.ceil(len(model.components_) / row_size), row_size, figsize=(30, 15), sharex=True)
    axes = axes.flatten()

    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# For 5 topics
lda_5_topics = LatentDirichletAllocation(
    n_components=5,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)
doc_topic_5_topics = lda_5_topics.fit_transform(tf)
    
plot_top_words(lda_5_topics, tf_feature_names, 20)

In [None]:
# For 7 topics
lda_7_topics = LatentDirichletAllocation(
    n_components=7,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)
doc_topic_7_topics = lda_7_topics.fit_transform(tf)
    
plot_top_words(lda_7_topics, tf_feature_names, 20)

In [None]:
# For 5 topics
lda_10_topics = LatentDirichletAllocation(
    n_components=10,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)
doc_topic_10_topics = lda_10_topics.fit_transform(tf)
    
plot_top_words(lda_10_topics, tf_feature_names, 20)

## Viewing the results for a given document

In [None]:
proposal_idx = 111

topic_scores = doc_topic_5_topics[proposal_idx]
print("Proposal:")
print(proposals.loc[proposal_idx, "full_contribution"])
print("\n")
print("Topic scores:")
for i in range(len(topic_scores)):
    print(f"\tTopic {i + 1}: {topic_scores[i]}")