# Topic Modeling bibtex entries

<a href="https://github.com/MaartenGr/BERTopic"><img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="20%" height="20%" align="right" /></a>

In this notebook, we will go through applying advanced topic modeling techniques with [BERTopic](https://github.com/MaartenGr/BERTopic). Topic modeling is a technique that allows users to generate insights into large amount of textual data without the need to read them individually.

It will explore other technics.

## Install dependencies

In [1]:
%%capture
#!pip install bertopic
#!pip install pybtex
#!pip install nltk
#!pip install rake_nltk

**NOTE**: After installing, make sure to restart the notebook to re-import the correct versions of packages that were previously imported. 

In [2]:
from plotly.offline import init_notebook_mode
import json
import nltk
from pybtex.database import parse_file
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from rake_nltk import Rake
from collections import Counter
from nltk.stem import WordNetLemmatizer

init_notebook_mode(connected=True)
nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package wordnet to /home/adsantos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/adsantos/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Prepare Data

In [3]:
# https://arxiv.org/help/api/user-manual
category_map = {'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'astro-ph.HE': 'High Energy Astrophysical Phenomena',
'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
'astro-ph.SR': 'Solar and Stellar Astrophysics',
'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
'cond-mat.mtrl-sci': 'Materials Science',
'cond-mat.other': 'Other Condensed Matter',
'cond-mat.quant-gas': 'Quantum Gases',
'cond-mat.soft': 'Soft Condensed Matter',
'cond-mat.stat-mech': 'Statistical Mechanics',
'cond-mat.str-el': 'Strongly Correlated Electrons',
'cond-mat.supr-con': 'Superconductivity',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control',
'econ.EM': 'Econometrics',
'eess.AS': 'Audio and Speech Processing',
'eess.IV': 'Image and Video Processing',
'eess.SP': 'Signal Processing',
'gr-qc': 'General Relativity and Quantum Cosmology',
'hep-ex': 'High Energy Physics - Experiment',
'hep-lat': 'High Energy Physics - Lattice',
'hep-ph': 'High Energy Physics - Phenomenology',
'hep-th': 'High Energy Physics - Theory',
'math.AC': 'Commutative Algebra',
'math.AG': 'Algebraic Geometry',
'math.AP': 'Analysis of PDEs',
'math.AT': 'Algebraic Topology',
'math.CA': 'Classical Analysis and ODEs',
'math.CO': 'Combinatorics',
'math.CT': 'Category Theory',
'math.CV': 'Complex Variables',
'math.DG': 'Differential Geometry',
'math.DS': 'Dynamical Systems',
'math.FA': 'Functional Analysis',
'math.GM': 'General Mathematics',
'math.GN': 'General Topology',
'math.GR': 'Group Theory',
'math.GT': 'Geometric Topology',
'math.HO': 'History and Overview',
'math.IT': 'Information Theory',
'math.KT': 'K-Theory and Homology',
'math.LO': 'Logic',
'math.MG': 'Metric Geometry',
'math.MP': 'Mathematical Physics',
'math.NA': 'Numerical Analysis',
'math.NT': 'Number Theory',
'math.OA': 'Operator Algebras',
'math.OC': 'Optimization and Control',
'math.PR': 'Probability',
'math.QA': 'Quantum Algebra',
'math.RA': 'Rings and Algebras',
'math.RT': 'Representation Theory',
'math.SG': 'Symplectic Geometry',
'math.SP': 'Spectral Theory',
'math.ST': 'Statistics Theory',
'math-ph': 'Mathematical Physics',
'nlin.AO': 'Adaptation and Self-Organizing Systems',
'nlin.CD': 'Chaotic Dynamics',
'nlin.CG': 'Cellular Automata and Lattice Gases',
'nlin.PS': 'Pattern Formation and Solitons',
'nlin.SI': 'Exactly Solvable and Integrable Systems',
'nucl-ex': 'Nuclear Experiment',
'nucl-th': 'Nuclear Theory',
'physics.acc-ph': 'Accelerator Physics',
'physics.ao-ph': 'Atmospheric and Oceanic Physics',
'physics.app-ph': 'Applied Physics',
'physics.atm-clus': 'Atomic and Molecular Clusters',
'physics.atom-ph': 'Atomic Physics',
'physics.bio-ph': 'Biological Physics',
'physics.chem-ph': 'Chemical Physics',
'physics.class-ph': 'Classical Physics',
'physics.comp-ph': 'Computational Physics',
'physics.data-an': 'Data Analysis, Statistics and Probability',
'physics.ed-ph': 'Physics Education',
'physics.flu-dyn': 'Fluid Dynamics',
'physics.gen-ph': 'General Physics',
'physics.geo-ph': 'Geophysics',
'physics.hist-ph': 'History and Philosophy of Physics',
'physics.ins-det': 'Instrumentation and Detectors',
'physics.med-ph': 'Medical Physics',
'physics.optics': 'Optics',
'physics.plasm-ph': 'Plasma Physics',
'physics.pop-ph': 'Popular Physics',
'physics.soc-ph': 'Physics and Society',
'physics.space-ph': 'Space Physics',
'q-bio.BM': 'Biomolecules',
'q-bio.CB': 'Cell Behavior',
'q-bio.GN': 'Genomics',
'q-bio.MN': 'Molecular Networks',
'q-bio.NC': 'Neurons and Cognition',
'q-bio.OT': 'Other Quantitative Biology',
'q-bio.PE': 'Populations and Evolution',
'q-bio.QM': 'Quantitative Methods',
'q-bio.SC': 'Subcellular Processes',
'q-bio.TO': 'Tissues and Organs',
'q-fin.CP': 'Computational Finance',
'q-fin.EC': 'Economics',
'q-fin.GN': 'General Finance',
'q-fin.MF': 'Mathematical Finance',
'q-fin.PM': 'Portfolio Management',
'q-fin.PR': 'Pricing of Securities',
'q-fin.RM': 'Risk Management',
'q-fin.ST': 'Statistical Finance',
'q-fin.TR': 'Trading and Market Microstructure',
'quant-ph': 'Quantum Physics',
'stat.AP': 'Applications',
'stat.CO': 'Computation',
'stat.ME': 'Methodology',
'stat.ML': 'Machine Learning',
'stat.OT': 'Other Statistics',
'stat.TH': 'Statistics Theory'}

In [25]:
# ​Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

def remove_stop_words(sentence):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence)
    
    filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

    filtered_sentence = []

    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)

    return ' '.join(filtered_sentence).replace(',','')


def extract_keyword(text):
    r = Rake()
    r.extract_keywords_from_text(text)
    keywordList = []
    rankedList = r.get_ranked_phrases_with_scores()
    for keyword in rankedList:
        keyword_updated = keyword[1].split()
        keyword_updated_string = " ".join(keyword_updated[:2])
        keywordList.append(keyword_updated_string)
        if len(keywordList) > 9:
            break

    return keywordList


def lemmatizer_word(word):
    return wnl.lemmatize(word)


def lemmatizer(sentence):
    tokens = nltk.word_tokenize(sentence)
    lemmatized_tokens = [lemmatizer_word(token) for token in tokens]
    return " ".join(lemmatized_tokens)


def most_common(items, num):
    return Counter(items).most_common(num)


def process_entries(bib_file):
    bib_data = parse_file(bib_file)
    all_years = []
    all_authors = []
    all_keywords = []
    all_titles = []
    all_abstract = []
    all_categories = []
    all_keywords_title = []
    all_keywords_abstract = []

    for key, value in bib_data.entries.items():
        if 'abstract' in value.fields:
            # Years
            try:
                year = value.fields["year"]
                all_years.append(year)
            except:
                pass

            # Authors
            authors = []
            try:
                for author in value.persons["author"]:
                    authors.append(f"{author.first_names[0]}, {author.last_names[0]}")
                    all_authors.extend(authors)
            except KeyError:
                print(f"{key}: No authors found")
                authors = None
            except IndexError:
                print(f"{key}: Wrong format for authors", value.persons)
                authors = None

            # Keywords
            try:
                keywords = value.fields["keywords"].split(",")
                all_keywords.extend(keyword.lower() for keyword in keywords)
            except KeyError:
                print(f"{key}: No keywords found")
                keywords = None

            # Title
            title = lemmatizer(remove_stop_words(value.fields["title"]))
            all_titles.append(title)
            all_keywords_title.extend([keyword.lower() for keyword in extract_keyword(title)])

            # Abstract
            abstract = lemmatizer(remove_stop_words(value.fields["abstract"]))
            all_abstract.append(abstract)
            all_keywords_abstract.extend(keyword.lower() for keyword in extract_keyword(abstract))            

            # Categories
            all_categories.append('stat.CO')
    return (
        all_years,
        all_authors,
        all_keywords,
        all_titles,
        all_abstract,
        all_categories,
        all_keywords_title,
        all_keywords_abstract,
    )

In [26]:
bib_file = './data/many_entries.bib'
num = 10
titles = []
abstracts = []
years = []
categories = []
keywords_title = []
keywords_abstract = []

(
    years,
    authors,
    keywords,
    titles,
    abstracts,
    all_categories,
    keywords_title,
    keywords_abstract,
) = process_entries(bib_file)

print(len(titles), len(abstracts), len(years), len(categories), len(keywords), len(keywords_title), len(keywords_abstract))

print(f"Keywords in abstract: {most_common(keywords_abstract, num)}")

print(f"Keywords in title: {most_common(keywords_title, num)}")

print(f"Keywords in keyword: {most_common(keywords, num)}")

print(f"Authors: {most_common(authors, num)}")

ret = most_common(years, num), min(years), max(years)
print(f"Years: {ret[0]}")
print(f"The oldest: {ret[1]} - Newest: {ret[2]}")


Andrew2018: No keywords found
Zhang2021: No keywords found
Khoja2022: No keywords found
Boley2010: No keywords found
Nay2016: No keywords found
Giri2017: No keywords found
Gangemi2005: No keywords found
Beltagy2020: No keywords found
Chalkidis2018: No keywords found
Prince-Tritto2024: No keywords found
Kriz2014: No keywords found
Monroy2009: No keywords found
Kaltenboeck2022: No keywords found
Zhong2020: No keywords found
Vogel2018: No keywords found
Amato2008: No keywords found
Eidelman2019: No keywords found
Siena2009: No keywords found
Mahabal2019: No keywords found
Katz2023: No keywords found
Chalkidis2020: No keywords found
Giaoui2023: No keywords found
Savelka2023: No keywords found
Zhang2023: No keywords found
Huang2023b: No keywords found
Kang2021: No keywords found
Boella2013: No keywords found
Dinesh2008: No keywords found
Tan2023: No keywords found
Alchourron1981: No keywords found
Shelton2006: No keywords found
Li2023: No keywords found
Ostling2023: No keywords found
Ramesh

In [24]:
titles

['ContractFrames : Bridging Gap Between Natural Language Logics Contract Law',
 'Combining expert knowledge NLP specialised application',
 'ContracT - Legal Contracts Formal Specifications : Preliminary Results',
 'Automatically Extracting Insurance Contract Knowledge Using NLP',
 'Semantically Enriched Obligation Management : An Approach Improving Handling Obligations Represented Contracts',
 'Automatic Extraction Entities Relation Legal Documents',
 'A Review Contract Entity Extraction',
 'Extracting definition Brazilian legal text',
 'A study legal knowledge base creation using artificial intelligence ontology',
 'A Review Application Deep Learning Legal Domain',
 'Temporal dependence legal document',
 'TimeLex : A Suite Tools Processing Temporal Information Legal Texts',
 'Named Entity Recognition Rental Documents Using NLP',
 'Learning Map GDPR Logic Representation DAPRECO-KB',
 'Banking Regulation Classification Portuguese',
 'Explainable Artificial Intelligence Technology Policy

## Train Topic Model
Next, we train our BERTopic model using just a few changes to the default parameters. 

Starting off is the embedding model. The string `"paraphrase-MiniLM-L6-v2"` references the embedding model that can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) and is a great sentence-transformer model that balances performance with speed. 

Next, we make sure that the minimum topic size of our topics is 50. We do this to limit the number of topics that could be generated. For example, if the minimum were to be 10 then much more topics could be generated that are most likely to be of little interest. Since we want large topics, we set it to 50. 

In [31]:
from bertopic import BERTopic

topic_model = BERTopic(verbose=True, embedding_model="paraphrase-MiniLM-L12-v2", min_topic_size=20)
topics, _ = topic_model.fit_transform(abstracts); len(topic_model.get_topic_info())

2023-12-30 16:28:13,161 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

2023-12-30 16:28:14,664 - BERTopic - Embedding - Completed ✓
2023-12-30 16:28:14,665 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-30 16:28:16,718 - BERTopic - Dimensionality - Completed ✓
2023-12-30 16:28:16,719 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-30 16:28:16,730 - BERTopic - Cluster - Completed ✓
2023-12-30 16:28:16,733 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-30 16:28:16,803 - BERTopic - Representation - Completed ✓


6

**NOTE**: BERTopic is stochastic since it uses UMAP as one of its dependencies so the results may differ between runs. 

## Topic Representation

We can see that roughly 200 topics were generated from our topics. Next, let's see what the major topics are:



In [85]:
topics=topic_model.get_topic_info()

topics

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,121,-1_legal_the_data_compliance,"[legal, the, data, compliance, we, knowledge, ...",[Today legal document system increasingly stri...
1,0,65,0_legal_document_text_contract,"[legal, document, text, contract, extraction, ...",[Due imbalance large number litigation case nu...
2,1,56,1_model_graph_llms_knowledge,"[model, graph, llms, knowledge, task, language...",[Large pre-trained language model shown store ...
3,2,34,2_business_rule_sbvr_vocabulary,"[business, rule, sbvr, vocabulary, model, tran...",[Business rule generally captured natural lang...
4,3,30,3_logic_deontic_agent_obligation,"[logic, deontic, agent, obligation, law, syste...",[What happens way handle genuine deontic confl...
5,4,27,4_ontology_semantic_web_legal,"[ontology, semantic, web, legal, domain, the, ...",[Currently technique content description query...


In [88]:
print(*topics['Representation'] , sep='\n')

['legal', 'the', 'data', 'compliance', 'we', 'knowledge', 'domain', 'model', 'language', 'approach']
['legal', 'document', 'text', 'contract', 'extraction', 'language', 'the', 'information', 'we', 'task']
['model', 'graph', 'llms', 'knowledge', 'task', 'language', 'performance', 'in', 'the', 'question']
['business', 'rule', 'sbvr', 'vocabulary', 'model', 'transformation', 'language', 'process', 'rules', 'the']
['logic', 'deontic', 'agent', 'obligation', 'law', 'system', 'ambiguity', 'norm', 'we', 'legal']
['ontology', 'semantic', 'web', 'legal', 'domain', 'the', 'concept', 'search', 'based', 'paper']


In [90]:
# Defining stop words set
stop_words = set(stopwords.words('english'))

# Using nested list comprehension to filter out stop words and flatten the list
topics_words = [word for sentence in topics['Representation'] for word in sentence if word not in stop_words]

# Converting to a set to remove duplicates, then back to a list
topic_words = list(set(topics_words))

print(topic_words)


['language', 'graph', 'agent', 'paper', 'norm', 'law', 'semantic', 'information', 'legal', 'ontology', 'concept', 'llms', 'web', 'rule', 'compliance', 'logic', 'vocabulary', 'model', 'transformation', 'performance', 'ambiguity', 'approach', 'obligation', 'text', 'data', 'sbvr', 'knowledge', 'based', 'search', 'domain', 'deontic', 'business', 'task', 'question', 'system', 'contract', 'process', 'rules', 'document', 'extraction']


In [33]:
num_topics=len(topic_model.get_topic_info())

In [34]:
topic_model.visualize_barchart(top_n_topics=num_topics, height=700)

From a glance, the most frequent topics seem to have coherent and clear topic representations. Interpretation of these clusters is much easier if you are familiar with the content of the documents.

In [35]:
topic_model.visualize_term_rank()

In [36]:
topic_model.visualize_term_rank(log_scale=True)

Using the elbow method, it seems that 3 words per topic are sufficient in representing the topic well. Any words that we add after that have seemingly little effect. 

**NOTE**: All visualizations are created with Plotly and are as such interactive!

## Topic Relationships
Having extracted the topics and their representations it might be helpful to check out the uniqueness of each topic. Some topics might be quite similar and could be merged or are simply interesting to research further. 

To do this, we start off by mapping our topics to a 2D representation by reducing the topic vectors with UMAP:

In [37]:
topic_model.visualize_topics(top_n_topics=num_topics)

The distance between topics show, to a certain extend, the similarities between topics. However, to better visualize and understand the similarity between topics, we can use two plots to gives us further insight into that. Namely, visualizing the possible topic hierarchy and its similarity matrix:

In [38]:
topic_model.visualize_hierarchy(top_n_topics=num_topics, width=800)

In [50]:
topic_model.visualize_heatmap(n_clusters=3, top_n_topics=num_topics)

The blocky structure in the heatmap shows that there are some clusters of topics to be found that are somewhat similar to each other. Zooming into these topics helps us understand why they are similar. If you hover over the topics, you can see the topic ID and representation. 

## Topics over Time
Although extracting the topics and their representation is interesting, we are missing some dimensional information. For example, some topics might not be relevant anymore or some are gaining traction over the last years. That can be vital information when making decisions. 

Here, we will model the topics over the years. For each topic and timestep, we calculate the c-TF-IDF representation. This will result in a specific topic representation at each timestep without the need to create clusters from embeddings as they were already created. However, topics can be regarded as evolutionary entities that evolve and change over time. This means that a topic representation at timestep *t* should be related to its representation at timesetps *t-1* and *t+1*. To model this evolutionary trend, we average its c-TF-IDF representation with that of the c-TF-IDF representation at timestep *t-1*. This is done for each topic representation allowing for the representations to evolve over time.

In [40]:
len(abstracts), len(years)

(333, 333)

In [41]:
topics_over_time = topic_model.topics_over_time(abstracts, years)

25it [00:00, 73.87it/s]


In [42]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20, width=900, height=500)

From the visualization above, we can see some interesting patterns appearing. Namely, around 2012 a lot of topics seem to become less and less frequent with several topics taking over. For example, topics 2 and 3 seem to be popular until 2012 after which its popularity decreased significantly. 

## Topics per Class
Lastly, let's focus on the given categories for each paper. Can we find out which topics are frequently found in certain categories? Typically, topic modeling tends to find more topics than the categories that were previously defined. This helps us find and understand certain subcategories that might exist in the data. 

In [43]:
topics_per_class = topic_model.topics_per_class(abstracts, classes=topics)

6it [00:00, 87.69it/s]


In [45]:
topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=num_topics, width=900)