<a href="https://colab.research.google.com/github/ceydab/NLP_Projects/blob/main/TopicModellingwithLDAandCTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Machine Learning for Natural Language Processing: Topic Modelling with LDA

This notebook shows an example on how to do topic modelling utilizing LDA and CTM, and compares the topics.

We start by loading the dataset in three groups: data pre90s, between 90s and 2010, post 2010.

Then, we create preprocess, vectorize, lda, ctm functions.

We use the three sets to run the functions. For LDA both preprocess and vectorize are used, for CTM only preprocess is used.

As we obtain the words, we can compare both LDA and CTM performance, and the difference in topics between the time periods.

We work on only 5 rows, but it can be extended according to the research topic desired.

In [1]:
!pip install contextualized-topic-models==2.4.1



In [2]:
import torch
import re
import urllib
import gzip
import io
import csv
import random
from collections import defaultdict
from tqdm import tqdm
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
path_before_1990 = '/content/drive/My Drive/titles_before_1990.txt'
path_from_1990_to_2009 = '/content/drive/My Drive/titles_from_1990_to_2009.txt'
path_from_2010 = '/content/drive/My Drive/titles_from_2010.txt'

In [4]:
from google.colab import drive
drive.mount('/content/drive')

# to download the data manually or get more information, go to: https://dblp.org/faq/How+can+I+download+the+whole+dblp+dataset.html
url = 'https://dblp.uni-trier.de/xml/dblp.xml.gz'
# num_titles = 500000  # the (max)number of titles to load


def load_gzip_file(url):
    """Download Gzip-file."""
    response = urllib.request.urlopen(url)
    compressed_file = io.BytesIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)
    return decompressed_file

def extract_titles(input_file, max_num=5000):
    """Extract title and publication year of dblp papers, given as input file.

    Divide the papers into 3 time periods.

    Collect max max_num papers per time period.
    """
    pairs_before_1990 = []
    count_before_1990 = 0
    pairs_from_1990_to_2009 = []
    count_from_1990_to_2009 = 0
    pairs_from_2010 = []
    count_from_2010 = 0
    got_title = False
    for line in tqdm(input_file):
        line_str = line.decode('utf-8')
        if got_title:
            # we have a title and check for the corresponding year
            year_result = re.search(r'<year>(.*)</year>', line_str)
            if year_result:
                # we also have the year and thus save the title-year pair
                year = int(year_result.group(1))
                if year < 1990:
                    pairs_before_1990.append((title, year))
                    count_before_1990 += 1
                elif year < 2010:
                    pairs_from_1990_to_2009.append((title, year))
                    count_from_1990_to_2009 += 1
                else:
                    pairs_from_2010.append((title, year))
                    count_from_2010 += 1
                got_title = False
        else:
            # we have no title and search for title
            result = re.search(r'<title>(.*)</title>', line_str)
            if result:
                title = result.group(1)
                if len(title.split(' ')) < 3:
                    # only include titles with at least four words
                    continue
                got_title = True

        if count_before_1990 >= max_num and count_from_1990_to_2009 >= max_num and count_from_2010 >= max_num:
            return pairs_before_1990, pairs_from_1990_to_2009, pairs_from_2010

    return pairs_before_1990, pairs_from_1990_to_2009, pairs_from_2010

def save_data(pairs, file_path):
    with open(file_path, 'w') as fout:
        writer = csv.writer(fout)
        for pair in pairs:
            writer.writerow(pair)

in_file = load_gzip_file(url)
pairs_before_1990, pairs_from_1990_to_2009, pairs_from_2010 = extract_titles(in_file)
save_data(pairs_before_1990, path_before_1990)
save_data(pairs_from_1990_to_2009, path_from_1990_to_2009)
save_data(pairs_from_2010, path_from_2010)

Mounted at /content/drive


3192750it [00:09, 322059.17it/s]


In [5]:
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z ]', '', text)
    text = text.lower()
    return text

In [6]:
def vectorize(prepro_titles):
    num_features = 10000
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=num_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(prepro_titles)
    tf_feature_names = tf_vectorizer.get_feature_names_out()
    return tf, tf_feature_names

In [7]:
def lda_model(tf, tf_feature_names,num_top=5):
  lda = LatentDirichletAllocation(n_components=num_top, max_iter=5, learning_method='online', random_state=42).fit(tf)
  topics = []
  for topic_idx, topic in enumerate(lda.components_):
      topic_words = ' '.join([tf_feature_names[i] for i in topic.argsort()[:-12 - 1:-1]])
      topics.append(f'Topic {topic_idx}: {topic_words}')
      print(f'Topic {topic_idx}:', end=' ')
      print(' '.join([tf_feature_names[i] for i in topic.argsort()[:-12 - 1:-1]]))
  return topics

In [8]:
tp = TopicModelDataPreparation("all-mpnet-base-v2")
def ctm_model(unprocessed, processed, num_top=5):
  training_dataset = tp.fit(text_for_contextual=unprocessed, text_for_bow=processed)
  training_dataset = training_dataset
  ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=num_top, num_epochs=5)
  ctm.fit(training_dataset)
  print(ctm.get_topic_lists(5))
  topics = ctm.get_topic_lists(5)
  return topics

In [9]:
with open(path_before_1990) as fin:
    reader = csv.reader(fin)
    titles1 = [row[0] for row in reader]

titles_pre90 = [preprocess_text(title) for title in titles1]
tf_pre90, tf_feature_names_pre90 = vectorize(titles_pre90)
lda_topics_pre90 = lda_model(tf_pre90, tf_feature_names_pre90,5)
ctm_topics_pre90 = ctm_model(titles1, titles_pre90, 5)

Topic 0: information analysis science structure data development problem objectoriented prolog editor models automatic
Topic 1: programs software computer model program management architecture note concurrent compiler based object
Topic 2: systems programming data language design structured database retrieval research implementation using approach
Topic 3: programming languages language simulation modula machine ada environment letter cipher theory extensible
Topic 4: pascal information review library use book comments structures new study scientific pp


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Epoch: [5/5]	 Seen Samples: [24960/25000]	Train Loss: 65.0618080236973	Time: 0:00:02.302592: : 5it [00:10,  2.05s/it]
Sampling: [20/20]: : 20it [00:37,  1.89s/it]

[['statistical', 'on', 'performance', 'issues', 'methods'], ['premises', 'lawrence', 'aspen', 'including', 'directing'], ['retrieval', 'document', 'online', 'model', 'distributed'], ['allunion', 'adabas', 'languages', 'programming', 'language'], ['information', 'and', 'library', 'of', 'literature']]





In [10]:
print("LDA topics before 1990: ", lda_topics_pre90)
print("CTM topics before 1990: ", ctm_topics_pre90)

LDA topics before 1990:  ['Topic 0: information analysis science structure data development problem objectoriented prolog editor models automatic', 'Topic 1: programs software computer model program management architecture note concurrent compiler based object', 'Topic 2: systems programming data language design structured database retrieval research implementation using approach', 'Topic 3: programming languages language simulation modula machine ada environment letter cipher theory extensible', 'Topic 4: pascal information review library use book comments structures new study scientific pp']
CTM topics before 1990:  [['statistical', 'on', 'performance', 'issues', 'methods'], ['premises', 'lawrence', 'aspen', 'including', 'directing'], ['retrieval', 'document', 'online', 'model', 'distributed'], ['allunion', 'adabas', 'languages', 'programming', 'language'], ['information', 'and', 'library', 'of', 'literature']]


In [11]:
with open(path_from_1990_to_2009) as fin:
    reader = csv.reader(fin)
    titles2 = [row[0] for row in reader]

titles_90to10 = [preprocess_text(title) for title in titles2]
tf_90to10, tf_feature_names_90to10 = vectorize(titles_90to10)
lda_topics_90to10 = lda_model(tf_90to10, tf_feature_names_90to10,5)
ctm_topics_90to10 = ctm_model(titles2, titles_90to10, 5)

Topic 0: data network neural using web software support new distributed systems learning development
Topic 1: dynamic method model detection using evaluation architecture efficient image case realtime finite
Topic 2: control using based fuzzy systems modeling simulation study adaptive logic approach optical
Topic 3: networks algorithm models management problems optimization linear graphs process scheduling algorithms images
Topic 4: analysis design information cmos digital systems programming performance problem power wireless mobile




Batches:   0%|          | 0/320 [00:00<?, ?it/s]

Epoch: [5/5]	 Seen Samples: [319040/319145]	Train Loss: 89.0145995160163	Time: 0:00:45.730705: : 5it [03:45, 45.16s/it]
Sampling: [20/20]: : 20it [12:27, 37.36s/it]

[['complexity', 'intersection', 'note', 'letter', 'minimal'], ['information', 'development', 'knowledge', 'the', 'research'], ['recovery', 'channels', 'circuits', 'optical', 'clock'], ['sound', 'stepper', 'tapestry', 'trained', 'renyi'], ['fuzzy', 'algorithm', 'optimization', 'reconstruction', 'genetic']]





In [12]:
print("LDA topics 1990-2010: ", lda_topics_90to10)
print("CTM topics 1990-2010: ", ctm_topics_90to10)

LDA topics 1990-2010:  ['Topic 0: data network neural using web software support new distributed systems learning development', 'Topic 1: dynamic method model detection using evaluation architecture efficient image case realtime finite', 'Topic 2: control using based fuzzy systems modeling simulation study adaptive logic approach optical', 'Topic 3: networks algorithm models management problems optimization linear graphs process scheduling algorithms images', 'Topic 4: analysis design information cmos digital systems programming performance problem power wireless mobile']
CTM topics 1990-2010:  [['complexity', 'intersection', 'note', 'letter', 'minimal'], ['information', 'development', 'knowledge', 'the', 'research'], ['recovery', 'channels', 'circuits', 'optical', 'clock'], ['sound', 'stepper', 'tapestry', 'trained', 'renyi'], ['fuzzy', 'algorithm', 'optimization', 'reconstruction', 'genetic']]


In [13]:
with open(path_from_2010) as fin:
    reader = csv.reader(fin)
    titles3 = [row[0] for row in reader]

titles_post10 = [preprocess_text(title) for title in titles3]
tf_post10, tf_feature_names_post10 = vectorize(titles_post10)
lda_topics_post10 = lda_model(tf_post10, tf_feature_names_post10,5)
ctm_topics_post10 = ctm_model(titles3, titles_post10, 5)

Topic 0: algorithm prediction estimation optimal recognition time images segmentation design new integrated distributed
Topic 1: based analysis data model power study performance dynamic information evaluation social multiple
Topic 2: learning network deep classification image neural data problem approach management framework selection
Topic 3: using control fuzzy method optimization detection model application hybrid models applications adaptive
Topic 4: networks systems machine learning supply design wireless chain algorithms quality detection support


Batches:   0%|          | 0/909 [00:00<?, ?it/s]

Epoch: [5/5]	 Seen Samples: [907840/908120]	Train Loss: 116.21596893375089	Time: 0:03:27.674887: : 5it [17:11, 206.31s/it]
Sampling: [20/20]: : 20it [58:10, 174.52s/it]


[['bioethanol', 'differentiator', 'valueadded', 'containment', 'discharges'], ['stochastic', 'intuitionistic', 'control', 'controller', 'solving'], ['cellular', 'diversity', 'massive', 'spectrum', 'array'], ['classifier', 'features', 'fusion', 'segmentation', 'multimodal'], ['research', 'customer', 'perspective', 'use', 'factors']]


In [14]:
print("LDA topics after 2010: ", lda_topics_post10)
print("CTM topics after 2010: ", ctm_topics_post10)

LDA topics after 2010:  ['Topic 0: algorithm prediction estimation optimal recognition time images segmentation design new integrated distributed', 'Topic 1: based analysis data model power study performance dynamic information evaluation social multiple', 'Topic 2: learning network deep classification image neural data problem approach management framework selection', 'Topic 3: using control fuzzy method optimization detection model application hybrid models applications adaptive', 'Topic 4: networks systems machine learning supply design wireless chain algorithms quality detection support']
CTM topics after 2010:  [['bioethanol', 'differentiator', 'valueadded', 'containment', 'discharges'], ['stochastic', 'intuitionistic', 'control', 'controller', 'solving'], ['cellular', 'diversity', 'massive', 'spectrum', 'array'], ['classifier', 'features', 'fusion', 'segmentation', 'multimodal'], ['research', 'customer', 'perspective', 'use', 'factors']]


We observe some difference in the topics of the two model for each time period. While a small information on the text would be more helpful to decide which one performs better, LDA seems to be more elaborate than CTM.