# Topic Modelling

Topic modelling is an unsupervised machine learning technique that looks to find semantic meaning from a collection of documents by forming clusters (latent topics) of word probabilities. Each document can be attributed to one or more of these topics based on the distribution of words. Latent Dirichlet Allocation is a popular algorithm for topic modelling.

The content has been gathered from the popular academic website arXiv.org for articles tagged as computer science content (though some of these are in mathematics or physics categories). The fields are:

• Title the full title <br>
• Abstract : the full abstract <br>
• InformationTheory a "1" if it is classified as an Information Theory article, otherwise "0". <br>
• ComputerVision a "1" if it is classified as a Computer Vision article, otherwise "0". <br>
• ComputationalLinguistics a "1" if it is classified as a Computati onal Linguistic s article, otherwise "0".

We will use the titles and abstracts to create LDA (Latent Dirichlet allocation) models, to find informative topics that can be used to group the papers.

## Import libraries

In [None]:
# For pre-processing
import spacy
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

# scikit-learn libraries
from sklearn.exceptions import ConvergenceWarning

# general use
import pandas as pd
import numpy as np
import time
from warnings import simplefilter, filterwarnings
import random
import pickle
from itertools import product, cycle

# For topic modelling
!pip3 install gensim # if running on Colab
!pip3 install --upgrade gensim # if running on Colab. # Optional, removes some deprecation warnings
from gensim.models import LdaModel, Phrases
from gensim.corpora import Dictionary
!pip install pyLDAvis==2.1.2 # if running on Colab
import pyLDAvis.gensim

### Define functions

In [9]:
# Function to show which topic contributes the most to each document
def get_document_topics(ldamodel, corpus, texts):
   # Initialise output
    document_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        # j is index position of tuple in row, other vals are tuple vals
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # i.e., the dominant topic (because it was sorted)
                # top tokens from the topic
                wp = ldamodel.show_topic(topic_num)
                 # join topic words together with comma space separating each
                topic_keywords = ", ".join([word for word, prop in wp])
                # Populate dataframe
                document_topics_df = document_topics_df.append(pd.Series([int(topic_num),
                                                                          round(prop_topic,4),
                                                                          topic_keywords]), 
                                                               ignore_index=True)
            else:
                break
    document_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    document_topics_df = pd.concat([document_topics_df, texts], axis=1)

    document_topics_df.columns = ['Dominant_Topic',
                                  'Perc_Contribution',
                                  'Topic_Keywords',
                                  'Title',
                                  'Original_Text']

    return document_topics_df

In [10]:
# Function to find k most representative documents for each topic
def most_representative_docs(document_topics, k=1):

    doc_topics_sorted_df = pd.DataFrame()

    # Group articles together that have the same most dominant topic
    doc_topic_df_grpd = document_topics.groupby('Dominant_Topic')

    # Sort and keep the articles that each topic contributes the most to
    for i, grp in doc_topic_df_grpd:
        doc_topics_sorted_df = pd.concat([doc_topics_sorted_df, 
                                          grp.sort_values(['Perc_Contribution'], 
                                                          ascending=[0]).head(k)])
    
    # Tidy up dataframe
    doc_topics_sorted_df.reset_index(inplace=True)
    doc_topics_sorted_df.columns = ['Doc_ID',
                                    'Topic_Num',
                                    "Topic_Perc_Contrib",
                                    "Keywords",
                                    "Title",
                                    "Text"]
                                    
    return doc_topics_sorted_df

In [11]:
# Function to print out details of a topic's most representative document
def display_rep_doc(rep_docs_df, topic_no):

    for i, row in rep_docs_df[rep_docs_df['Topic_Num']==topic_no].iterrows():
        print(f"Topic {topic_no} \n\nTopic keywords: \n{row['Keywords']}\n")
        print("Document Title: \n" + row["Title"] + "\n")
        print(f"Document Contribution is {row['Topic_Perc_Contrib']*100:.2f} % \n")
        print(row["Text"] + "\n-------------------------------------------------")

### Pre-processing

Each method of pre-processing shares some pre-processing steps such as the removal of stop words (using NLTK’s stop words list) and tokenization is undertaken in the same manner.

The following code only needs to run if part 1 hasn't been run:

In [12]:
# for colab
nltk.download('stopwords')

# Get English stopwords as a list for iteration
stopwordsEnglish = stopwords.words("english")
# Add long dash to punctuation
punctuation = string.punctuation + '–'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### Pre-processing 1

Method 1 uses the spacy lemmatizer, removing single character words, adds bigrams and filters the extremes from the dictionary for tokens that appear in fewer than 20 documents or more than 25% of documents. The models from pre-processing method 1 will be built with a specification of 10 topics. 

In [13]:
# If not already loaded
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Preprocessing method 1
def preprocessing_1(collection):

    collection_processed = []

    for idx, abstract in enumerate(collection):

        # Preprocess over characters in each text document    
        # Remove newline text and make lowercase
        abstract = "".join([char.replace('\n',' ') for char in abstract])
        # Remove punctuation
        abstract = "".join(
            [char for char in abstract if char not in string.punctuation or char=='\''])

        # Preprocess over tokens in each text document
        # Extract the lemma for each token / tokenize text using spacy
        abstract = [token.lemma_ for token in nlp(abstract)]
        # Remove pure numeric tokens
        abstract = [token for token in abstract if not token.isnumeric()]
        # Remove single character tokens
        abstract = [token for token in abstract if len(token) > 1]
        # Remove stop words    
        abstract = [token for token in abstract if token not in stopwordsEnglish]

        # Update the collection
        collection_processed.append(abstract)

    # Add bigrams to each article's tokens
    bigram = Phrases(collection_processed, min_count=20)
    for idx, abstract in enumerate(collection_processed):
        for token in bigram[abstract]:
            if '_' in token:
                # Token is a bigram, add to document.
                abstract.append(token)

    # Dictionary representation of the documents.
    dictionary = Dictionary(collection_processed)

    # Filter out words that occur less than 20 documents, or more than 25% of the documents.
    dictionary.filter_extremes(no_below=20, no_above=.25)

    # Bag-of-words representation
    corpus = [dictionary.doc2bow(abstract) for abstract in collection_processed]

    # Index to word dictionary
    temp = dictionary[0]  # Load dictionary
    id2word = dictionary.id2token

    return collection_processed, corpus, id2word, dictionary

#### Pre-processing 2

Method 2 uses NLTK’s wordnet lemmatizer, removing all characters with fewer than 2 characters, includes both bigrams and trigrams and has tokens appearing in fewer than 5 documents or more than 10% of documents to be filtered from the dictionary. The models from pre-processing method 2 will be built with a specification of 40 topics.

In [14]:
# Create lemmatizer object
nltk.download('wordnet') # if Colab
lemmatizer = WordNetLemmatizer()

# Preprocessing method 2
def preprocessing_2(collection):

    collection_processed = []

    for idx, abstract in enumerate(collection):

        # Preprocess over characters in each text document    
        # Remove newline text and make lowercase
        abstract = "".join([char.replace('\n',' ') for char in abstract])
        # Remove punctuation
        abstract = "".join(
            [char for char in abstract if char not in string.punctuation or char=='\''])
        # Tokenize and make lowercase
        abstract = abstract.lower().split(" ")

        # Preprocess over tokens in each text document
        # Extract the lemma for each token / tokenize text using nltk
        abstract = [lemmatizer.lemmatize(token) for token in abstract]
        # Remove pure numeric tokens
        abstract = [token for token in abstract if not token.isnumeric()]
        # Remove short character tokens
        abstract = [token for token in abstract if len(token) > 2]
        # Remove stop words    
        abstract = [token for token in abstract if token not in stopwordsEnglish]

        # Update the collection
        collection_processed.append(abstract)

    # Add bigrams/trigrams (and possibly some 4grams) to each article's tokens
    bigram = Phrases(collection_processed, min_count=5)
    trigram = Phrases([bigram[abstract] for abstract in collection_processed], 5)
    for idx, abstract in enumerate(collection_processed):
        for token in bigram[abstract]:
            if '_' in token:
                # Token is a bigram, add to document.
                abstract.append(token)
        # Note: due to the way gensim implements ngrams, some trigrams might be missed
        for token in trigram[abstract]:
            if '_' in token and token not in bigram[abstract]:
                # Token is a trigram (or possibly a 4gram), add to document:
                abstract.append(token)

    # Dictionary representation of the documents.
    dictionary = Dictionary(collection_processed)

    # Filter out words that occur less than 20 documents, or more than 25% of the documents.
    dictionary.filter_extremes(no_below=5, no_above=.1)

    # Bag-of-words representation
    corpus = [dictionary.doc2bow(abstract) for abstract in collection_processed]

    # Index to word dictionary
    temp = dictionary[0]  # Load dictionary
    id2word = dictionary.id2token

    return collection_processed, corpus, id2word, dictionary

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### Processing data

In [18]:
# Read in the data and create datasets
text_df = pd.read_csv("./drive/My Drive/Colab Notebooks/train.csv").head(20000)
docs = text_df['abstract'].tolist()

In [19]:
# Process docs 2 ways by 2 sizes
processed_1000_1 = preprocessing_1(docs[:1000])
processed_20000_1 = preprocessing_1(docs)
processed_1000_2 = preprocessing_2(docs[:1000])
processed_20000_2 = preprocessing_2(docs)

In [None]:
# Save results above
for x, y in product(['1000', '20000'], ['1','2']):
    pickle_out = open(f"./drive/My Drive/Colab Notebooks/processed_{x}_{y}.pickle", "wb")
    pickle.dump(eval(f'processed_{x}_{y}'), pickle_out)
    pickle_out.close()

### Train models

In [None]:
# Load processed data
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_1000_1.pickle", "rb")
processed_1000_1 = pickle.load(pickle_in)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_20000_1.pickle", "rb")
processed_20000_1 = pickle.load(pickle_in)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_1000_2.pickle", "rb")
processed_1000_2 = pickle.load(pickle_in)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_20000_2.pickle", "rb")
processed_20000_2 = pickle.load(pickle_in)
pickle_in.close()

#### Model 1

Size: 1000 <br>
Preprocessing: 1

In [None]:
# Train LDA model on first 1,000 rows of the data with preprocessing method 1.

# Set training parameters.
NUM_TOPICS = 10 # Use 10 topics for preprocessing 1
chunksize = 100
passes = 5
iterations = 200
eval_every = 5
corpus = processed_1000_1[1]
id2word = processed_1000_1[2]
dictionary = processed_1000_1[3]

model_1000_1 = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)

In [None]:
# Get top words per topic
top_topics = model_1000_1.top_topics(corpus)
model_1000_1.num_topics

# Average topic coherence is the sum of topic coherences of all topics, divided by number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS # t[1] is topic coherence
print('Average topic coherence: %.4f.' % avg_topic_coherence)

# Display top words by frequency per topic
model_1000_1.print_topics(num_words=8)

Average topic coherence: -3.2355.


[(0,
  '0.056*"system" + 0.055*"language" + 0.042*"classifier" + 0.039*"translation" + 0.037*"feedback" + 0.033*"al" + 0.032*"et_al" + 0.032*"et"'),
 (1,
  '0.012*"framework" + 0.011*"well" + 0.011*"sequence" + 0.010*"present" + 0.010*"high" + 0.010*"point" + 0.010*"3d" + 0.010*"term"'),
 (2,
  '0.050*"feature" + 0.026*"train" + 0.026*"stateoftheart" + 0.025*"neural" + 0.025*"training" + 0.024*"label" + 0.020*"convolutional" + 0.020*"input"'),
 (3,
  '0.153*"video" + 0.127*"object" + 0.086*"frame" + 0.045*"motion" + 0.031*"deep_neural" + 0.028*"style" + 0.026*"latent" + 0.023*"collect"'),
 (4,
  '0.071*"user" + 0.067*"system" + 0.046*"communication" + 0.040*"design" + 0.031*"distance" + 0.031*"computation" + 0.029*"interference" + 0.020*"distribute"'),
 (5,
  '0.026*"task" + 0.022*"information" + 0.018*"semantic" + 0.018*"text" + 0.017*"representation" + 0.016*"new" + 0.015*"knowledge" + 0.014*"set"'),
 (6,
  '0.100*"code" + 0.060*"channel" + 0.032*"scheme" + 0.028*"rate" + 0.027*"opti

In [None]:
# Visualise
lda_display = pyLDAvis.gensim.prepare(model_1000_1, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

2022-04-06 15:09:17,291 : DEBUG : performing inference on a chunk of 1000 documents
2022-04-06 15:09:17,587 : DEBUG : 1000/1000 documents converged within 200 iterations
  head(R).drop('saliency', 1)


#### Model 2

Size: 20,000 <br>
Preprocessing: 1

In [None]:
# Train LDA model on first 20,000 rows of the data with preprocessing method 1

# Set training parameters.
NUM_TOPICS = 10 # Use 10 topics for preprocessing 1
chunksize = 2000
passes = 5
iterations = 300
eval_every = 5
corpus = processed_20000_1[1]
id2word = processed_20000_1[2]
dictionary = processed_20000_1[3]

model_20000_1 = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)

In [None]:
# Get top words per topic
top_topics = model_20000_1.top_topics(corpus)
model_20000_1.num_topics

# Average topic coherence is the sum of topic coherences of all topics, divided by number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS # t[1] is topic coherence
print('Average topic coherence: %.4f.' % avg_topic_coherence)

# Display top words by frequency per topic
model_20000_1.print_topics(num_words=8)

Average topic coherence: -2.0173.


[(0,
  '0.014*"face" + 0.011*"adversarial" + 0.011*"generate" + 0.008*"pose" + 0.007*"training" + 0.007*"attack" + 0.007*"quality" + 0.006*"input"'),
 (1,
  '0.026*"algorithm" + 0.008*"signal" + 0.006*"matrix" + 0.006*"sparse" + 0.005*"point" + 0.005*"measurement" + 0.005*"space" + 0.005*"optimization"'),
 (2,
  '0.019*"object" + 0.017*"detection" + 0.012*"system" + 0.008*"research" + 0.007*"provide" + 0.007*"present" + 0.007*"visual" + 0.006*"application"'),
 (3,
  '0.019*"deep" + 0.018*"neural" + 0.013*"classification" + 0.011*"architecture" + 0.011*"accuracy" + 0.010*"convolutional" + 0.010*"learning" + 0.010*"layer"'),
 (4,
  '0.029*"language" + 0.017*"text" + 0.017*"word" + 0.010*"system" + 0.009*"sentence" + 0.008*"question" + 0.008*"translation" + 0.007*"embedding"'),
 (5,
  '0.021*"channel" + 0.017*"user" + 0.016*"system" + 0.014*"scheme" + 0.012*"power" + 0.010*"rate" + 0.009*"communication" + 0.008*"transmission"'),
 (6,
  '0.082*"code" + 0.015*"decode" + 0.012*"graph" + 0.01

In [None]:
# Visualise
lda_display = pyLDAvis.gensim.prepare(model_20000_1, processed_20000_1[1], processed_20000_1[3], sort_topics=False)
pyLDAvis.display(lda_display)

  head(R).drop('saliency', 1)


#### Model 3

Size: 1,000 <br>
Preprocessing: 2

In [None]:
# Train LDA model on first 1,000 rows of the data with preprocessing method 2

# Set training parameters.
NUM_TOPICS = 40 # Use 40 topics for preprocessing 2
chunksize = 100
passes = 5
iterations = 200
eval_every = 5
corpus = processed_1000_2[1]
id2word = processed_1000_2[2]
dictionary = processed_1000_2[3]

model_1000_2 = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)

In [None]:
# Get top words per topic
top_topics = model_1000_2.top_topics(corpus)
model_1000_2.num_topics

# Average topic coherence is the sum of topic coherences of all topics, divided by number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS # t[1] is topic coherence
print('Average topic coherence: %.4f.' % avg_topic_coherence)

# Display top words by frequency per topic
model_1000_2.print_topics(num_words=8)

Average topic coherence: -5.9585.


[(2,
  '0.000*"section" + 0.000*"pretrained_language" + 0.000*"closed" + 0.000*"summary" + 0.000*"graphical" + 0.000*"introducing" + 0.000*"multivariate" + 0.000*"consisting"'),
 (20,
  '0.103*"measurement" + 0.090*"numerical" + 0.075*"probability" + 0.055*"fading" + 0.054*"numerical_result" + 0.051*"analyze" + 0.050*"physical" + 0.047*"correspondence"'),
 (4,
  '0.117*"environment" + 0.112*"policy" + 0.064*"world" + 0.055*"adaptation" + 0.050*"robot" + 0.044*"evidence" + 0.039*"reinforcement" + 0.036*"reinforcement_learning"'),
 (0,
  '0.247*"research" + 0.083*"topic" + 0.051*"protocol" + 0.049*"basis" + 0.048*"obtaining" + 0.041*"active" + 0.037*"acquisition" + 0.034*"community"'),
 (19,
  '0.225*"distribution" + 0.122*"attribute" + 0.093*"inference" + 0.083*"latent" + 0.049*"perform" + 0.047*"continuous" + 0.039*"variable" + 0.037*"infer"'),
 (17,
  '0.107*"value" + 0.093*"correlation" + 0.061*"access" + 0.061*"search" + 0.055*"constraint" + 0.052*"upper_bound" + 0.045*"orthogonal" 

In [None]:
# Visualise
lda_display = pyLDAvis.gensim.prepare(model_1000_2, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

2022-04-06 16:01:45,287 : DEBUG : performing inference on a chunk of 1000 documents
2022-04-06 16:01:45,576 : DEBUG : 1000/1000 documents converged within 200 iterations
  head(R).drop('saliency', 1)


#### Model 4

Size: 20,000 <br>
Preprocessing: 2

In [None]:
# Train LDA model on first 20,000 rows of the data with preprocessing method 2

# Set training parameters.
NUM_TOPICS = 40 # Use 40 topics for preprocessing 2
chunksize = 2000
passes = 5
iterations = 200
eval_every = 5
corpus = processed_20000_2[1]
id2word = processed_20000_2[2]
dictionary = processed_20000_2[3]

model_20000_2 = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)

In [None]:
# Get top words per topic
top_topics = model_20000_2.top_topics(corpus)
model_20000_2.num_topics

# Average topic coherence is the sum of topic coherences of all topics, divided by number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS # t[1] is topic coherence
print('Average topic coherence: %.4f.' % avg_topic_coherence)

# Display top words by frequency per topic
model_20000_2.print_topics(num_words=8)

Average topic coherence: -4.2755.


[(0,
  '0.124*"face" + 0.054*"recognition" + 0.043*"person" + 0.026*"facial" + 0.024*"character" + 0.023*"face_recognition" + 0.020*"identity" + 0.018*"reid"'),
 (39,
  '0.122*"visual" + 0.055*"tracking" + 0.016*"explanation" + 0.014*"tracker" + 0.012*"track" + 0.012*"vqa" + 0.011*"target" + 0.011*"cluster"'),
 (24,
  '0.039*"style" + 0.029*"region" + 0.028*"resolution" + 0.015*"transfer" + 0.011*"line" + 0.010*"breast" + 0.009*"prototype" + 0.009*"style_transfer"'),
 (2,
  '0.059*"quality" + 0.046*"metric" + 0.023*"fewshot" + 0.020*"enhancement" + 0.017*"database" + 0.014*"evaluation" + 0.012*"measure" + 0.011*"assessment"'),
 (19,
  '0.047*"retrieval" + 0.047*"similarity" + 0.042*"query" + 0.038*"matching" + 0.038*"document" + 0.027*"descriptor" + 0.025*"local" + 0.025*"topic"'),
 (3,
  '0.062*"adversarial" + 0.051*"attack" + 0.029*"robustness" + 0.029*"example" + 0.020*"perturbation" + 0.018*"feedback" + 0.018*"robust" + 0.012*"classifier"'),
 (32,
  '0.041*"cell" + 0.012*"frequency

In [None]:
# Visualise
lda_display = pyLDAvis.gensim.prepare(model_20000_2, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

2022-04-06 16:10:23,746 : DEBUG : performing inference on a chunk of 20000 documents
2022-04-06 16:10:32,985 : DEBUG : 20000/20000 documents converged within 200 iterations
  head(R).drop('saliency', 1)


In [None]:
# Save all models
for x, y in product(['1000', '20000'], ['1','2']):
    output = f'./drive/My Drive/Colab Notebooks/model_{x}_{y}.gensim'
    eval(f'model_{x}_{y}.save(output)')

### Most dominant topic per article

In [None]:
# Load processed data (in case not already loaded previously)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_1000_1.pickle", "rb")
processed_1000_1 = pickle.load(pickle_in)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_20000_1.pickle", "rb")
processed_20000_1 = pickle.load(pickle_in)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_1000_2.pickle", "rb")
processed_1000_2 = pickle.load(pickle_in)
pickle_in = open("./drive/My Drive/Colab Notebooks/processed_20000_2.pickle", "rb")
processed_20000_2 = pickle.load(pickle_in)
pickle_in.close()

In [None]:
# Load models (in case not already loaded)
model_1000_1 = LdaModel.load('./drive/My Drive/Colab Notebooks/model_1000_1.gensim')
model_20000_1 = LdaModel.load('./drive/My Drive/Colab Notebooks/model_20000_1.gensim')
model_1000_2 = LdaModel.load('./drive/My Drive/Colab Notebooks/model_1000_2.gensim')
model_20000_2 = LdaModel.load('./drive/My Drive/Colab Notebooks/model_20000_2.gensim')

Applying top topic contribution to documents function for each model and then previewing the first 5 documents:

In [None]:
doc_topic_df_1000_1 = get_document_topics(ldamodel=model_1000_1, 
                                          corpus=processed_1000_1[1], 
                                          texts=text_df[['title', 'abstract']])
doc_topic_df_1000_1.head(5)

Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,Title,Original_Text
0,9.0,0.4887,"deep, detection, classification, learning, seg...",Objective-Dependent Uncertainty Driven Retinal...,From diagnosing neovascular diseases to dete...
1,5.0,0.7778,"task, information, semantic, text, representat...",SMARTies: Sentiment Models for Arabic Target E...,We consider entity-level sentiment analysis ...
2,1.0,0.2456,"framework, well, sequence, present, high, poin...",State-Aware Tracker for Real-Time Video Object...,In this work we address the task of semi-sup...
3,8.0,0.3845,"algorithm, distribution, parameter, complexity...",On the Performance of Optimized Dense Device-t...,We consider a D2D wireless network where $n$...
4,6.0,0.437,"code, channel, scheme, rate, optimal, error, i...",Design of Minimum Correlated Maximal Clique Se...,This paper proposes an algorithm to search a...


In [None]:
doc_topic_df_20000_1 = get_document_topics(ldamodel=model_20000_1, 
                                           corpus=processed_20000_1[1], 
                                           texts=text_df[['title', 'abstract']])
doc_topic_df_20000_1.head(5)

Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,Title,Original_Text
0,8.0,0.7431,"segmentation, 3d, domain, label, object, train...",Objective-Dependent Uncertainty Driven Retinal...,From diagnosing neovascular diseases to dete...
1,4.0,0.5771,"language, text, word, system, sentence, questi...",SMARTies: Sentiment Models for Arabic Target E...,We consider entity-level sentiment analysis ...
2,0.0,0.2511,"face, adversarial, generate, pose, training, a...",State-Aware Tracker for Real-Time Video Object...,In this work we address the task of semi-sup...
3,5.0,0.5305,"channel, user, system, scheme, power, rate, co...",On the Performance of Optimized Dense Device-t...,We consider a D2D wireless network where $n$...
4,6.0,0.5776,"code, decode, graph, node, error, linear, deco...",Design of Minimum Correlated Maximal Clique Se...,This paper proposes an algorithm to search a...


In [None]:
doc_topic_df_1000_2 = get_document_topics(ldamodel=model_1000_2, 
                                          corpus=processed_1000_2[1], 
                                          texts=text_df[['title', 'abstract']])
doc_topic_df_1000_2.head(5)

Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,Title,Original_Text
0,5.0,0.8392,"segmentation, parameter, map, neural_network, ...",Objective-Dependent Uncertainty Driven Retinal...,From diagnosing neovascular diseases to dete...
1,32.0,0.6183,"face, cluster, response, person, sentiment, ap...",SMARTies: Sentiment Models for Arabic Target E...,We consider entity-level sentiment analysis ...
2,13.0,0.4932,"robust, annotation, instance, property, make, ...",State-Aware Tracker for Real-Time Video Object...,In this work we address the task of semi-sup...
3,27.0,0.4668,"scheme, binary, interference, user, feedback, ...",On the Performance of Optimized Dense Device-t...,We consider a D2D wireless network where $n$...
4,17.0,0.5772,"value, correlation, access, search, constraint...",Design of Minimum Correlated Maximal Clique Se...,This paper proposes an algorithm to search a...


In [None]:
doc_topic_df_20000_2 = get_document_topics(ldamodel=model_20000_2, 
                                           corpus=processed_20000_2[1], 
                                           texts=text_df[['title', 'abstract']])
doc_topic_df_20000_2.head(5)

Unnamed: 0,Dominant_Topic,Perc_Contribution,Topic_Keywords,Title,Original_Text
0,8.0,0.8163,"segmentation, semantic_segmentation, segment, ...",Objective-Dependent Uncertainty Driven Retinal...,From diagnosing neovascular diseases to dete...
1,4.0,0.3647,"translation, sentence, natural, natural_langua...",SMARTies: Sentiment Models for Arabic Target E...,We consider entity-level sentiment analysis ...
2,27.0,0.1619,"distribution, probability, random, entropy, me...",State-Aware Tracker for Real-Time Video Object...,In this work we address the task of semi-sup...
3,25.0,0.2615,"channel, scheme, capacity, receiver, interfere...",On the Performance of Optimized Dense Device-t...,We consider a D2D wireless network where $n$...
4,14.0,0.435,"bound, coding, decoding, linear, error, length...",Design of Minimum Correlated Maximal Clique Se...,This paper proposes an algorithm to search a...


### Most representative document per topic

Applying most representative documents per topic function to each model and displaying the most representative document for each topic:

In [None]:
rep_docs_1000_1 = most_representative_docs(doc_topic_df_1000_1)
rep_docs_1000_1

Unnamed: 0,Doc_ID,Topic_Num,Topic_Perc_Contrib,Keywords,Title,Text
0,95,0.0,0.5318,"system, language, classifier, translation, fee...",Downlink Interference Alignment,We develop an interference alignment (IA) te...
1,47,1.0,0.8074,"framework, well, sequence, present, high, poin...",Wav2vec-Switch: Contrastive Learning from Orig...,The goal of self-supervised learning (SSL) f...
2,23,2.0,0.73,"feature, train, stateoftheart, neural, trainin...",Gated Fusion Network for Degraded Image Super ...,Single image super resolution aims to enhanc...
3,441,3.0,0.4083,"video, object, frame, motion, deep_neural, sty...",One-Step Time-Dependent Future Video Frame Pre...,There is an inherent need for autonomous car...
4,58,4.0,0.6215,"user, system, communication, design, distance,...",UAV-Sensing-Assisted Cellular Interference Coo...,Aerial-ground interference mitigation has be...
5,53,5.0,0.8799,"task, information, semantic, text, representat...",Relationship-Embedded Representation Learning ...,Grounding referring expressions in images ai...
6,96,6.0,0.8581,"code, channel, scheme, rate, optimal, error, i...",Transmit Power Minimization for MIMO Systems o...,This paper is concerned with a wireless syst...
7,621,7.0,0.5025,"face, sentence, function, recognition, attribu...",Dialogue Generation on Infrequent Sentence Fun...,Sentence function is an important linguistic...
8,41,8.0,0.5701,"algorithm, distribution, parameter, complexity...",Plan Optimization to Bilingual Dictionary Indu...,Creating bilingual dictionary is the first c...
9,99,9.0,0.8958,"deep, detection, classification, learning, seg...",Deep learning for image segmentation: veritabl...,Deep learning has achieved great success as ...


In [None]:
rep_docs_20000_1 = most_representative_docs(doc_topic_df_20000_1)
rep_docs_20000_1

Unnamed: 0,Doc_ID,Topic_Num,Topic_Perc_Contrib,Keywords,Title,Text
0,8715,0.0,0.9806,"face, adversarial, generate, pose, training, a...",Generating Photo-Realistic Training Data to Im...,In this paper we investigate the feasibility...
1,588,1.0,0.9898,"algorithm, signal, matrix, sparse, point, meas...",On DC based Methods for Phase Retrieval,In this paper we develop a new computational...
2,1612,2.0,0.9397,"object, detection, system, research, provide, ...",A Compact Survey on Event Extraction: Approach...,Event extraction is a critical technique to ...
3,1732,3.0,0.989,"deep, neural, classification, architecture, ac...",Surrogate-assisted Particle Swarm Optimisation...,Deep convolutional neural networks have demo...
4,1159,4.0,0.9762,"language, text, word, system, sentence, questi...",Improving Target-side Lexical Transfer in Mult...,To improve the performance of Neural Machine...
5,186,5.0,0.9886,"channel, user, system, scheme, power, rate, co...",User Cooperation for Enhanced Throughput Fairn...,This paper studies a novel user cooperation ...
6,1675,6.0,0.9761,"code, decode, graph, node, error, linear, deco...",Equivalences Between Network Codes With Link E...,In this paper new equivalence relationships ...
7,10265,7.0,0.9862,"feature, video, information, representation, r...",Multi-modal Representation Learning for Video ...,Video advertisement content structuring aims...
8,1252,8.0,0.9471,"segmentation, 3d, domain, label, object, train...",Semantic Scene Completion from a Single Depth ...,This paper focuses on semantic scene complet...
9,803,9.0,0.9794,"information, channel, distribution, function, ...",Relations between Information and Estimation i...,Fundamental relations between information an...


In [None]:
rep_docs_1000_2 = most_representative_docs(doc_topic_df_1000_2)
rep_docs_1000_2

Unnamed: 0,Doc_ID,Topic_Num,Topic_Perc_Contrib,Keywords,Title,Text
0,86,0.0,0.6732,"research, topic, protocol, basis, obtaining, a...",Unconstrained Biometric Recognition: Summary o...,The development of biometric recognition sol...
1,72,1.0,0.7989,"size, convolution, action, temporal, cnns, imp...",Rethinking Spatiotemporal Feature Learning: Sp...,Despite the steady progress in video analysi...
2,40,3.0,0.7315,"word, prediction, translation, classifier, mac...",Why ReLU networks yield high-confidence predic...,Classifiers used in the wild in particular f...
3,17,4.0,0.4733,"environment, policy, world, adaptation, robot,...",Never Stop Learning: The Effectiveness of Fine...,One of the great promises of robot learning ...
4,0,5.0,0.8392,"segmentation, parameter, map, neural_network, ...",Objective-Dependent Uncertainty Driven Retinal...,From diagnosing neovascular diseases to dete...
5,91,6.0,0.6249,"computation, mean, presented, computing, gener...",On the Cohomology of 3D Digital Images,We propose a method for computing the cohomo...
6,70,7.0,0.7376,"communication, power, state, energy, receiver,...",Layered Coding for Energy Harvesting Communica...,Due to stringent constraints on resources it...
7,55,8.0,0.5631,"term, processing, future, speech, output, fiel...",A systematic review of Hate Speech automatic d...,With the multiplication of social media plat...
8,99,9.0,0.802,"region, cnn, semantic, deep_learning, trained,...",Deep learning for image segmentation: veritabl...,Deep learning has achieved great success as ...
9,58,10.0,0.5998,"signal, observation, unsupervised, resource, m...",UAV-Sensing-Assisted Cellular Interference Coo...,Aerial-ground interference mitigation has be...


In [None]:
rep_docs_20000_2 = most_representative_docs(doc_topic_df_20000_2)
rep_docs_20000_2

Unnamed: 0,Doc_ID,Topic_Num,Topic_Perc_Contrib,Keywords,Title,Text
0,1819,0.0,0.5263,"face, recognition, person, facial, character, ...",Multi-Attributed and Structured Text-to-Face S...,Generative Adversarial Networks (GANs) have ...
1,1519,1.0,0.8175,"speech, social, medium, recognition, social_me...",WLV-RIT at SemEval-2021 Task 5: A Neural Trans...,In recent years the widespread use of social...
2,982,2.0,0.5747,"quality, metric, fewshot, enhancement, databas...",Learning to Compare: Relation Network for Few-...,We present a conceptually simple flexible an...
3,1422,3.0,0.7383,"adversarial, attack, robustness, example, pert...",Generating Unrestricted Adversarial Examples v...,Deep neural networks have been shown to be v...
4,27,4.0,0.9092,"translation, sentence, natural, natural_langua...",NICT's Neural and Statistical Machine Translat...,This paper presents the NICT's participation...
5,15592,5.0,0.7118,"attention, latent, generation, dialogue, mecha...",Knowledge-Grounded Response Generation with De...,End-to-end dialogue generation has achieved ...
6,629,6.0,0.611,"temporal, interaction, prediction, hierarchica...",Deep Multi-Shot Network for modelling Appearan...,The automatization of Multi-Object Tracking ...
7,1783,7.0,0.5729,"environment, vehicle, robot, autonomous, drivi...",Writer Identification Using Inexpensive Signal...,We propose to use novel and classical audio ...
8,0,8.0,0.8163,"segmentation, semantic_segmentation, segment, ...",Objective-Dependent Uncertainty Driven Retinal...,From diagnosing neovascular diseases to dete...
9,793,9.0,0.7904,"word, semantic, class, embedding, embeddings, ...",Learning Semantic Sentence Embeddings using Se...,In this paper we propose a method for obtain...


To see more detail of the documents themselves:

In [None]:
# Display details of top articles with a single topic contribution of over 90% (for model 2)
for t in rep_docs_20000_1[rep_docs_20000_1['Topic_Perc_Contrib']>=0.9]['Topic_Num'].tolist():
    display_rep_doc(rep_docs_20000_1, t)

Topic 0.0 

Topic keywords: 
face, adversarial, generate, pose, training, attack, quality, input, train, motion

Document Title: 
Generating Photo-Realistic Training Data to Improve Face Recognition
  Accuracy

Document Contribution is 98.06 % 

  In this paper we investigate the feasibility of using synthetic data to
augment face datasets. In particular we propose a novel generative adversarial
network (GAN) that can disentangle identity-related attributes from
non-identity-related attributes. This is done by training an embedding network
that maps discrete identity labels to an identity latent space that follows a
simple prior distribution and training a GAN conditioned on samples from that
distribution. Our proposed GAN allows us to augment face datasets by generating
both synthetic images of subjects in the training set and synthetic images of
new subjects not in the training set. By using recent advances in GAN training
we show that the synthetic images generated by our model are 

In [None]:
# Display details of top articles with a single topic contribution of over 90% (for model 4)
for t in rep_docs_20000_2[rep_docs_20000_2['Topic_Perc_Contrib']>=0.9]['Topic_Num'].tolist():
    display_rep_doc(rep_docs_20000_2, t) 

Topic 4.0 

Topic keywords: 
translation, sentence, natural, natural_language, text, corpus, entity, word, pretrained, transformer

Document Title: 
NICT's Neural and Statistical Machine Translation Systems for the WMT18
  News Translation Task

Document Contribution is 90.92 % 

  This paper presents the NICT's participation to the WMT18 shared news
translation task. We participated in the eight translation directions of four
language pairs: Estonian-English Finnish-English Turkish-English and
Chinese-English. For each translation direction we prepared state-of-the-art
statistical (SMT) and neural (NMT) machine translation systems. Our NMT systems
were trained with the transformer architecture using the provided parallel data
enlarged with a large quantity of back-translated monolingual data that we
generated with a new incremental training framework. Our primary submissions to
the task are the result of a simple combination of our SMT and NMT systems. Our
systems are ranked first for

In [None]:
# Display details of some top articles from nlp topics
for t in [4, 9, 19]:
    display_rep_doc(rep_docs_20000_2, t) 

Topic 4 

Topic keywords: 
translation, sentence, natural, natural_language, text, corpus, entity, word, pretrained, transformer

Document Title: 
NICT's Neural and Statistical Machine Translation Systems for the WMT18
  News Translation Task

Document Contribution is 90.92 % 

  This paper presents the NICT's participation to the WMT18 shared news
translation task. We participated in the eight translation directions of four
language pairs: Estonian-English Finnish-English Turkish-English and
Chinese-English. For each translation direction we prepared state-of-the-art
statistical (SMT) and neural (NMT) machine translation systems. Our NMT systems
were trained with the transformer architecture using the provided parallel data
enlarged with a large quantity of back-translated monolingual data that we
generated with a new incremental training framework. Our primary submissions to
the task are the result of a simple combination of our SMT and NMT systems. Our
systems are ranked first for t

### Topic discovery

From the intertopic distance maps, some logical clusters could be identified. Using the first pre-processing method, where LDA models were trained with 10 topics and using the data size of 1,000, topics that emerged were applications involving motion capture, communication systems, natural language processing, facial recognition and deep learning. Generally, it was difficult to pinpoint a theme for the topics, which appeared to be an amalgamation of terms from different domains or of general machine learning terms such as ‘classifier’, ‘training’ or ‘feature’.

Using the larger set of 20,000 abstracts resulted in more coherent topics. From this it looked like a possible set of labels of topics could be Biometrics or generative adversarial models, signals, object detection systems, deep convolutional neural networks, natural language processing, communication networks, cybersecurity, computer vision – video, computer vision – image segmentation, and information & probability theory. A number of these topics overlapped together on the intertopic distance map, being those that closely relate to, or are a subset of, computer vision. This is consistent with the labels from the original training data. There were however, some limitations. Firstly, they were a bit inconsistent as some topics were on areas of applications (e.g., object detection) whereas others are algorithms (e.g., deep CNNs). As was the case with the LDA run on the smaller dataset, the pre-processing for this run used the same thresholds for filtering to create the vocabulary dictionary. The upper threshold may have been set too high, allowing a few extra generic machine learning terms through, whereas method 2, discussed below, had a stricter upper threshold. The number of topics may have also been set too low. For example, words relating to health such as “clinical”, “patient”, “disease” and “diagnosis” were a part of the same topic as deep CNNs and were highly exclusive to this topic but were not as likely to appear in a given document from this topic. It could have easily been broken up into an additional topic.

With pre-processing method 2 and using the first 1,000 documents, each topic is rather sparse. Most of the tokens do not cluster together into obvious comprehensible topics but rather, appear to contain an amalgamation of unrelated machine learning terms and other computer science terms. There does appear to be some resemblance of topics such as NLP (split across a few topics), communications, motion tracking, question answering, computer vision, adversarial models and signal processing. For the most part however, there does not appear to be much coherence of topics.

Using pre-processing method 2 with 20,000 documents provides a higher quality set of topics compared to when using only 1,000 documents. Compared to pre-processing method 1, where only 10 topics were specified, pre-processing method 2 provides a more granular level of detail of topics, resulting in some distinct subtopics. For example, within NLP, there appears to be subtopics of speech recognition, machine translation, word embeddings, sentiment analysis combined with text classification, question answering and document similarity. There were also separate topics for object detection and autonomous vehicles rather than grouping them together. There was a topic with predominantly medical terms while a few additional medical terms were spread out over other topics. Many of the topics identified in the other models can also be identified in this model such as computer vision, biometric identification, GANs, CNNs, signals, communications and information & probability theory with some of them also split over multiple topics. There are also additional topics such as domain adaption and graph models. There were still some low-quality topics that did not consist of particularly comprehensible sets of words. This was likely due to the number of topics being set too high.

The topics identified from the larger model using pre-processing method 1 were generally present in the model that used method 2 but broken down in what appeared to be a somewhat arbitrary manner in a number of cases. This, along with the fact that some of the topics in method 2 were not comprehensible, suggests that 40 topics was too many. Some of the additional topics from method 2 did present logical subtopics to what had been discovered under method 1 and there were some new logical topics produced in this model too, which also suggests that the optimal number of topics is more than 10, although it depends on the level of granularity sought. Neither model trained on only 1,000 documents provided a solid set of insightful topics, especially the model trained to find 40 topics, although some of the topics discovered showed some level of comprehension.

### Example documents

Below are some example documents that had a high degree of contribution from a single topic using the LDA model of 20,000 training rows and pre-processing method 1 (10 topics). This model produced topics that all had some documents for which they alone were highly representative of.

$$
\begin{aligned}
&\begin{array}{cccc}
\hline \hline \text { Proposed topic } & \text { Topic keywords } & \text { Document title} \\
\hline \text{Biometrics} & \text{face, adversarial, generate, pose, training} & \text{Generating Photo-Realistic Training Data to Improve Face Recognition Accuracy} \\
\text{NLP} & \text{language, text, word, system, sentence} & \text{Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation}\\
\text{Network communications} & \text{channel, user, system, rate, communication, transmission} & \text{User Cooperation for Enhanced Throughput Fairness in Wireless Powered Communication Networks} \\
\text{Information and probability theory} & \text{information, distribution, function, probability} & \text{Relations between Information and Estimation in Discrete-Time L’evy Channels}\\
\hline
\end{array}
\end{aligned}
$$

On the other hand, the following 3 NLP documents were each the most representative for 3 separate topics under the LDA model of 20,000 training rows and pre-processing method 2 (40 topics), which demonstrates the increased granularity when specifying a greater number of topics.

$$
\begin{aligned}
&\begin{array}{cccc}
\hline \hline \text { Proposed topic } & \text { Topic keywords } & \text { Document title} \\
\hline \text{Machine translation} & \text{translation, sentence, natural_language, text} & \text{NICT's Neural and Statistical Machine Translation Systems for the WMT18 News Translation Task} \\
\text{Word embeddings} & \text{word, semantic, class, embedding} & \text{Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator}\\
\text{Document similarity (or topic modelling)} & \text{retrieval, similarity, matching, document} & \text{FLATM: A Fuzzy Logic Approach Topic Model for Medical Documents} \\
\hline
\end{array}
\end{aligned}
$$

### Summary
Topic modelling enabled us to semantically interpret clusters of topics applicable to the documents of our dataset, which is a useful way of understanding data. However, it is up to us to set the number of topics for the models to find, which is not easy to determine. With too few topics, multiple topics get clumped together and with too many topics, there are some semantically meaningless topics or multiple topics that seem arbitrarily separate from each other. Topic modelling requires a human to interpret the terms in each topic to create appropriate labels, which proved tricky and requires a certain degree of familiarity with the domains. The models are also limited by the fact that they do not regard the sequential nature of words in each sentence but rather, treat each document as a bag of words.