# Latent Dirichlet Allocation

This is the takehome notebook for the NLP engineer position at Contenda. 



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install boto3
!pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from itertools import chain
import numpy as np

from gensim.models import Phrases
from gensim import corpora, models

import nltk

import boto3
import json



In [4]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [5]:
import os

def file_to_list(dir):
  list_text = []
  for filename in os.listdir(dir):
    with open(dir + filename, 'r') as f:
      text = f.read()
      list_text.append(text)
  return list_text

In [6]:
path = '/content/drive/My Drive/nlp_take_home/'

train_dir = 'training_transcriptions/'

train_text_list = file_to_list(path+train_dir)

testing_dir = 'testing_transcriptions/'

testing_text_list = file_to_list(path+testing_dir)

In [7]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [8]:
def build_lda_model(list_text, num_topics, alpha, beta):
    df = pd.DataFrame(list_text)
    df.columns = ["documents"]
    df['sentences'] = df.documents.map(sent_tokenize)
    df['tokens_sentences'] = df['sentences'].map(lambda sentences: [word_tokenize(sentence) for sentence in sentences])
    df['POS_tokens'] = df['tokens_sentences'].map(lambda tokens_sentences: [pos_tag(tokens) for tokens in tokens_sentences])
    df['tokens_sentences_lemmatized'] = df['POS_tokens'].map(
        lambda list_tokens_POS: [
            [
                lemmatizer.lemmatize(el[0], get_wordnet_pos(el[1])) 
                if get_wordnet_pos(el[1]) != '' else el[0] for el in tokens_POS
            ] 
            for tokens_POS in list_tokens_POS
        ]
    )
    
    df['tokens'] = df['tokens_sentences_lemmatized'].map(lambda sentences: list(chain.from_iterable(sentences)))
    df['tokens'] = df['tokens'].map(lambda tokens: [token.lower() for token in tokens if token.isalpha()])
    stop_words = set(stopwords.words('english'))
    df['tokens'] = df['tokens'].map(lambda tokens: [token for token in tokens if not token.lower() in stop_words])


    tokens = df['tokens'].tolist()
    bigram_model = Phrases(tokens)
    trigram_model = Phrases(bigram_model[tokens], min_count=1)
    tokens = list(trigram_model[bigram_model[tokens]])
        
    dictionary_LDA = corpora.Dictionary(tokens)
    dictionary_LDA.filter_extremes(no_below=0)
    corpus = [dictionary_LDA.doc2bow(tok) for tok in tokens]
        
    lda_model = models.LdaModel(corpus, num_topics=num_topics, 
                                      id2word=dictionary_LDA, 
                                      passes=4,
                                      alpha = alpha,
                                      eta = beta,
                                      random_state = 123)
    
    return tokens, dictionary_LDA, corpus, lda_model

In [9]:
from gensim.models import CoherenceModel

def get_coherence_metrics(tokens, dictionary_LDA, corpus, lda_model):
  c_v_model = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary_LDA, coherence='c_v')
  c_v = c_v_model.get_coherence()
  c_npmi_model = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary_LDA, coherence='c_npmi')
  c_npmi = c_npmi_model.get_coherence()
  return c_v, c_npmi

In [12]:
min_topics = 30
max_topics = 45
step_size = 5
topics_list = list(range(min_topics, max_topics, step_size))

alpha_list = [0.25]
alpha_list.append('symmetric')
alpha_list.append('asymmetric')

beta_list = [0.25]
beta_list.append('symmetric')

In [13]:
result_dict = {
    'topics': [],
    'alpha': [],
    'beta': [],
    'c_v': [],
    'c_npmi' : []
}

for topic in topics_list:
  for alpha in alpha_list:
    for beta in beta_list:
      tokens, dictionary_LDA, corpus, lda_model = build_lda_model(train_text_list, topic, alpha, beta)
      c_v, c_npmi = get_coherence_metrics(tokens, dictionary_LDA, corpus, lda_model)
      result_dict['topics'].append(topic)
      result_dict['alpha'].append(alpha)
      result_dict['beta'].append(beta)
      result_dict['c_v'].append(c_v)
      result_dict['c_npmi'].append(c_npmi)



In [14]:
df_result = pd.DataFrame(result_dict)
df_result

Unnamed: 0,topics,alpha,beta,c_v,c_npmi
0,30,0.25,0.25,0.369124,-0.247586
1,30,0.25,symmetric,0.357223,-0.261564
2,30,symmetric,0.25,0.361841,-0.245502
3,30,symmetric,symmetric,0.358371,-0.263417
4,30,asymmetric,0.25,0.367131,-0.250243
5,30,asymmetric,symmetric,0.352962,-0.264245
6,35,0.25,0.25,0.342244,-0.262798
7,35,0.25,symmetric,0.385175,-0.265109
8,35,symmetric,0.25,0.35085,-0.264357
9,35,symmetric,symmetric,0.402661,-0.271158


In [15]:
df_result[df_result.c_v == df_result.c_v.max()]

Unnamed: 0,topics,alpha,beta,c_v,c_npmi
9,35,symmetric,symmetric,0.402661,-0.271158


In [16]:
best_tokens, best_dictionary_LDA, best_corpus, best_lda_model = build_lda_model(train_text_list, 
                                                                                df_result[df_result.c_v == df_result.c_v.max()].topics.item(), 
                                                                                df_result[df_result.c_v == df_result.c_v.max()].alpha.item(), 
                                                                                df_result[df_result.c_v == df_result.c_v.max()].beta.item())



In [17]:
best_lda_model.print_topics(num_topics=30)

[(23,
  '0.002*"gon_na" + 0.001*"image" + 0.001*"laughter" + 0.001*"button" + 0.001*"rust" + 0.001*"documentation" + 0.001*"site" + 0.001*"design" + 0.001*"chat" + 0.001*"version"'),
 (18,
  '0.010*"sveltekit" + 0.010*"svelte" + 0.009*"transition" + 0.007*"counter" + 0.005*"scale" + 0.005*"site" + 0.005*"fade" + 0.004*"component" + 0.004*"adapter" + 0.004*"framework"'),
 (19,
  '0.013*"swag" + 0.007*"persona" + 0.006*"design_thinking" + 0.005*"pick" + 0.005*"wonderful" + 0.005*"valujet" + 0.004*"photo" + 0.004*"tweet" + 0.004*"chad" + 0.004*"booth"'),
 (12,
  '0.007*"gon_na" + 0.004*"ably" + 0.003*"realtime" + 0.003*"ably_account" + 0.002*"cycle" + 0.002*"event" + 0.002*"chat" + 0.002*"authentication" + 0.002*"click" + 0.002*"message"'),
 (24,
  '0.010*"lab" + 0.009*"open_source" + 0.008*"application" + 0.007*"kubernetes" + 0.006*"java" + 0.006*"openshift" + 0.006*"laughter" + 0.005*"cluster" + 0.005*"image" + 0.005*"deploy"'),
 (25,
  '0.009*"react" + 0.008*"developer_relations" + 0.0

In [18]:
def test_corpus(list_text):
  df = pd.DataFrame(list_text)
  df.columns = ["documents"]
  df['sentences'] = df.documents.map(sent_tokenize)
  df['tokens_sentences'] = df['sentences'].map(lambda sentences: [word_tokenize(sentence) for sentence in sentences])
  df['POS_tokens'] = df['tokens_sentences'].map(lambda tokens_sentences: [pos_tag(tokens) for tokens in tokens_sentences])
  df['tokens_sentences_lemmatized'] = df['POS_tokens'].map(
      lambda list_tokens_POS: [
          [
              lemmatizer.lemmatize(el[0], get_wordnet_pos(el[1])) 
              if get_wordnet_pos(el[1]) != '' else el[0] for el in tokens_POS
          ] 
          for tokens_POS in list_tokens_POS
      ]
  )

  df['tokens'] = df['tokens_sentences_lemmatized'].map(lambda sentences: list(chain.from_iterable(sentences)))
  df['tokens'] = df['tokens'].map(lambda tokens: [token.lower() for token in tokens if token.isalpha()])
  stop_words = set(stopwords.words('english'))
  df['tokens'] = df['tokens'].map(lambda tokens: [token for token in tokens if not token.lower() in stop_words])

  tokens = df['tokens'].tolist()
  bigram_model = Phrases(tokens)
  trigram_model = Phrases(bigram_model[tokens], min_count=1)
  tokens = list(trigram_model[bigram_model[tokens]])

  dictionary_LDA = corpora.Dictionary(tokens)
  dictionary_LDA.filter_extremes(no_below=0)    
  corpus = [dictionary_LDA.doc2bow(tok) for tok in tokens]
  return corpus

In [19]:
testing_corpus = test_corpus(testing_text_list)



In [20]:
def calculate_test_probabilities(lda_model, corpus):
  result_list = []
  for doc_idx in np.arange(len(corpus)):
    current_doc = corpus[doc_idx]
    result_list.append(lda_model[current_doc])
  return result_list

In [21]:
test_probabilities = calculate_test_probabilities(best_lda_model, testing_corpus)
test_probabilities[1]

[(1, 0.07502441),
 (6, 0.08030472),
 (20, 0.0739318),
 (22, 0.70127267),
 (26, 0.06890067)]

In [22]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
vis = gensimvis.prepare(best_lda_model, best_corpus, best_dictionary_LDA)
vis

  from collections import Iterable
  by='saliency', ascending=False).head(R).drop('saliency', 1)


1.

I spent roughly three hours on the task. 

The first thirty minutes was understanding the pre-existing code and what I have to work with. First off, I noticed there already exists a structure to create the corpus and call the model building from gensim. I double checked what text cleaning techniques already existed and started to think about additional components that could be added on, such as removing stop words so the BOW method is not cluttered with irrelevant information. I looked at the BOW output as EDA to see if there were any keywords that I should be looking out for when we output the topics. Since I only studied LDA in an academic setting, I needed to read up on the documentation of the model function. After getting an understanding of the pieces, the next step is to start thinking about the additional parts of the pipeline.

I brainstormed for the next thirty minutes. I thought about what parts of a pipeline needed to be included in order to generate a model that is tuned, deployable, and ready for production. I needed to train the model, tune the hyperparameters, and evaluate the clustering task before calculating probabilities of the learned topics on the test set. I wanted to implement an objective evaluation metric for the unsupervised task, despite the subjectivity of selection, in the case where there is massive amounts of data being read in. I also thought visualization can help supplement the evaluation.

The coding and debugging took up the next hour. The main task was grid searching the number of topics, prior of the document-topic distribution and the prior of the topic-word distribution. I ranged the number of topics from 10 to 40 and ran the prior beliefs for alpha and beta in [0.01, 0.26, 0.51, 0.76]. I also tried the existing “symmetrical” and “asymmetrical” parameters for alpha and beta. I was optimizing for the c_v coherence. Trying the different parameters took the bulk of the time. Afterwards, I created a function to evaluate the probabilities of the training topics existing in the testing set. Finally, I also found a great interactive LDA visualization package that shows the clusters and term frequencies within topics.

The last chunk of time is writing this and making sure my words don’t sound too dumb.

2.

Transcripts and articles have very different styles. The pros and cons of training a model on either really depends on the goal of the task. Transcripts are based on human speech, where there is less structure and rules. With that comes more noise and variations in speech pattern. A lot of conversation is based around context, meaning it is possible to discuss a subject without explicitly mentioning the topic. There also exists many filler words, such as stutters and extraneous sounds, that could make it through transcription (in the case we use a machine instead of human transcription). A topic model performs sub-optimally for casual conversation in that case.

As for written articles, there are established rules and styles that exist when creating them. The whole goal of writing is to be concise, and therefore, there is a reduction of noise and more indicators of the topics. Topic modeling would be easier under these assumptions.

3.

In the case I had 8 more hours on the task, I would like to ask more questions to get better data understanding. The first pass only included me going through the BOW words, but not an in-depth understanding of the data. I would also like to implement something more baseline to compare to LDA, such as K-means clustering. Towards the end I would like to swap out the different ways the text can be processed with TF-IDF, word2vec, and Spacy. Spacy is close to SOTA and already has methods to tag POS, lemmatize, and detect the language. Having NER would also improve the model, since I saw that some proper nouns were getting lemmatized.

Given 3 days of time, I would start running the experiments from everything I have done up to this point. We can compare the model against other popular ones such as NMF. I can also implement more evaluation metrics such as perplexity. I fixed the LDA with a random state, but later I could experiment to see variance by running on different splits of test sets. I would also like to gridsearch through all the settings of LDA and optimize the model completely. 

Given a week, I would have started iterating towards some of the more advanced techniques. We can use sentence transformers embeddings instead of count vectors as features to capture relation between words/tokens better. We could also try to add on extra data, such as scraping Youtube video transcriptions since I believe that there may not be enough keywords in a typical transcription. At this point, I might also try to start increasing the complexity of the model, such as using BERT for topic modeling instead. We can also try to shift the task towards a more supervised task after giving a true topic tag to a small subset of the data. If we can optimize another model towards precision, we can slowly iterate and increase our labeled data.

4.

LDA is fast and relatively computationally efficient since it is an inference technique and not a deep neural net. There could still be some scaling issues when the data gets too large. It’s good for quick 3 hour take home interviews. The great thing about the model is that it works for any type of text data set and we don’t need to assume anything about the topics. However, the model is hard to evaluate since the topics are clustered. The model also includes gaussian assumptions.  Even though I optimized for coherence, the topics themselves still might not make sense as I reach a higher coherence score. What is learned from LDA can also be affected by the order the data is loaded in, leading to another source of variance. 

Another popular method includes Non-negative Matrix Factorization that is faster than LDA. The method is based on linear algebra dimensionality reduction. However the method is better for shorter text since larger text would lose more information from the dimensionality reduction.

We could also be running a large language model such as BERT or GPT3 as a lot of these models have a lot of evidence of performing better and ability to capture more long range dependencies and relations. The biggest problem would be cost and resources. 

5.

We could fix our machine transcriptions either by normalizing the data or deforming other datasets so the model can account for the mistakes. Normalization would correct the data to match previously seen good data. Popular methods include lemmatizing or stemming, which we already implemented in the simple model. We could mask some of the mistakes and use an MLM to predict the words that should be in the text. We could also apply deformations on the training set in order to train the model with noise. Some ways to deform the data could be masking, scrambling the words to replicate errors, and replacing words by using a dictionary of homonyms. 

Some of my other ideas include using more background after normalization or generating more data. If we have an idea of the topics in the corpus, we can adjust our variables to guide the model towards what we would consider to be the truth. If there are multiple genres of topics, we could model each genre separately to get specific topics within eached genre bins. We can then reduce the randomness of the model as well. As for methods to increase the amount of workable data, we could use human judgment to clean through the documents and label a document as good enough or bad for topic modeling. We then run a binary classification of whether we want to topic model the document or not. Essentially, we have created a filter. A filter could be very useful when we understand the reason why transcriptions are bad. For instance if we scraped Youtube data and  songs, different languages, and ASMR sounds might be transcribed incorrectly and .
