# Topic Modelling in Feedbacks Domain Using Latent Dirichlet Allocation



Partner: Michael Hewlett, Canada Digital Analytics Team

Mentor: Jungyeul Park, the University of British Columbia (UBC)

Author: Alex Chen, Student at the UBC Master of Data Science Computational Linguistics Program

Version: 2021.06.26




## Overview

This notebook applies unsupervised learning on tens of thousands comments to build a topic model and discover emerging clusters/topics that might deserve new tags of their own. The model also generates candidate new tags for domain experts to use. 

## Set-up

### Declare Constant Variables


In [1]:
 
 id_file_path = 'ids.csv'
csv_file_path = '../../../data/vaccine_full.csv'


### Trim Excessive Printout

In [2]:
from IPython.display import clear_output
import warnings
warnings.filterwarnings('ignore')

### Installs

In [3]:
!pip install pyLDAvis==3.2.2
!pip install gensim

clear_output()

### Imports

In [4]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
import spacy
nlp = spacy.load('en')
from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(nlp.vocab)
import nltk
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import gensim
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import remove_stopwords
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim import models
import pyLDAvis.gensim
clear_output()

## Load Data

In [5]:
df = pd.read_csv(csv_file_path)
cols = ['Unique ID', 'Comment', 'Tags']
df = df[cols]
df.columns = ['id', 'comment', 'tag']
display(df.head(0))
print()
df.shape


Unnamed: 0,id,comment,tag





(29292, 3)

In [7]:
iddf = pd.read_csv(id_file_path)
merged = pd.merge(iddf,df,on='id')

In [8]:
 columns = ['comment', 'tag']
 df = merged[columns]
 

In [9]:
docs = df['comment'].tolist()


## Build a topic model

In [10]:
def show_pyldavis(docs, num_topics):

#docs is a list of strings
#num_topics for the LDA model

  docs = [remove_stopwords(doc.lower()) for doc in docs]

  token_ = [strip_punctuation(' '.join([str(x) for x in nlp(doc)])) for doc in docs]

  token_ = [x.split(" ") for x in token_]

  lmtzr = WordNetLemmatizer()

  for token in token_:
      token = [lmtzr.lemmatize(x) for x in token]
      token = [x for x in token if x not in set(stopwords.words('english'))]

  bigram = Phrases(token_, min_count=5, threshold=2,delimiter=b' ')

  bigram_phraser = Phraser(bigram)

  bigram_token = []
  for sent in token_:
      bigram_token.append(bigram_phraser[sent])

  #now you can make dictionary of bigram token 
  dictionary = gensim.corpora.Dictionary(bigram_token)

  corpus = [dictionary.doc2bow(text) for text in bigram_token]

  lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=1)

  viz = pyLDAvis.gensim.prepare(lda, corpus, dictionary)

  return (viz,lda, dictionary, corpus)
clear_output()

In [11]:
viz,lda, dictionary, corpus = show_pyldavis(docs, 11)
clear_output()

In [12]:
dominant_topics = []
for doc in corpus:
    row = lda.get_document_topics(bow=doc)
    dom_topic = sorted(row, key = lambda x: x[1], reverse=True)[0][0]
    dominant_topics.append(dom_topic)

In [13]:
df['topic'] = dominant_topics

In [14]:
pyLDAvis.save_html(viz, 'topic_model_visualization.html')

clear_output()

pyLDAvis.enable_notebook()
viz

## Next steps

Now that we have machine-generated topic and expert tag pinned against each other:

In [15]:
df.head(0)

Unnamed: 0,comment,tag,topic


The user may use their favorate data analysis tool (e.g. Excel) to find out which topic consists of new information that the domain expert may find interesting.

In [16]:
topic_of_interest = 0


Once the user decides which topic is of interest, they can use the following line to get top 10 terms in the topic. These terms can be used as candidates or parts of the new tag.

In [17]:
[dictionary[tpl[0]] for tpl in lda.get_topic_terms(topic_of_interest)]

['want know',
 'work',
 'need know',
 'vaccine',
 'virus',
 'vaccinated',
 'covid',
 'ingredients',
 'canada',
 'trying']


Reference: 

[vtonmail's notebook on kaggle.com](https://www.kaggle.com/vtonmail/pyldavis-on-employee-reviews)