<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Topic Modeling with Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA)
Topic modeling involves identifying the primary topic(s) for a text document or within a set of documents.  There are topic modeling approaches which use supervised learning to perform multi-class multi-label classification of documents, but this type of approach requires a large labeled training dataset, which we do not always have available.  In this notebook we will explore unsupervised topic modeling approaches using Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA).  The "topics" produced by unsupervised topic modeling techniques are actually clusters of similar words found in the document(s). The topic model discovers these word clusters based on the frequency of the words in each document and attempts to model what the topics might be and what each document's balance of topics is.

Topic modeling can be used to extract the main topic(s) from a single document or a collection of documents - in this notebook we will demonstrate topic modeling for a small collection of articles from the web.

In [1]:
import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import spacy
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import itertools
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


## Get documents to tag with topics
We will use BeautifulSoup to get the content of a few articles from the web and strip the text content from the hmtl.  The articles we will use for this example are news articles each relating to one or both of two primary themes: COVID-19 and Duke basketball.  Therefore we would expect the topics which we identify to be related to these two themes.

In [19]:
# Get article
article_urls = ['https://www.cbssports.com/college-basketball/news/duke-basketballs-game-vs-clemson-postponed-due-to-positive-covid-19-tests-in-blue-devils-program/',
                'https://www.usatoday.com/story/news/health/2021/12/21/covid-holiday-safety-need-to-know/8968198002/',
                'https://www.fayobserver.com/story/sports/college/basketball/2021/12/29/duke-blue-devils-basketball-recruiting-jon-scheyer-commits/9032663002/',
                'https://www.today.com/health/health/covid-19-cold-flu-tell-difference-rcna10114',
                'https://www.dukechronicle.com/article/2021/06/duke-mens-basketball-head-coach-jon-scheyer-mike-krzyzewski',
                'https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus']
article_text = []
titles = []
for url in article_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    # Extract body text from article
    bodytext = soup.find_all('p')
    bodytext = [i.text for i in bodytext]
    bodytext = ' '.join(bodytext)
    article_text.append(bodytext)
    # Extract titles for articles
    title = soup.find_all('h1')
    title = title[0].text.strip()
    titles.append(title)



In [2]:
import pandas as pd
data = pd.read_csv("../data/tripadvisor_hotel_reviews.csv")
article_text = list(data.Review)
article_text = article_text[:10]

## Create features using Word Counts or TF-IDF
Before we can apply a topic model on our set of documents, we must first convert each document into a numeric feature vector.  We can use count vectorization or TF-IDF to do this.

In [3]:
def vectorize(documents, vectorizer_type='count'):
    # Use both 1-grams and 2-grams
    n_gram_range = (1, 2)

    if vectorizer_type == 'count':
        vectorizer = CountVectorizer(max_df=0.6,ngram_range=n_gram_range,
                                    stop_words=stopwords.words('english'))
        feature_vecs = vectorizer.fit_transform(documents)
        feature_vecs = feature_vecs.todense().tolist()
    else:
        vectorizer = TfidfVectorizer(max_df=0.6,ngram_range=n_gram_range,
                                    stop_words=stopwords.words('english'))
        feature_vecs = vectorizer.fit_transform(documents)
        feature_vecs = feature_vecs.todense().tolist()
        
    return feature_vecs, vectorizer

## Model topics using LDA or LSA
Now that our documents are represented by numeric feature vectors, we can apply our topic model.  When fitting the model, we need to select the number of topics we wish to include in our final list, which will correspond to the top n components extracted by LDA or LSA.

Each topic is represented by a cluster of similar keywords.  When printing our topic we can also select how many keywords we want to include to represent each topic.

In [4]:
def model_topics(vectorized_text,vectorizer,n_topics,n_words,model_type):
    if model_type=='lda':
        # Perform LDA 
        model = LatentDirichletAllocation(n_components=n_topics,learning_method='online', random_state=1)
        topic_assignments = model.fit_transform(vectorized_text)
    else:
        # Use LSA
        model = TruncatedSVD(n_components=n_topics,n_iter=500,random_state=0)
        topic_assignments = model.fit_transform(vectorized_text)

    # Get the main keywords and scores corresponding to each topic
    vocab = vectorizer.get_feature_names_out()
    topics = []
    for comp in model.components_:
        # Get the top keywords for each topic
        sorted_words = [vocab[score] for score in np.argsort(comp)[::-1]][:n_words]
        # Get the scores for each top keyword
        sorted_scores = np.sort(comp)[::-1][:n_words]
        words_scores = zip(sorted_words,sorted_scores)
        topics.append(words_scores)

    return topics, topic_assignments

In [8]:
feature_vecs, vectorizer = vectorize(article_text,vectorizer_type='tfidf')
topics,topic_assignments = model_topics(feature_vecs,vectorizer,n_topics=10,n_words=5,model_type='lda')
for i,topic in enumerate(topics):
    print('Topic {} keywords: {}'.format(i,[word[0] for word in topic]))

print('\nArticle assignments:')
for i, j in enumerate(article_text):
    print(j)
    #print('Article {}: {}'.format(i,title))
    print("Topic 0: {:.3f}, Topic 1: {:.3f}".format(topic_assignments[i][0],topic_assignments[i][1]))
    print()

Topic 0 keywords: ['clean small', 'suite laptops', 'valet parking', 'besite contains', 'hotel worked']
Topic 1 keywords: ['girlfriend', 'taxi', 'deals', 'funny', 'speak']
Topic 2 keywords: ['curtain', 'remodeled', 'bell', 'couch separated', 'double bed']
Topic 3 keywords: ['husband', 'fan', 'husband 175', 'restaurant', 'glass']
Topic 4 keywords: ['sheets', 'soiled sheets', 'feel home', 'feel', 'make feel']
Topic 5 keywords: ['reception', 'smart', '1st', 'wakeup', 'bed']
Topic 6 keywords: ['desk manager', 'lbs husband', 'website', '1st', 'staffnegatives ac']
Topic 7 keywords: ['nice', 'parking', 'music room', 'nice goldfish', 'valet']
Topic 8 keywords: ['website', 'guest', 'email', 'description', 'told']
Topic 9 keywords: ['building', 'street', 'plenty', 'husband spent', 'dining']

Article assignments:
nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent 

## Post-process topics
We may want to do some post-processing on the topics which our topic model identifies.  For example, we may have cases where a 1-gram is included twice in our keyword list for a topic - once by itself and then again as part of a 2-gram.  We may choose to remove the 1-gram if the word is included in a 2-gram since the two-gram may be more descriptive (e.g. "random" and "random forest").  We may also want to check and make sure we don't have any numeric-only keywords listed.

In [14]:
def dedupe_topics(topics,n_words):
    newtopics = []
    wordsused = []
    for topic in topics:
        newkeywords = []
        keywords = [word[0] for word in topic]
        for word in keywords:
            # Remove words that contain only digits
            if word.isdigit():
                continue
            # Remove 1-gram if word is included in a 2-gram
            elif sum([word in x for x in keywords])==1:
                newkeywords.append(word)
        newtopics.append(newkeywords[:n_words])
    return newtopics

In [15]:
feature_vecs, vectorizer = vectorize(article_text,vectorizer_type='tfidf')
topics = model_topics(feature_vecs,vectorizer,n_topics=2,n_words=15,model_type='lda')
deduped_topics = dedupe_topics(topics,n_words=5)
for i,keywords in enumerate(deduped_topics):
    print('Topic {} keywords: {}'.format(i,keywords))


Topic 0 keywords: ['duke', 'scheyer', 'cbs', 'coach', 'blue devils']
Topic 1 keywords: ['symptoms', 'coronavirus', 'flu', 'people', 'torres']
