<div style="text-align: center;"> <!-- This div will center all its contents -->
  <img src="https://scontent.fopo6-1.fna.fbcdn.net/v/t39.30808-6/327345211_708012977623591_5371889953719216000_n.png?_nc_cat=104&ccb=1-7&_nc_sid=5f2048&_nc_eui2=AeGA4Epi5DPgQWGmwJnzDzYwlTHqnE4dPp2VMeqcTh0-ndnVzTPGmZ1C7LYJvEsh0wc&_nc_ohc=oHf3AV_aUB0AX_auBWi&_nc_ht=scontent.fopo6-1.fna&oh=00_AfCTA0yaHCQugeMu_44t-6cLSKGa53d67a0DpQQ-fVTGYg&oe=654F295F" width="570" height="250" style="display: block; margin: auto;"/> <!-- This will center the image -->
  <div><strong style="color: #4F5B63;">Master in Data Science for Social Sciences</strong></div>
  <div><strong style="color: #4F5B63;">University of Aveiro</strong></div>
</div>


<div style="display: flex; justify-content: space-around; align-items: flex-start;">
  <div style="width: 100%; padding: 10px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); margin: 10px;">
    <h2><h1 style="text-align: center; font-size: 4em; color: #46627F; margin-top: 0; margin-bottom: 0; line-height: 1;">Topic Modeling using Latent Semantic Analysis</h1>
<h1 style="text-align: center; color: #B1C0CF; margin-top: 0; margin-bottom: 0; line-height: 1;"> -Deduce the hidden topic from the document- </h1></h2>
      </div>
</div>


*Note: Highly recommend going through [this article](https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/) to understand terms like SVD and UMAP. They are leveraged in this article so having a basic understanding of them will help solidify these concepts.*

## Table of Contents

1. What is a Topic Model?

1. When is Topic Modeling used?

1. Overview of Latent Semantic Analysis (LSA)

1. Implementation of LSA in Python

    * Data Reading and Inspection

    * Data Preprocessing

    * Document-Term Matrix

    * Topic Modeling

5. Pros and Cons of LSA

## What is a topic model?

A Topic Model can be defined as an unsupervised technique to discover topics across various text documents. These topics are abstract in nature, i.e., words which are related to each other form a topic. Similarly, there can be multiple topics in an individual document. For the time being, let’s understand a topic model as a black box, as illustrated in the below figure:

![](https://cdn-images-1.medium.com/max/2000/0*o4QTTXD7nHjsYQsK.png)

This black box (topic model) forms clusters of similar and related words which are called topics. These topics have a certain distribution in a document, and every topic is defined by the proportion of different words it contains.

## When is Topic Modeling used?

Recall the example we saw earlier of arranging similar books together. Now suppose you have to perform a similar task with a few digital text documents. You would be able to manually accomplish this, as long as the number of documents is manageable (aka not too many of them). But what happens when there’s an impossible number of these digital text documents?

That’s where NLP techniques come to the fore. And for this particular task, topic modeling is the technique we will turn to.

![*Source: topix.io/tutorial/tutorial.html*](https://cdn-images-1.medium.com/max/2000/0*E_P2Vkt9ZxrgG1gw.png)**Source: topix.io/tutorial/tutorial.html**

Topic modeling helps in exploring large amounts of text data, finding clusters of words, the similarity between documents, and discovering abstract topics. As if these reasons weren’t compelling enough, topic modeling is also used in search engines wherein the search string is matched with the results. Getting interesting, isn’t it? Well, read on then!

## Overview of Latent Semantic Analysis (LSA)

All languages have their own intricacies and nuances which are quite difficult for a machine to capture (sometimes they’re even misunderstood by us humans!). This can include different words that mean the same thing, and also the words which have the same spelling but different meanings.

For example, consider the following two sentences:

1. I liked his last **novel** quite a lot.

1. We would like to go for a **novel** marketing campaign.

In the first sentence, the word ‘novel’ refers to a book, and in the second sentence it means new or fresh.

We can easily distinguish between these words because we are able to understand the context behind these words. However, a machine would not be able to capture this concept as it cannot understand the context in which the words have been used. This is where Latent Semantic Analysis (LSA) comes into play as it attempts to leverage the context around the words to capture the hidden concepts, also known as topics.

So, simply mapping words to documents won’t really help. What we really need is to figure out the hidden concepts or topics behind the words. LSA is one such technique that can find these hidden topics. Let’s now deep dive into the inner workings of LSA.

## Steps involved in the implementation of LSA

Let’s say we have **m **number of text documents with **n** number of total unique terms (words). We wish to extract **k** topics from all the text data in the documents. The number of topics, k, has to be specified by the user.

* Generate a document-term matrix of shape **m x n **having TF-IDF scores**.**

![](https://cdn-images-1.medium.com/max/2000/0*Gq8AgWwkpQ-pvfaD.png)

* Then, we will reduce the dimensions of the above matrix to **k **(no. of desired topics) dimensions, using singular-value decomposition (SVD).

* SVD decomposes a matrix into three other matrices. Suppose we want to decompose a matrix A using SVD. It will be decomposed into matrix U, matrix S, and VT (transpose of matrix V).

![](https://cdn-images-1.medium.com/max/2000/0*ZRAsqV_YN0OOl-iC.png)

![](https://cdn-images-1.medium.com/max/2000/0*RQ7tGRBA6wiMeGFD.png)

Each row in the matrix **Uk (document-term matrix)** is a vector representation of the corresponding document. The length of these vectors is k, which is the number of desired topics. Vector representation for the terms in our data can be found in the matrix **Vk (term-topic matrix)**.

* So, SVD gives us vectors for every document and term in our data. The length of each vector would be **k**. We can then use these vectors to find similar words and similar documents using the cosine similarity method.

## Implementation of LSA in Python

It’s time to power up Python and understand how to implement LSA in a topic modeling problem. Once your Python environment is open, follow the steps I have mentioned below.

### Data reading and inspection

Let’s load the required libraries before proceeding with anything else.

In [1]:
    import numpy as np
    import pandas as pd 
    import matplotlib.pyplot as plt 
    pd.set_option("display.max_colwidth", 200)

In [2]:
# Load the CSV file into a DataFrame
df = pd.read_csv('scopus.csv')

In [3]:
document=df['Title']
document

0      A perspective on computational research support programs in the library: More than 20 years of data from Stanford University Libraries
1                                             Handbook of research on artificial intelligence applications in literary works and social media
2                       Teaching beginner-level computational social science: interactive open education resources with learnr and shiny apps
3                                   Early prediction of student engagement in virtual learning environments using machine learning techniques
4                                            mdx: A Cloud Platform for Supporting Data Science and Cross-Disciplinary Research Collaborations
                                                                        ...                                                                  
352                                                                                                                    One health informatics
353   

### Data Preprocessing

To start with, we will try to clean our text data as much as possible. The idea is to remove the punctuation, numbers, and special characters all in one step using the regex *replace(“[^a-zA-Z#]”, “ ”)*, which will replace everything, except alphabets with space. Then we will remove shorter words because they usually don’t contain useful information. Finally, we will make all the text lowercase to nullify case sensitivity.

In [4]:
    news_df = pd.DataFrame({'document':document}) 

    # remove everything except alphabets` 
    news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ") 

    # remove short words 
    news_df['clean_doc']=news_df['clean_doc'].apply(lambda x:' '.join([w for w in x.split() if len(w)>3])) 

    # make all text lowercase 
    news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

It’s good practice to remove the stop-words from the text data as they are mostly clutter and hardly carry any information. Stop-words are terms like ‘it’, ‘they’, ‘am’, ‘been’, ‘about’, ‘because’, ‘while’, etc.

To remove stop-words from the documents, we will have to tokenize the text, i.e., split the string of text into individual tokens or words. We will stitch the tokens back together once we have removed the stop-words.


In [5]:
    from nltk.corpus import stopwords 
    stop_words = stopwords.words('english') 

    # tokenization 
    tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) 

    # remove stop-words 
    tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words]) 

    # de-tokenization 
    detokenized_doc = [] 
    for i in range(len(news_df)): 
        t = ' '.join(tokenized_doc[i]) 
        detokenized_doc.append(t) 

    news_df['clean_doc'] = detokenized_doc

## Document-Term Matrix

This is the first step towards topic modeling. We will use sklearn’s *TfidfVectorizer* to create a document-term matrix with 1,000 terms.

In [6]:
    from sklearn.feature_extraction.text import TfidfVectorizer 

    vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000, max_df = 0.5, smooth_idf=True) 

    X = vectorizer.fit_transform(news_df['clean_doc']) 

    X.shape

(357, 1000)

We could have used all the terms to create this matrix but that would need quite a lot of computation time and resources. Hence, we have restricted the number of features to 1,000. If you have the computational power, I suggest trying out all the terms.

### Topic Modeling

The next step is to represent each and every term and document as a vector. We will use the document-term matrix and decompose it into multiple matrices. We will use sklearn’s *TruncatedSVD* to perform the task of matrix decomposition.

Since the data comes from 20 different newsgroups, let’s try to have 20 topics for our text data. The number of topics can be specified by using the *n_components* parameter.

In [7]:
    from sklearn.decomposition import TruncatedSVD 

    # SVD represent documents and terms in vectors 
    svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122) 

    svd_model.fit(X) 

    len(svd_model.components_)

20

The components of *svd_model* are our topics, and we can access them using *svd_model.components_*. Finally, let’s print a few most important words in each of the 20 topics and see how our model has done.


In [8]:
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:7]
    print("Topic " + str(i) + ": ", end="")
    for t in sorted_terms:
        print(t[0], end=" ")
    print("\n")


Topic 0: data science social research sciences computational analytics 

Topic 1: data science analytics visualization survey approach management 

Topic 2: learning machine analytics analysis applications based using 

Topic 3: science learning computational machine artificial intelligence opportunities 

Topic 4: using based study twitter case networks approach 

Topic 5: artificial intelligence research media responsible digital using 

Topic 6: analysis networks network artificial intelligence analytics approaches 

Topic 7: research digital analysis education methods applications based 

Topic 8: networks social complex mining learning framework platform 

Topic 9: twitter networks sciences 19 covid education dataset 

Topic 10: networks education computational human analytics digital approach 

Topic 11: twitter 19 covid analytics research dataset computational 

Topic 12: case education applications study academic performance challenges 

Topic 13: challenges opportunities compu

- Topic 0: data science social research sciences computational analytics 
- Topic 1: data science analytics visualization survey approach management 
- Topic 2: learning machine analytics analysis applications based using 
- Topic 3: science learning computational machine artificial intelligence opportunities 
- Topic 4: using based study twitter case networks approach 
- Topic 5: artificial intelligence research media responsible digital using 
- Topic 6: analysis networks network artificial intelligence analytics approaches 
- Topic 7: research digital analysis education methods applications based 
- Topic 8: networks social complex mining learning framework platform 
- Topic 9: twitter networks sciences 19 covid education dataset 
- Topic 10: networks education computational human analytics digital approach 
- Topic 11: twitter 19 covid analytics research dataset computational 
- Topic 12: case education applications study academic performance challenges 
- Topic 13: challenges opportunities computational applications 19 covid based 
- Topic 14: based analytics case internet sciences business interdisciplinary 
- Topic 15: methods media digital communication based making survey 
- Topic 16: cloud using open digital analysis language platform 
- Topic 17: language natural processing network framework workshop international 
- Topic 18: internet patterns challenges things processing predicting human 
- Topic 19: framework education twitter social gender media learning 


## Pros and Cons of LSA

Latent Semantic Analysis can be very useful as we saw above, but it does have its limitations. It’s important to understand both the sides of LSA so you have an idea of when to leverage it and when to try something else.

**Pros:**

* LSA is fast and easy to implement.

* It gives decent results, much better than a plain vector space model.

**Cons:**

* Since it is a linear model, it might not do well on datasets with non-linear dependencies.

* LSA assumes a Gaussian distribution of the terms in the documents, which may not be true for all problems.

* LSA involves SVD, which is computationally intensive and hard to update as new data comes up.


References: https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/
