# Introduction to Topic Modeling 

One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed.

**Definitions:**

- Topic Modeling is a technique to extract the hidden topics from large volumes of text. 

- Topic Modelling is the task of using unsupervised learning to extract the main topics (represented as a set of words) that occur in a collection of documents.

- Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.

Topic Models, in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts. In a practical and more intuitively, you can think of it as a task of:

1. **Dimensionality Reduction:** where rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in Vocabulary}, you can represent it in a topic space as {Topic_i: Weight(Topic_i, T) for Topic_i in Topics}

2. **Unsupervised Learning:** where it can be compared to clustering, as in the case of clustering, the number of topics, like the number of clusters, is an output parameter. By doing topic modeling, we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a specific weight

3. **Tagging:** abstract “topics” that occur in a collection of documents that best represents the information in them.

There are several existing algorithms you can use to perform the topic modeling. The most common of it are:

1. **Latent Semantic Analysis (LSA/LSI)**, 
2. **Probabilistic Latent Semantic Analysis (pLSA)**, and 
3. **Latent Dirichlet Allocation (LDA)**

Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

 **The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful.** This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. This tutorial attempts to tackle both of these problems.


# LDA

Using LDA (Latent Dirichlet Allocation) for topics extraction from a corpus of documents

A recurring subject in NLP is to understand large corpus of texts through topics extraction. Whether you analyze users’ online reviews, products’ descriptions, or text entered in search bars, understanding key topics will always come in handy.

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.


## Intuition

LDA (short for Latent Dirichlet Allocation) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic.

A topic is represented as a weighted list of words. An example of a topic is shown below:

![image](https://user-images.githubusercontent.com/28102493/172018149-386a4ee8-bd83-4175-9f50-4cab6cd1e714.png)

LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.

**LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.**

1. Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.

1. LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.

1. It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

**When I say topic, what is it actually and how it is represented?**


A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

1. **The quality of text processing.**
2. **The variety of topics the text talks about.**
3. **The choice of topic modeling algorithm.**
4. **The number of topics fed to the algorithm.**
5. **The algorithms tuning parameters.**

## Parameters

There are 3 main parameters of the model:

1. the number of topics

1. the number of words per topic, the [distribution of the] number of words per topic is handled by **eta**. Beta parameter is the same prior concentration parameter that represents topic-word density — with high beta, topics are assumed to made of up most of the words and result in a more specific word distribution per topic.

1. the number of topics per document, the [distribution of the] number of topics per document is handled by **alpha**. Alpha parameter is Dirichlet prior concentration parameter that represents document-topic density — with a higher alpha, documents are assumed to be made up of more topics and result in more specific topic distribution per document.


In reality, the last two parameters are not exactly designed like this in the algorithm, but I prefer to stick to these simplified versions which are easier to understand.

<img width="942" alt="image" src="https://user-images.githubusercontent.com/28102493/172224896-42f40185-c010-4dcc-851c-535bbb8382f8.png">


We can describe the generative process of LDA as, given the M number of documents, N number of words, and prior K number of topics, the model trains to output:

- psi, **the distribution of words for each topic K**

- phi, **the distribution of topics for each document i**


## How to successfully implement LDA

LDA is a complex algorithm which is generally perceived as hard to fine-tune and interpret. Indeed, getting relevant results with LDA requires a strong knowledge of how it works.

### Data cleaning
A common thing you will encounter with LDA is that words appear in multiple topics. One way to cope with this is to add these words to your stopwords list.

**Another thing is plural and singular forms. I would recommend lemmatizing — or stemming if you cannot lemmatize but having stems in your topics is not easily understandable.** 

Removing words with digits in them will also clean the words in your topics. Keeping years (2006, 1981) can be relevant if you believe they are meaningful in your topics.

Filtering words that appear in at least 3 (or more) documents is a good way to remove rare words that will not be relevant in topics.

### Data preparation

- **Include bi- and tri-grams to grasp more relevant information.**

- Another classic preparation step is to use only nouns and verbs using **POS tagging** (POS: Part-Of-Speech).

### Fine-tuning

1. Number of topics: try out several numbers of topics to understand which amount makes sense. You actually need to see the topics to know if your model makes sense or not. As for K-Means, LDA converges **and the model makes sense at a mathematical level, but it does not mean it makes sense at a human level.**

2. Cleaning your data: adding stop words that are too frequent in your topics and re-running your model is a common step. Keeping only nouns and verbs, removing templates from texts, testing different cleaning methods iteratively will improve your topics. Be prepared to spend some time here.

3. **Alpha, Eta.** If you’re not into technical stuff, forget about these. Otherwise, you can tweak alpha and eta to adjust your topics. Start with ‘auto’, and if the topics are not relevant, try other values. I recommend using low values of Alpha and Eta to have a small number of topics in each document and a small number of relevant words in each topic.

4. Increase the **number of passes** to have a better model. **3 or 4** is a good number, but you can go higher.


### Assessing results

- Are your topics interpretable?
- Are your topics unique? (two different topics have different words)
- Are your topics exhaustive? (are all your documents well represented by these topics?)

If your model follows these 3 criteria, it looks like a good model :)


Now that we have a trained model let’s visualize the topics for interpretability. To do so, we’ll use a popular visualization package, pyLDAvis which is designed to help interactively with:

1. Better understanding and interpreting individual topics, and

2. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.

For (2), exploring the **Intertopic Distance Plot** can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.



## Main advantages of LDA

1. **It’s fast:** The model is usually fast to run. Of course, it depends on your data. Several factors can slow down the model:
    - Long documents
    - Large number of documents
    - Large vocabulary size (especially if you use n-grams with a large n)

2. **It’s intuitive** Modelling topics as weighted lists of words is a simple approximation yet a very intuitive approach if you need to interpret it. No embedding nor hidden dimensions, just bags of words with weights.

3. **It can predict topics for new unseen documents:** Once the model has run, it is ready to allocate topics to any document. Of course, if your training dataset is in English and you want to predict the topics of a Chinese document it won’t work. But if the new documents have the same structure and should have more or less the same topics, it will work.


## Main disadvantages of LDA

1. **Lots of fine-tuning:** If LDA is fast to run, it will give you some trouble to get good results with it. That’s why knowing in advance how to fine-tune it will really help you.

2. **It needs human interpretation:** Topics are found by a machine. A human needs to label them in order to present the results to non-experts people.

3. **You cannot influence topics:**  Knowing that some of your documents talk about a topic you know, and not finding it in the topics found by LDA will definitely be frustrating. And there’s no way to say to the model that some words should belong together. You have to sit and wait for the LDA to give you what you want.



# Exploration - pyLDAvis


There is a nice way to visualize the LDA model you built using the package pyLDAvis:

This visualization allows you to compare topics on two reduced dimensions and observe the distribution of words in topics.



# Implementation Methods

## Topic Modeling

1. gensim package
2. Gensim Mallet package

## Text Cleaning

1. nltk
2. spacy
3. gensim