# 1. Topic modeling: a way to navigate through text collections

This week, we have explored a lot of ways to build vector representations for words or for some pieces of text. This lesson is about **topic modeling**. Topic modeling is an alternative way to build **vector representations** for your document collections. So, let us start with a brief introduction to the task. You are given a text collection and you want to build some hidden representation. 
![tm_1.png](pics/tm_1.png)

So, you want to say that okay there are some topics here and every document is described with those topics that are discussed in this document. Now, what is the topic? Well you can imagine that you can describe a topic with some words. For example, such topic as weather is described with sky, rain, sun, and something like this and such topics such as mathematics is described with some mathematical terms and probably they do not even overlap at all. So, you can think about it as soft biclustering. 

**Why soft b-clustering?** 
1. So, first, it is biclustering because you cluster both words and documents. 
2. Second, it is soft because you will see that we are going to build some probability distributions to softly assign words and documents to topics. 

![tm_1.png](pics/tm_2.png)

This is the formal way of saying the same thing. You are given a text collection, so you are given the counts of how many times every word occurs in every document? And what you need to find is two kinds of probability distributions. So, first the probability distribution over words for topics and second the probability distribution over topics for documents. And importantly, this is just the definition of a topic, so you should not think that topic is something complicated like it is in like real life or as linguists can say. For us, for all this lesson, **topic is just a probability distribution**. That's it. 

![tm_3.png](pics/tm_3.png)

Where do we need this kind of models in real life? Well actually everywhere because everywhere you have big collections of documents. It can be some news flows or some social media messages or maybe some data for your domain like for example papers, research papers that you do not want to read but you want to know that there are some papers about these and that and they are connected this way. **So, you want some nice overview of the area to build this automatically and topic models can do it for you**. 

![tm_3.png](pics/tm_4.png)

Some other applications would be social network analysis or even dialog systems because you can imagine that you would generate some text. You know how to generate text, right, from the previous week, but now you can do this text generation dependent on the topics that you want to mention. 

![tm_3.png](pics/tm_5.png)

So, there are many many other applications, for example aggregation of news flows, when you have some news about politics for example and you want to say that this topic becomes popular nowadays, and one other important application that i want to mention is exploratory search, which means that you want to say this is some document that I am interested in, could you please find some similar documents and tell me how they are interconnected? 

![tm_3.png](pics/tm_6.png)

Now, let us do some math, so let us discuss probabilistic latent semantic analysis, PLSA. This is a topic model proposed by Thomas Hofmann in 1999. **This is a very basic model that tries to predict words in documents and it does so by a mixture of topics**. 

So, do you understand what happens for the first equation here in this formula? Well, this is a **law of total probability**. So, if you just don't care about documents in the formulas for now, about D, you can notice that this is the law of total probability applied here. Just take a moment to understand this. 

Now what about the second equation here? Well, this is not correct, this is just our assumption. So, just for simplicity, **we assume that the probability of word given the topic doesn't depend anymore on the document**. So, this is conditional independence assumption. This is all that we need to introduce PLSA model. 

![tm_3.png](pics/tm_7.png)

Now i just want you to give you intuition on how that works. So, this is a generative story. This is a story how the data is generated by our model. 
1. You have some probability distribution of topics for the document 
2. and first you decide what would be the topic for the next word. 
3. then, once you have decided on that, you can draw a certain word from the probability distribution for this topic. 

So, this model just assumes that the text is generated not by authors, not just by handwriting, but by some probability procedure. So, first we toss a coin and decide what topic will be next, and then we toss a coin again and decide what would be the exact word, and we go on through the whole text. 

Well, this is just one way to think about it. If you do not feel very comfortable with this way, I will provide for you another way of thinking. 

![tm_3.png](pics/tm_8.png)

So, this is a matrix way of thinking about this same model. You can imagine that you have some data which is just word document co-occurrences. So, you know how many times each word occurs in each document. That's why you can compute distributions. You can compute probabilities of words in documents. You just normalize those counts and that's it. Now you need to factorize this real matrix into two matrices of your parameters, $\Phi$ and $\Theta$. One matrix, $\Phi$  matrix, is about probability distributions over words and $\Theta$ matrix contains probability distributions over topics. Actually every column in this matrix is a probability distribution. So, this is just a matrix form of the same formula in the top of the slide, and you can see that it holds just for one element and for any element obviously. So, this is the introduction of the model and in the next video we will figure out how to train this model.

# 2. How to train PLSA

Let us understand how to train PLSA model. So, just to recap, this is a topic model that predicts words in documents by a mixture of topics. 

![tm_9.png](pics/tm_9.png)

So we have some parameters in this model. We have two kinds of probability distributions, phi parameters stand for probabilities of words and topics, and theta parameters stand for probabilities of topics and documents. Now, you have your probabilistic model of data, and you have your data. How do you train your models? So, how do you estimate the parameters? 

![tm_10.png](pics/tm_10.png)

**Likelihood maximization** is something that always help us. So the top line for this slide is the log-likelihood of our model, and we need to maximize this with the respect to our parameters. Now, let us do some modification in this formula. 

So first, let us apply logarithm, and we will have the sum of logarithms instead of the logarithm of the products. Then, let us just get rid of the probability of the document because the probability of the document does not depend on our parameters, which they do not even know how to model this pairs. So we just forget about them. What we care about is the probabilities of words in documents. 

So we substitute them by the sum of our topics. So this is what our model says. Great. So that's it. And we want to maximize this likelihood, and we need to remember about **constraints**. So our parameters are probabilities. 
1. That's why they need to be non-negative, 
2. and they need to be a normalized. 

Now, you can notice that this term that we need to maximize is not very nice. We have a logarithms for the sum, and this is something ugly that is not really clear how to maximize.

But fortunately, we have EM-algorithm, you could hear about this algorithm in other course in our Specialization. But now, I want just to come to this algorithm intuitively. So let us start with some data. 

![tm_10.png](pics/tm_11.png)

So we are going to train our model on plain text. So this is everything of what we have. Now, let us remember that we know the generative model. So we assume that every word in this text has some one topic that was generated when we decided to reach what will be next. So let us pretend, just for a moment, just for one slide, that we know these topics. So let us pretend that we know that the words sky, raining, and clear up go from sub topic number 22, and that's it. So we know these assignments. How would you then calculate the probabilities of words in topics? So you know you have four words for this topic, and you want to calculate the probability of sky, let's say. This is how you do it. You just say, "Well, I like one word out of these four words. So the probability will be one divided by four." 
![tm_10.png](pics/tm_12.png)

By NWT here, I denote the count of how many times this certain word was connected to this certain topic. So, can you imagine how would we evaluate the probability of topics in this document for this colorful case.



In [4]:
string = "Pooh rubbed his nose again, and said that he hadn't thought of that. And then he brightened again, and said that, if it were raining already, the Heffalump would be looking at the sky wondering if it would clear up, and so he wouldn't see the Very Deep Pit until he was half-way down"
length = len(string.split())
print(4/length)


0.07407407407407407


Well, it's just the same. So we know that we have four words about this red topic, and we have 54 words in our document, that's why we have this probability for this example. 

Well, unfortunately, life is not like this. We do not know this colorful topic assignments. What we have is just plain text. And that's a problem. But, can we somehow estimate those assignments? Can we somehow estimate the probabilities of the colors for every word? 

![tm_13.png](pics/tm_13.png)

Yes we can. So, Bayes rule helps us here. What we can do, we can say that we need probabilities of topics for each word in each document and apply Bayes rule and product rule. So, to understand this, I just advise you to forget about D in all this formulas, and then everything will be very clear. So we just apply these two rules, and we get some estimates for probabilities of our hidden variables, probabilities of topics. 

![tm_13.png](pics/tm_14.png)

Now, it's time to put everything together. So, we have EM-algorithm which has two steps, 
E-step and M-step. 
1. E-step is about estimating the probabilities of hidden variables, and this is what we have just discussed. 
2. M-step is about those updates for parameters. 

So we have discussed it for the simple case when we know the topics assignment exactly. Now, we do not know them exactly. So, it is a bit more complicated to compute NWT counts. This is not just how many times the word is connected with this topic, but it's still doable. 

So, we just take the words, we take the **counts of the words**, and we **weight** them with the probabilities that we know from the E-step. And that's how we get some **estimates for NWT**. So this is not int counter anymore. It has some flow to variable that still has the same meaning, still has the same intuition. 

![tm_14_a.png](pics/tm_14_a.png)
![tm_14_a.png](pics/tm_14_b.png)

So, the EM-algorithm is a super powerful technique, and it can be used any time when you have your model, you have your observable data, and you have some hidden variables. So, this is all formulas that we need for now. You just want to understand that to build your topic model, you need to repeat those E-step and M-step iteratively. 

So, you scan your data, you compute probabilities of topics using your current parameters, then you update parameters using your current probabilities of topics and you repeat this again and again. And this iterative process converge and hopefully, you will get your nice topic model trained.



# 3. The zoo of topic models

You know the basic topic model which is called **PLSA**, and now you know how to train it. Now, what are some other topic models in this world? What are some other applications that we can solve with the topic modeling? I want to start with a nice application. 

![tm_15.png](pics/tm_15.png)
It is about diary of Martha Ballard. So, this is a big diary. She was writing for 27 years. This is why it's rather complicated for people to read this diary and to analyze this. So, some other people decided to apply topic modeling to this and see what other topics revealed in this diary. These are some examples of the topics and you can see just the top most probable words. So, you remember you have your Phi metrics which stand for probabilities of words and topics. And this is exactly those words with the highest probabilities. And actually you can see that the topics are rather intuitively interpretable. So, there is something about the gardens, and potatoes, and work in these gardens. There is something about shopping like sugar, or flour, or something else. So, you can look through these top words, and you can name the topics, and that's nice. 

![tm_15.png](pics/tm_16.png)

What's nicer, you can look into how these topics change over time. So, for example the gardening topic is very popular during summer, in her diary, and it's not very popular during winter, and it makes perfect sense. Right? Another topic which is about emotions has some high probabilities during those periods of her life when she had some emotional events. For example, one moment of high probability there corresponds to the moment when she got her husband into prison, and somebody else died, and something else happened. So, the historians can I say that, ''OK, this is interpretable. We understand why this topic has high probability there.'' 

Now, to feel flexible and to apply your topics in many applications, we need to do a little bit more math. 

![tm_17.png](pics/tm_17.png)

So, first, this is the model called Latent Dirichlet Allocation, and I guess this is the most popular topic model ever. So, it was proposed in 2003 by David Blei, and actually any paper about topic models now cite this work. But, you know this is not very different from PLSA model. 

So, everything that it says is that, ''OK we will still have Phi and Theta parameters, but we are going to have Dirichlet priors for them.'' So, Dirichlet distribution has rather ugly form and you do not need to memorize this, you can just always Google it. But, important thing here is that we say that our parameters are not just fixed values, they have some distribution. That's why as the output of our model, we are also going to have some distribution over parameters. So, not just two matrices of values, but distribution over them, and this will be called posterior distribution and it will be also Dirichlet but with some other hyperparameters. 

In other course of our specialization devoted to Bayesian methods, you could learn about lots of ways how to estimate this model and how to train it. So, here I just name a few ways. One way would be a Variational Bayes. Another way would be Gibbs Sampling. 

![tm_18.png](pics/tm_18.png)


All of them have lots of complicated math, so we are not going to these details right now. Instead, I'm going just to show you what is the main path for developing new topic models. So, usually people use probabilistic graphical models and Bayesian inference to provide new topic models and they say, ''OK, we will have more parameters, we will have more priors. They will be connected to this and that way.'' So people draw this nice pictures about what happens in the models. And again, let us not go into the math details but instead let us look how these models can be applied. 

![tm_19.png](pics/tm_19.png)


Well, one extension of LDA model would be Hierarchical topic model. So, you can imagine that you want your topics to build some hierarchy. For example, the topic about speech recognition would be a subtopic for the topic about algorithms. And you see that the root topic has some very general Lexis and this is actually not surprising. So, unfortunately, general Lexis is always something that we see with high probabilities, especially for root topics. And in some models, you can try to distill your topics and to say well maybe we should have some separate topics about the stop words and we don't want to see them in our main topics, so we can also play with it. 

![tm_20.png](pics/tm_20.png)

Now, another important extension of topic models is Dynamic topic models. So, these are models that say that topics can evolve over time. So, you have some keywords for the topic in one year and they change for the other year. Or you can see how the probability of the topics changes.

![tm_21.png](pics/tm_21.png)

For example, you have some news flow and you know that some topic about bank-related stuff is super popular in this month but not that popular later. OK? 

![tm_21.png](pics/tm_22.png)

One more extension, multilingual topic models. So, topic is something that is not really dependent on the language because mathematics exists everywhere, right? So, we can just express it with different terms in English, in Italian, in Russian, and in any other language. And this model captures this intuition. 

So, we have some topics that are just the same for every language but they are expressed with different terms. You usually train this model on parallel data so you have two Wikipedia articles for the same topic, or let's better say for the same particular concept, and you know that the topics of these articles should be similar, but expressed with different terms, and that's okay. 

So, we have covered some extensions of Topic Models, and believe me there are much more in the literature. 
![tm_23.png](pics/tm_23.png)

So, one natural question that you might have now if whether there is a way to combine all those requirements into one topic model. And there might be different approaches here and one approach which we develop here in our NLP Lab is called Additive Regularization for Topic Models. The idea is super simple. So, we have some likelihood for PLSA model. Now, let us have some additional regularizers. Let us add them to the likelihood with some coefficients. So, all we need is to formalize our requirements with some regularizers, and then tune those tau coefficients to say that, for example, we need better hierarchy rather than better dynamics In the model. 

So, just to provide one example of how those regularizers can look like, we can imagine that we need different topics in our model, so it would be great to have as different topics as possible. To do this, we can try to maximize the negative pairwise correlations between the topics. So, this is exactly what is written down in the bottom formula. You have your pairs of topics and you try to make them as different as possible. 

![tm_24.png](pics/tm_24.png)

Now, how can you train this model? Well, you still can use EM algorithm. So, the E-step holds the same, exactly the same as it was for the PLSA topic model. The M-step changes, but very slightly. So, the only thing that is new here is green. This is the derivatives of the regularizers for your parameters. So, you need to add these terms here to get maximum likelihood estimations for the parameters for the M-step. And this is pretty straightforward, so you just formalize your criteria, you took the derivatives, and you could built this into your model. 

![tm_26.png](pics/tm_26.png)

Now, I will just show you one more example for this. So, in many applications we need to model not only words in the texts but some additional modalities. What I mean is some metadata, some users, maybe authors of the papers, time stamps, and categories, and many other things that can go with the documents but that are not just words. 

Can we build somehow them into our model? 

![tm_27.png](pics/tm_27.png)

We can actually use absolutely the same intuition. So, let us just, instead of one likelihood, have some weighted likelihoods. So, let us have a likelihood for every modality and let us weigh them with some modality coefficients. Now, what do we have for every modality? Actually, we have different vocabularies. So, we treat the tokens of authors modality as a separate vocabulary, so every topic will be now not only the distribution of words but the distribution over authors as well. Or if we have five modalities, every topic will be represented by five distinct distributions. One cool thing about multimodal topic models is that you represent any entities in this hidden space of topics. So, this is a way somehow to unify all the information in your model. 

For example, you can find what are the most probable topics for words and what are the most probable topics for time stamps, let's say. And then you can compare some time stamps and words and say, ''What are the most similar words for this day?'' And this is an example that does exactly this. 

![tm_27.png](pics/tm_28.png)

So, we had a corpora that has some time stamps for the documents and we model the topics both for words and for time stamps, and we get to know that the closest words for the time stamp, which corresponds to the Oscars date would be Oscar, Birdman, and some other words that are really related to this date. So, once again, this is a way to embed all your different modalities into one space and somehow find a way to build similarities between them. 

![tm_27.png](pics/tm_29.png)

OK. Now, what would be your actions if you want to build your topic models? Well, probably you need some libraries. So, BigARTM library is the implementation of the last approach that I mentioned. Gensim and MALLET implement online LDA topic model. Gensim was build for Python and MALLET is built for JAVA. And Vowpal Wabbit is the implementation of the same online LDA topic model, but it is known to be super fast. So, maybe it's also a good idea to check it out. 

![tm_27.png](pics/tm_30.png)

Now, finally, just a few words about visualization of topic models. So you will never get through large collections and that is not so easy to represent the output of your model, those probability distributions, in such a way that people can understand that. 

So, this is an example how to visualize Phi metrics. We have words by topic's metrics here and you can see that we group those words that correspond to every certain topic together so that we can see that this blue topic is about these terms and the other one is about social networks and so on. 

![tm_27.png](pics/tm_31.png)

But actually, the visualization of topic models is the whole world. So this website contains 380 ways to visualize your topic models. So, I want to end this video and ask you to just explore them maybe for a few moments, and you will get to know that topic models can build very different and colorful representations of your data.