# 1. State tracking in DM

In this chapter, we will talk about state tracking in dialog manager. 

![st_1.png](pics/st_1.png)

Let me remind you that dialog managers are responsible for two tasks. 

The first one is state tracking and it actually acquires some hand-crafted states. And, what it does is, it can query the external database or knowledge base for some additional information. It actually tracks the evolving state of the dialog and it constructs the state estimation after every utterance from the user. 

And another part is policy learner, that is the part that takes the state estimation as input and chooses the next best action from the dialog system from the agent. 

![st_2.png](pics/st_2.png)

You can think of a dialog as the following. We have dialog turns, the system, and user provide some input and when we get input from the user, we actually get some observations. We hear something from the user and when we hear something from the user, we actually update the state of the dialog and dialog manager is responsible for tracking that state. Because with every new utterance, user can specify more details or change its intent, and that all affects our state. And you can think of state as something describing what the user ultimately wants. 

And then, when we have the state, we have to do something, we have to react, and we need to learn policy, and that is a part of a dialogue manager as well. And policy is actually a rule. What do we need to say when we have a certain state? 

So next, we will all review state tracking part of dialog manager, that red border. And for that, we will need to introduce DSTC 2 dataset. 

![st_3.png](pics/st_3.png)

It is a dialog state tracking challenge. It was collected in 2013. It is a human computer dialogs about finding a restaurant in Cambridge. It contains 3,000 telephone-based dialogs and people were recruited for this using Amazon Mechanical Turk. So, this collection didn't assume that we need some experts in the field. These are like regular users that use our system. They used several dialog systems like Markov decision process or partially observed Markov decision process for tracking the dialog state and hand-crafted policy or policy learned with reinforcement learning. 

So, this is a computer part of that dialog collection. 

The labeling procedure then followed this principles. 

1. First, the utterances that they got from user and that was sound, they were transcribed using Amazon Mechanical Turk as well. 
2. And then, these transcriptions were annotated by heuristics, some regular expressions. 
3. And then, they were checked by the experts and corrected by hand. 

That's how these dataset came into being. 

So, how do they define dialog state and this dataset? Dialog state consists of three things. 
1. The first one, goals, that is a distribution over the values of each informable slot in our task. The slot is an informable if the user can give it in the utterance as a constraint for our search. 

2. Then, the second part of the state is a method, that is a distribution over methods namely by name, by constraints, by alternatives, or finished. So these are the methods that we need to track. 
3. And, user can also request some slots from the system. And, this is a part of a dialog state as well. They requested slots that the user needs and that is a probability for each requestable slot that it has been requested by the user and the system should inform it. 

So, the dataset was marked up in terms of user dialog acts and slots. So, the utterance like what part of town is it, can become the request. So, this is an act that the user makes. And, you can think of it as an intent, and that area slot that is there tells us that the user actually wants to get the area. Then, we can infer the method from act and goals. So if we have informed food which is Chinese, then it is clear that we need to search by constraints. 

![st_4.png](pics/st_4.png)

Let's look at the dialog example. The user says, "I'm looking for an expensive restaurant with Venetian food." What we need to understand from this is now our state becomes food= Venetian, price range=expensive, and the method is by constraints and no slots were requested. 

Then, when the dialog progresses, the user says,"Is there one with Thai food?" And, we actually need to change our state so all the rest is the same, but food is now Thai. 

And then, when the user is okay with the options that we have provided, it asks, "Can I have the address?" And that means that our state of our dialog is the same, but this time, the requested slots is the address. 

And so, these three components goals, method, and requested slots are actually our context that we need to track off to every utterance from the user. 

![st_5.png](pics/st_5.png)

So, let's look at the results of the competition that was held after the collection of this dataset. The results are the following. If we take the goals, then the best solution had 65 percent correct combinations that means that every slot and every value is guessed correctly and that happened in 65 percent of times. And as for the method, it has 97 percent accuracy and requested slots have 95 percent accuracy as well. So, it looks like slot tagging is still the most hard, the most difficult part. 

How can you do that state tracking? 

![st_6.png](pics/st_6.png)

When you looked at that example, that was pretty clear that after those utterances, it is pretty easy to change the state of our dialog. So maybe, if you train a good NLU, which gives you intents and slots, then you can come up with some hand-crafted rules for dialog state change. If the user like mentions a new slot, you just add it to the state, if it can override the slot or it can start to fill a new form. And, you can actually come up with some rules to track that state, but you can actually do better if you do neural networks. 

![st_6.png](pics/st_7.png)

This is an example of an architecture that does the following. It uses the previous system output, which says, "Would you like some Indian food?" Then, it takes the current utterance from the user like, "No, how about Farsi food?" And then, we need to actually parse that system output and user utterance and to come up with a current state of our dialog. 

And this is done in the following way. 

1. First, we embed the context and that is the system output on the previous state then, we embed the user utterance and we also embed candidate pairs for the slot and values, like food-Indian, food-Persian, or any other else. 
2. Then, we do the following thing. We have a context modelling network that actually takes the information about system output, about candidate pairs, uses some type of gating and uses the user utterance to come up with the idea whether this user utterance effects the context or not. And also, there is the second part which does semantic decoding, so it takes user utterance, the candidate pairs for slot and values, and they decide whether they match or not. 
3. And finally, we have a final binary decision making whether these candidate pairs match the user utterance provided the previous system output was the following. 

So in this way, we actually solve NLU and dialog state tracking simultaneously in a joint model. 

So, this is pretty cool. 

![st_8.png](pics/st_8.png)

Let's see, for example, how one part of that model can actually work and let's look at the utterance representation. We can take our utterance, we can split it into tokens, we can take Word2Vec embeddings, or any other embeddings you like. And then, we apply 1D convolutions that we investigated in week one. And, you can take bigram, trigram, and so forth. And then, you can just sum up those vectors and that's how we get the representation for the utterance. So, that is a small part in our architecture. And we don't have time to overview like all of those parts. 

Let's go to the results. 

![st_9.png](pics/st_9.png)

If we look at how good that network is, you can see that using that neural belief tracker architecture with convolutional neural networks, you can get 73 percent accuracy for goals, and this is pretty huge improvement. And, it actually improves request accuracy as well on our dialog state tracking challenge dataset. 

We can see that when we solved the task of NLU and dialog state tracking simultaneously, we can actually get better results. 

![st_10.png](pics/st_10.png)

Another dataset worth mentioning is Frames dataset. It is pretty recent dataset. It was collected in 2016. It is human-human goal-oriented dataset. It is all about booking flights and hotels. It has 12 participants for 20 days and they have collected 1400 dialogs. And, they were collected in human-human interaction, that means that two humans talk to each other via a Slack chat. One of them was the user and he has a task from the system. Find a vacation between certain dates, between destination, and like the place where you go from, and date not flexible if not available, then end the conversation. So, the user had this task. The wizard which is another user, which has an access to a searchable database with packages and a hotel, and round trip flights, and that user, his task was to provide the help via a chat interface to the user who was searching for something. 

![st_11.png](pics/st_11.png)

So, this dataset actually introduces a new task called frame tracking, which extends state tracking to a setting where several states attract simultaneously and users can go back and forth between them and compare results. Like, I simultaneously want to compare the flight from Atlanta to Caprica or let's say from Chicago to New York, and I investigate these two options, and these are different frames, and I can compare them. So, this is a pretty difficult task. 

![st_12.png](pics/st_12.png)

How is it annotated? It is annotated with dialog act, slot types, slot values, and one more thing, references to other frames for each utterance. And also, we have an idea of the current active frame for each utterance. Let's see how it might work. The user says, "2.5 stars will do." What he does is, he actually informs the system that the category equally 2.5 is okay. Then, the system might ask the user. It might make an offer to him, like offer the user in the frame six business suite for the price $1,000$ dollars and it will actually be converted into the following utterance from the system. What about a 1,000 business class ticket to San Francisco? And we know that it is to San Francisco because we have an ID of the frame, so we have all the information for that frame. 

![st_12.png](pics/st_13.png)

Let's summarize, we have overviewed a state tracker of a dialog manager. We have discussed the datasets for dialog manager training and those are dialog state tracking challenge and Frames dataset. State tracking can be done by hand having a good NLU or you can do better with neural network approaches, like a joint NLU and dialog manager. 

# 2. Policy optimisation in DM 

In this chapter, we will talk about Policy Learner in Dialogue Manager. 

![po_1.png](pics/po_1.png)

Okay, let me remind you what policy learning is. We have a dialogue that progresses with time, and after every turn, after every observation from the user will somehow update our state of the dialogue and state records responsible for that. And then, after we have a certain state, we actually have to make some action, and we need to figure out the policy that tells us if you have a certain state then this is an action that you must do, and this is something that we then sell to the user. 

So let's look at what dialog policy actually is. 

![po_1.png](pics/po_2.png)

It is actually a mapping from dialog state to agent act. Imagine that we have a conversation with the user. We collect some information from him or her, and we have that internal state that tells us what the user essentially wants, and we need to take some action to continue the dialog. And we need that mapping from dialog state to agent act, and this is what dialog policy essentially is. 

Let's look at some policy execution examples. 

A system might inform the user that the location is 780 Market Street. The user will hear it as of the following, "The nearest one is at 780 Market Street." Another example is that the system might request location of the user. And the user will see it as, "What is the delivery address?" And we have to train a model to give us an act from a dialog state or we can do that by hand crafted rules, which is my favorite. 

Okay, so let's look at the Simple approach: hand crafted rules. 

![po_3.png](pics/po_3.png)

You have NLU and state tracker. And you can come up with hand crafted rules for policy. Because if you have a state tracker, you have a state, and if you remember the dialog state tracking challenge dataset, it actually contains a part of the state which has requested slots, and we can use that information to understand what to do next, whether we need to tell the user a value of a particular slot or we should search the database or something else. So, it should be pretty easy to come up with hand crafted rules for policy. 

![po_4.png](pics/po_4.png)

But it turns out that you can make it better if you do it with machine learning. And there are two ways to do that, to optimize dialog policies with machine learning. 

The first one is Supervised learning, and in this setting, you train to imitate the observed actions of an expert. So we have some human-human interactions, one of them is an expert, and you just use that observations and try to imitate the action of an expert. It often requires a large amount of expert label data and as you know it is pretty expensive to collect that data, because you cannot use crowd sourcing platforms like Amazon Mechanical Turk. 

But even with a large amount of training data, parts of the dialog state space may not be well covered in the training data and our system will be blind there. So, there is a different approach to this called Reinforcement learning, and this is a huge field and it is out of our scope, but it is like an honorable mention. Given only rewards signal, now, the agent can optimize a dialog policy through interaction with users. Reinforcement learning can require many samples from an environment, making learning from scratch with real user is impractical, we will just waste the time of our experts. That's why there, we need simulated users based on the supervised data for reinforcement learning. So and this is a huge field and it gains popularity in dialog policies optimization. 

![po_4.png](pics/po_5.png)

Let's look at how supervised approach might work. Here is an example of another model that does joint NLU and dialog management policy optimization, and you can see what it does. We actually have four utterances that are all utterances that we got from the user so far. We pass each of them through NLU which gives us intents and slot tagging, and we can also take the hidden vector, the hidden representation of that phrase from the NLU and we can use it for a consecutive LSTM that will actually come up with an idea what system action we can actually execute. So, we've got several utterances, NLU results, and then the LSTM reads those utterances in latent space from NLU, and it actually decides what to do next. 

So this is pretty cool because, here, we don't need dialog state tracking, we don't have state. State here is replaced with a state of the LSTM, so that is some latent variables like 300 of them let's say. **So our state becomes not hand crafted, but it becomes a real valued vector.** So this is pretty cool. And then we can actually learn a classifier on top of that LSTM, and it will output us the probability of the next system action. 

![po_6.png](pics/po_6.png)

Let's see how it actually works. If we look at the results, there are three models that we compare here. 

![po_7.png](pics/po_7.png)

The first one is baseline. That is a classical approach to this problem. We have a conditional random field for slot tagging and we have SVM for action classification. As you can see, the frame level accuracies, that means that we need to be accurate about everything in the current frame that we have after every utterance, and you can see that the accuracy for dialog manager is pretty bad here. But for NLU, it's okay. Then, another model is Pipeline-BLSTM, and what it actually does is it does NLU training separately and then that bidirectional LSTM for dialog policy optimization on top of that model. But these models are trained separately. And you can see that the third option is when these two models, NLU and bidirectional LSTM which was in blue in the previous slides, we can actually train them end to end, jointly, and we can increase the dialog manager accuracy by a huge margin and we actually improve NLU as well. So we have seen that effect of joint training before, and it still continues to happen. 

![po_7.png](pics/po_8.png)

Okay, so what have we looked at? Dialog policy can be done by hand crafted rules if you have a good NLU and you have a good state tracker. Or it can be done in a supervised way where you can learn it from data and you can learn it jointly with NLU, and this way you will not need state tracker for example. Or you can do the reinforcement learning way, but that is a story for a different course.

# 3. Final remarks

In this chapter I want to overview what we have done at notebooks 10, 11.

![fr_1.png](pics/fr_1.png)

We have overviewed so-called task-oriented dialog systems. And our dialog system looks like the following. We get the speech from the user and we can convert it to text using ASR. Or we can get text like in chat bots. 

Then comes Natural Language Understanding that gives us intents and slots from that natural language. Then, there is a magic box called Dialog Manager, and it actually does two things. It tracks the dialog state and it learns the dialog policy, what should be done and what the user actually wants. The Dialog Manager can query a backend like Google Maps or Yelp or any other. And then it cast to say something to the user. And we need to convert the text from Dialogue Manager to speech with some Natural Language Generation. 

The red boxes here are the parts of the system that we don't overview because it will take a lot of time. And it can actually work without those systems. It can take the user input as text, so you will not need ASR. Then you can output your response to the user as a text as well. So we don't need Natural Language Generation. And sometimes you don't need Backend action to solve the user's task. 

We have overviewed in details Natural Language Understanding and Dialog Manager.
![fr_1.png](pics/fr_2.png)

And let me remind you, you can train slot tagger and intent classifier, which are basically NLU. And you can train them separately or jointly. And when you do that jointly, that yields better results. You can train NLU and Dialogue Manager separately or jointly, and it will give you better results as well. You can use hand-crafted rules sometimes. For example, for dialog policy over state tracking. But learning from data actually works better if you have time for that.

![fr_3.png](pics/fr_3.png)

Let me remind you how we evaluate NLU and Dialog Manager. For NLU, we use turn-level metrics like intent accuracy and slots F1. For Dialogue Manager, there are two kinds of metrics. The first is turn-level metrics. That means that after every turn in the dialogue, we track let's say, state accuracy or policy accuracy. And they're are dialog-level metrics like success rate, whether this dialog solved the problem of a user or not or what reward we got when we solved that problem of the user. The reward could be the number of turns and we want to minimize that turns, so that we solve that task for the user faster.

![fr_4.png](pics/fr_4.png)

And here, actually, is the question. We have NLU and Dialogue Manager. And if we train them separately, we want to understand how the errors of NLU affect the final quality of our Dialog Manager. Here, on the left, on the vertical axis, we have success rate. And on the right, on the same axis, we have average number of turns in the dialogue. And we have three colors in the legend. The blue one is when we don't have any NLU errors. The green one is when we have 10% of the errors in NLU and a red one is when we have 20% of errors in our NLU. And you can see what happens. When you have a huge error in NLU, the success rate of your task actually decreases. And the number of turns needed to solve that task where there was a success, actually increases. So it takes more time for the user to solve his task and the chance of solving that task is lower.


But NLU actually consists of intent classifier and slot tagger. So let's see which one is more important. 

![fr_4.png](pics/fr_5.png)

Let's look what happens when we change the Intent Error Rate. It looks like it doesn't effect the quality, the success rate of our dialogue that much. And the dialogues don't become that much longer. So it looks like intent error is not as important as slot tagging, and we will see now why. 

![fr_6.png](pics/fr_6.png)

Because when you introduce the same amount of error in slot tagging, that actually decreases your success rate of the dialogue dramatically. And it seems that slot tagging error is actually the main problem of our success rate. 

So it looks like we need to concentrate on slot tagger. And that can give you some insight when you want to train a joint model. When you have a loss for intent and a loss for slot tagging. You can actually come up with some weights for them so that the intuition isn't following. It seems like a slot tagging loss should have a bigger weight because it is more important for the success of the whole dialogue. 

Let me summarize, we have overviewed how test-oriented dialogue system looks like. And we have overviewed in-depth NLU component and Dialog Manager component.

So this is the basic knowledge that you will need to build your own task-oriented dialog system. 

![to_quiz_1.png](pics/to_quiz_1.png)
![to_quiz_1.png](pics/to_quiz_2.png)
![to_quiz_1.png](pics/to_quiz_3.png)
