
# 1. Task-oriented dialog systems

In this chapter, we will talk about task-oriented dialog systems. 

![to_1.png](pics/to_1.png)
And where you can see task-oriented dialog systems, you can actually talk to a personal assistant like Apple Siri or Google Assistant or Microsoft Cortana or Amazon Alexa. You can solve these tasks like set up a reminder or find a photos of your pet or find a good restaurant or anything else. So, people are really familiar with this personal assistance and this week we will overview how you can make your own. 

![to_2.png](pics/to_2.png)

Okay. You can also write to chat bot like for different reasons: to book a tickets, to order food, or to contest a parking ticket for example. And this time, you don't use your voice but you rather type in your question to the bot and you actually assume that the result will come up instantaneously. 

![to_2.png](pics/to_3.png)

What we actually get from the user when he uses our system is either speech or text. If it is speech, we can run it through automatic speech recognition and get the text and the result. And what we actually get is the utterance and we will further assume that our utterance is text and we don't mess with speech or anything like that because it is out of scope of this chapter. 

![to_4.png](pics/to_4.png)

The first thing you need to do when you get the utterance from the user, is you need to understand what does the user want, and this is the intent classification problem. 

You should think of it as the following, which predefined scenario is the user trying to execute? Let's look at this Siri example, "How long to drive to the nearest Starbucks?", I asked Siri and the Siri tells me the result. The traffic to Starbucks is about average so it should take approximately ten minutes. And I had such an intent, I wanted to know how long to drive to the nearest Starbucks and we can mark it up as the intent: navigation.time.closest. So, that means that I am interested about time of navigation to the closest thing. And I can actually ask it in any other way and because our natural language has a lot of options for that. But it will still need to understand that this is the same intent. 

![to_4.png](pics/to_5.png)

Okay. So, I can actually ask the Siri a different question, "Give me directions to nearest Starbucks". This time, I don't care about how long it takes, I just need the directions. And so this time, Siri gives me the directions of a map. 

And let's say that this is a different intent like navigation.directions.closest. And you actually need to classify different intents, you need to distinguish between them, and this is classification task and you can measure accuracy here. 

![to_4.png](pics/to_6.png)

And one more example, "Give me directions to Starbucks." This time, I don't say that I need the time or the nearest Starbucks, that's why the system doesn't know which Starbucks I want. And that's when this system initiate the dialogue with me and because it needs additional information like which Starbucks. And this is intent: navigation.directions. 
![to_4.png](pics/to_7.png)

And how to think about this dialogue and how our chat bot, a personal assistant actually tracks what we are saying to it. You should think of intent as actually a form that a user needs to fill in. Each intent has a set of fields or so-called slots that must be filled in to execute the user request. 

Let's look at the example intent like navigation.directions. So that the system can build the directions for us, it needs to know where we want to go and from where we want to go. So, let's say we have two slots here like FROM and TO, and the FROM slot is actually optional because it can default to current geolocation of the user. And TO slot is required, we cannot build directions for you if you don't say where you want to go. And we need a slot tagger to extract slots from the user utterance. Whenever we get the utterance from the user, we need to know what slots are there and what intent is there. 
![to_4.png](pics/to_8.png)

And let's look at slot filling example. The user says, "Show me the way to History Museum." And what we expect from our slot tagger is to highlight that History Museum part, and tell us that History Museum is actually a value of a TO slot in our form. And you should think of it as a sequence tagging and let me remind you that we solve sequence tagging tasks using BIO Scheme coding. And in here B corresponds to the word of the beginning of the slot, I corresponds to the word inside the slot, and O corresponds to all other words that are outside of slots. 

And if we look at this example, "Show me the way to History Museum.", the text that we want to produce for each token are actually the following, "Show me the way to" are outside of any slots, that's why they have O, "History" is actually the beginning of slot TO, and "Museum" is the inside token in the slot TO, so that's why it gets that tag. 

![to_4.png](pics/to_9.png)

You train it as a sequence tagging task in BIO scheme and we have overview that in sequence to sequence in previous week. Let's say that a slot is considered to be correct if it's range and type are correct. And then, we can actually calculate the following metrics: we can calculate the recall of our slot tagging, we can take all the two slots and find out which of them are actually correctly found by our system, and that's how we define a recall. The precision is the following would take all of found slots and we find out which of them are correctly classified slots. And you can actually evaluate your slot tagger with F1 measure, which is a harmonic mean of precision and recall that we have defined. 

![to_4.png](pics/to_10.png)

Okay. So, let's see how form filling dialog manager can work in a single turn scenario. That means that we give single utterance to the system and then outputs the result right away. 

Okay, the user says "Give me directions to San Francisco." We run our intent classifier and it says, "This is in navigation.directions intent." Okay, then we're on slot tagger and it says that "San Francisco seems to be the value of slot TO." Then, our dialog manager actually needs to decide what to do with that information. It seems that all slots are filled so we can actually ask for the route. We can query Google Maps or any other service that will give us the route, and we can output it to the user and say, "Here is your route." Okay, that was a simple way, this is a single dialog. 

![to_4.png](pics/to_11.png)
Let's look at a more difficult example. This time the user starts the conversation like this, "Give me directions from L.A.", and we run intent classifier, it says, "Navigation.directions", where on slot tagger and it says that Los Angeles is actually a FROM slot and this time, dialog manager looks at this and says, "Okay, so required slot is missing, I don't know where to go. Please ask the user where to go." And the system asks the user, "Where do you want to go?", and the user gives us, this is where a second turn in the dialog happens and the user says San Francisco. We're on our intent classifier and slot tagger and hopefully, they will give us the values on the slide. The slot tagger will feel that San Francisco barred as TO slot. 

This time dialog manager knows that, "Okay. I have all the information I need. I can query Google Maps and give you the route." And the assistant outputs, "Here is your route." The problem here is that during the second turn here, actually, if we don't know the history of our conversation and just see the odds are in San Francisco, it's really hard to guess that this is in navigation.directions intent and that San Francisco actually fills TO slot. So, here we need to add context to our intent classifier and slot tagger and that context is actually some information about what happened previously. 

![to_12.png](pics/to_12.png)

Let's see how you can track context in an easy way. We already understand that both intent classifier and slot tagger are needed. Let's add simple features to both of them. The first feature is the previous utterance intent as a categorical feature. So we know what to user wanted in the previous turn and that information can be valuable to decide what to do now, what intent the user has now. Then, we also add the slots that are filled in so far with binary feature for each possible slot, so that the system during slot tagging already knows which slots are filled by the user previously and which are not, and that will help it to decide which slot is correct in the utterance it sees. 

And this simple procedure actually improves slot tagger F1 by 0.5% and it reduces intent classifier error by 6.7%. So, this is pretty cool. These are pretty easy features and you can reduce your error. We will review a better way to do that and that is memory networks but that will happen later. 

![to_12.png](pics/to_13.png)

Okay. But how do we track a form switch? Imagine that at first the user says, "Give me directions from L.A.", and then we ask, "Where do you want to go?" and this time, the user says, "Forget about it, let's eat some sushi first." So, this is where we need to understand that the intent has changed and we should forget about all the previous slots that we had and all the previous information that we had because we don't need it anymore. And the intent classifier gives us navigation find and the category, which is a slot and it has the value of sushi. Then, we make a query to the database or knowledge base like Yelp and dialog manager understands, "Okay, let's start a new form and find some sushi." and the assistant outputs, "Okay, here are nearby sushi places." 

We can actually track the forms which when the intent switches from navigation.directions lets say to navigation.find. 

![to_12.png](pics/to_14.png)

If we overview the whole system, it looks like the following: we have a user, we get the speech or text from him or her, and then, we have natural language understanding module that outputs us intents and slots for our utterance. Then we have that magic box that is called dialog manager and dialog manager is responsible for two tasks. 
1. The first one is dialog state tracking. So we need to understand what the user wanted throughout the conversation and track that state. 
2. And also, it does dialog policy managing. So, there is a certain policy, which says that, okay, if the state is the following then we need to query some information from the user or request some information from the user or we just inform the user about something. 
3. And we can also query backend services like Google Maps or Yelp, and when we are ready to give users some information, we use natural language generation box that outputs the speech for the user so that this is a conversation. 

![to_15.png](pics/to_15.png)

Okay, so let's summarize. We have overviewed the task-oriented dialog system with form filling and, how do we evaluate form filling? We evaluate accuracy for intent classifier and F1-measure for slot tagger. In the next chapter, we will take a closer look at the intent classifier and slot tagger.

# 2. Intent classifier and slot tagger (NLU)

In this chapter, we will talk about intent classifier and slot tagger in depth. 
![ic_1.png](pics/ic_1.png)
Let's start with intent classifier. How we can do that. You can use any model on bag-of-words with n-grams and TF-IDF, just use classical approaches of text mining, or you can use some recurrent architecture and you can use LSTM cells, GRU cells, or any other. You can also use convolutional networks and you can use 1D convolutions that we have overviewed in week one. 

And the study actually shows that CNNs can perform better on datasets where the task is essentially a **key phrase recognition task** and it can happen in some sentiment detection datasets, for example. So, it makes sense to try RNN or CNN, or any classical approach as a baseline and choose what works best. 

![ic_2.png](pics/ic_2.png)

Then, there comes a slot tagger, and this is a bit more difficult task. It can use handcrafted rules like regular expressions, so that when I say, for example, take me to Starbucks, then you know that if something happens after the phrase take me to, then that is most definitely like a two slot or any other slots of your intent. But that approach doesn't scale because the natural language has a huge variation in how we can express the same thing. So, it makes sense to do something data driven here. 

You can use conditional random fields, that is a rather classical approach, or you can use RNN sequence-to-sequence model, when you have encoder and decoder, and a funny fact is that you can still use convolutional networks for a sequence-to-sequence task as well, and you can add attention to any of these models, any sequence-to-sequence model. 

In the next slide, I want to overview convolutional sequence-to-sequence model because that is- that gains popularity because it works faster and sometimes it even beats RNN in some tasks. 

![ic_3.png](pics/ic_3.png)

Okay, let's see how convolutional networks can be used to model sequences. Let's say we have an input sequence which is bedding-bedding, then start of sequence and three German watts. And what we actually want to do, let's say, where we want to solve the task of language modeling. When we see each new token, we need to predict which token comes next. And usually, we use a recurrent architectures for this. But let's see how we can use convolutions. 

Let's say that when we generate the next token, what we actually- we actually care only about the last three tokens in the sequence that we have seen. And if we assume that, then we can use convolution to aggregate the information about the last three tokens and this is the blue triangle here, and we actually get some filters in the output. 

Let's take half of those filters and add them as is, and the second half, we will pass through sigmoid activation function, and then take an element Y as multiplication of these two halves. What we actually get is we get some Gated Linear Unit, and we add non-linear part to it and it becomes non-linear. 

So, this is how we actually look at the context that we had before and we predict some hidden state or let's say, next token and you can use convolutions for that, and then, that triangle is actually convolutional filter and you can slide it across the sequence and use the same weights, the same learned filters, and it will work the same on every iteration on that sequence. 

So, it is pretty similar to RNN, but in this way, we actually don't have a hidden state that we need to change. We actually only look at the context that we had before, and some intermediate representation. 

But you can see that we actually look at only three last tokens and that is not very good. Maybe we need to look at it like last 10 tokens or so because RNN is like LSTM cell, can actually have a very long short-term memory. 

Okay. So, we know from convolutional neural networks, we know how to increase the input receptive field. 

And we actually stack convolutional layers. 

Let's stack six layers here with kernel size five, and that will actually result in an input field of 25 elements. And the experiments show that 25 elements in the receptive field might be enough to model your sequences. 

Let's see how CNNs work for sequences. 

![ic_4.png](pics/ic_4.png)

The office provided the results on language modeling dataset which is WikiText-103, and you can see that this CNN architecture actually beats LSTM, it has lower perplexity, and it actually runs faster. We will go into that a little bit later. 

And another example is a machine translation dataset, or from English to French, let's say, and there they have a metric called BLEU and the higher that metric the better. And you can see that convolutional sequence-to-sequence actually beats LSTM here as well, and this is pretty surprising. 

What is a good thing about CNNs is, the speed benefit. 

![ic_5.png](pics/ic_5.png)

If you compare it with RNN, the problem with RNN is that it has a hidden state and we change that state through iterations and we cannot do our calculations in parallel, because every step depends on the other, and we can actually overcome that with convolutional networks because during training, we can process all time steps in parallel. So, we apply the same convolutional filters but we do that at each time step, and they are independent and we can do that in parallel. During testing, let's say, in sequence-to-sequence manner, our encoder can actually do the same because there is no that dependence on the previous outputs and we use only our input tokens, and we can apply that convolutions and get our hidden states in parallel. During testing one more thing, one more good thing is that GPUs are highly optimized for convolutions and we can get a higher throughput, thanks to using convolutions instead of RNNs. You can actually see a table here, and it shows the model based on LSTM, and the model based on convolutional sequence-to-sequence, and you can see that convolutional model actually provides a better score in terms of translation quality, and it also works 10 times faster. So, that is a pretty good thing because for a real-world systems like, let's say Facebook, they need to translate to the post when you want and they need to translate it fast. **So, in order to implement these machine translation in production environment, maybe CNN is a very good choice**. By the way, this paper is by the folks from Facebook. 

![ic_6.png](pics/ic_6.png)

So, let's look at one more thing. You know that when you do a sequence-to-sequence task, you actually want your encoder to be bi-directional, so that you look at the sequence from left to right and from right to left. And the good thing about convolutions is that actually you can make that convolutional filters symmetric, and you can look at your context at the left and at the right to the same time. 

So, it is very easy to make bi-directional encoder with CNNs. And it still works in parallel, there is no dependence on hidden state here, it just applies all of that multiplications in parallel. 

![ic_7.png](pics/ic_7.png)

To move further, with our, let me remind you, we are actually reviewing intent classifier and slot tagger and to move further, we need some dataset so that we can use it for our overview. Let's take ATIS dataset, it's Airline Travel Information System. It was collected back in 90s, and it has roughly 5,000 context independent utterances, and that is important. That means that we actually have a one turn dialogue and we don't need like a fancy dialogue manager here. It has 17 intents and 127 slot labels, like from location to location, departure time, and so forth. The utterances are like this, show me flights from Seattle to San Diego tomorrow. 

The State-of-the-art for this task is the following: 1.7 intent error, and 95.9 slots F1. 

So, this is pretty cool. 

![ic_8.png](pics/ic_8.png)

Another thing is that you can actually learn your intent classifier and slot tagger jointly. You don't need to train like two separate tasks, you can train this supertask, because it can actually learn representations that is suitable for both tasks, and this time, we provide more supervision for our training and we get the higher quality as a result. 

![ic_9.png](pics/ic_9.png)

Let's see how this joint model might work. It is still a sequence-to-sequence model, but this time we use, let's say, a bi-directional encoder, and the last hidden state, we can use for decoding the slot tags, and at the same time we can use that to decode the intent. And if we train these end-to-end for the two tasks, we can get a higher quality. And notice that we have in the decoder, we have hidden states from encoder post just as is, and this is called aligned inputs, and we also have C-vectors which are attention. 

![ic_10.png](pics/ic_10.png)

Let's see how attention works in decoder. Lets say that we have at time step E, and we have to output our new decoder hidden state SE. And that is actually a function of the previous hidden state which is in blue, a previous output which is in red, and hidden stated from encoder and some vector which is attention. Let's see how attention works. The vector attention Ci, is actually a weighted sum of hidden vectors from encoder. And we need to come up with weights for these vectors. And we actually train the system to learn these weights in such a way so that it makes sense to give attention to those weights, to those vectors. And the coefficient that we use to define what weight that particular vector from encoder has, is modeled as a forward network that uses our previous decoder hidden state, and all of the states from encoders, and it needs to figure out whether we need that state from encoder or not. 

You can also see an example of attention distribution when we predict the label for the last word, and you can see that when we predict the label like departure time, our model looks at phrases like, from city, or city name, or something like that. 

![ic_11.png](pics/ic_11.png)

Okay. So, we can also see how our two losses decrease during training, and during training we use two losses and we use a sum of them, and you can see the green loss here is for intent, and the blue one is for slots. You can see that intent loss actually saturates and it doesn't change, but blue slots, blue curve continues to decrease and so, our model continues to train because that is a harder task than intent classification. 

![ic_11.png](pics/ic_12.png)

Okay. Let's look at joint training results on the 80s dataset. If we had trained slot filling independently, we have slot F1 95.7, and if we train our intent detection, our classifier independently we have intent at two percent, but if we train those two tasks jointly using the architecture that we have overviewed, we actually can get a higher slot F1 and a lower intent error. 

And a good thing also is that this joint model works faster if you use it on mobile phone, or any other embedded system because you have only one encoder and you reuse that information for two tasks. 

![ic_11.png](pics/ic_13.png)

Okay. Let's summarize what we have overviewed. We have viewed at different options for intent classifier and slot tagger, you can start from classical approaches and go all the way to deep approaches. People start to use CNNs for a sequence modeling and sometimes get better results than with RNN. This is a pretty surprising fact. You can also use joint training and it can be beneficial in terms of speed and performance for your slot tagger and intent classifier. In the next video, we will take a look at context utilization in our NLU, our intent classifier and slot tagger.

# 3. Adding context to NLU

In this chapter, we'll talk about context utilization in our NLU. Let me remind you why we need context. We can have a dialect like this. 

![ic_15.png](pics/ic_15.png)

User says, "Give me directions from LA," and we understand that we need, we have a missing slot so we ask, "Where do you want to go?" And then the user says, "San Francisco." And when we have the next utterance, it would be very nice if intent classifier and slot tagger could use the previous context, and it could understand that, that San Francisco is actually, @To slot that we are waiting, and the intent didn't change, and we had context for that. 

A proper way to do this is called **memory networks**. Let's see how it might work. 

![ic_16.png](pics/ic_16.png)

We have a history of utterances, and let's call them x's, and that is our utterances. Then we passed them through a special RNN, that will encode them into memory vectors. And we take out with two utterances passed through these RNN, and we have some memory vectors. And these are dense vectors just like neural networks like. 

Okay. So we can encode all the utterances we had before into the memory. Let's see how we can use that memory. 

![ic_17.png](pics/ic_17.png)

Then when a new utterance comes, and this is utterance C in the lower left corner, then we actually encoded into the vector of the same size as our memory, and we use a special RNN for that, called RNN for input. And when we have that, orange "u" vector, we actually, this is actually the representation of our current utterance, and what we need to do is we need to match this current utterance with all the utterances that we had before in that memory. 

And for that, we use a dark product with the representations of utterances we had before, and that actually gives us, after applying soft marks, we can actually have a knowledge attention distribution. So we know what knowledge, what previous knowledge is relevant to our current utterance and which is not. 

And we can actually take all the memory vectors, and we can take them with weights of this attention distribution, and we have a final vector which is a weighted sum. We can edit to our representation of an utterance, which is an orange vector, and we can pass it through some fully connected layers and get the final vector "o" which is the knowledge encoding of our current utterance and the knowledge that we had before. 

What do we do with that vector? That vector actually accumulates all the context of the dialect that we had before. And so, we can actually use it in our RNN for tagging, let's say. 

![ic_18.png](pics/ic_18.png)

Now, let's say how we can implement that knowledge vector into tagging RNN. We can edit as input on every step of our RNN tagger, and that is a memory vector that doesn't change, and if we train it end to end, then we might have a better quality because we use context here. 

![ic19.png](pics/ic19.png)

Okay. So this is an overview of the whole architecture. We have historical utterances, and we use a special RNN to turn them into memory vectors. Then we use attention mechanism when a new utterance comes, and we actually know which prior knowledge is relevant to us at the current stage and which is not. And we use that information in the RNN tagger that gives us slot tagging sequence. 

![ic_18.png](pics/ic_20.png)

Let's see how it actually works. If we evaluate the slot tagger on multi-turn data set, when the dialect is along, and we actually measure F1, F1-measure here. And let's compare RNN tagger without context, and these memory networks architecture. We can see that this model performs better and not only on the first turn but also on the consecutive turns as well. And overall, it gives a significant improvement to the F1 score, like 47, comparing with 6 to 7. So, let me summarize. You can make your NLU context-aware with memory networks. 

![ic_21.png](pics/ic_21.png)

In the previous notebooks, in the previous chapters, we actually overviewed how you can do that in a simple manner, but memory network seems to be the right approach to this. In the next chapter, we will take a look at lexicon utilization in our NLU. You can think of lexicon as, let's say, a list of all music artists. We already know that this is a knowledge base, and let's try to use that in our intent classifier and slot tagger.

# 4. Adding lexicon to NLU


In this chapter, we will talk about lexicon utilization in our NLU. 

Why do we want to utilize lexicon?

![al_1.png](pics/al_1.png)

Let's take ATIS dataset for example. The problem with these dataset is that it has a finite set of cities in training. And, the thing we don't know is whether the model will work for a new city during testing. And, the good fact is that we have a list of all cities like from Wikipedia or any other source, and we can actually use it somehow to help on a model to detect new cities. 

Another example, imagine you need to fill a slot like "music artist" and we have all music artists in the database, like musicbrainz.org and you can actually download it, parse it, and use for your NLU. 

But how can we use it? 

![al_2.png](pics/al_2.png)

Let's add lexicon features to our input words. We will overview an approach from the paper, you can see the lower left corner. Let's match every n-gram of input text against entries in our lexicon. Let's take n-grams "Take me," "me to," "san," and "San Francisco," and all the possible ones. And let's match them with the lexicon, with the dictionary that we have for, let's say, cities. And we will say that the match is successful when the n-gram matches either prefix or postfix of an entry from the dictionary, and it is at least half the length of the entry, so that we don't have a lot of spurious matches. 

Let's see the matches we might have. San might have a match with San Antonio, with San Francisco, and the San Francisco n-gram can match with San Francisco entry. So, we'd get these matches and we need to decide which one of them is best. 

And when we have overlapping matches, that means that one word can be used in different n-grams, we need to decide which one is better, and we will prefer them in the following order. 1. First, will prefer exact matches over partial. So, if the word San is used in San Francisco and that is an exact match, then it is preferable than, let's say, the match of San with San Antonio. 
2. And we will also prefer longer matches over shorter, 
3. and we will prefer earlier matches in the sequence over later. 

These three rules actually give us a unique distribution of our words in the non-overlapping matches with our lexicon. So, let's see how we can use that information, that lexicon matching information in our model. 

![al_3.png](pics/al_3.png)

We will use a so-called BIOES coding, which stands for Begin, Inside, Outside, End, and Single, and we will mark the token with B if token matches the beginning of some entity. We will use B and I if token matches as prefix. We will use I and E if two tokens match as postfix. So, it is some token in the middle and some token at the end of the entity. And we will use S for matches when a single token matches an entity. Let's see an example of such coding for four lexicon dictionaries, location, miscellaneous, organization, and person. And we have a certain utterance like "Hayao Tada commander of the Japanese North China area army." And you can see that we have a match in persons lexicon and that gives us a B and E, so we know that that is an entity. And we also have a full match in "North China area army," and it has a match with organisation lexicon, and it has an encoding like B, I, E, I, E. And, we can actually have the full match even if we don't have an entity in our lexicon. Let's say, we have North China History Museum, and let's say, I don't know, any country area army entities. And when we have those two entities, we can actually have the postfix from the second one and the prefix from the first match and it will still give us the same BIOES encoding. So, this is pretty cool. We can make new entities that we haven't seen before. 

Okay, so, what we do next is we use these letters as we will later encode them as one hot encoded vectors. 

![al_3.png](pics/al_4.png)

Let's see how we can add that lexicon information to our module. Let's say we have an utterance, "We saw paintings of Picasso," and we have a word embedding for every token. And to that word embedding, we can actually add some lexicon information. And we do it in the following way. Remember the table that we have on the previous slide? Let's take two first words and let's take that column that corresponds to the word, and let's use one hot encoding to decode that BIOES letters into numbers, and we will use that vector and we will concatenate it with the embedding vector for the word, and we will use it as an input for our B directional LSTM, let's say. And this thing will predict tags for our slot tagger. So, this is like a pretty easy approach to embed that lexicon information into your model. 

![al_3.png](pics/al_5.png)

Let's see how it works. It was bench-marked on the dataset for a Named Entity Recognition, and you can see that if you add lexicon, it actually improves your Precision, Recall and F1 measure a little bit, like one percent or something like that. So, it seems to work and it seems that it will be helpful to implement these lexicon features for your real world dialogue system. 

![al_3.png](pics/al_6.png)

Let's look into some training details. You can sample your lexicon dictionaries so that your model learns not only the lexicon features but also the context of the words. Let's say, when I say, "Take me to San Francisco," that means that the word that comes after the phrase "take me to" is most likely a two-slot. And we want the model to learn those features as well because in real world, we can see entities that were not in our vocabulary before, and our lexicon features will not work. So, this sampling procedure actually gives you an ability to detect unknown entities during testing. So, this is a pretty cool approach. 

When you have the lexicon dictionaries, you can also augment your data set because you can replace the slot values by some other values from the same lexicon. Let's say, "Take me to San Francisco," becomes "Take me to Washington," because you can easily replace San Francisco's slot value with Washington because you have the lexicon dictionaries. 

![al_3.png](pics/al_7.png)

So, let me summarize. You can add lexicon features to further improve your NLU because that will help you to detect the entities that the user mentions and some unknown and long entities like "South China area army" that can be detected. 
