<a href="https://colab.research.google.com/github/abhay43/FaceApp-with-Deep-Learning/blob/master/Intent_Detection_LSTM_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Intent Classification using LSTM 
This use-case provides a demo of how LSTM can be used for Intent classification in texts. 

##Workflow:

1.   Understanding the problem
2.   Reading the data and understanding it
3.   Data Preprocessing
4.   Build LSTM model
5.   Train & Evaluate the model

##1. Understanding the problem
Intent Classification is the automated association of text to a specific intention. For example: Let's say you are writing an email to one of the Airlines and the text of the same is 'Can you please cancel my ticket with PNR 123456'. The intent of the customer here is 'Cancellation of Air Ticket'.

The idea of this use case to introduce the concept of Intent classification and how can LSTM be used to solve this. 

###Import the necessary libraries
Please load the following packages before you proceed further. 


In [None]:
import numpy as np 
import pandas as pd 
import nltk

from sklearn.preprocessing import OneHotEncoder as oneHot
from nltk.corpus import stopwords
from nltk import word_tokenize
from string import punctuation
from nltk.stem import PorterStemmer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.layers import BatchNormalization, Dropout, Input, Embedding, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy as cce
from tensorflow.keras.activations import relu, softmax
from tensorflow.keras.initializers import he_uniform, glorot_uniform
from tensorflow.keras.metrics import AUC
from tensorflow.keras import Model
from tensorflow.keras.regularizers import l2
from sklearn.metrics import classification_report

nltk.download('punkt')
nltk.download('stopwords')  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
import tensorflow

In [None]:
tensorflow.__version__

'2.2.0'

##2. Collecting the Data
The ATIS(Air Travel Information System) data is a rich corpus that contains natural language text used by general public to book flight tickets, enquire about flight timings, prices etc. 

The Train and test data can be downloaded from https://www.kaggle.com/hassanamin/atis-airlinetravelinformationsystem

There are 2 columns in each of the above datasets. First column is 'target' which is the output we will be classifying and second column is 'text' which is the user input asking for queries related to flights. 

Basically 'target' is the intent of the customer. 

In [None]:
#Read the train and test datasets with column names as target and text
train= pd.read_csv('/content/datasets_284285_585165_atis_intents_train.csv',
                       names= ["target", "text"])

test= pd.read_csv('/content/datasets_284285_585165_atis_intents_test.csv',
                       names= ["target", "text"])

####This is how the data looks like

In [None]:
train.head(10) #Get Top 10 rows from train dataset

Unnamed: 0,target,text
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what is the arrival time in san francisco for...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...
5,atis_flight,i need a flight tomorrow from columbus to min...
6,atis_aircraft,what kind of aircraft is used on a flight fro...
7,atis_flight,show me the flights from pittsburgh to los an...
8,atis_flight,all flights from boston to washington
9,atis_ground_service,what kind of ground transportation is availab...


In [None]:
train.shape

(4834, 2)

 Check the number of intents.

In [None]:
train['target'].value_counts() #Get counts of different types of target variable in train data. We will not be using this anywhere but it is just for the overview of the data.  

atis_flight            3666
atis_airfare            423
atis_ground_service     255
atis_airline            157
atis_abbreviation       147
atis_aircraft            81
atis_flight_time         54
atis_quantity            51
Name: target, dtype: int64

In [None]:
[00001000] [10000000] 

In [None]:
train.target.shape

(4834,)

### 3. Preprocessing the Data
We will be doing the following preprocessing steps to get the desired format of the data. 

1. Perform One Hot Encoding on the target variable of both train & test datasets.
2. Convert the text into lower case.
3. Tokenize the words.
4. Remove stop words.
5. Perform stemming & normalization.
6. Convert texts into sequences.
7. Pad the sequences.


In [None]:
encode_target= oneHot().fit(np.array(train.target).reshape(-1,1)) #We perform one hot encoding on the target variable to convert into a matrix of 0s and 1s. 

In [None]:
encode_target.get_feature_names()

array(['x0_atis_abbreviation', 'x0_atis_aircraft', 'x0_atis_airfare',
       'x0_atis_airline', 'x0_atis_flight', 'x0_atis_flight_time',
       'x0_atis_ground_service', 'x0_atis_quantity'], dtype=object)

Perform One Hot Encoding on the target variable. The output of this step would be an array with 0s and 1s. 

In [None]:
train_target_encoded= encode_target.transform(np.array(train.target).reshape(-1,1)).toarray()
test_target_encoded= encode_target.transform(np.array(test.target).reshape(-1,1)).toarray()

In [None]:
train_target_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
test_target_encoded.shape

(800, 8)

Convert text to lower case

In [None]:
train["text"]= train.text.map(lambda l: l.lower())
test["text"]= test.text.map(lambda l: l.lower())

In [None]:
train.text

0        i want to fly from boston at 838 am and arriv...
1        what flights are available from pittsburgh to...
2        what is the arrival time in san francisco for...
3                 cheapest airfare from tacoma to orlando
4        round trip fares from pittsburgh to philadelp...
                              ...                        
4829     what is the airfare for flights from denver t...
4830     do you have any flights from denver to baltim...
4831            which airlines fly into and out of denver
4832     does continental fly from boston to san franc...
4833     is there a delta flight from denver to san fr...
Name: text, Length: 4834, dtype: object

Next step is to tokenize the text. We use word_tokenize function from nltk library for this purpose. 

In [None]:
train["text"]= train.text.map(word_tokenize)
test["text"]= test.text.map(word_tokenize)

In [None]:
#Output of the above exercise looks like this
train["text"]

0       [i, want, to, fly, from, boston, at, 838, am, ...
1       [what, flights, are, available, from, pittsbur...
2       [what, is, the, arrival, time, in, san, franci...
3          [cheapest, airfare, from, tacoma, to, orlando]
4       [round, trip, fares, from, pittsburgh, to, phi...
                              ...                        
4829    [what, is, the, airfare, for, flights, from, d...
4830    [do, you, have, any, flights, from, denver, to...
4831    [which, airlines, fly, into, and, out, of, den...
4832    [does, continental, fly, from, boston, to, san...
4833    [is, there, a, delta, flight, from, denver, to...
Name: text, Length: 4834, dtype: object

Eliminate stop words. 'english' dictionary from nltk.corpus library is used for this purpose. We also remove punctuation along with the removal of stop words.

In [None]:
def clean_data_rm_stop(strings, stop_list):
    sw= [str for str in strings if str not in stop_list]
    return sw

stop_words= stopwords.words("english")
rm_punc_stop= list(set(punctuation))+ stop_words  #Remove punctuation and stop words

train["text"]= train.text.map(lambda dataframe: clean_data_rm_stop(dataframe, rm_punc_stop))
test["text"]= test.text.map(lambda dataframe: clean_data_rm_stop(dataframe, rm_punc_stop))

In [None]:
train.text

0       [want, fly, boston, 838, arrive, denver, 1110,...
1       [flights, available, pittsburgh, baltimore, th...
2       [arrival, time, san, francisco, 755, flight, l...
3                    [cheapest, airfare, tacoma, orlando]
4       [round, trip, fares, pittsburgh, philadelphia,...
                              ...                        
4829    [airfare, flights, denver, pittsburgh, delta, ...
4830            [flights, denver, baltimore, via, dallas]
4831                              [airlines, fly, denver]
4832    [continental, fly, boston, san, francisco, sto...
4833              [delta, flight, denver, san, francisco]
Name: text, Length: 4834, dtype: object

Stemming & Normalizing


*   Stemming helps in reducing the word to the root form.
*   Normalizing is the process of transforming text into a standard form. Eg: Gud will be converted to good etc.




In [None]:
def normalize(text):
    return " ".join(text)

#We use PorterStemmer function from nltk.stem library.
stem_func= PorterStemmer()

train["text"]= train.text.map(lambda s: [stem_func.stem(x) for x in s])
train["text"]= train.text.apply(normalize)

test["text"]= test.text.map(lambda s: [stem_func.stem(x) for x in s])
test["text"]= test.text.apply(normalize)

In [None]:
train.text

0              want fli boston 838 arriv denver 1110 morn
1          flight avail pittsburgh baltimor thursday morn
2       arriv time san francisco 755 flight leav washi...
3                          cheapest airfar tacoma orlando
4       round trip fare pittsburgh philadelphia 1000 d...
                              ...                        
4829         airfar flight denver pittsburgh delta airlin
4830                     flight denver baltimor via dalla
4831                                    airlin fli denver
4832       continent fli boston san francisco stop denver
4833                    delta flight denver san francisco
Name: text, Length: 4834, dtype: object

Tokenize

In [None]:
# We use Tokenizer from tensorflow.keras.preprocessing.text library
num_words=10000
text_tokenizer= Tokenizer(num_words)
text_tokenizer.fit_on_texts(train.text) #fit_on_texts - creates the vocabulary index based on word frequency.

tokenized_train_data= text_tokenizer.texts_to_sequences(train.text) #Converting texts to sequences
tokenized_test_data= text_tokenizer.texts_to_sequences(test.text)

In [None]:
vocab_len = len(text_tokenizer.index_word)

In [None]:
text_tokenizer.index_word

{1: 'flight',
 2: 'boston',
 3: 'show',
 4: 'san',
 5: 'denver',
 6: 'francisco',
 7: 'atlanta',
 8: 'pittsburgh',
 9: 'dalla',
 10: 'baltimor',
 11: 'philadelphia',
 12: 'leav',
 13: 'airlin',
 14: 'like',
 15: 'list',
 16: 'fare',
 17: 'arriv',
 18: 'washington',
 19: 'fli',
 20: 'pleas',
 21: 'morn',
 22: 'pm',
 23: 'would',
 24: 'first',
 25: 'wednesday',
 26: 'oakland',
 27: 'ground',
 28: "'d",
 29: 'transport',
 30: 'trip',
 31: 'class',
 32: 'cheapest',
 33: 'need',
 34: 'citi',
 35: 'go',
 36: 'round',
 37: 'avail',
 38: 'afternoon',
 39: 'american',
 40: 'one',
 41: 'give',
 42: 'want',
 43: 'way',
 44: 'new',
 45: 'thursday',
 46: 'york',
 47: 'earliest',
 48: 'nonstop',
 49: 'monday',
 50: 'dc',
 51: 'stop',
 52: 'tuesday',
 53: 'unit',
 54: 'la',
 55: 'inform',
 56: 'st',
 57: 'milwauke',
 58: 'find',
 59: 'airport',
 60: 'sunday',
 61: 'twenti',
 62: 'miami',
 63: 'even',
 64: 'vega',
 65: 'delta',
 66: 'noon',
 67: 'newark',
 68: 'chicago',
 69: "o'clock",
 70: 'charlott

Then, we pad the sequences

In [None]:
len(tokenized_train_data), len(tokenized_test_data)

(4834, 800)

In [None]:
len_list = []
for sent in tokenized_train_data:
  len_sent = len(sent)
  len_list.append(len_sent)


In [None]:
np.percentile(np.array(len_list),99.5)

16.0

In [None]:
#We use pad_sequences from tensorflow.keras.preprocessing.sequence library
train_data= pad_sequences(tokenized_train_data, maxlen= 20, padding= "pre")
test_data= pad_sequences(tokenized_test_data, maxlen= 20, padding= "pre")

In [None]:
train_data.shape, test_data.shape

((4834, 20), (800, 20))

In [None]:
max_len = 20

In [None]:
vocab_len

654

In [None]:
[655*100]

In [None]:
##Learn your own embeddings
inputs = Input(name='inputs',shape=[max_len]) #(20,)

layer1 = Embedding(vocab_len+1,300,input_length=max_len,
                  mask_zero=True)(inputs) #(20,100)
layer2 = LSTM(64)(layer1) #(64,)
layer3 = Dense(256,name='FC1')(layer2) #(256,)
layer4 = Activation('relu')(layer3)
layer5 = Dropout(0.5)(layer4)
layer6 = Dense(8,name='out_layer')(layer5) #(8,)
layer7 = Activation('softmax')(layer6)
model = Model(inputs=inputs,outputs=layer7)

In [None]:
model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 20)]              0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 20, 300)           196500    
_________________________________________________________________
lstm_3 (LSTM)                (None, 64)                93440     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
activation_4 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 8)                 2056

In [None]:
##Use pretrained embeddings
!wget http://vectors.nlpl.eu/repository/20/0.zip




--2020-07-25 16:34:28--  http://vectors.nlpl.eu/repository/20/0.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 344050746 (328M) [application/zip]
Saving to: ‘0.zip’


2020-07-25 16:34:33 (63.2 MB/s) - ‘0.zip’ saved [344050746/344050746]



In [None]:
!unzip 0.zip

Archive:  0.zip
  inflating: meta.json               
  inflating: model.bin               
  inflating: model.txt               
  inflating: README                  


In [None]:
!head -10 model.txt

163473 300
say_VERB -0.008861 0.097097 0.100236 0.070044 -0.079279 0.000923 -0.012829 0.064301 -0.029405 -0.009858 -0.017753 0.063115 0.033623 0.019805 0.052704 -0.100458 0.089387 -0.040792 -0.088936 0.110212 -0.044749 0.077675 -0.017062 -0.063745 -0.009502 -0.079371 0.066952 -0.070209 0.063761 -0.038194 -0.046252 0.049983 -0.094985 -0.086341 0.024665 -0.112857 -0.038358 -0.007008 -0.010063 -0.000183 0.068841 0.024942 -0.042561 -0.044576 0.010776 0.006323 0.088285 -0.062522 0.028216 0.088291 0.033231 -0.033732 -0.002995 0.118994 0.000453 0.158588 -0.044475 -0.137629 0.066080 0.062824 -0.128369 -0.087959 0.028080 0.070063 0.046700 -0.083278 -0.118428 0.071118 0.100757 0.017944 0.026296 0.017282 -0.082127 -0.006148 0.002967 -0.032857 -0.076493 -0.072842 -0.055179 -0.081703 0.011437 -0.038698 -0.062540 -0.027899 0.087635 0.031870 0.029164 0.000524 -0.039895 -0.055559 0.024582 -0.030595 0.003942 -0.034500 0.003012 -0.023863 0.033831 0.061476 -0.090183 -0.039206 -0.026586 -0.042763 0.049835

In [None]:
!wc -l model.txt

163474 model.txt


In [None]:
embeding_index={}

f=open('/content/model.txt',encoding='utf-8')

for i,line in enumerate(f):
    if i==0:continue
    values=line.split()
    word=values[0].split('_')[0]
    coefs=np.asarray(values[1:],dtype='float32')
    embeding_index[word]=coefs
f.close()

In [None]:
len(embeding_index['say'])

300

In [None]:
#text_tokeniser -> key: number, value: word (from ur vocab) -> 654
#embedding_index -> key: word, value: vector (global) ->163474

#embedding_matrix -> key: number, value: vector

In [None]:
embedding_matrix=np.zeros((vocab_len+1,300)) #655*300 <zeros>
words_not_available=0
for word,i in text_tokenizer.word_index.items():
    embed_vector=embeding_index.get(word)
    if embed_vector is not None:
        embedding_matrix[i]=embed_vector
    else:
      words_not_available+=1

In [None]:
embedding_matrix.shape

(655, 300)

In [None]:
inputs = Input(name='inputs',shape=[max_len]) #(20,)

layer1 = Embedding(vocab_len+1,300,input_length=max_len,weights=[embedding_matrix],trainable=True,
                  mask_zero=True)(inputs) #(20,100)
layer2 = LSTM(64)(layer1) #(64,)
layer3 = Dense(256,name='FC1')(layer2) #(256,)
layer4 = Activation('relu')(layer3)
layer5 = Dropout(0.5)(layer4)
layer6 = Dense(8,name='out_layer')(layer5) #(8,)
layer7 = Activation('softmax')(layer6)
model2 = Model(inputs=inputs,outputs=layer7)

In [None]:
model2.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 20)]              0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 20, 300)           196500    
_________________________________________________________________
lstm_4 (LSTM)                (None, 64)                93440     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
activation_6 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 8)                 2056

Let's build a 3 dim array. The dimensions are samples, steps and unique words.

### 4. Build LSTM Model

Build a class lstm_model_class that has three methods(methods are similar to functions) in a class, they are:



1.   Build a lstm model 
2.   Train the created lstm model on the train data
3.   Predict the output on the train data

Note: Building a Class with 3 methods helps in tying all these three functions to the same object at the same instance. 

Process of Building a LSTM Model:



1.   Build an embedded layer with dimensions as number of steps and input dimensions. 
2.   Build an LSTM layer with number of steps equal to memory units.  
3.   Then, build a dense layer which is fully connected layer that represents a matrix vector multiplication. 
4.   Apply the relu function after normalization and scaling of the activations. This is the standard activation function used. 
5.   Finally, build the output layer.

Note: 
1. activation function for multi-class classification problem - softmax
2. loss function is categorical cross entropy. 
3. performace metric - Area under the curve(AUC)
4. optimizer would be the Adam optimizer. 





Build the model on the necessary inputs.
We define the number of steps as the , output shape and input dimensio appropriately. 

### 5. Train & Evaluate the model
In our last step, we will train and evaluate the model and check the performance metrics. 

In [None]:
final_model.train_lstm_model(trans_matrix_train, train_target_encoded,
           0.2, 60) #Model takes train data, train target variable, validation split(here it is 80:20) and number of epochs. 

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


In [None]:
pred_train= encode_target.inverse_transform(final_model.predict_lstm_model(trans_matrix_train)) #Predict on the train matrix and look at the performance
print(classification_report(train.target, pred_train)) #Print the classification report

                     precision    recall  f1-score   support

  atis_abbreviation       0.96      1.00      0.98       147
      atis_aircraft       0.97      0.96      0.97        81
       atis_airfare       0.99      0.99      0.99       423
       atis_airline       0.98      0.96      0.97       157
        atis_flight       1.00      1.00      1.00      3666
   atis_flight_time       0.98      0.94      0.96        54
atis_ground_service       1.00      1.00      1.00       255
      atis_quantity       1.00      1.00      1.00        51

           accuracy                           0.99      4834
          macro avg       0.98      0.98      0.98      4834
       weighted avg       0.99      0.99      0.99      4834



F1 and weighted avg are excellent. We can now move to implement this model on test data and see how it is performing.

In [None]:
pred_test= encode_target.inverse_transform(final_model.predict_lstm_model(trans_matrix_test)) #Predict on the test data
print(classification_report(test.target, pred_test)) #Print the classification report

                     precision    recall  f1-score   support

  atis_abbreviation       0.80      1.00      0.89        33
      atis_aircraft       0.83      0.56      0.67         9
       atis_airfare       0.98      1.00      0.99        48
       atis_airline       0.97      0.76      0.85        38
        atis_flight       0.99      0.99      0.99       632
   atis_flight_time       0.50      1.00      0.67         1
atis_ground_service       1.00      0.94      0.97        36
      atis_quantity       0.43      1.00      0.60         3

           accuracy                           0.97       800
          macro avg       0.81      0.91      0.83       800
       weighted avg       0.98      0.97      0.97       800



F1 and weighted avg are excellent. We can settle with this model. 

### Next Steps


*   Try changing some of the parameters and see if there could be any change in the performance metrics. 

