###### About the dataset
The dataset chosen contains data about Airline Travel. The dataset has the intents and user raw input conversations which contain the queries about the flights availability, flight fares, details about airlines, flight timings in it. 

#### Web Service Model for the Chatbot

 

To host our NLP model for Intent classification, Name Entity recognition and dialogue flow manager, Flask API framework is choosen. Flask is python web development framework with an inbuilt web server, using which APIs are developed using different machine learning python libraries such as tensorflow, nltk, sklearn, keras, spacy etc. Flask is chosen as our serving model not only because it is easy to integrate with web pages and takes little effort to make a application up and running with API end points but also it is comparatively flexible and readable than anyother API deployment models. We did explore other  machine learning deployment model service such as FastAPI, Seldon Core, DeepDetect etc in which few of them focus only on serving one component and few are complex to understand and implement. FastAPI is known for its performance, but is relatively new, not many resources were available for achieving our goal. 

As our goal was to make a custom UI to serve our chatbot model, Flask was the right choice as it gives more flexibility to work with html pages and smooth dataflow between UI and backend. The application deployment is also easy and fast with Flask API. 

As we are going with docker deployment, base images for building a flask application(eg.python:3.8-slim-buster) are readily available in docker hub and installing the python libraries and making the application run as a container is quite easily acheviable.

In [2]:
import pandas as pd

flight_details = pd.read_csv('atis_data.csv')

df = pd.DataFrame(flight_details)


flight_details.head()

Unnamed: 0,Intents,Input_Queries
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what's the arrival time in sanfrancisco for t...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...


#### Logging and monitoring
We are using python logging library to track the events that happen in the application. Logging library helps to capture the log information with the level of severity - debug, info, warning, error and critical. 

Using format as current date time and display the log message beside the datetime so that the loggers are captured with the exact time when the event happened

We are saving the log information into a file 'atis_details_bot.log' and maintaining in the same directory in which the code resides.


In [66]:
import logging

logging.basicConfig(filename='atis_ChatBot.log', encoding='utf-8', level=logging.DEBUG, format='%(asctime)s %(message)s')

### NER implementation

To get the name entities for the given query, we initially used Spacy pretrained pipeline. It has some limitations while extracting few entities for this dataset,like GPE is specified for both the source and destination locations which doesn't give the exact information about whether the location is source or destination.Hence, we used Spacy rule based entity recognizer to extract the exact source and destination locations.

Then the obtained entities are trained with Spacy custom NER with 70-30 train-test split and 50-50 train-test split and checked the metrics like accuracy, f1 score, recall and precision which was nearly 100% 

Using this approach, we retrived the both default entity labels using spacy pretrained pipeline, and source and destination entities from the given text. These entities can be combined to be used in the conversations to get the output for the given query.

#### Data preprocessing for NER

The data is already clean and doesnt contain nulls. Also, stop words removal cannot be done since the needed information for the rule based entity recognition('from','in','to' etc) will be removed

Hence, applying basic preprocessing like removing the punctuations, making the words to lower case and performing Lemmatization with NLTK

In [67]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import re
from spacy.lang.en import English
from spacy import displacy


nlppipeline = spacy.load('en_core_web_sm')

# Getting the pipeline component
ner = nlppipeline.get_pipe("ner")
nlp = spacy.load("en_core_web_sm")

optimizer = nlp.resume_training()


[nltk_data] Downloading package wordnet to C:\Users\Dharani
[nltk_data]     Rayadurgam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [68]:
def data_preprocessing(df):
    logging.info('Preprocessing for NER')
    input_convos = df['Input_Queries']

    # Removing punctuation
    df['final_processed_data'] = input_convos.map(lambda x: re.sub('[,\.!?]', '', x))

    # Converting the dataset to lowercase
    df['final_processed_data'] = input_convos.map(lambda x: x.lower())
    lemmatizer = WordNetLemmatizer()
    final_processed_data = []
    for data in df['final_processed_data']:
        lemmatized_data = lemmatizer.lemmatize(data)
        final_processed_data.append(lemmatized_data)

    df['final_processed_data'] = final_processed_data
    #df.head()
    return df

##### NER using spacy pipeline
Getting the Name entities using the spacy pipeline 'en_core_web_sm' for entity labels other than GPE(location)

In [69]:
def nlp_ner(df):
    logging.info('Getting NER using spacy pipeline')
    query = nlp(df)
    ners = {}
    for word in query.ents:
        if word.label_ != 'GPE':
            ners[word.label_] = word.text
    logging.info('Obtained NERs from spacy pipeline', ners)
    return ners

##### Model 1  Rule based entity recognizer using spacy
Using these rules to distingish the source and destination location and adding them to the dataframe

In [71]:

def rule_based_ner(df):
    logging.debug('Entering NER rule based training ')
    data = df['final_processed_data']
    sourceloc = ''
    destloc = ''
    destination = []
    source = []
    source_data = []
    destination_data = []
    for doc in data:
        words_list = doc.split()

        if ' from ' in doc:
            sourceloc = words_list[words_list.index('from') + 1]
        elif ' leaving ' in doc:
            sourceloc = words_list[words_list.index('leaving') + 1]
        else:
            doc = doc + ' na'
            sourceloc = 'na'
        nlpner = English()
        ruler = nlpner.add_pipe("entity_ruler")
        source_rules = [{"label": "source", "pattern": [{"LOWER": sourceloc}]}]
        sourceloc = ''
        ruler.add_patterns(source_rules)

        if ' in ' in doc:
            destloc = words_list[words_list.index('in') + 1]
        elif ' to ' in doc:
            destloc = words_list[words_list.index('to') + 1]
        else:
            destloc = 'na'
            doc = doc + ' na'
        dest_rules = [{"label": "destination", "pattern": [{"LOWER": destloc}]}]
        ruler.add_patterns(dest_rules)
        destloc = ''
        sourceloc = ''
        doc1 = nlpner(doc)
        for entity in doc1.ents:
            if entity.label_ == 'source':
                source.append(entity.text)
                source_data.append((doc, {'entities': [(doc.index(entity.text),
                                                        doc.index(entity.text) + len(entity.text),
                                                        'SOURCE_LOC')]}))
                break
        for entity in doc1.ents:
            if entity.label_ == 'destination':
                destination.append(entity.text)
                destination_data.append((doc, {'entities': [(doc.index(entity.text),
                                                             doc.index(entity.text) + len(
                                                                 entity.text),
                                                             'DESTINATION_LOC')]}))
                break


    df['source'] = source
    df['destination'] = destination
    logging.debug('Completed NER rule based entity recognition')
    return source_data,destination_data


##### Inference from the above output
By executing the above methods, the source and destination of the input query is properly distingused and provided with the rule defined labels - source and destination  and the data required for custom NER for source and destination is obtained

Printing the dataset with the source and destination appended

In [72]:
logging.info('Performing NER preprocessing and rule based')
data_preprocessing(df)
source_data, destination_data = rule_based_ner(df)
df.head()

Unnamed: 0,Intents,Input_Queries,final_processed_data,source,destination
0,atis_flight,i want to fly from boston at 838 am and arriv...,i want to fly from boston at 838 am and arriv...,boston,denver
1,atis_flight,what flights are available from pittsburgh to...,what flights are available from pittsburgh to...,pittsburgh,baltimore
2,atis_flight_time,what's the arrival time in sanfrancisco for t...,what's the arrival time in sanfrancisco for t...,washington,sanfrancisco
3,atis_airfare,cheapest airfare from tacoma to orlando,cheapest airfare from tacoma to orlando,tacoma,orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...,round trip fares from pittsburgh to philadelp...,pittsburgh,philadelphia


##### Model 2 Custom NER modelling
Using the labels derived above to train the data using spacy custom NER minibatch approach and fetching the loss for 30 iterations
Stored the data as required to train with spacy custom NER in source_train_data with custom entity 'SOURCE_LOC' and destination_train_data with custom entity 'DESTINATION_LOC'
Everytime, the data is shuffled and trained in batches for 30 iterations so that the model dont remember the labels as is


In [73]:
import spacy
nlppipeline=spacy.load('en_core_web_sm')

# Getting the pipeline component
ner=nlppipeline.get_pipe("ner")
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training import Example

In [74]:

def ner_train_test_split(source_data,destination_data):
    logging.info('Splitting the data to train and test')
    n = len(source_data)
    print('Total data length: ', n)
    train_data_size = n * 0.7
    test_data_size = n * 0.3
    source_train_data = source_data[0:int(train_data_size)]
    source_test_data = source_data[int(train_data_size):]
    destination_train_data = destination_data[0:int(train_data_size)]
    destination_test_data = destination_data[int(train_data_size):]
    print('source split: ', len(source_train_data), len(source_test_data))
    print('destination split: ', len(destination_train_data), len(destination_test_data))
    return source_train_data,source_test_data,destination_train_data,destination_test_data

In [85]:

def custom_modelling(label, traindata):
    logging.debug('Entering NER custom training model')

    output_path = ''
    for _, annotates in traindata:
        for ent in annotates.get("entities"):
            ner.add_label(ent[2])
    # Disabling the components otherthan the required ones
    unaffected_pipelines = [pipeline for pipeline in nlppipeline.pipe_names if
                            pipeline not in ["ner", "trf_wordpiecer", "trf_tok2vec"]]

    # Model Training with 40 iterations so that it wont remember the data
    with nlppipeline.disable_pipes(*unaffected_pipelines):

        for iteration in range(30):
            random.shuffle(traindata)
            losses = {}
            #  using spaCy's minibatch to batch up the train data
            allbatches = minibatch(traindata, size=compounding(5.0, 30.0, 1.001))
            for eachbatch in allbatches:
                for txt, annotates in eachbatch:
                    doc = nlppipeline.make_doc(txt)
                    example = Example.from_dict(doc, annotates)
                    # Running nlppipeline.update to adjust the weights
                    nlppipeline.update([example], losses=losses, drop=0.3)
                    # print(losses)

    # Saving the model to path same as the label so that it can be loaded from the same path again

    output_path = Path(label)
    logging.info("Saving the model to", output_path)
    nlppipeline.to_disk(output_path)
    logging.debug('Leaving the NER custome model')


In [86]:

def custom_spacy_ner(source_train_data,destination_train_data):
    logging.info('Training NER custom model for source entity')
    custom_modelling('source', source_train_data)
    logging.info('Training NER custom model for destination entity')
    custom_modelling('destination', destination_train_data)


In [81]:
source_train_data, source_test_data, destination_train_data, destination_test_data = ner_train_test_split(source_data, destination_data)

Total data length:  199
source split:  139 60
destination split:  139 60


In [87]:
custom_spacy_ner(source_train_data, destination_train_data)

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\logging\__init__.py", line 1083, in emit
    msg = self.format(record)
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\logging\__init__.py", line 927, in format
    return fmt.format(record)
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\logging\__init__.py", line 663, in format
    record.message = record.getMessage()
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\logging\__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Dharani Rayadurgam\anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.

KeyboardInterrupt: 

#### NER inference
Using Spacy NER pipeline, rule based NER recognistion and custom NER modelling, we were able to obtain the required name entities for the given query

In [28]:
def get_ner(query):
    logging.debug('Entering get_ner')
    source = ''
    dest = ''
    all_entities=nlp_ner(query)
    for output_dir in ['source','destination']:
        logging.info("Loading from", output_dir)
        move_names = list(ner.move_names)
        nlp2 = spacy.load(output_dir)
        #assert nlp2.get_pipe("ner").move_names == move_names
        logging.info(query)
        doc2 = nlp2(query)

        for ent in doc2.ents:
            print(ent.label_, ent.text)
            if ent.label_=='SOURCE_LOC':
                obtained_source=ent.text
                source= obtained_source
            if ent.label_=='DESTINATION_LOC':
                obtained_dest=ent.text
                dest=obtained_dest

    location_entities={'source':source,'dest':dest}
    result=all_entities | location_entities
    logging.info(result)
    return result

### Using Heuristics for Dialogue flow Manager

For Dialogue Flow, We used multiple models such as RNN(Recurrent Neural Network), LSTM(Long-Short Term Memory) and did encoding/decoding too for the dialogue flow management. For our dataset, we don't have responses for the questions being asked. Using models such as RNN,LSTM, was difficult to generate text from these models. And without proper pre-defined responses and enough data, it was difficult to train the RNN/LSTM models. 

Hence we decided to proceed with rule-based/Heuristics approach for Dialogue Manager. This method gets the intent and Entities for the given query, and using the heuristics, generates response easily. 

Here, since we dont have the output of the input queries in our dataset, we defined a format in the reply message based on the intents and entities and displaying the other values such as flight names, flight fares, air craft prices as a random value. 

Usually, the exact details of the flights names, timings, fares etc could be maintained in a database and using our intents and NERs, the exact value of each could be fetched using the queries and replaced in the random values that we display

As it is not a model, it doesn't keep any history of previous conversations yet. We could however make a mechanism to keep history of the conversation and also check if the user directs the question to some other topic other than what the bot is trained for. 

#### Including greetings and farewell in the dialogue flow

In [101]:
import json
from flask import Flask,request,jsonify
from werkzeug.wrappers import Request, Response
from werkzeug.serving import run_simple

app = Flask(__name__)


In [102]:
greetings= ["hi", "hello", "hey", "helloo", "hellooo", "g morining", "gmorning", "good morning",
            "morning", "good day", "good afternoon", "good evening", "greetings",
            "greeting", "good to see you", "its good seeing you", "how are you",
            "how're you", "how are you doing", "how ya doin'", "how ya doin",
             "how is you", "how's you", "how is it going", "how's it going", "how's it goin'",
            "how's it goin", "what is up",  "g’day", "howdy"]
farewell=["Thank you","Bye", "Good day","good bye","thanks","have a good day","cheers","thank you so much for your help","see you",
         "bye-bye", "have a good day"]


In [103]:
import random

data = {"text": "Hi",
    "Intents": "atis_flight",
        "Entities": {"source":"buffalo", "dest":"oakland"}}
logging.info('Collecting data to display in the bot output')


flights = ['Panam Airlines', 'American Airlines', 'Virgin Airlines', 'United Airlines', 
                'Breeze Airlines','Alaska Airlines','Frontier Airlines', 'JetBlue']

aircraft_types = ['Airbus A320 family', 'Boeing 737 NG', 'Boeing 777','Airbus A330','Boeing 747','Airbus A319']

flight_time = list(range(1,13))

flight_fares = list(range(100,1000,50))


    
@app.route('/chat',methods=['POST'])
def get_dialogue():
    if request.method == 'POST':
        input_query = request.form.get('msg')
    log.debug('Entering dialogue flow manager')
    log.info('Checking greetings and farewell intents')
    if input_query.lower() in greetings:
        intents='greetings'
        entities={}
    elif input_query.lower() in farewell:
        intents='farewell'
        entities={}
    else:
        log.info('Checking other intents by calling get_intents and fetching the NERs from get_ner method')
        
        intents = 'atis_flight'#get_intents(input_query)
    
        entities= get_ner(input_query)
    
    if len(entities)!= 0:
        if 'source' in entities.keys():
            source = entities['source']
        
        if 'dest' in entities.keys():
            dest = entities['dest']
    if intents == 'greetings':
        output='Hello! Nice meeting you! How can I help'
    elif intents== 'farewell':
        output='Thank you for using ATIS chat bot! Have a good day'
    
    elif intents == 'atis_flight':
        if source!= None and dest!= None:
            random_flights = random.sample(flights, 3)
            random_time = random.sample(flight_time, 3)
            output = "Below are the flights from "+source+" to "+dest+" :"+random_flights[0]+" at "+str(random_time[0])+"PM, "+random_flights[1]+" at "+str(random_time[1])+"PM, "+random_flights[2]+" at "+str(random_time[2])+"PM."
        
        elif source == None or dest == None:
            output = "Sorry!! Please enter the source and destination in your details"

    elif intents == 'atis_flight_time':
        if source!= None and dest!= None:
            x = "Here are the timings and flights available for today "
            random_flights = random.sample(flights, 3)
            random_time = random.sample(flight_time, 3)
            random_time.sort()
            output = x+"from "+source+" to "+dest+": "+random_flights[0]+" at "+str(random_time[0])+"PM, "+random_flights[1]+" at "+str(random_time[1])+"PM, "+random_flights[2]+" at "+str(random_time[2])+"AM."
        
        elif source == None or dest == None:
            output = "Sorry!! Please enter the source and destination in your details"
    
    elif intents == 'atis_airline':
        if source!= None and dest!= None:
            x = "Here are the flights available"
            random_flights = random.sample(flights, 4)
            output = x+" from "+source+" to "+dest+": "+random_flights[0]+", "+random_flights[1]+", "+random_flights[2]+", "+random_flights[3]+"."
        
        elif source == None or dest == None:
            output = "Sorry!! Please enter the source and destination in your details"
    
    elif intents == 'atis_airfare':
        if source!= None and dest!= None:
            x = "These are the flights and price I could find"
            random_flights = random.sample(flights, 4)
            random_fares = random.sample(flight_fares, 4)
            output = x+"from "+source+" to "+dest+": "+random_flights[0]+": "+str(random_fares[0])+" USD, "+random_flights[1]+": "+str(random_fares[1])+" USD, "+random_flights[2]+": "+str(random_fares[2])+" USD, "+random_flights[3]+": "+str(random_fares[3])+" USD."
        
        elif source == None or dest == None:
            output = "Sorry!! Will you please enter source and destination in the details?"
        
    elif intents == 'atis_aircraft':
        if source!= None and dest!= None:
            x = "Here are the list of types of flights used "
            random_types = random.sample(aircraft_types, 4)
            output = x+"from "+source+" to "+dest+": "+random_types[0]+": "+random_types[0]+", "+random_types[1]+", "+random_types[2]+", "+random_types[3]+"."
        
        elif source == None or dest == None:
            output = "Sorry!! Can you please enter source and destination in the details?"
    

    else:
        output = "Sorry, I didn't understand. I can only help you with Flight Details, Flight Fares, Flight timings and types of Flights. Please enter the correct information."
    log.info(output)    
    return output
    

In [None]:
if __name__ == '__main__':
    run_simple('127.0.0.1', 8000, app)

#### CI/CD implementation

##### Continuous Integration using GIT

Our github URL- https://github.com/DharaniRK/NLP_group (Private git repo, for only our team people are the collaborators)

We have used github as our continuous integration tool since we are a team of 5 and each team member alters the code frequently. We have committed the initial version of code in default(master) branch from which we created individual branches based on the component/the person using a branch. 

Since github keeps the track of commit history, it was extremely useful when we need to rebase/revert the commit or merge the code among us

We made use of the git feature- raising pull requests and merging the code to master branch regularly to avoid merge conflicts and any miss in the codebase. 


##### Continuous deployment using Docker

We used docker to deploy our web application since docker deployment is one of the most popular deployment technique for a microservice web application architecture. 

The base image we used in python:3.8-slim-buster which comes with linux operating system, python 3.8 and pip3 already installed.

The docker image for our application is created with all the codebase moved to /app path in docker container and flask run command is executed

Upon docker run, the container is created and the container acts as a chatbot. The container can be accessed using curl commands to localhost:8000

In [38]:
get_dialogue('hello')

'Hello!!! Nice meeting you! How can I help'

In [39]:
get_dialogue('thanks')

'Thank you for using ATIS chat bot! Have a good day'

In [47]:
get_dialogue(' whats the smallest plane that flies from pittsburgh to baltimore on eight sixteen')

Loading from source
 whats the smallest plane that flies from pittsburgh to baltimore on eight sixteen
SOURCE_LOC pittsburgh
Loading from destination
 whats the smallest plane that flies from pittsburgh to baltimore on eight sixteen
DESTINATION_LOC baltimore
{'TIME': 'eight sixteen', 'source': 'pittsburgh', 'dest': 'baltimore'}


'Below are the flights from pittsburgh to baltimore :Alaska Airlines at 8PM, Virgin Airlines at 9PM, United Airlines at 6PM.'