# 1. Intro

**FAQ style retrieval based chat bot**
- Train three model types (in different configurations) to figure out user input and map to a response class. Let's see which one does well. The models are
    - a TF-IDF similarity measure doc classifier 
    - a TFIDF based n-gram MLP multi-class classifier (supervised)
    - an RNN classifier (unsupervised)

**The Data**
- Pulling data from known disease/pandemic authorities such as CDC and WHO

- Also getting KE national government content. These are static data; knowledge already in place. TODO: a channel for news updates 

- Data is maintained in a Gsheet and can make updates/additions/etc from there

- Clean and classify the above data to have two datasets
    - FAQ_db: This is the knowledge base. One to one mapping of class categories and response paragraphs. Has two main fields: class_category, response_p. Additional fields: src, src_link 
    - Phrases_db: This is the training set on questions/input that users may present to the bot. Has two main fields: input_phrase, class_category 
    
**Approach**
- Retrieval based chat bot. 


# 2. Corana Dashboard by John Hopkins Uni

[Link to map FAQ](https://coronavirus.jhu.edu/map-faq)

In [1]:
### John Hopkins Dashboard - https://coronavirus.jhu.edu/map.html
from IPython.display import IFrame
## default 77.3846,11.535 
start_coordz = "77.3846,11.535"  # rabat, morocco"33.9693414,-6.9273026"
center_coordz = "28.8189834,-2.5117154" #center Bukavu, DRC "-2.5117154,28.8189834"

IFrame(src="//arcgis.com/apps/Embed/index.html?webmap=14aa9e5660cf42b5b4b546dec6ceec7c&extent="+start_coordz+",163.5174,52.8632"+
       "&center="+center_coordz+
       "&zoom=true&previewImage=false&scale=true&disable_scroll=true&theme=light", 
    width="650", height="400", frameborder="0", scrolling="no", marginheight="0", marginwidth="0", title="2019-nCoV" )

# 3. FAQ Chat bot - Part 2

- Try different models
- Fine tune hyper params
- Save best fit

**Recall::**
- INFOR   : 2020-04-06 17:38:00.251162 [dataSource.writeTo] dpath = cleaned_phrases.xftz
- INFOR   : 2020-04-06 17:38:00.252160 [dataSource.writeTo] dpath = cleaned_phrases.ylbz
- INFOR   : 2020-04-06 17:38:00.253182 [dataSource.writeTo] dpath = cleaned_phrases.faqdb
- INFOR   : 2020-04-06 17:38:00.253182 [dataSource.writeTo] dpath = cleaned_phrases.phrdb

In [2]:
# TODO: 
## Don't know why the stop words are failing and yet the last stop has the correct list :( 
## set a seed at shuffle dataset for train ,test split 

## 3.1. Vectorize
- Do TFIDF Vectorization. Can be used with similarity doc classification, n-gram MLP, 


In [3]:
import numpy as np
import pandas as pd

import nltk

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [4]:
import sys
sys.path.append("../../../shared") 
import zdataset, zdata_source
from zdataset import ZGsheetFaqDataSet


import warnings
warnings.filterwarnings('ignore')

In [5]:
## plot settings
params = {
    'font.size' : 14.0,
    'figure.figsize': (20.0, 12.0),
    'figure.dpi' : 40
}
plt.rcParams.update(params)
plt.style.use('fivethirtyeight') #tableau-colorblind10 ggplot

In [6]:
### Load the preprocessed data 
dpath = "cleaned_phrases"
dset = ZGsheetFaqDataSet()
dset.dumpLoad( dpath, zdata_source.zSERIALIZED)


INFOR   : 2020-04-06 21:03:16.115361 [[34mzdataset.dumpLoad[0m] Loaded cleaned_phrases.xftz of size 141
INFOR   : 2020-04-06 21:03:16.118353 [[34mzdataset.dumpLoad[0m] Loaded cleaned_phrases.xidx of size 141
INFOR   : 2020-04-06 21:03:16.119349 [[34mzdataset.dumpLoad[0m] Loaded cleaned_phrases.ylbz of size 141
INFOR   : 2020-04-06 21:03:16.120346 [[34mzdataset.dumpLoad[0m] Not Found: cleaned_phrases.argz
INFOR   : 2020-04-06 21:03:16.121344 [[34mzdataset.dumpLoad[0m] Loaded cleaned_phrases.faqdb of size 92
INFOR   : 2020-04-06 21:03:16.122341 [[34mzdataset.dumpLoad[0m] Loaded cleaned_phrases.phrdb of size 141
INFOR   : 2020-04-06 21:03:16.122341 [[34mzdataset.dumLoad[0m] FINISHED: clean data size 141


In [7]:
print("Example Featurez: {}".format( dset.clean_data[:3] ) )
print("Example Label: {}".format( dset.y_labelz[:3] ) )

print("Example Featurez: {}".format( dset.clean_data[-1:] ) )
print("Example Label: {}".format( dset.y_labelz[-1:] ) )

Example Featurez: ['pandemic' 'cause pandemic' 'mean declared pandemic']
Example Label: ['pandemic_define' 'pandemic_causes' 'pandemic_WHO']
Example Featurez: ['make cat dog pet sick']
Example Label: ['pets_infection_cdc']


In [8]:
#### TODO: Split training and validation <<< b/c @ shuffling and resetting index at new dset obje
#train

In [9]:
## TFIDF encode 
# ZENC_TFIDF is default 
encoder_tfidf, encoded_featurez = dset.encodeTrain(enc_type=zdataset.ZENC_TFIDF, ngramz=(2,2)) 
print("Example Encoding: {}".format( encoded_featurez[: 3] ) ) 
print("Example Encoding: {}".format( encoded_featurez[-1:] ) ) 

Example Encoding:   (1, 24)	1.0
  (2, 68)	0.7071067811865476
  (2, 173)	0.7071067811865476
Example Encoding:   (0, 205)	0.533923657905307
  (0, 167)	0.533923657905307
  (0, 75)	0.46360061208869924
  (0, 23)	0.46360061208869924


## 3.2. Cosine Similarity Model

In [10]:
from zmodel_cosine_similarity import ZCosineSimilarity

In [11]:
## setup 
model = ZCosineSimilarity('TFIDF_Cosine')
model.build( dset.context )

In [12]:
## TODO: train_test_split. For now use below
val_txt = ['Is corana deadly', 
           'What is corana', 
           "Should my pet get tested", 
           "Will mosquito bite infect me", 
          "Can I vist my elderly parents", 
          "Is there a vaccineS"]

val_ylabz = ['covid19_define',
             'covid19_define',
            'pets_testing', 
            'covid19_spread_insects',
            'covid19_at_risk',
            'covid19_cure']


cleaned_txt = dset.preprocessPredict( val_txt)

for itx, otx in zip(val_txt, cleaned_txt):
    print("CLEANED: {} ===> {}".format(itx, otx))

CLEANED: Is corana deadly ===> corana deadly
CLEANED: What is corana ===> corana
CLEANED: Should my pet get tested ===> pet get tested
CLEANED: Will mosquito bite infect me ===> mosquito bite infect
CLEANED: Can I vist my elderly parents ===> vist elderly parent
CLEANED: Is there a vaccineS ===> vaccine


In [13]:
acc, predicted_yz = model.validate( cleaned_txt, val_ylabz)
print("Predicted Accuracy: {}".format( acc) ) 

for idx, txt, y in  zip( predicted_yz, val_txt, val_ylabz):
    cat,res =  dset.getPredictedAtIndex( idx )
    print("PREDICTED: {}:{} ===> {} for '{}' ".format(idx, y, cat, txt))
    print("\t{}\n".format(resp) )

INFOR   : 2020-04-06 21:03:17.410896 [[34mcosine.predict[0m] IN: corana deadly


ValueError: Iterable over raw text documents expected, string object received.

## 3.3. Multi-class MLP 

In [None]:
np.array(np.array( [1, 2, 3] ) == np.array( [1, 2.0, 3.000] )).mean()