In [2]:
from nlu_engine import EntityExtractor, crf
from nlu_engine import Analytics
from nlu_engine import RenderJSON
from nlu_engine import DataUtils
from nlu_engine import MacroEntityRefinement
from nlu_engine import MacroDataRefinement

from nlu_engine import NLUEngine
import nltk
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Macro NLU Data Refinement

It's a bit like the TV show [Serverance](https://www.imdb.com/title/tt11280740/) .

![Helly R and Mark S](https://media.npr.org/assets/img/2022/02/15/atv_severance_photo_010103-5f8033cc2b219ba64fe265ce893eae4c90e83896-s1100-c50.jpg "Helly R and Mark G")

*Helly R*: `My job is to scroll through the spreadsheet and look for the numbers that feel scary?`

*Mark S*: `I told you, you’ll understand when you see it, so just be patient.`

![MDR](https://www.imore.com/sites/imore.com/files/styles/large/public/field/image/2022/03/refinement-software-severance-apple-tv.jpg "serverance micro data refinement")

*Helly R*: `That was scary. The numbers were scary.`

Hopefully the intents and entities that are wrong aren't scary, just a bit frustrating. Let's see if we can find the right ones.

NOTE: We will use Logistic Regression with TFIDF features to train our intent models and CRFs for entity exraction. Why? Well, they are very fast and both methods aren't state-of-the-art. This is good, because it is easier to find problems we will need to refine in the dataset than if we were to use a proper NLU engine like Snips or something SOTA like BERT. It is very important to note that some of the the problems we will pick up on, might not be an actual issue, but might be due to the limitations of the models. Refining the real problems and ignoring the limitations of the models is a good way to improve the models. Then when the dataset is ready, we can use some more advanced NLU engine and get the best performance possible.

* Macro NLU Data Refinement: Intent
* Macro NLU Data Refinement: Entity


Load the dataset

In [3]:
try:
    nlu_data_df = pd.read_csv(
        'data/refined/nlu_data_refined_df.csv', sep=',', index_col=0)
    print('Successfully loaded nlu_data_refined_df.csv')
except:
    data = 'data/NLU-Data-Home-Domain-Annotated-All-Cleaned.csv'
    nlu_data_df = DataUtils.load_data(
    data
)

Successfully loaded nlu_data_refined_df.csv


Make sure `nlu_data_df['answer_normalised']` is taken from `nlu_data_df['answer_annotation']`

In [4]:
nlu_data_df = DataUtils.convert_annotated_utterances_to_normalised_utterances(
    nlu_data_df)


We should remove the unwanted entries for the next few steps.

In [5]:
removed_nlu_data_refined_df = nlu_data_df[nlu_data_df['remove'] != True]

# Entity extraction report

The entity extraction could be greatly improved by improving the features it uses. It would be great if someone would take a look at this. Perhaps the CRF features similar to what Snips uses would be better such as Brown clustering (probably).

It is important to have the NLTK tokenizer to be able to extract entities.

In [None]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
        nltk.download('punkt')

Due to this error featured in [this git issue](https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/60) we have to use an older version of scikit learn (sklearn<0.24), otherwise the latest version would work. Hopefully this gets fixed one day..

In [6]:
entity_report_df = NLUEngine.evaluate_entity_classifier(
    data_df=removed_nlu_data_refined_df, cv=4)

Evaluating entity classifier
Cross validating with CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)




Time it took to cross validate CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100): 330.08361625671387


  _warn_prf(average, modifier, msg_start, len(result))


In [7]:
entity_report_df

Unnamed: 0,entity-type,precision,recall,f1-score,support
0,0,0.875369,0.953684,0.91285,64449.0
1,alarm_type,0.0,0.0,0.0,26.0
2,app_name,0.878788,0.460317,0.604167,63.0
3,artist_name,0.350877,0.285714,0.314961,560.0
4,audiobook_author,0.0,0.0,0.0,22.0
5,audiobook_name,0.0,0.0,0.0,235.0
6,business_name,0.319149,0.153374,0.207182,489.0
7,business_type,0.605714,0.305476,0.40613,347.0
8,change_amount,0.0,0.0,0.0,150.0
9,coffee_type,1.0,0.155556,0.269231,45.0


Optional: we can train a classifier and save the model to disk. For the default intent refined dataset, we already saved a model we can load one cell below.

In [None]:
crf_model = NLUEngine.train_entity_classifier(removed_nlu_data_refined_df)
model_path = 'models/analytics/entity_tagger.sav'
DataUtils.pickle_model(classifier=crf_model, model_path=model_path)

Let's load open up a model

In [8]:
model_path = 'models/analytics/entity_tagger.sav'
crf_model = DataUtils.import_pickled_model(model_path)

In [10]:
removed_nlu_data_refined_df['predicted_tagging'] = removed_nlu_data_refined_df['answer_normalised'].apply(
    lambda x: NLUEngine.create_entity_tagged_utterance(x, crf_model))


4, failed for utterance: ['wake', 'me', 'up', 'at', 'five', 'am', 'this', 'week']
5, failed for utterance: ['wake', 'me', 'up', 'at', 'five', 'am', 'this', 'week']
6, failed for utterance: ['wake', 'me', 'up', 'at', 'five', 'am', 'this', 'week']
7, failed for utterance: ['wake', 'me', 'up', 'at', 'five', 'am', 'this', 'week']
4, failed for utterance: ['wake', 'me', 'up', 'at', 'nine', 'am', 'on', 'friday']
5, failed for utterance: ['wake', 'me', 'up', 'at', 'nine', 'am', 'on', 'friday']
7, failed for utterance: ['wake', 'me', 'up', 'at', 'nine', 'am', 'on', 'friday']
4, failed for utterance: ['set', 'an', 'alarm', 'for', 'two', 'hours', 'from', 'now']
5, failed for utterance: ['set', 'an', 'alarm', 'for', 'two', 'hours', 'from', 'now']
6, failed for utterance: ['set', 'an', 'alarm', 'for', 'two', 'hours', 'from', 'now']
7, failed for utterance: ['set', 'an', 'alarm', 'for', 'two', 'hours', 'from', 'now']
6, failed for utterance: ['make', 'the', 'lighting', 'a', 'bit', 'more', 'warm', '

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [33]:
incorrect_tagged_df = removed_nlu_data_refined_df[removed_nlu_data_refined_df['answer_annotation']
                            != removed_nlu_data_refined_df['predicted_tagging']]


incorrect_tagged_domain_count_df = incorrect_tagged_df.groupby('scenario').size(
).reset_index(name='count').sort_values(by='count', ascending=False)


In [34]:
incorrect_tagged_domain_count_df

Unnamed: 0,scenario,count
2,calendar,681
12,qa,523
11,play,424
6,general,321
5,email,315
7,iot,198
4,datetime,192
16,weather,153
10,news,140
8,lists,132


In [36]:
incorrect_tagged_df['entity_types'] = incorrect_tagged_df['answer_annotation'].apply(
    EntityExtractor.extract_entities)

#TODO: get the entities and find the counts for each entity type!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
#TODO: load domain input and get analytics of that domain
#TODO: refine by that domain for each intent (?)

In [37]:
#TODO: extract entity types and join them together?
incorrect_tagged_df

Unnamed: 0,userid,answerid,notes,question,suggested_entities,answer,answer_normalised,scenario,intent,predicted_label,intent_refined,entity_refined,remove,status,answer_annotation,predicted_tagging,entity_types
21,1.0,52.0,,How would you ask your PDA to activate your ro...,,"cleaning is good, dust is so bad do now your m...",cleaning is good dust is so bad do now your ma...,iot,cleaning,cleaning,,,,,cleaning is good dust is so bad do now your ma...,cleaning is good dust is so bad do now your ma...,"[{'type': 'house_place', 'words': ['carpet']}]"
26,181.0,5063.0,,Write what you would tell your PDA in the foll...,,I WANT TO PLAY THAT MUSIC ONE AGAIN.,i want to play that music one again,play,music,music,,,,,i want to play that [media_type : music] one a...,i want to play that music one again,"[{'type': 'media_type', 'words': ['music']}]"
29,224.0,6242.0,,How would you tell your PDA to confirm a setting,,Is the brightness of my screen running low?,is the brightness of my screen running low,general,quirky,quirky,,,,,is the brightness of my screen running low,is the brightness of my [device_type : screen]...,[]
32,1003.0,27047.0,,How would you tell your PDA to confirm a setting,,I want the status on my screen brightness,i want the status on my screen brightness,general,quirky,quirky,,,,,i want the status on my screen brightness,i want the status on my [device_type : screen]...,[]
50,2.0,584.0,,How would you ask your PDA for news from a par...,news_source,Whats happening in football today?,whats happening in football today,news,news_query,news_query,,,,,whats happening in [news_topic : football] [da...,whats happening in football [date : today],"[{'type': 'news_topic', 'words': ['football']}..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24745,,499.0,,,,clarify more on your answers please.,clarify more on your answers please.,general,explain,explain,,,,,clarify more on your answers please.,clarify more on your [relation : answers] please.,[]
24786,,540.0,,,,"s1, again clarify your answers please.","s1, again clarify your answers please.",general,explain,explain,,,,,"s1, again clarify your answers please.","s1, again clarify your [relation : answers] pl...",[]
24824,,578.0,,,,once again clarify me about your answers please.,once again clarify me about your answers please.,general,explain,explain,,,,,once again clarify me about your answers please.,once again clarify me about your [relation : a...,[]
24831,,585.0,,,,"s1, further elaborate your answers please.","s1, further elaborate your answers please.",general,explain,explain,,,,,"s1, further elaborate your answers please.","s1, further elaborate your [relation : answers...",[]
