In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import nltk

from nlu_engine import NLUEngine
from nlu_engine import MacroDataRefinement
from nlu_engine import MacroEntityRefinement

from nlu_engine import DataUtils
from nlu_engine import RenderJSON

from nlu_engine import Analytics

from nlu_engine import EntityExtractor, crf

# Macro NLU Data Refinement

It's a bit like the TV show [Serverance](https://www.imdb.com/title/tt11280740/) .

![Helly R and Mark S](https://media.npr.org/assets/img/2022/02/15/atv_severance_photo_010103-5f8033cc2b219ba64fe265ce893eae4c90e83896-s1100-c50.jpg "Helly R and Mark G")

*Helly R*: `My job is to scroll through the spreadsheet and look for the numbers that feel scary?`

*Mark S*: `I told you, you’ll understand when you see it, so just be patient.`

![MDR](https://www.imore.com/sites/imore.com/files/styles/large/public/field/image/2022/03/refinement-software-severance-apple-tv.jpg "serverance micro data refinement")

*Helly R*: `That was scary. The numbers were scary.`

Hopefully the intents and entities that are wrong aren't scary, just a bit frustrating. Let's see if we can find the right ones.

NOTE: We will use Logistic Regression with TFIDF features to train our intent models and CRFs for entity exraction. Why? Well, they are very fast and both methods aren't state-of-the-art. This is good, because it is easier to find problems we will need to refine in the dataset than if we were to use a proper NLU engine like Snips or something SOTA like BERT. It is very important to note that some of the the problems we will pick up on, might not be an actual issue, but might be due to the limitations of the models. Refining the real problems and ignoring the limitations of the models is a good way to improve the models. Then when the dataset is ready, we can use some more advanced NLU engine and get the best performance possible.

* Macro NLU Data Refinement: Intent
* Macro NLU Data Refinement: Entity


Load the dataset

In [2]:
try:
    nlu_data_df = pd.read_csv(
        'data/refined/nlu_data_refined_df.csv', sep=',', index_col=0)
    print('Successfully loaded nlu_data_refined_df.csv')
except:
    data = 'data/NLU-Data-Home-Domain-Annotated-All-Cleaned.csv'
    nlu_data_df = DataUtils.load_data(
    data
)

Successfully loaded nlu_data_refined_df.csv


In [None]:
# TODO: Remove this when done.It's just for testing!
data = 'data/NLU-Data-Home-Domain-Annotated-All-Cleaned.csv'
nlu_data_df = DataUtils.load_data(
    data
)

In [3]:
# Make sure nlu_data_df['answer_normalised'] is taken from nlu_data_df['answer_annotation']
nlu_data_df = DataUtils.convert_annotated_utterances_to_normalised_utterances(
    nlu_data_df)


# Entity extraction report

The entity extraction could be greatly improved by improving the features it uses. It would be great if someone would take a look at this. Perhaps the CRF features similar to what Snips uses would be better such as Brown clustering (probably).

In [None]:
#TODO: implement brown clustering to improve entity extraction (see entity_extractor.py)

It is important to have the NLTK tokenizer to be able to extract entities.

In [None]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
        nltk.download('punkt')

We should remove the unwanted entries for the next few steps.

In [4]:
removed_nlu_data_refined_df = nlu_data_df[nlu_data_df['remove'] != True]

Due to this error featured in [this git issue](https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/60) we have to use an older version of scikit learn (sklearn<0.24), otherwise the latest version would work. Hopefully this gets fixed one day..

In [7]:
entity_report_df = NLUEngine.evaluate_entity_classifier(
    data_df=removed_nlu_data_refined_df, cv=4)

Evaluating entity classifier
Cross validating with CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)




Time it took to cross validate CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100): 248.99148607254028


  _warn_prf(average, modifier, msg_start, len(result))


In [8]:
entity_report_df

Unnamed: 0,entity-type,precision,recall,f1-score,support
0,0,0.875369,0.953684,0.91285,64449.0
1,alarm_type,0.0,0.0,0.0,26.0
2,app_name,0.878788,0.460317,0.604167,63.0
3,artist_name,0.350877,0.285714,0.314961,560.0
4,audiobook_author,0.0,0.0,0.0,22.0
5,audiobook_name,0.0,0.0,0.0,235.0
6,business_name,0.319149,0.153374,0.207182,489.0
7,business_type,0.605714,0.305476,0.40613,347.0
8,change_amount,0.0,0.0,0.0,150.0
9,coffee_type,1.0,0.155556,0.269231,45.0


In [None]:
entity_type = 'alarm_type'

In [None]:
removed_nlu_data_refined_df[removed_nlu_data_refined_df['answer_annotation'].str.contains(
    entity_type)]['scenario'].unique().tolist()


In [5]:
# TODO: instead of getting domains for each entity type,
# get entries for each domain and create classification report for each domain
domain = 'general'
domain_df = removed_nlu_data_refined_df[removed_nlu_data_refined_df['scenario'] == domain]
#domain_entity_report_df = NLUEngine.evaluate_entity_classifier(data_df=domain_df, cv=4)

In [12]:
domain_df['entity_types'] = domain_df['answer_annotation'].apply(EntityExtractor.extract_entities)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [15]:
import regex as re

In [16]:
domain_df['entity_types'] = domain_df['answer_annotation'].apply(
    lambda utterance: re.findall(r'\[(.*?)\]', utterance))
#TODO: have to find a way to get the entity type from the utterance


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [18]:
domain_df.head(50)

Unnamed: 0,userid,answerid,notes,question,suggested_entities,answer,answer_normalised,scenario,intent,predicted_label,intent_refined,entity_refined,remove,status,answer_annotation,entity_types
27,201.0,5488.0,,How would you tell your PDA to confirm a setting,,check my car is ready,check my car is ready,general,quirky,quirky,,,,,check my car is ready,[]
28,201.0,5490.0,,How would you tell your PDA to confirm a setting,,check my laptop is working,check my laptop is working,general,quirky,quirky,,,,,check my laptop is working,[]
29,224.0,6242.0,,How would you tell your PDA to confirm a setting,,Is the brightness of my screen running low?,is the brightness of my screen running low,general,quirky,quirky,,,,,is the brightness of my screen running low,[]
32,1003.0,27047.0,,How would you tell your PDA to confirm a setting,,I want the status on my screen brightness,i want the status on my screen brightness,general,quirky,quirky,,,,,i want the status on my screen brightness,[]
34,1.0,60.0,,Write what you would tell your PDA in the foll...,,"Olly, I am not tired, I am actually happy.",i am not tired i am actually happy,general,quirky,quirky,,,,,i am not tired i am actually happy,[]
35,1.0,64.0,,Write what you would tell your PDA in the foll...,,what's up olly,what's up,general,quirky,definition,True,,False,,what's up,[]
105,4.0,684.0,,How would you ask for a joke,joke_type,make me laugh,make me laugh,general,joke,recipe,True,,False,,make me laugh,[]
106,4.0,685.0,,How would you ask for a joke,joke_type,tell me a good joke,tell me a good joke,general,joke,joke,,,,,tell me a [joke_type : good] joke,[joke_type : good]
108,4.0,687.0,,Write what you would tell your PDA in the foll...,joke_type,tell me a joke.,tell me a joke,general,joke,joke,,,,,tell me a joke,[]
112,4.0,704.0,,How would you ask your PDA about the current d...,date,tell me about today,tell me about today,general,quirky,quirky,,,,,tell me about [date : today],[date : today]


In [None]:
#TODO: get all domains, get overall scores and combine them into a single report
domain_entity_report_df
#TODO: graph this report
#TODO: user picks a domain, then they see this report of the entity breakdown.
# TODO: the user gets a data sheet for the domain arranged by support.

In [None]:
# Refactor method to include domain as a possible input
Analytics.plot_report(domain_entity_report_df)

In [None]:
domains = removed_nlu_data_refined_df['scenario'].unique().tolist()


In [None]:
domains

In [None]:
def get_entity_reports_for_domains(data_df):
    domains = data_df['scenario'].unique().tolist()
    domain_entity_reports_df = pd.DataFrame()
    for domain in domains:
        print(f'Evaluating entity classifier for {domain}')
        domain_df = data_df[data_df['scenario'] == domain]
        domain_entity_report_df = NLUEngine.evaluate_entity_classifier(
            data_df=domain_df, cv=4)
        domain_scores_df = domain_entity_report_df.tail(3)
        domain_scores_df['domain'] = domain

        domain_entity_reports_df.append(
            domain_scores_df)

    return domain_entity_reports_df

domain_entity_reports_df = get_entity_reports_for_domains(removed_nlu_data_refined_df)


In [None]:
#TODO: Extract the entities into something countable!

In [None]:
crf_model = NLUEngine.train_entity_classifier(removed_nlu_data_refined_df)

In [None]:
model_path = 'models/analytics/entity_tagger.sav'

In [None]:
DataUtils.pickle_model(classifier=crf_model, model_path=model_path)

In [None]:
crf_model = DataUtils.import_pickled_model(model_path)

In [None]:
annotated_utterance = 'set an alarm for [time : two hours] [time : from now]'


In [None]:
def combine_neigboring_same_entity_types(utterance, crf_model):
    """
    Combines neighboring entities of the same type and removes them in an utterance, until all of the same neigboring entities are combined
    """
    tagged_utterance = EntityExtractor.tag_utterance(
        utterance,
        crf_model
    )
    split_tagged_utterance = tagged_utterance.split(' ')

    def combine_and_remove_entities(split_tagged_utterance):
        """
        Combines neighboring entities of the same type and marks the duplicates to be removed.
        NOTE: It is important for parsing to keep the same length, therefore we mark them instead of directly remove the matches.
        """
        change_counter = 0
        for index, token in enumerate(split_tagged_utterance):
            if '[' in token:
                if len(split_tagged_utterance) > index + 3:
                    if token == split_tagged_utterance[index + 3]:
                        split_tagged_utterance[index + 2] = split_tagged_utterance[index + 2].replace(
                            ']', '') + ' ' + split_tagged_utterance[index + 5]

                        split_tagged_utterance[index + 3] = 'to_remove'
                        split_tagged_utterance[index + 4] = 'to_remove'
                        split_tagged_utterance[index + 5] = 'to_remove'
                        change_counter += 1
        return (split_tagged_utterance, change_counter)

    def remove_entities(split_tagged_utterance):
        try:
            while True:
                split_tagged_utterance.remove('to_remove')
        except ValueError:
            pass
        return split_tagged_utterance

    combined_entities_split_tagged_utterance, change_counter = combine_and_remove_entities(
        split_tagged_utterance)
    removed_entities_split_tagged_utterance = remove_entities(
        combined_entities_split_tagged_utterance)

    while change_counter > 0:
        combined_entities_split_tagged_utterance, change_counter = combine_and_remove_entities(
            removed_entities_split_tagged_utterance)
        removed_entities_split_tagged_utterance = remove_entities(
            combined_entities_split_tagged_utterance)

    return ' '.join(removed_entities_split_tagged_utterance)


In [None]:
combine_neigboring_same_entity_types(utterance, crf_model)

In [None]:
removed_nlu_data_refined_df

In [None]:
removed_nlu_data_refined_df['predicted_tagging'] = removed_nlu_data_refined_df['answer_normalised'].apply(
    lambda x: combine_neigboring_same_entity_types(x, crf_model))

In [None]:
removed_nlu_data_refined_df[removed_nlu_data_refined_df['answer_annotation']
                            != removed_nlu_data_refined_df['predicted_tagging']]
# TODO: have user choose a domain, make report about entities in the domain,...
# TODO: after the report, does the user clean all entities in the domain at once, or by individual entity type?


In [None]:
removed_nlu_data_refined_df.loc[21, 'answer_annotation']


In [None]:
#TODO: It is probably a good idea to drop all of the ones that lack a good support.
#NOTE: But it didn't work to fix the problem.
remove_entities = [
    'music_album',
    'game_type',
    
]
removed_nlu_data_refined__entities_cleaned_df = removed_nlu_data_refined_df[~removed_nlu_data_refined_df['answer_annotation'].str.contains('|'.join(remove_entities))]

In [None]:
Analytics.plot_report(entity_report_df)

In [None]:
#TODO: Remove/replace worst: add in state features like here: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-check-what-classifier-learned
# Specifically, we want print_state_features()

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))


In [None]:
from nlu_engine import crf
from collections import Counter

In [None]:
crf_model.state_features_

In [None]:
print_state_features(Counter(crf.state_features_).most_common(100))

In [None]:
# TODO: review the most common features, none of them are the word parts (chunks) or POS tags, are these even needed or helpful?
# Can we remove them and speed up the process?
# What other features could be used? Word2vec? Brown clustering?

In [None]:
domain_selection = MacroDataRefinement.list_and_select_domain(nlu_data_df)

As we have seen from the entity extraction report, the entity extraction is not working for the alarm_type.

In [None]:
#TODO: review all scoring 0, see if they can be completely dropped or what
entity_to_refine = 'alarm_type'
nlu_scenario_df = removed_nlu_data_refined_df[removed_nlu_data_refined_df['answer_annotation'].str.contains(
    entity_to_refine)]

In [None]:
nlu_scenario_df

In [None]:
def remove_entity(df, entity_to_remove):
    """
        Remove all entries of an entity type from the dataframe.
        :param df: pandas dataframe
        :return: pandas dataframe
        """
    updated_df = df.copy()
    updated_df.loc[updated_df['answer_annotation'].str.contains(
        entity_to_remove), 'remove'] = True
    return updated_df


In [None]:
updated_df = remove_entity(removed_nlu_data_refined_df, entity_to_refine)

In [None]:
removed_nlu_data_refined_df[removed_nlu_data_refined_df['answer_annotation'].str.contains(
    entity_to_refine)]

## Entity Convert to ipysheet and review
TODO: add in description of the types of fixes we can do to the NLU data for entity


In [None]:

# TODO: same as above for intents but with predicted entities: report on them, break them down into a dictionary of dataframes and refine them..

For the example with 'alarm' and the alarm_type:
* We see that the alarm_type entities are really event_name(ie wake up, soccer practice) except for ID 5879, we will need to change them to event_name and remove ID 5879.
* The last one(ID 6320) is a mistake. Someone got confused with the prompt and assumed alarm is a security system. This is out of scope for the alarm domain, as the alarms are ones set on a phone or other device. We will drop this utterance.
Once you are done reviewing, you convert it back to a dataframe and check to make sure it looks okay.
Let's change all alarm_type entities to event_name.

In [None]:

reviewed_scenario_df['answer_annotation'] = reviewed_scenario_df['answer_annotation'].str.replace(
    'alarm_type', 'event_name')
reviewed_scenario_df


Okay dokey, now we can merge this with the original data set and see if it made a difference already(well of course it did!).

In [None]:
nlu_data_df.drop(
    reviewed_scenario_df[reviewed_scenario_df['remove'] == True].index, inplace=True)

reviewed_scenario_df = reviewed_scenario_df[~reviewed_scenario_df['remove'] == True]

nlu_data_df.loc[nlu_data_df.index.intersection(
    reviewed_scenario_df.index), 'answer_annotation'] = reviewed_scenario_df['answer_annotation']

nlu_data_df[(nlu_data_df['scenario'].str.contains('alarm')) & (nlu_data_df['answer_annotation'].str.contains(
    'event_name'))]


### Benchmark changed data set
TODO: repeat reports for the changed data set for domain and entities and compare


In [None]:

entity_reviewed_report_df = NLUEngine.evaluate_entity_classifier(
    data_df=nlu_data_df)
entity_reviewed_report_df.sort_values(by=['f1-score'])

If you are sure it is okay, you can save it as a csv file, make sure to name it correctly(i.e. `alarm_domain_first_review.csv`)

In [None]:
reviewed_scenario_df.to_csv('alarm_domain_first_review.csv')

Load it back up and check to make sure it looks okay. Make sure to give it the right name!


In [None]:
audio_domain_first_review_df = pd.read_csv(
    'alarm_domain_first_review.csv', index_col=0)
audio_domain_first_review_df.tail(50)


In [None]:
# TODO: implement the evaluate_classifier in the NLU engine to check f1 score for intents and entities in the domain vs original NLU data of domain!
# Value: benchmark!
#TODO: implement a flow for getting the domains with the lowest f1 scores by intent/domain and entities and cleaning them by the order of the lowest f1 scores
# TODO: concat all reviewed dfs and save to csv
# TODO: add benchmark for whole NLU data set before and after cleaning! (by intents and domains!)
# TODO: review the review marked entries
# TODO: add new column for notes
# TODO: change flow of review for only ones that should be reviewed, not all of the ones that have been changed (track changes by comparing against the original data set)
# TODO: do the changed utterances have to be changed in other fields too or is it just enough for the tagged utterancve field?
# TODO: add visualizations of domains, their intents, keywords in utterances, and entities to top
