In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import nltk

from nlu_engine import NLUEngine
from nlu_engine import MacroDataRefinement
from nlu_engine import MacroIntentRefinement

from nlu_engine import DataUtils
from nlu_engine import RenderJSON

from nlu_engine import IntentMatcher, LR
from nlu_engine import EntityExtractor

# Macro NLU Data Refinement

It's a bit like the TV show [Serverance](https://www.imdb.com/title/tt11280740/) .

![Helly R and Mark S](https://media.npr.org/assets/img/2022/02/15/atv_severance_photo_010103-5f8033cc2b219ba64fe265ce893eae4c90e83896-s1100-c50.jpg "Helly R and Mark G")

*Helly R*: `My job is to scroll through the spreadsheet and look for the numbers that feel scary?`

*Mark S*: `I told you, you’ll understand when you see it, so just be patient.`

![MDR](https://www.imore.com/sites/imore.com/files/styles/large/public/field/image/2022/03/refinement-software-severance-apple-tv.jpg "serverance micro data refinement")

*Helly R*: `That was scary. The numbers were scary.`

Hopefully the intents and entities that are wrong aren't scary, just a bit frustrating. Let's see if we can find the right ones.

NOTE: We will use Logistic Regression with TFIDF features to train our intent models and CRFs for entity exraction. Why? Well, they are very fast and both methods aren't state-of-the-art. This is good, because it is easier to find problems we will need to refine in the dataset than if we were to use a proper NLU engine like Snips or something SOTA like BERT. It is very important to note that some of the the problems we will pick up on, might not be an actual issue, but might be due to the limitations of the models. Refining the real problems and ignoring the limitations of the models is a good way to improve the models. Then when the dataset is ready, we can use some more advanced NLU engine and get the best performance possible.

* Macro NLU Data Refinement: Intent
* Macro NLU Data Refinement: Entity


Load the dataset

In [None]:
data = 'data/NLU-Data-Home-Domain-Annotated-All-Cleaned.csv'

nlu_data_df = DataUtils.load_data(
    data
)

The data set can be a bit unruly to just check out in it's pure form, but it might be worth checking out.

In [None]:
nlu_data_df

## Intent

### Create intent classifier report

Let's do a report by domain classification.

In [None]:
domain_labels = 'scenario'

domain_report_df = NLUEngine.evaluate_intent_classifier(
    data_df_path=nlu_data_df,
    labels_to_predict=domain_labels,
    classifier=LR
)

domain_report_df

And now let's do a report by intent classification.

In [None]:
intent_labels= 'intent'

intent_report_df = NLUEngine.evaluate_intent_classifier(
    data_df_path=nlu_data_df,
    labels_to_predict=intent_labels,
    classifier=LR
)
intent_report_df

### Macro Intent Data Refinement

Now that we know what works and what doesn't, we can start refining the intents.

We don't want all of the columns, so we will drop some to review the data set

In [None]:
nlu_scenario_df = nlu_data_df.drop(
    columns=[
        'userid', 'notes', 'answer', 'answerid', 'suggested_entities'
    ])

Pick a domain (scenario) to review

For this example we are going to pick `alarm` as an example but for actual refinement, pick whatever you like.

The intent classification isn't bad, but the entity extraction for alarm_type is terrible. Perhaps it overlaps with another entity type, like 'event_name'.

We will try to fix this.

In [None]:
domain_selection = MacroDataRefinement.list_and_select_domain(nlu_scenario_df)

The amount of training for a domain can be a bit much, however we will give it a quick glance.

In [None]:
nlu_scenario_df = nlu_scenario_df[
    nlu_scenario_df['scenario'] == domain_selection
]
nlu_scenario_df

TODO: add in description of the types of fixes we can do to the NLU data for intent
* intents that collide with other intents and how to fix them (separation by TFIDF terms and using checkboxes in ipysheet to annotate them into the correct intent): this leads to the visualization of the intents in the NLU data with venn word cloud diagrams
* utterances that are grammatically incorrect or contain incorrect spelling (grammar checker in the future?)
* utterances that are straight up wrong for the intent
* utterances that actually seem contain multiple intents (this isn't supported by default)

Let's train an intent classifier on the whole data set for labeling intents and get the incorrect results for the intents on the domain we want to clean.
(why not split a training test set? Because we want to see the results of the intent classifier on the whole data set, I mean if it's still getting it wrong when it has trained on it, then perhaps there is something wrong with the utterance, tagging, overlapping intents, etc.)

In [None]:
LR_intent_classifier_model, tfidf_vectorizer = NLUEngine.train_intent_classifier(
    data_df_path=nlu_data_df,
    labels_to_predict='intent',
    classifier=LR
)

We will get the intent keyword features and their rankings (coefs) from the intent classifier.

In [None]:
intent_feature_rank_df = MacroIntentRefinement.intent_keyword_feature_rankings(
    LR_intent_classifier_model, tfidf_vectorizer)

Having all of the incorrectly predicted intents to review for this domain is a good way to see what is going wrong. 

The big question is: Is it because of defects in the data or is it because of the intent classifier? We really want to find defects in the data to refine over classifier defects.

In [None]:
incorrect_intent_predictions_df = IntentMatcher.get_incorrect_predicted_labels(
    nlu_scenario_df, LR_intent_classifier_model, tfidf_vectorizer)
incorrect_intent_predictions_df

However, it can be a bit much seeing everything that isn't working right, perhaps we can break it down better in a report!

In [None]:
incorrect_predicted_intents_report = MacroIntentRefinement.get_incorrect_predicted_intents_report(
    nlu_scenario_df, incorrect_intent_predictions_df, intent_report_df, intent_feature_rank_df)

Let's take a look at the report. You can use the circle with the plus to expand the items individually in the report or click on the number of items. Might I recommend looking at one intent at a time and expanding the nested items for each of those. There is a lot of information to look at here, but this stuff is super important to understand for the refinement of the data.

Each intent has the following items:
* **f1 score**: the overall score of the intent (we wanrt to improve this number!)
* **total count**: the total number of utterances that have this intent
* **total incorrect count**: the total number of utterances that have this intent but are incorrectly predicted (we want to reduce this number!)
* **top features**: the top ten features (words) that are associated with this intent (these are just the individual words ranked, not combined together!)
* **overlapping features**: the top ten features (words) that are associated with this intent and are also associated with other intents that may make classification based solely on these features difficult
* **correct utterance example**: the intent, the first annotated utterance that is correctly predicted as an example, and a list of the words in the utterance with their coefficient rankings
* **incorrect utterance example**: the intent, the first annotated utterance that is incorrectly predicted as an example, and a list of the words in the utterance with their coefficient rankings
* **incorrect predicted intents and counts**: for this intent, a list of the incorrectly predicted intents and their counts (we want to reduce this!)


In [None]:
RenderJSON(incorrect_predicted_intents_report)

Save the report (for now make sure to add the correct domain name to the file)

In [None]:
DataUtils.save_json(incorrect_predicted_intents_report, 'data/reports/' +
                    domain_selection + '_incorrect_predicted_intents_report.json')


In [None]:
incorrect_predicted_intents_report = DataUtils.load_json(
    'data/reports/' + domain_selection + '_incorrect_predicted_intents_report.json')


In [None]:
#TODO: add in way to show the improvements when refinement is complete (save original json and new json as one file with two main keys), how best to compare them?

In [None]:
#TODO: experiement with the term frequencies of the incorrect predictions for the intent and the correct predictions for the incorrect predicted labels!

In [None]:
#TODO: make a report similar to the intent keyword report, but for the incorrect predicted intents.
# What keywords are sending these utterances to the wrong intent? (do all utterances for an intent that have that keyword get sent to the wrong intent?!)
# How do you interpret the results? Is it because of an intent overlap, utterances that are incorrectly written or labeled, or is the basic NLU engine breaking down?
# How do you seperate the different types of errors?

In [None]:
#TODO: get the features from each intent that fails to match the intent (how?)

#TODO: possible solutions: get top features for each incorrect predicted intents and the intents they are being incorrectly predicted as

In [None]:
incorrect_intent_predictions_df
#1. TODO: get features for intent and predicted intent from intent_feature_rank_df
    #TODO: get list of intents and predicted_label from incorrect_intent_predictions_df
    #TODO: get all features for each intent and predicted_label from intent_feature_rank_df
    #TODO: find where the features overlap for each intent and the incorrectly predicted label
    #TODO: find examples of utterances that have these features for each intent, both correctly and incorrectly predicted 



In [None]:
#TODO: find separation criteria for alarm_set and calander_set: get most popular tfidf words (and/or coef features) for each intent and assign them as the separation criteria. e.g.
# alarm -> alarm_set
# reminder -> calander_set
# is alert alarm or reminder?
# wakeup or wake up -> alarm_set
# get up -> alarm_set
# timer -> remove, it's not a timer!

#TODO: this will be later visually represented in a venn diagram with word clouds and forms the basis for refinement besides the sheets

### Human in the for loop.

You have made it this far, now it's your turn to shine human!

You will provide a refinement to each incorrectly predicted intent. some of the incorrectly predicted utterances are actually fine the way they are, you may need to review the intent that is fasly being predicted...


Besides correcting the utterances(ie spelling), you can also mark an entry with the following:
* **review**: the utterance needs to be reviewed again by a human
* **move**: the utterance needs to be moved to another intent(NOTE: if you have a big data set, it might be better to just **remove** the utterance from the data set)
* **remove**: the utterance should be removed from the dataset

You can use your human ability to refine the NLU intent data by answering the following questions:

1. Does the utterance fit to the intent? -> mark as move, remove, or review

2. Is the utterance grammar or spelling wrong but(1) is fine? -> correct the utterance

3. Is this intent collidating with another intent because the scope of both intents are overlapping? -> redefine the scope of the intents(either combine them or separate their functionality better)

4. Is the intent collidating with another intent because certain keywords overlap between intents? -> redefine the keywords to split between intents or merge them together if they are similar


In [None]:
# TODO: from here it's all just a work in progress. These 4 flows should be implemented in a human for loop pipeline with ipysheets.
# TODO: at the end of each flow, the dataframe will be appended with a column to indicate MDR was successfully completed. This way users can keep track of what they have refined.

In [None]:
intent_refinement_dictionary = MacroIntentRefinement.get_intent_dataframes_to_refine(incorrect_intent_predictions_df)

From the list of intents, pick one to refine. If you are unsure which one to start with, look at the `MacroDataRefinement.get_incorrect_predicted_intents_report` and start with the one with the most incorrect predictions.

In [None]:
intent_to_refine = MacroIntentRefinement.list_and_select_intent(incorrect_intent_predictions_df)

In [None]:
to_review_sheet = MacroDataRefinement.create_sheet(
    intent_refinement_dictionary[intent_to_refine])
to_review_sheet

It's a really good idea to convert the sheet back to a dataframe and convert it to a csv to be saved.

NOTE: make sure to replace the `domain` with the domain and `intent` in the csv file name to the actual domain and intent name you have refined in both `to_csv()` and `load_data()`.

In [None]:
reviewed_intent_df = MacroDataRefinement.convert_sheet_to_dataframe(to_review_sheet)

In [None]:
reviewed_intent_df.to_csv(
    'data/reviewed/reviewed_alarm_alarm_query_incorrectly_predicted_df.csv')


In [None]:
reviewed_intent_df = pd.read_csv(
    'data/reviewed/reviewed_alarm_alarm_query_incorrectly_predicted_df.csv', sep=',', index_col=0)


In [None]:
#TODO: move remove_entries_marked_remove to the end of the flow (so that when joining together back to the main dataset, those entries are removed there too!)
#TODO: add in list of intents above this cell so people can see if it's the right intent to pick
#TODO: change method to look up scenario for changed intent to relabel the scenario
refined_intent_df = reviewed_intent_df.apply(
    MacroDataRefinement.move_entry, axis=1)

In [None]:
refined_intent_df.to_csv(
    'data/refined/refined_alarm_alarm_query_incorrectly_predicted_df.csv', sep=',')


In [None]:
refined_intent_df = pd.read_csv(
    'data/refined/refined_alarm_alarm_query_incorrectly_predicted_df.csv', sep=',', index_col=0)


In [None]:
#TODO: this can be removed in the future, I forgot to remove these entries when I did my own refinement so I drop them for now
refined_intent_df = refined_intent_df[~refined_intent_df['status'].str.contains(
    'IRR', na=False)]


We will mark all the refined entries and merge these into the original data set, then save it.

In [None]:
refined_intent_df = MacroDataRefinement.mark_entries_as_refined(refined_dataframe=refined_intent_df, refined_type='intent')

In [None]:
nlu_data_refined_df = MacroDataRefinement.merge_refined_data_into_dataset(
    nlu_data_df, refined_intent_df)
nlu_data_refined_df

In [None]:
nlu_data_refined_df.to_csv('data/refined/nlu_data_refined_df.csv')

Now repeat all of these steps for each intent in each domain you want to work on and then when you are done, test it below.

In [None]:
removed_nlu_data_refined_df = nlu_data_refined_df[nlu_data_refined_df['remove'] != True]

LR_intent_classifier_model, tfidf_vectorizer = NLUEngine.train_intent_classifier(
    data_df_path=removed_nlu_data_refined_df,
    labels_to_predict='intent',
    classifier=LR
)

refined_incorrect_intent_predictions_df = IntentMatcher.get_incorrect_predicted_labels(
    removed_nlu_data_refined_df[removed_nlu_data_refined_df['scenario'] == 'alarm'], LR_intent_classifier_model, tfidf_vectorizer)
refined_incorrect_intent_predictions_df


In [None]:
improved_intent_report_df = NLUEngine.evaluate_intent_classifier(
    data_df_path=nlu_data_df,
    labels_to_predict='intent',
    classifier=LR
)
improved_intent_report_df

In [None]:
MacroDataRefinement.get_incorrect_predicted_intents_report(
    removed_nlu_data_refined_df[removed_nlu_data_refined_df['scenario'] == 'alarm'], refined_incorrect_intent_predictions_df, improved_intent_report_df)


Besides some incorrect utterances and intents, we can see that there is an overlap between the intent 'alarm_set' and the intent 'calandar_set'. This is because those two intents are not well defined and will require refinement. We will try to fix this.


In [None]:
#TODO: integrate refined_intent_df into the main dataset and save it as nlu_data_refined_df


#TODO: export nlu_data_refined_df to a csv file and save it as NLU-Data-Home-Domain-Annotated-Refined.csv

In [None]:
#TODO: for every intent in the predicted intent column, get the top 5 tfidf features and their scores
# Like this: https://stackoverflow.com/questions/34232190/scikit-learn-tfidfvectorizer-how-to-get-top-n-terms-with-highest-tf-idf-score
#TODO: Make sure to pass them to the intent refinement process for each intent by putting them in the report!


In [None]:

#TODO: get the counts of the terms from the utterances that are incorrect for a specific domain (should I filter by tfidf scores?)
#TODO: Look up the most popular terms for an intent if they are red or green for that intent

In [None]:
# For every word (feature) in the utterances, we get the coeficients for the intents.
# From the shape, we see it contains the classes and the coeficients.
coefs = LR_intent_classifier_model.coef_
coefs.shape

In [None]:
from nlu_engine import LabelEncoder

In [None]:
# We need to get the encoded classes
classes = LR_intent_classifier_model.classes_
classes

In [None]:
# We cant to get the actual feature names (the words)
feature_names = tfidf_vectorizer.get_feature_names()

In [None]:
# Let's try an example with TFIDF only, this only tells us overall the TFIDF score for each word, not related to the intent
from nlu_engine import TfidfEncoder

utterance = 'turn off the alarm I set'

response = TfidfEncoder.encode_vectors(
    utterance, tfidf_vectorizer)

for vector in response.nonzero()[1]:
    print(f'word: {feature_names[vector]} - ranking: {response[0, vector]}')


In [None]:
# Let's rip out a list of tuples for the features and their coeficients for the intent
output = []
for classIndex, features in enumerate(coefs):
    for featureIndex, feature in enumerate(features):
        output.append(
            (classes[classIndex], feature_names[featureIndex], feature))
feature_rank_df = pd.DataFrame(output, columns=['class', 'feature', 'coef'])
feature_rank_df


In [None]:
# It is a good idea to convert the classes from the encoded to a normal human form
feature_rank_df['class'] = LabelEncoder.inverse_transform(feature_rank_df['class'])

In [None]:
# Sort the features by the absolute value of their coefficient and color them red or green
feature_rank_df["abs_value"] = feature_rank_df["coef"].apply(lambda x: abs(x))
feature_rank_df["colors"] = feature_rank_df["coef"].apply(lambda x: "green" if x > 0 else "red")
feature_rank_df = feature_rank_df.sort_values("abs_value", ascending=False)

In [None]:
# Take a look at an example of the word 'set'
feature_rank_df[(feature_rank_df['feature'] == 'set') & (feature_rank_df['colors'] == 'red')
   ].sort_values('abs_value', ascending=False)


### Entity extraction report

The entity extraction could be greatly improved by improving the features it uses. It would be great if someone would take a look at this. Perhaps the CRF features similar to what Snips uses would be better such as Brown clustering (probably).

In [None]:
#TODO: implement brown clustering to improve entity extraction (see entity_extractor.py)

It is important to have the NLTK tokenizer to be able to extract entities.

In [None]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
        nltk.download('punkt')

Due to this error featured in [this git issue](https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/60) we have to use an older version of scikit learn (sklearn<0.24), otherwise the latest version would work. Hopefully this gets fixed one day..

In [None]:
entity_report_df = NLUEngine.evaluate_entity_classifier(data_df=nlu_data_df)

In [None]:
entity_report_df.sort_values(by=['f1-score'])

In [None]:
#TODO: Benchmark the state features to find the best and the worst, remove/replace worst: add in state features like here: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-check-what-classifier-learned
# Specifically, we want print_state_features()

As we have seen from the entity extraction report, the entity extraction is not working for the alarm_type.

In [None]:

#TODO: move this below the intent cleaning flow
nlu_scenario_df = nlu_scenario_df[nlu_scenario_df['answer_annotation'].str.contains(
    'alarm_type')]

## Entity Convert to ipysheet and review
TODO: add in description of the types of fixes we can do to the NLU data for entity


In [None]:

# TODO: same as above for intents but with predicted entities: report on them, break them down into a dictionary of dataframes and refine them..

For the example with 'alarm' and the alarm_type:
* We see that the alarm_type entities are really event_name(ie wake up, soccer practice) except for ID 5879, we will need to change them to event_name and remove ID 5879.
* The last one(ID 6320) is a mistake. Someone got confused with the prompt and assumed alarm is a security system. This is out of scope for the alarm domain, as the alarms are ones set on a phone or other device. We will drop this utterance.
Once you are done reviewing, you convert it back to a dataframe and check to make sure it looks okay.
Let's change all alarm_type entities to event_name.

In [None]:

reviewed_scenario_df['answer_annotation'] = reviewed_scenario_df['answer_annotation'].str.replace(
    'alarm_type', 'event_name')
reviewed_scenario_df


Okay dokey, now we can merge this with the original data set and see if it made a difference already(well of course it did!).

In [None]:
nlu_data_df.drop(
    reviewed_scenario_df[reviewed_scenario_df['remove'] == True].index, inplace=True)

reviewed_scenario_df = reviewed_scenario_df[~reviewed_scenario_df['remove'] == True]

nlu_data_df.loc[nlu_data_df.index.intersection(
    reviewed_scenario_df.index), 'answer_annotation'] = reviewed_scenario_df['answer_annotation']

nlu_data_df[(nlu_data_df['scenario'].str.contains('alarm')) & (nlu_data_df['answer_annotation'].str.contains(
    'event_name'))]


### Benchmark changed data set
TODO: repeat reports for the changed data set for domain and entities and compare


In [None]:

entity_reviewed_report_df = NLUEngine.evaluate_entity_classifier(
    data_df=nlu_data_df)
entity_reviewed_report_df.sort_values(by=['f1-score'])

If you are sure it is okay, you can save it as a csv file, make sure to name it correctly(i.e. `alarm_domain_first_review.csv`)

In [None]:
reviewed_scenario_df.to_csv('alarm_domain_first_review.csv')

Load it back up and check to make sure it looks okay. Make sure to give it the right name!


In [None]:
audio_domain_first_review_df = pd.read_csv(
    'alarm_domain_first_review.csv', index_col=0)
audio_domain_first_review_df.tail(50)


In [None]:
# TODO: implement the evaluate_classifier in the NLU engine to check f1 score for intents and entities in the domain vs original NLU data of domain!
# Value: benchmark!
#TODO: implement a flow for getting the domains with the lowest f1 scores by intent/domain and entities and cleaning them by the order of the lowest f1 scores
# TODO: concat all reviewed dfs and save to csv
# TODO: add benchmark for whole NLU data set before and after cleaning! (by intents and domains!)
# TODO: review the review marked entries
# TODO: add new column for notes
# TODO: change flow of review for only ones that should be reviewed, not all of the ones that have been changed (track changes by comparing against the original data set)
# TODO: do the changed utterances have to be changed in other fields too or is it just enough for the tagged utterancve field?
# TODO: add visualizations of domains, their intents, keywords in utterances, and entities to top
