# Chat Intents

## Applying labels

**Summary**

This notebook provides a way to automatically extract and apply labels to document clusters. See the `chatintents_tutorial.ipynb` notebook for a tutorial of the chatintents package, which simplifies and makes it easier to use the methods outlined below.

In [1]:
import collections
from pathlib import Path

import numpy as np
import pandas as pd
import spacy
from spacy import displacy

pd.set_option("display.max_rows", 600)
pd.set_option("display.max_columns", 500)
pd.set_option("max_colwidth", 400)

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
data_clustered = pd.read_csv('../data/processed/sample_clustered.csv')
data_clustered = data_clustered[['text', 'label_st1']]
data_clustered.sample(10)

Unnamed: 0,text,label_st1
840,Please help! I was mugged and everything stolen. What do I do to stop them from accessing my account?,14
141,Can you explain why my Google Pay Top isn't working?,13
611,I recently made a transfer but I need to cancel it as soon as possible. Please let me know when this happens.,9
906,"According to the app, I got cash from an ATM but I haven't made any transactions.",-1
867,Do you give out Visa or Mastercards?,25
313,I found my card! Can I link it back into the app?,17
101,Can you tell me what I need for identity validation?,11
197,I am trying to revert a transcation I did this morning,9
42,How do I check security settings using the app?,7
546,Please could you give me a refund,3


In [4]:
example_category = data_clustered[data_clustered['label_st1']==31].reset_index(drop=True)
example_category 

Unnamed: 0,text,label_st1
0,"I am overseas in China, can I get a replacement card?",31
1,Where are your cards delivered to?,31
2,Do I get a real card?,31
3,I would like to purchase another card.,31
4,tell me how I can order my card.,31
5,What is the procedure for me to get more cards on my account?,31
6,How to receive the actual card,31
7,Where do we mail the card?,31
8,Can I get a card in the EU?,31
9,Please tell me how I can receive a physical card,31


In [5]:
example_doc = nlp(list(example_category['text'])[12])

print(f'{example_doc}\n')

for token in example_doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_ , token.is_stop)

Where do I go to order my card?

Where where ADV WRB advmod True
do do AUX VBP aux True
I I PRON PRP nsubj True
go go VERB VB ROOT True
to to PART TO aux True
order order VERB VB advcl False
my my PRON PRP$ poss True
card card NOUN NN dobj False
? ? PUNCT . punct False


In [6]:
displacy.render(example_doc, style="dep")

In [7]:
fig = displacy.render(example_doc, style="dep", jupyter=False)
output_path = Path("../images/dependency_plot.svg") # you can keep there only "dependency_plot.svg" if you want to save it in the same folder where you run the script 
output_path.open("w", encoding="utf-8").write(fig)

6335

## Helper functions

In [8]:
def get_group(df, category_col, category):
    """
    Returns documents of a single category
    
    Arguments:
        df: pandas dataframe of documents
        category_col: str, column name corresponding to categories or clusters
        category: int, cluster number to return
    Returns:
        single_category: pandas dataframe with documents from a single category
    """
    
    single_category = df[df[category_col]==category].reset_index(drop=True)

    return single_category 

In [9]:
def most_common(lst, n_words):
    """
    Get most common words in a list of words
    
    Arguments:
        lst: list, each element is a word
        n_words: number of top common words to return
    
    Returns:
        counter.most_common(n_words): counter object of n most common words
    """
    counter=collections.Counter(lst)
    return counter.most_common(n_words)

In [10]:
def extract_labels(category_docs, print_word_counts=False):
    """
    Extract labels from documents in the same cluster by concatenating
    most common verbs, ojects, and nouns

    Argument:
        category_docs: list of documents, all from the same category or
                       clustering
        print_word_counts: bool, True will print word counts of each type in this category

    Returns:
        label: str, group label derived from concatentating most common
               verb, object, and two most common nouns

    """

    verbs = []
    dobjs = []
    nouns = []
    adjs = []
    
    verb = ''
    dobj = ''
    noun1 = ''
    noun2 = ''

    # for each document, append verbs, dobs, nouns, and adjectives to 
    # running lists for whole cluster
    for i in range(len(category_docs)):
        doc = nlp(category_docs[i])
        for token in doc:
            if token.is_stop==False:
                if token.dep_ == 'ROOT':
                    verbs.append(token.text.lower())

                elif token.dep_=='dobj':
                    dobjs.append(token.lemma_.lower())

                elif token.pos_=='NOUN':
                    nouns.append(token.lemma_.lower())
                    
                elif token.pos_=='ADJ':
                    adjs.append(token.lemma_.lower())

    # for printing out for inspection purposes
    if print_word_counts:
        for word_lst in [verbs, dobjs, nouns, adjs]:
            counter=collections.Counter(word_lst)
            print(counter)
    
    # take most common words of each form
    if len(verbs) > 0:
        verb = most_common(verbs, 1)[0][0]
    
    if len(dobjs) > 0:
        dobj = most_common(dobjs, 1)[0][0]
    
    if len(nouns) > 0:
        noun1 = most_common(nouns, 1)[0][0]
    
    if len(set(nouns)) > 1:
        noun2 = most_common(nouns, 2)[1][0]
    
    # concatenate the most common verb-dobj-noun1-noun2 (if they exist)
    label_words = [verb, dobj]
    
    for word in [noun1, noun2]:
        if word not in label_words:
            label_words.append(word)
    
    if '' in label_words:
        label_words.remove('')
    
    label = '_'.join(label_words)
    
    return label

In [11]:
def apply_and_summarize_labels(df, category_col):
    """
    Assign groups to original documents and provide group counts

    Arguments:
        df: pandas dataframe of original documents of interest to
            cluster
        category_col: str, column name corresponding to categories or clusters

    Returns:
        summary_df: pandas dataframe with model cluster assignment, number
                    of documents in each cluster and derived labels
    """
    
    numerical_labels = df[category_col].unique()
    
    # create dictionary of the numerical category to the generated label
    label_dict = {}
    for label in numerical_labels:
        current_category = list(get_group(df, category_col, label)['text'])
        label_dict[label] = extract_labels(current_category)
        
    # create summary dataframe of numerical labels and counts
    summary_df = (df.groupby(category_col)['text'].count()
                    .reset_index()
                    .rename(columns={'text':'count'})
                    .sort_values('count', ascending=False))
    
    # apply generated labels
    summary_df['label'] = summary_df.apply(lambda x: label_dict[x[category_col]], axis = 1)
    
    return summary_df

In [12]:
def combine_ground_truth(df_clusters, df_ground, key):
    """
    Combines dataframes of documents with extracted and ground truth labels
    
    Arguments:
        df_clusters: pandas dataframe, each row as a document with corresponding extracted label
        df_ground: pandas dataframe, each row as a document with corresponding ground truth label
        key: str, key to merge tables on
        
    Returns:
        df_combined: pandas dataframe, each row as a document with extracted and ground truth labels
    """
    df_combined = pd.merge(df_clusters, df_ground, on=key, how = 'left')
    return df_combined

In [13]:
def get_top_category(df_label, df_summary):
    """
    Returns a dataframe comparing a single model's results to ground truth
    label to evalute cluster compositions and derived label relative to labels
    and counts of most commmon ground truth category

    Arguments:
        df_label: pandas dataframe, each row as a document with extracted and ground truth labels
                  (result of `combine_ground_truth` function)
        df_summary: pandas dataframe with model cluster assignment, number
                    of documents in each cluster and derived labels
                    (result from `apply_and_summarize_labels` function)

    Returns:
        df_result: pandas dataframe with each row containing information on
                   each cluster identified by this model, including count,
                   extracted label, most represented ground truth label name,
                   count and percentage of that group
    """
    df_label_ground = (df_label.groupby('label')
                      .agg(top_ground_category=('category', lambda x:x.value_counts().index[0]), 
                           top_cat_count = ('category', lambda x:x.value_counts()[0]))
                      .reset_index())
    
    df_result = pd.merge(df_summary, df_label_ground, on='label', how='left')
    df_result['perc_top_cat'] = df_result.apply(lambda x: int(round(100*x['top_cat_count']/x['count'])), axis=1)
    
    return df_result

### Manual inspection

In [14]:
example_category = list(get_group(data_clustered, 'label_st1', 46)['text'])
extract_labels(example_category, True)

Counter({'help': 6, 'says': 1, 'withdrew': 1, 'checking': 1, 'understand': 1, 'believe': 1})
Counter({'card': 4, 'withdrawal': 3, 'money': 1, 'account': 1, 'cash': 1, 'reason': 1})
Counter({'account': 8, 'withdrawal': 5, 'cash': 3, 'withdraw': 1, 'card': 1, 'app': 1, 'money': 1, 'withdrawl': 1})
Counter({'duplicate': 1, 'strange': 1, 'unexpected': 1, 'unusual': 1, 'odd': 1})


'help_card_account_withdrawal'

### Without ground truth labels

In [15]:
cluster_summary = apply_and_summarize_labels(data_clustered, 'label_st1')
cluster_summary.head(20)

Unnamed: 0,label_st1,count,label
0,-1,56,add_card_app
27,26,49,use_account_card_auto
14,13,45,pending_money_account
4,3,44,help_refund_statement
15,14,38,help_card_app
32,31,37,like_card_cost
5,4,36,think_rate_exchange
34,33,31,tried_money_card_today
33,32,31,charged_fee_withdrawal_cash
30,29,29,use_currency_app


In [16]:
labeled_clusters = pd.merge(data_clustered, cluster_summary[['label_st1', 'label']], on='label_st1', how = 'left')
labeled_clusters.head()

Unnamed: 0,text,label_st1,label
0,I'm worried my card might be lost in the mail? How long does it usually take to arrive?,30,expect_card_week
1,I got charged a fee that shouldn't be there from my cash,32,charged_fee_withdrawal_cash
2,Do you charge for making a withdrawal? I took some money out of my account earlier and I was charged for this.,32,charged_fee_withdrawal_cash
3,Is there an issue with my account? I don't see a cheque deposit that I made yesterday. Please assist.,53,deposited_cheque_balance_yesterday
4,Are there ways for other people to send me money?,51,sent_money_friend_hour


If we don't have the ground truth labels (which is the primary use case for this), then the above tables would be the final results. In this case, since we do have the ground truth labels we can investigate how well our model did.

### With ground truth labels

In [17]:
data_ground = pd.read_csv('../data/processed/data_sample.csv')[['text', 'category']]
data_ground.head()

Unnamed: 0,text,category
0,I'm worried my card might be lost in the mail? How long does it usually take to arrive?,card_delivery_estimate
1,I got charged a fee that shouldn't be there from my cash,cash_withdrawal_charge
2,Do you charge for making a withdrawal? I took some money out of my account earlier and I was charged for this.,cash_withdrawal_charge
3,Is there an issue with my account? I don't see a cheque deposit that I made yesterday. Please assist.,balance_not_updated_after_cheque_or_cash_deposit
4,Are there ways for other people to send me money?,receiving_money


In [18]:
labeled_clusters = combine_ground_truth(labeled_clusters, data_ground, 'text')
labeled_clusters.sample(10)

Unnamed: 0,text,label_st1,label,category
572,I've forgotten my passcode. Can I reset?,2,remember_passcode_app,passcode_forgotten
495,How can I investigate a missing refund?,3,help_refund_statement,Refund_not_showing_up
989,How can I cancel this transaction?,9,cancel_transfer_account_tomorrow,cancel_transfer
565,What currencies can I use?,29,use_currency_app,supported_cards_and_currencies
548,How are the exchange rates determined?,4,think_rate_exchange,exchange_rate
416,Why hasn't my in country transfer gone through yet? I confirmed the account info a couple days ago but the payment hasn't been posted yet.,49,waiting_transfer_account,transfer_not_received_by_recipient
259,Why is this payment pending?,12,waiting_payment_card,pending_card_payment
712,I was charged and shouldn't have been charged when using my card!,35,charged_card_fee,card_payment_fee_charged
765,My card expires soon,31,like_card_cost,card_about_to_expire
950,How can I tell the source for my available funds?,19,find_source_fund_money,verify_source_of_funds


The extracted labels (called 'label') match the ground label ('category') quite well for many of the sample documents.

In [20]:
labeled_clusters[labeled_clusters['label_st1']==45]

Unnamed: 0,text,label_st1,label,category
54,Why does it say my transfer failed?,45,tried_transfer_account,failed_transfer
95,I wasn't able to do a transfer to an account,45,tried_transfer_account,beneficiary_not_allowed
247,Can you please explain why my transfer failed?,45,tried_transfer_account,failed_transfer
283,I went to do a transfer and it was declined.,45,tried_transfer_account,declined_transfer
316,A transfer to my account was denied.,45,tried_transfer_account,beneficiary_not_allowed
320,I tried to transfer money but it didn't go through,45,tried_transfer_account,failed_transfer
330,How do I contact customer support about my declined transfer?,45,tried_transfer_account,declined_transfer
377,Why was one of my transfers declined?,45,tried_transfer_account,declined_transfer
379,"I normally have no issues making transfers, so why am I suddenly being told it is not possible?",45,tried_transfer_account,beneficiary_not_allowed
472,My transfer is failing when I try to do it.,45,tried_transfer_account,failed_transfer


#### Count and name of most common category of generated labels and clusters

In [21]:
get_top_category(labeled_clusters, cluster_summary)

Unnamed: 0,label_st1,count,label,top_ground_category,top_cat_count,perc_top_cat
0,-1,56,add_card_app,beneficiary_not_allowed,8,14
1,26,49,use_account_card_auto,automatic_top_up,14,29
2,13,45,pending_money_account,pending_top_up,14,31
3,3,44,help_refund_statement,Refund_not_showing_up,26,59
4,14,38,help_card_app,lost_or_stolen_card,10,26
5,31,37,like_card_cost,order_physical_card,15,41
6,4,36,think_rate_exchange,wrong_exchange_rate_for_cash_withdrawal,14,39
7,33,31,tried_money_card_today,declined_cash_withdrawal,14,45
8,32,31,charged_fee_withdrawal_cash,cash_withdrawal_charge,23,74
9,29,29,use_currency_app,fiat_currency_support,10,34


Many of the smaller groups seem to be more pure (the top category is near or at 100%) compared to the larger groups. Thus, it makes sense that many of the extracted labels for the smaller groups tend to be more suitable than some of the extracted labels for the larger clusters, which have more varied representation of different ground truth clusters.