# Incorporate Lexical Quality Metrics

## 1. Install required dependencies


You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install sentence-transformers
!pip install cleanlab
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [2]:
# Package installation (hidden on docs.cleanlab.ai).
# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)
# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2

dependencies = ["cleanlab", "sentence_transformers"]

# Supress outputs that may appear if tensorflow happens to be improperly installed:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # disable parallelism to avoid deadlocks with huggingface

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab==v2.7.0
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    dependencies_test = [dependency.split('>')[0] if '>' in dependency
                         else dependency.split('<')[0] if '<' in dependency
                         else dependency.split('=')[0] for dependency in dependencies]
    missing_dependencies = []
    for dependency in dependencies_test:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import re
import string
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

from cleanlab.classification import CleanLearning

In [4]:
# This cell is hidden from docs.cleanlab.ai

import random
import numpy as np

pd.set_option("display.max_colwidth", None)

SEED = 123456 # for reproducibility

np.random.seed(SEED)
random.seed(SEED)

## 2. Load and format the text dataset


In [5]:
data = pd.read_csv("https://s.cleanlab.ai/banking-intent-classification.csv")
data.head()

Unnamed: 0,text,label
0,i accidentally made a payment to a wrong account. what should i do?,cancel_transfer
1,"i no longer want to transfer funds, can we cancel that transaction?",cancel_transfer
2,"cancel my transfer, please.",cancel_transfer
3,i want to revert this mornings transaction.,cancel_transfer
4,i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow,cancel_transfer


In [6]:
raw_texts, raw_labels = data["text"].values, data["label"].values

raw_train_texts, raw_test_texts, raw_train_labels, raw_test_labels = train_test_split(raw_texts, raw_labels, test_size=0.1)

In [7]:
num_classes = len(set(raw_train_labels))

print(f"This dataset has {num_classes} classes.")
print(f"Classes: {set(raw_train_labels)}")

This dataset has 10 classes.
Classes: {'apple_pay_or_google_pay', 'card_payment_fee_charged', 'beneficiary_not_allowed', 'change_pin', 'getting_spare_card', 'cancel_transfer', 'visa_or_mastercard', 'supported_cards_and_currencies', 'lost_or_stolen_phone', 'card_about_to_expire'}


Let's print the first example in the train set.

In [8]:
i = 0
print(f"Example Label: {raw_train_labels[i]}")
print(f"Example Text: {raw_train_texts[i]}")

Example Label: getting_spare_card
Example Text: can i have another card in addition to my first one?


The data is stored as two numpy arrays for each the train and test set:

1. `raw_train_texts` and `raw_test_texts` store the customer service requests utterances in text format
2. `raw_train_labels` and `raw_test_labels` store the intent categories (labels) for each example


First, we need to perform label enconding on the labels, cleanlab's functions require the labels for each example to be an interger integer in 0, 1, …, num_classes - 1. We will use sklearn's `LabelEncoder` to encode our labels.


In [9]:
encoder = LabelEncoder()
encoder.fit(raw_train_labels)

train_labels = encoder.transform(raw_train_labels)
test_labels = encoder.transform(raw_test_labels)

Next we convert the text strings into vectors better suited as inputs for our ML model.

We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example.

In [10]:
transformer = SentenceTransformer('google/electra-small-discriminator')

train_texts = transformer.encode(raw_train_texts)
test_texts = transformer.encode(raw_test_texts)

No sentence-transformers model found with name google/electra-small-discriminator. Creating a new one with mean pooling.


## 3. Define a classification model and use cleanlab to find potential label errors

<a id="section3"></a>

In [11]:
model = LogisticRegression(max_iter=400)

In [12]:
cv_n_folds = 5  # for efficiency; values like 5 or 10 will generally work better

cl = CleanLearning(model, cv_n_folds=cv_n_folds, verbose=True)
cl_lexical = CleanLearning(model, cv_n_folds=cv_n_folds, verbose=True)

In [14]:
label_issues_lexical = cl_lexical.find_label_issues(X=train_texts, labels=train_labels, X_raw=raw_train_texts)

Computing label noise estimates from provided noise matrix ...
Computing out of sample predicted probabilities via 5-fold cross validation. May take a while ...
Computing lexical quality metrics...
Using predicted probabilities to identify label issues ...
Identified 43 examples with label issues.


In [15]:
label_issues_non_lexical = cl.find_label_issues(X=train_texts, labels=train_labels)

Computing out of sample predicted probabilities via 5-fold cross validation. May take a while ...


Using predicted probabilities to identify label issues ...
Identified 39 examples with label issues.


In [17]:
def print_as_df(index):
    return pd.DataFrame(
        {
            "text": raw_train_texts,
            "given_label": raw_train_labels,
            "is_label_issue": label_issues_lexical["is_label_issue"],
            "predicted_label": encoder.inverse_transform(label_issues_lexical["predicted_label"]),
        },
    ).iloc[index]

In [18]:
print_as_df(label_issues_lexical[label_issues_lexical['is_label_issue'] ^ label_issues_non_lexical['is_label_issue']].index)

Unnamed: 0,text,given_label,is_label_issue,predicted_label
16,(A AND NOT B) OR (C AND NOT D) OR (B AND NOT C AND D),change_pin,False,lost_or_stolen_phone
29,i'm on vacation in europe but i desperately need to change my pin. can i do this from abroad?,change_pin,False,lost_or_stolen_phone
66,will you accept my credit card?,supported_cards_and_currencies,True,visa_or_mastercard
84,if i want to can i set up a new pin?,change_pin,True,getting_spare_card
117,what card or currency can i use to pay?,supported_cards_and_currencies,True,visa_or_mastercard
121,Would you rather fight one horse-sized duck or 100 duck-sized horses?,lost_or_stolen_phone,False,change_pin
159,i didn't know i was going to get charged to use my card.,card_payment_fee_charged,True,beneficiary_not_allowed
164,what happens if a merchant doesn't accept you payment?,change_pin,True,beneficiary_not_allowed
235,cancel transaction,cancel_transfer,True,supported_cards_and_currencies
282,I am an outlier :),visa_or_mastercard,False,beneficiary_not_allowed


The above table shows the difference between using label_issues with lexical quality metrics and label_issues with non lexical quality metrics (XOR). Even though, only 4 more label issues. 26 different outcomes for label issues. We can observe a few phenomenoms:


In [20]:
# Examples where quality metric helped, where they weren't seen as anomalies but became. For example
print_as_df([296, 164, 235])

Unnamed: 0,text,given_label,is_label_issue,predicted_label
296,"<button onclick=""alert('Bad, example!')"">Beep Beep!</button>",apple_pay_or_google_pay,True,beneficiary_not_allowed
164,what happens if a merchant doesn't accept you payment?,change_pin,True,beneficiary_not_allowed
235,cancel transaction,cancel_transfer,True,supported_cards_and_currencies


In [21]:
# However, it seems that if the there is no errors it is pretty much passing. for example:
print_as_df([282, 16, 121])


Unnamed: 0,text,given_label,is_label_issue,predicted_label
282,I am an outlier :),visa_or_mastercard,False,beneficiary_not_allowed
16,(A AND NOT B) OR (C AND NOT D) OR (B AND NOT C AND D),change_pin,False,lost_or_stolen_phone
121,Would you rather fight one horse-sized duck or 100 duck-sized horses?,lost_or_stolen_phone,False,change_pin


From the example above, overall it did help change many to the correct. One pattern that perfectly written but no realations to the label will almost always pass. We can modify it by marginally rewarding less when the difference is high. Maybe a sigmoid function.