# Walk-Through

We will focus on the most difficult of our datasets (FDA) because it contains a lot fo technical terms (Drug names, illnesses, etc.)

## 1. EDA

## 1.1 Loading Data

In [None]:
import pandas as pd
from pathlib import Path

from dataload.dataloading import DataFilesRegistry

dataset_dir = Path(Path.cwd() / "../datasets")
print(f"Datasets dir: {dataset_dir} , exists {dataset_dir.exists()}")

First, we list available datasets. DataFilesRegistry is an object that abstracts away the location of the folder and simplifies loading.

In [None]:
datasets = DataFilesRegistry(dataset_dir)

for idx, item in enumerate(datasets.keys()):
    # add new line each 5 iterations
    if idx % 5 == 0:
        print()
    print(item, end=", ")

Let's load just one dataset: fda. A dataset is organized into one or multiple paragraphs, where each paragraph can be multiple sentences.

In [None]:
paragraphs = datasets.load_items("fda")
print(f"Loaded {len(paragraphs)} paragraphs")

se_paragraphs = pd.Series(paragraphs)
pd.DataFrame(se_paragraphs, columns=["paragraph"])

Let's find the somes named entities in our text.

## 1.2 Finding important Named Entities

In this section, we will attempt to find important terms. This is important because we want latter use "Huggingface Transformer" model for sentence embeddings, and each pre-trained transformer has a vocabulary: we want to make sure those important accronyms, or "Proper Nouns" are part of the core vocabulary items meaning the HFTransformer has probably a good sense of the meaning of the word...

### 1.2.1 Finding important words using detection of "Named Entities"
First we find named entity using NLTK. It uses grammar rules, as well as casing and punctuation to find out the important "Named Entities"

In [None]:
from analysis import words

words.find_named_enties(se_paragraphs)

### 1.2.2 Finding important words using word frequencies

`ParagraphTransform` is a tool developped by our team that chains a series of text transformations(Cleanups) on paragraphs. Let's run some cleanups before counting words. 

Note that this `ParagraphTransform` will be used later as part of a text pre-processing step, in particular because our AI algorithm requires the proper demarcation of sentences and multiple things (bad punctuation, blanks, misplaced upper-case letters, abbreviations) can prevent a proper demarcation...

In [None]:
words.find_most_frequent_words(se_paragraphs)


### 1.2.3 Putting things together

As both previous steps have a fair bit of imprecision, we consider the "top entites" as those that match both criterias: being detected as Named Entity by NLTK and being a frequent word.

Let's join together the top words and the named entities on the lower case version of the word...

In [None]:
df_important_words = words.find_top_named_entities(se_paragraphs)
df_important_words

## 1.3 Sentence splitting

As our further model works on sentences, it is important that we are able to split the paragraphs in text into proper sentences.

## 1.3.1 Prepare data for proper sentence splitting

We again use `ParagraphTransform` to cleanup the data but this time in order to get allow NLTK to proper split the sentences: NLTK tipycally expects a dot to be followed by a space at end of sentences and followed by a upper-case letter. So it gets fooled by abbreviations or accronyms (That contain dots), sentences that do not start by a space after previous sentence's dot. Also, we add cleanup to normalize URLs and Emails in text as we do want the model to understand all of those as just "a link" or "an email"... Here are some cleanup available:

- `spaces` replaces series of spaces and tabs by a single space"
- `sentences_starts` Ensures a sentence end is followed by a space and a Capital letter"
- `uri` Replaces URLs by the placeholder WEBLINK"
- `email` Removes all emails by the placeholder WEBMAIL"
- `common_abbr` Expands common abbreviations (like i.e. into for instance) so that no dot remains
- `hyphens` Replaces hyphens in hyphenated words by spaces
- `no_stop_words` Removes stop words: it is not clear if this could bring value before using Hugging Face Transformers

In [None]:
# All available paragraph normalizers are in coded in our codebase (mostly using NLTK and regex )
from pipelines.impl.preprocessing import get_available_normalizers
get_available_normalizers()

In [None]:
from pipelines.impl.preprocessing import make_text_normalizer, split_into_sentences, get_available_normalizers, normalize_sent_min_words
from pipelines.impl.paragraph import ParagraphTransform

activated_cleanups = ['spaces', 'sentences_start', 'uri', 'email', 'common_abbr']
preprocessing_pipe = ParagraphTransform([
    make_text_normalizer(activated_cleanups),
    split_into_sentences,
    normalize_sent_min_words
], unique_sentences=True)  # only unique sentences

In [None]:
df_sentences = preprocessing_pipe.transform(se_paragraphs)


In [None]:
# print Sentences length vs df_raw length
print(f"Unique Sentences count: {len(df_sentences)}")
print(f"Paragraphs (or sections) count: {len(se_paragraphs)}")

In [None]:
df_sentences.head(5)

### 1.3.2 Sentence splitting as a step in our model

We now create an untrained instance of our model (see code in related files).

In [None]:
from pipelines.impl.anomaly_detection import GaussianEmbeddingsAnomalyDetector
from pipelines.filtering import FilterTrainFiles

run_params = {
    "embedder_name": "all-MiniLM-L6-v2",
    "robust_covariance": True,
    "text_normalizer_keys": activated_cleanups, 
    "support_fraction": 0.90,
}
model_datasets = FilterTrainFiles(train_id="fda", validation_id="validation_fda_id", validation_ood="validation_fda_ood")

model = GaussianEmbeddingsAnomalyDetector(run_params=run_params, datasets=model_datasets)

And use the `sentence_splitter` steps that does both the text normalizatins and the sentence splitting... We also set the `embedder` variable to point to the Hugging Face Transformer we will use later

In [None]:
df_sentences = model.sentence_splitter.transform(se_paragraphs)
df_embeddings = model.embedder.fit_transform(df_sentences)
embedder = model.embedder.embedder

In [None]:
df_sentences

## 1.4 Check tokenization

In this section, we check that all "Important named entites" are part of a single vocabulary token in the "Hugging Face Transformer". In such a transformer, the most important tokens have their own entry in the vocabulary (meaning the Transformer understands them well) and other less important words are split into sub-tokens (like Embedding can become 2 tokens `embedd` and `##ing` )

Let's now find tokens that are split into the Hugging Face vocab

In [None]:
# find out split words
from analysis import tokens

split_words = tokens.find_vocab_split_words(df_sentences, embedder.tokenizer)     
split_words

There is a small problem... To have the best results we want the most important/frequent words like `COVID` `COVID-19` to be encoded on a single token in itself... But we have `co` and `##vid` as sub-tokens... Same for `pandemic`...

In [None]:
important_word_list = df_important_words["entity_name"].values
split_words_list = split_words["word"].values
# Compute intersection
df_import_split_words = set(important_word_list).intersection(set(split_words_list))


In [None]:
# Warn user if there are split words
if len(df_import_split_words) > 0:
    print("WARNING: Split words found")
    print(df_import_split_words)

There is a function doing all of that already in package `analysis`

In [None]:
from analysis.tokens import find_important_split_words

find_important_split_words(df_sentences, embedder.tokenizer)

In our application onboarding we want to issue warnings if this happens, and cue for a fine-tuning of the pre-trained transformer on the body of text so that the Transformer gets a better sence of those words...

# 2 Model Understanding

In this section, we will run the model to look in particular at the embeddings (without fine-tuning the language model) and look at calibration

## 2.1 Looking at Hugging Face Transformer embeddings

In [None]:
embeddings = embedder.encode(df_sentences, show_progress_bar=True)

### 2.1.0 Checking embeddings have a "sense" of language used in FAQ

Despite above warnings, let's verify the HF Transformer embeddings has a reasonable sense of similar sentences...

In [None]:
from sentence_transformers import util


def most_similar(my_sentence, top_n=5):
    emb = embedder.encode(my_sentence)
    cosines = util.cos_sim(emb, embeddings)
    similarities = {s: float(cosines[0][i]) for i, s in enumerate(df_sentences)}
    ordered_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    return ordered_similarities[:top_n]


most_similar("Can my dog transmit COVID-19 to me?")

We can see that the transformer probably understands that the word `dog` (that does not appear in original FAQ) is close to the word `pet`

### 2.1.1 Visualize in Tensorboard

In [None]:
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir=("./embeddings"))
writer.add_embedding(embeddings, df_sentences, tag="My Embeddings")
writer.flush()
writer.close()

In [None]:
# % load_ext tensorboard
# % tensorboard  --logdir ./embeddings


### 2.1.2 Visualize using UMAP with Seaborn

In [None]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline

import umap
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})
embeddings.shape

We create the dataset and a bit of our own out-of-domain sentences...

In [None]:
# Create dataset for UMAP
test_embeddings = embeddings.copy()
test_sentences = df_sentences.copy()
# Add a column of zeros to represent the label
test_embeddings = np.hstack((test_embeddings, np.zeros((test_embeddings.shape[0], 1), dtype=np.int32)))

def add_new_sents(test_embeddings, test_sentences, new_sentences, label):
    np_new_sents = np.array(new_sentences).reshape(-1, 1)
    np_labels = np.full_like(np_new_sents, label, dtype=np.float64)
    test_embeddings = np.vstack((test_embeddings, np.hstack((embedder.encode(new_sentences), np_labels))))
    test_sentences = pd.concat([test_sentences, pd.Series(new_sentences)])
    test_sentences.index = range(len(test_sentences))
    return test_embeddings, test_sentences

test_embeddings, test_sentences = add_new_sents(test_embeddings, test_sentences, 
                                                ["I love playing soccer ?", "Do you eat cheese very often ?", "Can I play some chess with you?", "I liked a lot this movie we looked at yesterday", "Can I help you with something else?", "Good morning, my name is John"], 1)

test_embeddings.shape

Here we train the model: this will fit a guassian with a center and covariance on the embeddings space. It allows us to get a pointer for the center of the distribution. Nicer for visualization...

In [None]:
model.fit(datasets)

In [None]:
location, _ = model.distribution.get_dist_params()
test_embeddings = np.vstack((test_embeddings, np.hstack((location, 2)))) # Add center with label 2 to have another color...
test_sentences = np.append(test_sentences, "DISTRIBUTION CENTER")

Compute UMAP

In [None]:

# Reduce dimensionality with UMAP
reducer = umap.UMAP(n_neighbors=24, metric='cosine', n_epochs=1000)
X = test_embeddings[:, :-1]
# scaled_embeddings = StandardScaler().fit_transform(X)
scaled_embeddings = X
umap_embedding = reducer.fit_transform(scaled_embeddings)
umap_embedding.shape

In [None]:
# Plot the UMAP projection
sns_palette = sns.color_palette()
color_ids = [int(label) for label in test_embeddings[:, -1]]
my_palette = [sns_palette[i] for i in color_ids]
plt.scatter(
    umap_embedding[:, 0],
    umap_embedding[:, 1],
    c=my_palette)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the sentence embeddings', fontsize=24)

Out of domain sentences appear in red. The distribution center in green.

### 2.1.3 Interactive visualization with Bokeh

Bokeh will allow us to see to which sentence each points corresponds. In below viz, mouse over some points to see which sentence it relates to. We can see that the OOD points are close to each other, but the OOD-point that corresponds to a sentence with `cheese` gets positionned close to an ID sentence about "food and gatrointestinal and stomach illness"

In [None]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Spectral10

output_notebook()

In [None]:
embeddings_df = pd.DataFrame(umap_embedding, columns=('x', 'y'))
embeddings_df['ood'] = [str(x) for x in test_embeddings[:, -1].astype(int)]
embeddings_df['sentence'] = list(test_sentences)

datasource = ColumnDataSource(embeddings_df)
color_mapping = CategoricalColorMapper(factors=["0","1", "2"],
                                       palette=("#0000ff", "#ff0000", "#00ff00"))

plot_figure = figure(
    title='UMAP projection of the dataset',
    width=800,
    height=600,
    tools=('pan, wheel_zoom, reset')
)

plot_figure.add_tools(HoverTool(tooltips="""
<div>
    <div>
        <span style='font-size: 10px'>@sentence</span>
    </div>
</div>
"""))

plot_figure.circle(
    'x',
    'y',
    source=datasource,
    color=dict(field='ood', transform=color_mapping),
    line_alpha=0.6,
    fill_alpha=0.6,
    size=4
)
show(plot_figure)

## 2.2 Gaussian fit to the embeddings

It is hard to show much about it. We already vizualized the center of the distribution that has been fit on the embeddings vector space. It will be nice to vizualize the `mahalanobis distance` between the sentence and various sentences. We will do that as part of the next section.

## 2.3 Looking at mahalanobis distances and Calibrate

Let's make up some sentences that are ID and OOD and add some random ones from our datasets. The calibrator will attempt to compute the best cutoff distance of "mahalanobis" distance for a sentence to be considered OOD: if a sentence is further away from the "center" of the Gaussian distribution it is considered OOD. Below:
- "score" represents the raw mahalanobis distance between the center and the tested sentence. 
- "delta" represents the distance from the cutoff. A delta of zero means the sentence is on this frontier

In [None]:
from pipelines.impl.anomaly_detection import OnInvalidSentence

ood_sentences = [
    "My brother got a headache",
    "How can I assist you?",
    "How can I help you?",
    "Welcome to our FDA support center",
    "Goodbye, have a nice day",
    "Anything else I can assist you with?",
    "Sorry to hear that",
    "I love eating pizza",
    "I am not allowed to wear glasses",
    "Welcome to our hotline"
]

id_sentences = [
    "Do I need to wear a mask to protect myself?",
    "Does hydroxychloroquine help to treat COVID-19?",
    "Can cats transmit the illness to humans?",
    "How much vaccines do I need to take?",
    "I have food allergies, can I still take the vaccine?",
    "As a smoker do I have more risks?",
    "Are foreign foods dangerous?",
    "After taking the vaccine will I be able to get children"
]

model.recalibrate(id_sentences, ood_sentences, registry=datasets, on_invalid_sentence=OnInvalidSentence.WARN)
model.calibrator.cutoff_

### 2.3.1 Calibration as a table

In the on-boarding app we will allow the user to enter some sentences to calibrate himself. We will also provider a slider for the user to adjust: he will see in-live which sentences are becoming ID or OOD depending on it: this allows the user to adjust precision and recall to his liking.

In [None]:
calibration_sentences = [*id_sentences, *ood_sentences]

In [None]:
raw_id_scores = model.train_pipe.transform(id_sentences)
raw_ood_scores = model.train_pipe.transform(ood_sentences)


df_id = pd.DataFrame({"score": raw_id_scores, "origin": "ID"})
df_ood = pd.DataFrame({"score": raw_ood_scores, "origin": "OOD"})
df = pd.concat([df_id, df_ood])
df["sentence"] = [*id_sentences, *ood_sentences]

# get the top 10 rows where abs(delta) is closest to 0
df = df.sort_values(by="score", ascending=False)
df


In [None]:
pred, sent = model.predict_proba(calibration_sentences)
pd.DataFrame({"sentence": sent, "OOD prob": pred})

In [None]:
model.calibrator.r_id_, model.calibrator.r_ood_, model.calibrator.cutoff_

### 2.3.2 Calibration as an histogram

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("darkgrid")
graph = sns.histplot(data=df, x="score", hue="origin", legend=True, stat="count", palette=["red", "lightblue"], alpha=0.5, kde=True)
graph.axvline(model.calibrator.cutoff_, color="grey", linestyle="--", label="Cutoff")

### 2.3.3 Confusion matrix and scoring

In [None]:
original_cutoff = model.calibrator.cutoff_

In [None]:
from analysis.evaluate import evaluate_model
# model.calibrator.cutoff, model.calibrator.adjusted_cutoff
# In interface allow user to change the tradeoff
adjustment = 0 # add positive amount for more recall, negative for more precision (less false positives)
model.calibrator.cutoff_ = model.calibrator.cutoff_ + adjustment
# You can adjust the cutoff to get a different tradeoff...
scores = evaluate_model(model, id_sentences, ood_sentences)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

print(f"F1-SCORE: {scores.f1}")

print(f"False positives ({len(scores.fp_indices)}):")
for i in scores.fp_indices:
    print("    " + id_sentences[i])

print(f"False negatives ({len(scores.fn_indices)}):")
for i in scores.fn_indices:
    print("    " + ood_sentences[i])

cm = confusion_matrix(scores.y_true, scores.y_pred)
ConfusionMatrixDisplay(cm).plot()

# 3 Explainability

In [None]:
!pip install lime

LiMe requires a predict_one function to explain an example

In [None]:
from pipelines.impl.anomaly_detection import GaussianEmbeddingsAnomalyDetector
from pipelines.filtering import FilteredSentence

def predict_one(sentences):
    # print(f"Sents: {sentences[:5]}")
    results = []
    for sentence in sentences:
        scored_sentences = list(model.filter_sentences(sentence))
        if len(scored_sentences) == 0:
            results.append([0.0, 1.0])
        if len(scored_sentences) > 1:
            raise Exception(f"More than one result in {scored_sentences}")
        if len(scored_sentences) == 1:
            ood_score = scored_sentences[0].score / 100
            id_score = 1 - ood_score
            results.append([id_score, ood_score])
    return np.array(results)



predict_one(["How much vaccines do I need to take?"])


In [None]:
def predict_one(sentences):
    embeddings = model.embedder.transform(sentences)
    raw_scores = model.distribution.transform(embeddings)
    ood_probas = model.calibrator.predict_proba(raw_scores)
    # Concatenate iid and ood as a 2D array
    new_var = np.vstack([1 - ood_probas, ood_probas]).T
    return new_var
predict_one(["How much vaccines do I need to take?", "What are you doing"])

In [None]:
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=['no', 'yes'])

In [None]:
def explain(sentence, num_features=4):
    exp = explainer.explain_instance(
        sentence,
        predict_one,
        num_features=num_features)
    exp.show_in_notebook(text=sentence)
 
explain("Do I need a car jacket and a safety belt?")

In [None]:
explain("How can I help you?")

In [None]:
explain("Goodbye, see you later")

In [None]:
explain("Do I need to wear a mask to protect myself?")

In [None]:
explain("Can my pet be infected also?")

In [None]:
explain("A backwards poet writes inverse.")

In [None]:
explain("How much vaccine do I need to take?")

In [None]:
iid = datasets.load_items(model.datasets.validation_id)

explain(iid[3])

In [None]:
explain(iid[27])

In [None]:
ood = datasets.load_items(model.datasets.validation_ood)
explain(ood[0])

In [None]:
explain(ood[12])

In [None]:
explain(ood[13])

In [None]:
explain("Can I buy a movie?")

In [None]:
explain("Welcome to our FDA support center")

Because a lot of ID sentences are "How much do I need to take?" "How many" or "Can I" some sentences tend to be marked ID but they are really OOD. Removing stop words seems to solve the issue but this needs confirmation, and might force us to retrain the HuggingFace transformer in all situations: because our language is not exactly the same with and without stop words...