# Goal of the notebook

## Context
Our filter is supposed to detect Out-Of-Domain (OOD) sentences from In-Domain (ID) sentences in a conversation between a support agent and a customer.
Here the domain (ID) is the full FAQ scrapped from the Europcar website (from https://faq.europcar.com/ and subpages). The domain is "Car rental".

## Goal

Visualize the (semantic) sentence embeddings (generated by a Hugging Face SentenceTransformer) to see if we can spot some patterns or problems.

Visualization is done by generating embeddings in a format suitable to use in the Tensorflow Embedding Projector: https://projector.tensorflow.org/


First we load un hugging face model (see other python files for details). Embedder model is `all-MiniLM-L6-v2` which is marked as suitable for similarity comparisons with euclidian distances. (Important because we want to fit a Guassian
to those)

In [2]:
from pathlib import Path
from pipelines.persistence import load_pipeline
from dataload.dataloading import DataFilesRegistry
from pipelines.impl.anomaly_detection import GaussianEmbeddingsAnomalyDetector

CAPSTONE_FOLDER = Path("/Users/jlinho/Desktop/capstone/")
DATAFILES_FOLDER = CAPSTONE_FOLDER / "datasources"
MODELS_FOLDER = CAPSTONE_FOLDER / "models"

data_registry = DataFilesRegistry(DATAFILES_FOLDER)
detector = load_pipeline(MODELS_FOLDER / "20220127_16-50-10", GaussianEmbeddingsAnomalyDetector)


print(f"Name: {detector.name}")
print("Train params used: ")
print(detector.run_params)
print("Files used for training:")
print(detector.files)

Name: GaussianEmbeddingsAnomalyDetector
Train params used: 
{'embedder_name': 'all-MiniLM-L6-v2', 'robust_covariance': True}
Files used for training:
FilterTrainFiles(train_id=['europcar'], train_ood=[], validation_id='validation_id', validation_ood='validation_ood')


Then we generate all embeddings...

In [4]:
import numpy as np
from pipelines.impl.anomaly_detection import _file_sentences
from torch.utils.tensorboard import SummaryWriter

id_sentences = list(_file_sentences("europcar", data_registry))

writer = SummaryWriter(log_dir=(CAPSTONE_FOLDER / "tensorboard_logs"))

writer.add_embedding(
    np.vstack([detector.embedding(s) for s in id_sentences if s.strip() != ""]),
    metadata=[s for s in id_sentences if s.strip() != ""],
    tag=detector.embedder_name,
)

writer.flush()
writer.close()
print("done")

done


In [5]:
%load_ext tensorboard
%tensorboard  --logdir "/Users/jlinho/Desktop/capstone/tensorboard_logs"

Launching TensorBoard...

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Some visualizations with UMAP algorithm:

![embeddings1](useless_sentences.png)


We can clearly see 2 clusters:
- the first one containing the "car rental" domain sentences
- another containing politeness sentences like "Please feel free to tell us if you found answer helpful"

If we want to fit cleanly the ID sentences those need to be eliminated (or their effect mitigated) before fitting a gaussian to the main cloud of data.


We want to fit a guassian **like this** : ![good](/Users/jlinho/MyGit/Capstone/Project/notebooks/good.png)

But not like this ![not good](/Users/jlinho/MyGit/Capstone/Project/notebooks/not_good.png)

Also small sentences like "Yes", "No" are should be eliminated ![yes](yes.png)

Multiple ways to eliminate the noise from the ID sentence embeddings are to be explored:
- Using Robust Covariance estimator that allows to include only the most similiar 80% (or 90%) of points
- Embeddings paragraphs instead of sentences to dilute the effect of those sentences
- Try to isolate them automatically by some algorithm (like we did visually above)