>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started) to leverage the power of whylogs and WhyLabs together!*

# Profiling text and text embeddings data

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/experimental/embeddings/Embeddings_Distance_Logging.ipynb)

High dimensional embedding spaces can be difficult to understand because we often rely on our own subjective judgement of clusters in the space. Often, data scientists try to find issues solely by hovering over individual data points and noting trends in which ones feel out of place.

WhyLabs has a number of features that are useful for natural language processing and text. In this notebook, we'll look at both our unicode tracking features for text and our embeddings features for word embeddings.

## Setup

### Install package extras for whylogs

For convenience, we include helper functions to select reference data points for comparing new embedding vectors against. To follow this notebook in full, install the `embeddings` extra (for helper functions) and `viz` extra (for visualizing drift) when installing whylogs.

In [None]:
%pip install --upgrade whylogs[embeddings,viz]

In [40]:
import os
import pickle as pkl
import glob

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import whylogs as why

from sklearn.cluster import KMeans
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True)

### Downloading word lists

We won't train a model for this notebook, but instead look at how our text features work on a stream of data inputs that are organized by topic. Those topics are food, geography, machine learning, and musical instruments.

We'd like to use functionality in whylogs to note data quality differences as well as drift when a new category (sports) is added.

### Downloading GLoVE vectors

We'll use the 200-dimensional GLoVE word embedddings to encode our words in. This can be downloaded from OpenML via scikit-learn. Because the download can take a few minutes, we suggest saving the data locally as well.

In [1]:
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-200')



### Preparing data for our example

Here, we pull our vocabulary from the different word lists and grabbing the word embedding for those relevant.

In [19]:
num_dimensions = 200

words = []
categories = []
embeddings = []

for list_file in glob.glob("wordlists/*"):
    with open(list_file, "r") as f:
        word_list = f.readlines()
        word_list = [word[:-1] for word in word_list]
        category_list = [list_file.split("/")[1][:-4]] * len(word_list)
        words.extend(word_list)
        categories.extend(category_list)

for word in words:
    try:
        embeddings.append(glove_vectors.get_vector(word))
    except:
        f"Couldn't find embedding for word {word}. Skipping."
        embeddings.append(np.zeros(num_dimensions))
    if len(embeddings[-1]) != 200:
        print(word)

X_words = np.array(words)
X_embeddings = np.array(embeddings)
y = np.array(categories)

X_train, X_prod, y_train, y_prod = train_test_split(X, y, test_size=0.1)

### Splitting into training and production datasets

Instead of training a model, we'll use the same functionality to split our dataset into an original training dataset and data we'll see in our first day of production.

In [26]:
from sklearn.model_selection import train_test_split

X_words_train, X_words_prod, X_embed_train, X_embed_prod, y_train, y_prod = train_test_split(X_words, X_embeddings, y, test_size=0.1)

In [36]:
df_train = pd.DataFrame({"words": X_words_train, "embeddings": [val for val in X_embed_train], "labels": y_train})
df_prod = pd.DataFrame({"words": X_words_prod, "embeddings": [val for val in X_embed_prod]})

display(df_train)

Unnamed: 0,words,embeddings,labels
0,hypothesis,"[0.3481999933719635, 0.05061199888586998, 0.46...",machinelearning
1,intercept,"[0.4326600134372711, -0.4637100100517273, 0.18...",sports
2,boundary,"[0.05375000089406967, -0.3873499929904938, 0.4...",machinelearning
3,cape,"[-0.08508200198411942, -0.1590999960899353, -0...",geography
4,river,"[-0.33959999680519104, -0.034967999905347824, ...",geography
...,...,...,...
266,wheat,"[0.3941600024700165, 0.20038999617099762, 0.51...",food
267,pass,"[-0.2054699957370758, 0.08040100336074829, -0....",sports
268,cookie,"[-0.2373799979686737, 0.504580020904541, 0.011...",food
269,territory,"[0.2437400072813034, 0.00025101000210270286, 0...",geography


## Profiling with whylogs

As with other advanced features, we can create a `DeclarativeSchema` to tell whylogs to resolve columns of a certain name to the `EmbeddingMetric` that we want to use.

We must pass our references, labels, and preferred distance function (either cosine distance or Euclidean distance) as parameters to `EmbeddingConfig` then log as normal.

### Unicode and string length metrics for strings

By default, columns of type `str` will have the following metrics, when logged with whylogs:
- Counts
- Types
- Frequent Items/Frequent Strings
- Cardinality

In this example, we'll see how you can track further metrics for string columns. We will do that by counting, for each string record, the number of characters that fall in a given unicode range, and then generating distribution metrics, such as `mean`, `stddev` and quantile values based on these counts. In addition to specific unicode ranges, we'll do the same approach, but for the overall string length.

For more info on the unicode list of characters, check this [Wikipedia Article](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [41]:
from whylogs.core.schema import ColumnSchema, DatasetSchema
from whylogs.core.metrics.unicode_range import UnicodeRangeMetric
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType
from typing import Dict
from whylogs.core.metrics import Metric, MetricConfig

class UnicodeResolver(Resolver):
    def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
        return {UnicodeRangeMetric.get_namespace(): UnicodeRangeMetric.zero(column_schema.cfg)}

In [43]:
text_results = why.log(df_train.drop("embeddings", axis=1), schema=DatasetSchema(resolvers=UnicodeResolver()))

In [45]:
text_results.profile().view().to_pandas().T

column,labels,words
type,SummaryType.COLUMN,SummaryType.COLUMN
unicode_range/UNKNOWN:cardinality/est,1.0,1.0
unicode_range/UNKNOWN:cardinality/lower_1,1.0,1.0
unicode_range/UNKNOWN:cardinality/upper_1,1.00005,1.00005
unicode_range/UNKNOWN:counts/inf,0,0
...,...,...
unicode_range/string_length:types/boolean,0,0
unicode_range/string_length:types/fractional,0,0
unicode_range/string_length:types/integral,271,271
unicode_range/string_length:types/object,0,0


Our example data is fairly clean, so there isn't as much interesting content, but the above would notify us of any emoticons, control characters, numerals, and extended Latin that unexpectedly shows up in production data. 

## Finding references for embeddings

We would like to compare incoming embeddings against up to 30 predefined references. These can chosen by the user either manually or algorithmically. Here, we use a supervised method for finding references, but some use cases dictate an unsupervised approach shown as well.

For use with labeled training data (even if no labels at inference):

In [21]:
from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector

references, labels = PCACentroidsSelector(n_components=20).calculate_references(X_train, y_train)

For use with unlabeled training data:

In [22]:
from whylogs.experimental.preprocess.embeddings.selectors import PCAKMeansSelector

unsup_references, _ = PCAKMeansSelector(n_clusters=8, n_components=20).calculate_references(X_train)

In [50]:
import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.core.metrics.embedding_metric import (
    DistanceFunction,
    EmbeddingConfig,
    EmbeddingMetric,
)

config = EmbeddingConfig(
    references=references,
    labels=labels,
    distance_fn=DistanceFunction.euclidean,
)
schema = DeclarativeSchema([ResolverSpec(column_name="embeddings", metrics=[MetricSpec(EmbeddingMetric, config)])])

train_profile = why.log(row={"embeddings": X_train}, schema=schema)

Let's confirm the contents of our profile measures the distribution of embeddings relative to the references we've provided.

In [51]:
train_profile.profile().view().to_pandas().T

column,embeddings
embedding/closest:counts/inf,0
embedding/closest:counts/n,271
embedding/closest:counts/nan,0
embedding/closest:counts/null,0
embedding/closest:frequent_items/frequent_strings,"[FrequentItem(value='food', est=89, upper=89, ..."
...,...
embedding/sports_distance:types/fractional,271
embedding/sports_distance:types/integral,0
embedding/sports_distance:types/object,0
embedding/sports_distance:types/string,0


## Measuring embeddings drift in WhyLabs

Both approaches can be really powerful for measuring drift across new batches of embeddings in a programmatic way using drift metrics as well as the WhyLabs Observability Platform.

Consider changing our production data such that it represents drift. For example, perhaps the embeddings are purposely or accidentally rounded to the first decimal place. Perhaps we remove one of the categories, say sport, found in production. You can also add your own words to induce drift.

Many similar issues get added to an ML pipeline and will have a detrimental impact on our incoming data.

## What's Next?

### Upload profiles to WhyLabs for more drift calculations and monitoring

See [example notebook](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_to_WhyLabs.html) for monitoring your profiles continuously with the WhyLabs Observability Platform.

### Exploring other sources of drift

Consider comparing this profile to different transformations and subsets of our MNIST dataset: randomly selected subsets of the data, normalized values, missing one or more labels, sorted values, and more.

### More example notebooks and documentation

Go to the [examples page](https://whylogs.readthedocs.io/en/stable/examples.html) for the complete list of examples!