>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking) to leverage the power of whylogs and WhyLabs together!*

# Natural Language Processing Logging


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/String_Tracking.ipynb)

Blah blah blah NLP

## Installing whylogs

If you haven't already, install whylogs: 

In [1]:
pip install whylogs-1.1.7-py3-none-any.whl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./whylogs-1.1.7-py3-none-any.whl
Collecting whylogs-sketching>=3.4.1.dev3
  Downloading whylogs_sketching-3.4.1.dev3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (559 kB)
[K     |████████████████████████████████| 559 kB 6.4 MB/s 
[?25hCollecting importlib-metadata<4.3
  Downloading importlib_metadata-4.2.0-py3-none-any.whl (16 kB)
Installing collected packages: whylogs-sketching, importlib-metadata, whylogs
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 4.13.0
    Uninstalling importlib-metadata-4.13.0:
      Successfully uninstalled importlib-metadata-4.13.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.4.1 requires importlib-metadata>=4.4; python_version < "3.10", but you

In [None]:
%pip install whylogs

## Creating the Data

We'll install NLTK to get access to its corpora and basic NLP functions.

In [2]:
%pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Let's start by building an inverted index of the NLTK inaugural corpus. We'll use NLTK's facilities for tokenization, stemming, and stopping. We'll use log-entropy weighting.

In [3]:
import numpy as np
from nltk.corpus import inaugural, stopwords
from nltk.stem import PorterStemmer

from whylogs.core.configs import SummaryConfig
from whylogs.experimental.core.metrics.nlp_metric import (
    NlpLogger,
    SvdMetric,
    SvdMetricConfig,
    UpdatableSvdMetric,
)

from nltk.tokenize import word_tokenize

import nltk
nltk.download('stopwords')
nltk.download('inaugural')

# inverted index weighting utility functions


def global_freq(A: np.ndarray) -> np.ndarray:
    gf = np.zeros(A.shape[0])
    for i in range(A.shape[0]):
        for j in range(A.shape[1]):
            gf[i] += A[i, j]
    return gf


def entropy(A: np.ndarray) -> np.ndarray:
    gf = global_freq(A)
    g = np.ones(A.shape[0])
    logN = np.log(A.shape[1])
    assert logN > 0.0
    for i in range(A.shape[0]):
        assert gf[i] > 0.0
        for j in range(A.shape[1]):
            p_ij = A[i, j] / gf[i]
            g[i] += p_ij * np.log(p_ij) / logN if p_ij > 0.0 else 0.0
    return g


def log_entropy(A: np.ndarray) -> None:
    g = entropy(A)
    for i in range(A.shape[0]):
        for j in range(A.shape[1]):
            A[i, j] = g[i] * np.log(A[i, j] + 1.0)


# the NLTK tokenizer produces punctuation as terms, so stop them
stop_words = set(
    stopwords.words("english")
    + [
        ".",
        ",",
        ":",
        ";",
        '."',
        ',"',
        '"',
        "'",
        " ",
        "?",
        "[",
        "]",
        ".]",
        "' ",
        '" ',
        "? ",
        "-",
        "- ",
        "/",
        '?"',
        "...",
        "",
    ]
)

# build weighted inverted index of inaugural speeches
stemmer = PorterStemmer()

vstopped = {w for w in inaugural.words() if w.casefold() not in stop_words}
vocab = {stemmer.stem(w.casefold()) for w in vstopped}
vocab_size = len(vocab)

vocab_map = {}
rev_map = [""] * vocab_size
dim = 0
for w in vocab:
    if w not in vocab_map:
        vocab_map[w] = dim
        rev_map[dim] = w
        dim += 1

doc_lengths = []
ndocs = len(inaugural.fileids())
doc = 0
index = np.zeros((vocab_size, ndocs))
for fid in inaugural.fileids():
    stopped = [t.casefold() for t in inaugural.words(fid) if t.casefold() not in stop_words]
    stemmed = [stemmer.stem(w) for w in stopped]
    doc_lengths.append(len(stemmed))
    for w in stemmed:
        index[vocab_map[w], doc] += 1
    doc += 1

# A is our weighted inverted index
A = index.copy()
log_entropy(A)

# We'll need the global frequencies and entropies for weighting new document vectors
gf = global_freq(index)
g = entropy(index)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.


## Log the reference profile

Let's instantiate an `NlpLogger` to create the reference profile for the corpus. We need to pick the number of eigenvalues to keep in the SVD approximation of the inverted index. Since the inaugural corpus is very small, we'll set it to 10. We also need to specify the `UpdatableSvdMetric` since we want to update the SVD approximation as we process the documents.

Once we've logged the documents, we can send the profile to whylabs and save the SVD locally.

In [4]:
num_concepts = 10
old_doc_decay_rate = 1.0
svd_config = SvdMetricConfig(k=num_concepts, decay=old_doc_decay_rate)
nlp_logger = NlpLogger(svd_class=UpdatableSvdMetric, svd_config=svd_config)

for fid in inaugural.fileids():
    stopped = [t.casefold() for t in inaugural.words(fid) if t.casefold() not in stop_words]
    stemmed = [stemmer.stem(w) for w in stopped]

    doc_vec = np.zeros(vocab_size)
    for w in stemmed:
        doc_vec[vocab_map[w]] += 1
    for i in range(vocab_size):
        doc_vec[i] = g[i] * np.log(doc_vec[i] + 1.0)

    nlp_logger.log(stemmed, doc_vec)


# save reference profile locally
send_me_to_whylabs = nlp_logger.get_profile()  # small--only has a few standard metrics (no SVD)
nlp_logger._profile.flush()
svd_write_me = nlp_logger.get_svd_state()  # big--contains the SVD approximation & parameters


We can take a look at the resulting profile

In [6]:
bow_summary = nlp_logger.get_profile().view().get_column("nlp_bag_of_words").get_metric("nlp_bow").to_summary_dict()
for key, value in bow_summary.items():
  print(f"  {key}: {value}")
print()

svd = nlp_logger._svd_metric
concepts = svd.U.value.transpose()
for i in range(concepts.shape[0]):
    pos_idx = sorted(range(len(concepts[i])), key=lambda x: concepts[i][x])[-10:]
    neg_idx = sorted(range(len(concepts[i])), key=lambda x: -1 * concepts[i][x])[-5:]
    print(", ".join([rev_map[j] for j in pos_idx]))  # + [rev_map[j] for j in neg_idx]))
print()

  doc_length:distribuion/mean: 1121.0847457627117
  doc_length:distribuion/stddev: 638.2022781742044
  doc_length:distribuion/n: 59
  doc_length:distribuion/max: 3833.0
  doc_length:distribuion/min: 62.0
  doc_length:distribuion/q_01: 62.0
  doc_length:distribuion/q_05: 343.0
  doc_length:distribuion/q_10: 524.0
  doc_length:distribuion/q_25: 669.0
  doc_length:distribuion/median: 1030.0
  doc_length:distribuion/q_75: 1370.0
  doc_length:distribuion/q_90: 1940.0
  doc_length:distribuion/q_95: 2280.0
  doc_length:distribuion/q_99: 3833.0
  doc_length:counts/n: 59
  doc_length:counts/null: 0
  doc_length:types/integral: 59
  doc_length:types/fractional: 0
  doc_length:types/boolean: 0
  doc_length:types/string: 0
  doc_length:types/object: 0
  doc_length:cardinality/est: 57.00000792741905
  doc_length:cardinality/upper_1: 57.00285389510009
  doc_length:cardinality/lower_1: 57.0
  doc_length:ints/max: 3833
  doc_length:ints/min: 62
  term_length:distribution/mean: 5.591935776487663
  term

For production logging, we can choose whether or not to continue updating the SVD approximation. In this case, we'll use `SvdMetric` so that the reference SVD won't be updated. We'll load the reference SVD that we saved locally.

In [7]:
# production tracking, no reference update

prod_logger = NlpLogger(svd_class=SvdMetric, svd_state=svd_write_me)  # use UpdatableSvdMetric to train in production

prod_svd = prod_logger._svd_metric

residuals = []
for fid in inaugural.fileids():
    stopped = [t.casefold() for t in inaugural.words(fid) if t.casefold() not in stop_words]
    stemmed = [stemmer.stem(w) for w in stopped]

    doc_vec = np.zeros(vocab_size)
    for w in stemmed:
        doc_vec[vocab_map[w]] += 1
    for i in range(vocab_size):
        doc_vec[i] = g[i] * np.log(doc_vec[i] + 1.0)

    residuals.append(prod_svd.residual(doc_vec))
    prod_logger.log(stemmed, doc_vec)  # update residual only, not SVD

print(f"\nresiduals: {residuals}\n")

# if we trained with production data
# svd_write_me = prod_logger.get_svd_state()

# send to whylabs, no SVD state
send_me = prod_logger.get_profile()

# get stats on doc length, term length, SVD "fit"
view_me = prod_svd.to_summary_dict(SummaryConfig())
print(view_me)



residuals: [0.9374349405545456, 0.9900965672717289, 0.9113568618849668, 0.9229127995333184, 0.9026913759048323, 0.9309452409506167, 0.9600856215484892, 0.7978363639300197, 0.2485423329154347, 0.8388866693394962, 0.9219593968094596, 0.9338229346804906, 0.39892206399674773, 0.05057287344894667, 0.2561104893885517, 0.9119026566621015, 0.7955023103731856, 0.8373906262694235, 0.4977861456818984, 0.9732114162116675, 0.9374763643180631, 0.9393504075942269, 0.8652658479179123, 0.8306574119037258, 0.9100508803781197, 0.09127160761155154, 0.9118942761089684, 0.43885793955621716, 0.8920923233150638, 0.9621954071529196, 0.06675439215580092, 0.9508982030878171, 0.9417248038715189, 0.22717416450227484, 0.7054859897545632, 0.7922191445402398, 0.9109914947994936, 0.9268860122888748, 0.9512663005854589, 0.969489084485672, 0.8903536588967198, 0.9005175921107347, 0.9308675277059155, 0.9378327473580307, 0.9495842047110129, 0.9135955673015228, 0.9240933916958679, 0.9444879200285949, 0.8956638416552153, 0.