>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=String_Tracking) to leverage the power of whylogs and WhyLabs together!*

# Natural Language Processing Logging


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/String_Tracking.ipynb)

Blah blah blah NLP

## Installing whylogs

If you haven't already, install whylogs: 

In [2]:
pip install whylogs-1.1.7-py3-none-any.whl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./whylogs-1.1.7-py3-none-any.whl
whylogs is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [None]:
%pip install whylogs

## Creating the Data

We'll install NLTK to get access to its corpora and basic NLP functions.

In [3]:
%pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Let's start by building an inverted index of the NLTK inaugural corpus. We'll use NLTK's facilities for tokenization, stemming, and stopping. We'll use log-entropy weighting.

In [4]:
import numpy as np
from nltk.corpus import inaugural, stopwords
from nltk.stem import PorterStemmer

from whylogs.core.configs import SummaryConfig
from whylogs.experimental.core.metrics.nlp_metric import (
    NlpLogger,
    SvdMetric,
    SvdMetricConfig,
    UpdatableSvdMetric,
)

from nltk.tokenize import word_tokenize

import nltk
nltk.download('stopwords')
nltk.download('inaugural')

# inverted index weighting utility functions


def global_freq(A: np.ndarray) -> np.ndarray:
    gf = np.zeros(A.shape[0])
    for i in range(A.shape[0]):
        for j in range(A.shape[1]):
            gf[i] += A[i, j]
    return gf


def entropy(A: np.ndarray) -> np.ndarray:
    gf = global_freq(A)
    g = np.ones(A.shape[0])
    logN = np.log(A.shape[1])
    assert logN > 0.0
    for i in range(A.shape[0]):
        assert gf[i] > 0.0
        for j in range(A.shape[1]):
            p_ij = A[i, j] / gf[i]
            g[i] += p_ij * np.log(p_ij) / logN if p_ij > 0.0 else 0.0
    return g


def log_entropy(A: np.ndarray) -> None:
    g = entropy(A)
    for i in range(A.shape[0]):
        for j in range(A.shape[1]):
            A[i, j] = g[i] * np.log(A[i, j] + 1.0)


# the NLTK tokenizer produces punctuation as terms, so stop them
stop_words = set(
    stopwords.words("english")
    + [
        ".",
        ",",
        ":",
        ";",
        '."',
        ',"',
        '"',
        "'",
        " ",
        "?",
        "[",
        "]",
        ".]",
        "' ",
        '" ',
        "? ",
        "-",
        "- ",
        "/",
        '?"',
        "...",
        "",
    ]
)

# build weighted inverted index of inaugural speeches
stemmer = PorterStemmer()

vstopped = {w for w in inaugural.words() if w.casefold() not in stop_words}
vocab = {stemmer.stem(w.casefold()) for w in vstopped}
vocab_size = len(vocab)

vocab_map = {}
rev_map = [""] * vocab_size
dim = 0
for w in vocab:
    if w not in vocab_map:
        vocab_map[w] = dim
        rev_map[dim] = w
        dim += 1

doc_lengths = []
ndocs = len(inaugural.fileids())
doc = 0
index = np.zeros((vocab_size, ndocs))
for fid in inaugural.fileids():
    stopped = [t.casefold() for t in inaugural.words(fid) if t.casefold() not in stop_words]
    stemmed = [stemmer.stem(w) for w in stopped]
    doc_lengths.append(len(stemmed))
    for w in stemmed:
        index[vocab_map[w], doc] += 1
    doc += 1

# A is our weighted inverted index
A = index.copy()
log_entropy(A)

# We'll need the global frequencies and entropies for weighting new document vectors
gf = global_freq(index)
g = entropy(index)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


## Log the reference profile

Let's instantiate an `NlpLogger` to create the reference profile for the corpus. We need to pick the number of eigenvalues to keep in the SVD approximation of the inverted index. Since the inaugural corpus is very small, we'll set it to 10. We also need to specify the `UpdatableSvdMetric` since we want to update the SVD approximation as we process the documents.

Once we've logged the documents, we can send the profile to whylabs and save the SVD locally.

In [5]:
num_concepts = 10
old_doc_decay_rate = 1.0
svd_config = SvdMetricConfig(k=num_concepts, decay=old_doc_decay_rate)
nlp_logger = NlpLogger(svd_class=UpdatableSvdMetric, svd_config=svd_config)

for fid in inaugural.fileids():
    stopped = [t.casefold() for t in inaugural.words(fid) if t.casefold() not in stop_words]
    stemmed = [stemmer.stem(w) for w in stopped]

    doc_vec = np.zeros(vocab_size)
    for w in stemmed:
        doc_vec[vocab_map[w]] += 1
    for i in range(vocab_size):
        doc_vec[i] = g[i] * np.log(doc_vec[i] + 1.0)

    nlp_logger.log(stemmed, doc_vec)


# save reference profile locally
send_me_to_whylabs = nlp_logger.get_profile()  # small--only has a few standard metrics (no SVD)
nlp_logger._profile.flush()
svd_write_me = nlp_logger.get_svd_state()  # big--contains the SVD approximation & parameters


TypeError: ignored

We can take a look at the resulting profile

In [13]:
print(nlp_logger.get_profile().profile()._columns.keys())

svd = nlp_logger._svd_metric

concepts = svd.U.value.transpose()
for i in range(concepts.shape[0]):
    pos_idx = sorted(range(len(concepts[i])), key=lambda x: concepts[i][x])[-10:]
    neg_idx = sorted(range(len(concepts[i])), key=lambda x: -1 * concepts[i][x])[-5:]
    print(", ".join([rev_map[j] for j in pos_idx]))  # + [rev_map[j] for j in neg_idx]))
print()

dict_keys(['nlp_bag_of_words', 'nlp_lsi'])
reunion, annex, bill, occasion, compromis, array, discrimin, european, texa, tax
hate, exig, mischief, proposit, speedili, exercis, discrimin, minor, texa, union
!, outset, actual, unequ, amid, fortitud, overlook, outrun, weigh, forebod
ballot, unless, journey, job, seced, case, stori, slaveri, fugit, 
deleg, occasion, incident, discrimin, feder, reunion, annex, compromis, levi, texa
preced, invas, railroad, predecessor, board, type, negro, fortif, coast, interst
tribe, florida, contest, occurr, neutral, spain, union, naval, territori, augment
philippin, suitabl, elector, report, employe, tariff, south, type, negro, interst
era, journey, world, econom, , challeng, help, ideal, today, america
legisl, power, execut, revenu, congress, upon, law, union, state, constitut



For production logging, we can choose whether or not to continue updating the SVD approximation. In this case, we'll use `SvdMetric` so that the reference SVD won't be updated. We'll load the reference SVD that we saved locally.

In [None]:
# production tracking, no reference update

prod_logger = NlpLogger(svd_class=SvdMetric, svd_state=svd_write_me)  # use UpdatableSvdMetric to train in production

prod_svd = prod_logger._svd_metric

residuals = []
for fid in inaugural.fileids():
    stopped = [t.casefold() for t in inaugural.words(fid) if t.casefold() not in stop_words]
    stemmed = [stemmer.stem(w) for w in stopped]

    doc_vec = np.zeros(vocab_size)
    for w in stemmed:
        doc_vec[vocab_map[w]] += 1
    for i in range(vocab_size):
        doc_vec[i] = g[i] * np.log(doc_vec[i] + 1.0)

    residuals.append(prod_svd.residual(doc_vec))
    prod_logger.log(stemmed, doc_vec)  # update residual only, not SVD

print(f"\nresiduals: {residuals}\n")

# if we trained with production data
# svd_write_me = prod_logger.get_svd_state()

# send to whylabs, no SVD state
send_me = prod_logger.get_profile()

# get stats on doc length, term length, SVD "fit"
view_me = prod_svd.to_summary_dict(SummaryConfig())
print(view_me)
