# How to evaluate text data using Descriptors

**Disclaimer**. This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this notebook. For updated and new examples using the latest Evidently versions, visit our documentation. 

Evidently docs: https://docs.evidentlyai.com/

Join our Discord: https://discord.com/invite/xZjKRaNp8b

This tutorial explains:
* how to evaluate text data using Descriptors
* how to use external models to generate additional features for the text data

# Installation

Install Evidently following the instructions for your environment: https://docs.evidentlyai.com/user-guide/install-evidently

In [None]:
try:
    import evidently
except:
    !pip install evidently==0.3.3

Install transformers to be able to use the external model. Instructions: https://huggingface.co/docs/transformers/installation

In [None]:
!pip install transformers

In [None]:
import pandas as pd
import numpy as np

from sklearn import datasets, ensemble, model_selection

Import the required Evidently components that you will use in the tutorial.

In [None]:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.test_suite import TestSuite

from evidently.metrics import DataDriftTable, TextDescriptorsDriftMetric, ColumnDriftMetric
from evidently.metric_preset import TextOverviewPreset
from evidently.descriptors import TextLength, TriggerWordsPresence, OOV, NonLetterCharacterPercentage, SentenceCount, WordCount, Sentiment
from evidently.tests import *

Import the components from NLTK required to compute some of the metrics.

In [None]:
import nltk
nltk.download('words')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

Import the components from Transformers to be able to use the external model.

In [None]:
from transformers import pipeline

# Prepare the data

Load the e-commerce review demo dataset.

In [None]:
reviews_data = datasets.fetch_openml(name='Womens-E-Commerce-Clothing-Reviews', version=2, as_frame='auto')
reviews = reviews_data.frame

Split into two datasets: reference and current. Let's imagine that "reference" data is the data for some representative past period (e.g., last month) and "current" is the current production data (e.g., this month).

In [None]:
reviews_ref = reviews[reviews.Rating > 3].sample(n=5000, replace=True, ignore_index=True, random_state=42)
reviews_cur = reviews[reviews.Rating < 3].sample(n=5000, replace=True, ignore_index=True, random_state=42)

Add column mapping to help Evidently parse the input data correctly and specify the columns with text.

In [None]:
column_mapping = ColumnMapping(
    numerical_features=['Age', 'Positive_Feedback_Count'],
    categorical_features=['Division_Name', 'Department_Name', 'Class_Name'],
    text_features=['Review_Text', 'Title']
)

Let's look at the data stucture!

In [None]:
reviews_ref.head()

# Get the text overview report

Evidently generates a lot of metrics out of the box. For example, you can generate a comparative Report to visualize the characteristics of the review texts in the two datasets. There is a pre-built **Text Overview Preset** that combines different descriptive checks and evaluates data drift.

We also use **Text Descriptors** - standard auto-generated features that describe the text dataset (e.g. text length, % of words out of vocabulary, etc.)
We defined two additional Descriptors using Trigger Words. Each checks whether any word from the list appears in the dataset.

Check out the Evidently docs on Descriptors for details: https://docs.evidentlyai.com/user-guide/customization/text-descriptors-parameters

In [None]:
text_overview_report = Report(metrics=[
    TextOverviewPreset(column_name="Review_Text", descriptors={
        "Review texts - OOV %" : OOV(),
        "Review texts - Non Letter %" : NonLetterCharacterPercentage(),
        "Review texts - Symbol Length" : TextLength(),
        "Review texts - Sentence Count" : SentenceCount(),
        "Review texts - Word Count" : WordCount(),
        "Review texts - Sentiment" : Sentiment(),
        "Reviews about Dress" : TriggerWordsPresence(words_list=['dress', 'gown']),
        "Reviews about Blouses" : TriggerWordsPresence(words_list=['blouse', 'shirt']),
    })
])

text_overview_report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
text_overview_report

# Customize the report - data drift

Let's say you are only interested in tracking a few relevant text properties and detecting when there is a change.
* Whether the review is about Dresses
* The word length of the review
* The review sentiment

Let's create a simple custom report to track drift in these properties.

In [None]:
descriptors_report = Report(metrics=[
    ColumnDriftMetric(WordCount().for_column("Review_Text")),
    ColumnDriftMetric(Sentiment().for_column("Review_Text")),
    ColumnDriftMetric(TriggerWordsPresence(words_list=['dress', 'gown']).for_column("Review_Text")),
])

descriptors_report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
descriptors_report

You can also include raw data in the reports: the visuals might be more informative (but will take longer to compute and load: use with caution).

In [None]:
report = Report(
    metrics=[
      ColumnDriftMetric(Sentiment().for_column("Review_Text")),
    ],
    options={"render": {"raw_data": True}}
  )
report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
report

# How to add a new text property?

To add a new text property, you can use an external open-source model to score your dataset. Then you will work with this property as an additional column.

As an example, we will take Distilibert model that classifies the text by 5 emotions. Source: https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion

You can consider any other model, for example, for named entity recognition, language detection, toxicity detection, etc.

In [None]:
classifier = pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion', top_k=1)
prediction = classifier("I love using evidently! It's easy to use", )
print(prediction)

## Score the reviews by emotion

**Note**: this step will score the dataset using external model. It will take some time to execute. If you want to understand the principle without waiting, scroll down to the "Simple example" section below.

In [None]:
reviews_ref['emotion'] = [x[0]['label'] for x in classifier(list(reviews_ref.Review_Text.fillna('')))]
reviews_cur['emotion'] = [x[0]['label'] for x in classifier(list(reviews_cur.Review_Text.fillna('')))]

## Update column mapping

Let's take a look at the new dataset that now contains "emotion" column.

In [None]:
reviews_cur.head()

You should reflect this in Column Mapping.

In [None]:
column_mapping = ColumnMapping(
    numerical_features=['Age', 'Positive_Feedback_Count'],
    categorical_features=['Division_Name', 'Department_Name', 'Class_Name', 'emotion'],
    text_features=['Review_Text', 'Title']
)

## Add "emotion drift" checks

You can now add the drift check for the "emotion" column to the Report.

In [None]:
descriptors_report = Report(metrics=[
    ColumnDriftMetric(WordCount().for_column("Review_Text")),
    ColumnDriftMetric(Sentiment().for_column("Review_Text")),
    ColumnDriftMetric(TriggerWordsPresence(words_list=['dress', 'gown']).for_column("Review_Text")),
    ColumnDriftMetric('emotion'),
])

descriptors_report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
descriptors_report

# Run pipeline tests

To execute regular checks, you can use Evidently Test Suites.

In [None]:
descriptors_test_suite = TestSuite(tests=[
    TestColumnDrift(column_name = 'emotion'),
    TestColumnDrift(column_name = WordCount().for_column("Review_Text")),
    TestColumnDrift(column_name = Sentiment().for_column("Review_Text")),
    TestColumnDrift(column_name = TriggerWordsPresence(words_list=['dress', 'gown']).for_column("Review_Text")),
])

descriptors_test_suite.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
descriptors_test_suite

## Beyond drift

Detecting statistical distribution drift is one of the ways to monitor changes in the property. However, sometimes it is convenient to use other checks: for example, rule-based expectations on min-max values.

Let's say that we want to check that:
* The reviews are longer than 2 words. We want the test to fail if at least 1 review is < 2 words, and see the number of short texts.

In [None]:
descriptors_test_suite = TestSuite(tests=[
    TestNumberOfOutRangeValues(column_name = WordCount().for_column("Review_Text"), left=2, eq=0),
])

descriptors_test_suite.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
descriptors_test_suite

# Simple example

To avoid waiting until the model scores the dataset, let's assume that the existing column "Class_Name" is the new descriptor.

In [None]:
simple_descriptors_report = Report(metrics=[
    ColumnDriftMetric(WordCount().for_column("Review_Text")),
    ColumnDriftMetric(Sentiment().for_column("Review_Text")),
    ColumnDriftMetric(TriggerWordsPresence(words_list=['dress', 'gown']).for_column("Review_Text")),
    ColumnDriftMetric('Class_Name'),
])

simple_descriptors_report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
simple_descriptors_report

In [None]:
descriptors_test_suite = TestSuite(tests=[
    TestColumnDrift(column_name = 'Class_Name'),
    TestColumnDrift(column_name = WordCount().for_column("Review_Text")),
    TestColumnDrift(column_name = Sentiment().for_column("Review_Text")),
    TestColumnDrift(column_name = TriggerWordsPresence(words_list=['dress', 'gown']).for_column("Review_Text")),
])

descriptors_test_suite.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
descriptors_test_suite

# Support Evidently

Enjoyed the tutorial? Star Evidently on GitHub to contribute back! This helps us continue creating free open-source tools for the community. https://github.com/evidentlyai/evidently


