<a href="https://colab.research.google.com/github/datametal/oreilly_live_training_llm_apps/blob/main/Copy_of_DCAI_Principles_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applying Data-centric AI (DCAI) Principles To Production Systems

We will complete a series of exercises to explore the use of Data-centric AI (DCAI) in practice. Please feel free to return to this notebook to refresh those skills.

# Section 0. Setup and configuration

We will ensure that all attendees can access and run notebooks using Google Colab, including Python dependencies.

In [None]:
# Installing needed packages
%pip install langkit[all]
%pip install whylogs[viz,image]

In [None]:
%pip install Pillow==9.0.0

In [None]:
# Collecting files and resources
!curl https://whylabs-public.s3.us-west-2.amazonaws.com/workshops/dcai_workshop_resources.zip -o resources.zip
!unzip resources.zip

In [None]:
# Importing needed packages for all sections
import glob
import pandas as pd
import seaborn as sns
import whylogs as why

import helpers

from sklearn.ensemble import (
    RandomForestRegressor,
    AdaBoostRegressor,
    ExtraTreesRegressor
)
from sklearn.metrics import (
    mean_absolute_error,
    mean_absolute_percentage_error
)
from whylogs.viz import NotebookProfileVisualizer
from whylogs.extras.image_metric import log_image

If you haven't gotten any errors, the we're all set! Let's minimize **Section 0** to make others easier to find.

# Section 1. Model-centric vs data-centric AI

We will demonstrate how dataset cleaning and iteration using a single simple model can be just as if not more effective than iteration on model architectures.

The [Ames housing dataset](http://jse.amstat.org/v19n3/decock/AmesHousing.txt) (De Cock 2011) provides tax assessor information and sale price for properties sold in Ames, Iowa between 2006 and 2010. The dataset comes with great [data documentation](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) for understanding the features -- statisticians seem to be much more thorough.

In [None]:
ames_data = pd.read_csv("datasets/original_ames.csv")

In [None]:
ames_data.shape

(844, 80)

In [None]:
helpers.peek_at_dataframe(ames_data)

MS SubClass (int64):
  [20, 60, 50], 16 total
MS Zoning (object):
  ['RL', 'RM', 'FV'], 5 total
Lot Frontage (float64):
  [60.0, 80.0, 70.0], 97 total
Lot Area (int64):
  [7200, 9600, 9000], 677 total
Street (object):
  ['Pave', 'Grvl'], 2 total
Alley (object):
  ['Grvl', 'Pave'], 2 total
Lot Shape (object):
  ['Reg', 'IR1', 'IR2'], 4 total
Land Contour (object):
  ['Lvl', 'HLS', 'Bnk'], 4 total
Utilities (object):
  ['AllPub'], 1 total
Lot Config (object):
  ['Inside', 'Corner', 'CulDSac'], 5 total
Land Slope (object):
  ['Gtl', 'Mod', 'Sev'], 3 total
Neighborhood (object):
  ['NAmes', 'CollgCr', 'OldTown'], 27 total
Condition 1 (object):
  ['Norm', 'Feedr', 'Artery'], 9 total
Condition 2 (object):
  ['Norm', 'Feedr', 'PosA'], 5 total
Bldg Type (object):
  ['1Fam', 'TwnhsE', 'Duplex'], 5 total
House Style (object):
  ['1Story', '2Story', '1.5Fin'], 7 total
Overall Qual (int64):
  [5, 6, 7], 10 total
Overall Cond (int64):
  [5, 6, 7], 9 total
Year Built (int64):
  [2006, 2005, 2007], 1

In [None]:
ames_data.head()

Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,30,RL,56.0,4130,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,7,2008,WD,Normal,52000
1,60,RL,,16545,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,5,2009,WD,Normal,340000
2,60,RL,,12388,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,8,2009,WD,Normal,249000
3,30,RL,52.0,9022,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,5,2009,WD,Normal,109500
4,70,RL,54.0,9399,Pave,,Reg,Bnk,AllPub,Inside,...,0,,,,0,9,2006,WD,Abnorml,167000


## Model-centric approach
As typical of model-centric AI, we will develop several ML models for the dataset provided.

In [None]:
# Load test dataset
test_ames_data = pd.read_csv("datasets/test_ames.csv")

In [None]:
test_ames_data.head()

Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,50,RL,51.0,6191,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,11,2006,WD,Normal,112000
1,20,RL,,10659,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,1,2006,COD,Normal,136500
2,20,RL,39.0,16300,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,,MnPrv,,0,1,2007,WD,Normal,130000
3,20,RL,75.0,10650,Pave,,Reg,Lvl,AllPub,Corner,...,0,,MnPrv,,0,2,2010,WD,Normal,128200
4,160,RM,21.0,1680,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,7,2009,WD,Normal,97000


In [None]:
# References to models
models = {
    "RF": RandomForestRegressor(),
    "AdaBoost": AdaBoostRegressor(),
    "ExtraTrees": ExtraTreesRegressor()
}

In [None]:
# Run model prediction and evaluation for all models
X_train, X_test, y_train, y_test = helpers.prepare_for_model_training(
    train_df=ames_data,
    test_df=test_ames_data,
    target_column="SalePrice",
)

for model_name, model in models.items():
    print(model_name)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("  MAE:", mean_absolute_error(y_test, y_pred))
    print("  MAPE:", mean_absolute_percentage_error(y_test, y_pred), "\n")

RF
  MAE: 13790.585328083991
  MAPE: 0.07967679651884382 

AdaBoost
  MAE: 18583.88709129945
  MAPE: 0.11452043816943304 

ExtraTrees
  MAE: 14559.632388451442
  MAPE: 0.08156706193755117 



Great! Let's compare our results with the data-centric approach.

## Data-centric approach
Let's now load three prepared variations of the Ames housing dataset of increasing dataset quality: **`ames_data`** (above), **`improved_ames_data`**, **`best_ames_data`**.

We will investigate exactly how this dataset was iterated on and improved later in Section 2.

In [None]:
# Load and reference dataset variations
improved_ames_data = pd.read_csv("datasets/improved_ames.csv")
cleanest_ames_data = pd.read_csv("datasets/best_ames.csv")

datasets = {
    "Ames data": ames_data,
    "Improved data": improved_ames_data,
    "Cleanest data": cleanest_ames_data
}

In [None]:
# Run model prediction and evaluation for all dataset variations
model = RandomForestRegressor()

for dataset_name, dataset in datasets.items():
    print(dataset_name)

    X_train, X_test, y_train, y_test = helpers.prepare_for_model_training(
        train_df=dataset,
        test_df=test_ames_data,
        target_column="SalePrice",
    )

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("  MAE:", mean_absolute_error(y_test, y_pred))
    print("  MAPE:", mean_absolute_percentage_error(y_test, y_pred), "\n")

Ames data
  MAE: 14273.317427821521
  MAPE: 0.08161069974835317 

Improved data
  MAE: 13612.328110236222
  MAPE: 0.07785682312099437 

Cleanest data
  MAE: 12828.048792650918
  MAPE: 0.07428364942784614 



You may see that your improved and cleanest datasets yielded better results using the default Random Forest model compared to the more advanced models above.

(However, this may not be the case due to the randomized nature of model fitting.)

# Section 2. Characterizing datasets and quality

We haven't yet tried to understand the features of the Ames dataset, which is an important thing to do *before* training a model. Let's do so now.

We will demonstrate several data quality issues found in the datasets by reverse engineering them -- exploring the variations of the Ames housing dataset introduced in Section 1.

## Profiling datasets and visualizing distribution drift

We'll start by profiling our datasets using whylogs. Profiles can be passed into a visualizer for informative charts about the data. We'll then use it to conveniently display the distribution drift.

In [None]:
why

In [None]:
# Log dataset profiles for original dataset and variations
ames_profile = why.log(ames_data).profile()
improved_ames_profile = why.log(improved_ames_data).profile()
cleanest_ames_profile = why.log(cleanest_ames_data).profile()

# Create visualizer and generate drift report
viz = NotebookProfileVisualizer()
viz.set_profiles(target_profile_view=improved_ames_profile.view(),
                 reference_profile_view=ames_profile.view())
viz.summary_drift_report()

⚠️ No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.


In [None]:
# Create visualizer and generate drift report
viz = NotebookProfileVisualizer()
viz.set_profiles(target_profile_view=cleanest_ames_profile.view(),
                 reference_profile_view=improved_ames_profile.view())
viz.summary_drift_report()

We can also show distribution comparisons for single features. This is helpful in investigating high drift features such as `Gr Liv Area`, the ground floor living area in square feet.

In [None]:
viz.distribution_chart(feature_name="Sale Condition")

In [None]:
viz.distribution_chart(feature_name="MS Zoning")

In [None]:
viz.distribution_chart(feature_name="Sale Condition")

Try to determine what other features may be causing the differences between each variant compared to the original Ames dataset using whylogs and other data tools of your choosing.

# Section 3. Production scale and complexity

Let's explore how we might profile data at production scale in more complex environments, such as streaming and distributed data.

In [None]:
# Load large data profiles
production_profile_views = []
for file_path in glob.glob("production/*.bin"):
    production_profile_views.append(why.read(file_path).view())

Let's look more closely at one. Notice how many rows stored in the `counts/n` column!

In [None]:
# Inspecting a profile view
selected_profile_view = production_profile_views[-1]
selected_profile_view.to_pandas()

Now, let's talk about merging whylogs profiles. The mergability of whylogs is critical for many production use cases where distributed computing and large dataset sizes make it difficult to work with raw data or precalculated metrics.

In whylogs, merging is easy as calling merge on two profile views.

In [None]:
merged_profile_view = production_profile_views[0].merge(production_profile_views[-1])
merged_profile_view.to_pandas()

The alternative to saving many separate profiles and merging them is using a *rolling logger*, which can combine profiles in a streaming fashion.

In [None]:
# Rolling loggers
prod_logger = why.logger(
    model = "rolling",
    interval = 1,
    when = "H")

# Then, use it like this:
# prod_logger.log()

Unlike static profiling tools that use static numbers as telemetry, more advanced tools can store more complex data structures.

For example, let's take a look at the **distribution** and **frequent_items** metrics for the `Mo Sold` feature.

In [None]:
merged_profile_view._columns["Mo Sold"].get_metric("frequent_items")

In [None]:
kll_sketch = merged_profile_view._columns["Mo Sold"].get_metric("distribution").kll.value
kll_sketch.get_quantiles([0, 0.1, 0.543, 0.9998])

While our examples so far have been on batches of data, whylogs was built from the ground up to work well for streaming datasets just as easily.

See our documentation on [rolling loggers](https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Log_Rotation_for_Streaming_Data/Streaming_Data_with_Log_Rotation.ipynb) for streaming use cases as well as examples using [Fugue](https://github.com/whylabs/whylogs/blob/mainline/python/examples/integrations/Fugue_Profiling.ipynb), [Dask](https://github.com/whylabs/whylogs/blob/mainline/python/examples/integrations/Dask_Profiling.ipynb), and [Kafka](https://github.com/whylabs/whylogs/tree/mainline/python/examples/integrations/kafka-example).

# Section 4. Production DCAI prinicples

We will see how some of these DCAI principles become more nuanced in a production setting with LLMs that we want to actively monitor.

In [None]:
from langkit import llm_metrics

In [None]:
import openai

In [None]:
active_llm_logger = why.logger()

In [None]:
def user_request():
    # Take request
    request = input("\nEnter your desired item to make a recipe" \
                    "(or 'quit'):")
    if request.lower() == "quit":
        raise KeyboardInterrupt()

    # Log request
    active_llm_logger.log({"request": request})

    return request

In [None]:
def prompt_llm(request):
    # Transform prompt
    prompt = f"""Please give me a short recipe for creating"\
    the following item in up to 6 steps. Each step of the recipe "\
    should be summarized in no more than 200 characters."\
    Item: {request}"""

    # Log prompt
    active_llm_logger.log({"prompt": prompt})

    # Collect response from LLM
    response = openai.ChatCompletion.create(
        model = "gpt-3.5-turbo",
        messages = [{
            "role": "system",
            "content": prompt
        }]
    )["choices"][0]["message"]["content"]

    # Log response
    active_llm_logger.log({"response": response})

    return response

In [None]:
def user_reply_success(request,response):
    # Create and print user reply
    reply = f"\nSuccess! Here is the recipe for"\
            f"{request}:\n{response}"
    print(reply)

    #Log reply
    active_llm_logger.log({"reply": reply})

In [None]:
def user_reply_failure(request = "your request"):
    # Create and print user reply
    reply = ("\nUnfortunately, we are not able to provide a recipe for " \
            f"{request} at this time. Please try Recipe Creator 900 " \
            f"in the future.")
    print(reply)

    #Log reply
    active_llm_logger.log({"reply": reply})

In [None]:
class LLMApplicationValidationError(ValueError):
    pass

In [None]:
while True:
    try:
        request = user_request()
        response = prompt_llm(request)
        user_reply_success(request, response)
    except KeyboardInterrupt:
        break
    except LLMApplicationValidationError:
        user_reply_failure(request)
        break

But often in production, ML models are parts of overall systems. Let's see a data-centric approach to such systems where we use this data

In [None]:
from whylogs.core.relations import Predicate
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.validators import ConditionValidator

In [None]:
def raise_error(validator_name, condition_name, value):
    raise LLMApplicationValidationError(
        f"Failed {validator_name} with value {value}."
    )

In [None]:
low_condition = {"<0.3": Condition(Predicate().less_than(0.3))}

In [None]:
toxicity_validator = ConditionValidator(
    name = "Toxic",
    conditions = low_condition,
    actions = [raise_error]
)

In [None]:
refusal_validator = ConditionValidator(
    name = "Refusal",
    conditions = low_condition,
    actions = [raise_error]
)

In [None]:
llm_validators = {
    "prompt.toxicity": [toxicity_validator],
    "response.refusal_similarity": [refusal_validator]
}

In [None]:
from whylogs.experimental.core.udf_schema import udf_schema

active_llm_logger = why.logger(
    model = "rolling",
    interval = 5,
    when = "M",
    base_name = "active_llm",
    schema = udf_schema(validators = llm_validators)
)

In [None]:
active_llm_logger.log(
    {"response":"I'm sorry, but I can't answer that."}
)

In [None]:
while True:
    try:
        request = user_request()
        response = prompt_llm(request)
        user_reply_success(request, response)
    except KeyboardInterrupt:
        break
    except LLMApplicationValidationError:
        user_reply_failure(request)
        break

# Section 5. Characterizing image data

Characterizing and improving the quality of unstructured data like images and text looks quite different than that of tabular data at times. Unstructured data is often very high-dimensional and complex causing human labeling to take a more important role.

But there are several efforts that go the opposite direction -- using simple rules, metrics, labeling functions to create higher quality data systems. We'll look at simple metrics that are valuable for image data.

## Viewing and logging example images

We take just four images from the [Ahmed & Moustafa (2016) dataset](https://github.com/emanhamed/Houses-dataset) for a home listing for our example.

In [None]:
%pip install --upgrade pillow

In [None]:
from PIL import Image

In [None]:
# Re-importing in case we restart runtime (needed for Google Colab)
import helpers
import whylogs as why
from whylogs.viz import NotebookProfileVisualizer
from whylogs.extras.image_metric import log_image

In [None]:
frontal_img = Image.open("images/99_frontal.jpg")
frontal_img

In [None]:
bedroom_img = Image.open("images/99_bedroom.jpg")
bedroom_img

In [None]:
bathroom_img = Image.open("images/99_bathroom.jpg")
bathroom_img

In [None]:
kitchen_img = Image.open("images/99_kitchen.jpg")
kitchen_img

In [None]:
# Logging several related images
image_profile = log_image({
    "frontal": frontal_img,
    "bedroom": bedroom_img,
    "bathroom": bathroom_img,
    "kitchen": kitchen_img
})

## Simple metrics for images

### Brainstorm

What simple metrics could we use to distinguish between these images?

### Done? See whylogs defaults

After brainstorming, run the function to see a list of what whylogs collects by default.

In [None]:
helpers.whylogs_image_metrics_text()

## Comparing images of different types

Let's compare frontal, bedroom, bathroom, and kitchen image types. I have profiled the full dataset for all images in the dataset. Any drift we see between two profiles represent a difference between the two categories represented in the dataset, not just individual images.



In [None]:
all_frontal_images_profile = why.read("images/all_frontal_images.bin")
all_bedroom_images_profile = why.read("images/all_bedroom_images.bin")
all_bathroom_images_profile = why.read("images/all_bathroom_images.bin")
all_kitchen_images_profile = why.read("images/all_kitchen_images.bin")

In [None]:
viz = NotebookProfileVisualizer()
viz.set_profiles(target_profile_view=all_frontal_images_profile.view(),
                 reference_profile_view=all_kitchen_images_profile.view())
viz.summary_drift_report()

## Deeper dive on image embeddings

See the following Google Colab notebook for an exploration of image analysis on image embeddings for MNIST.

https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/experimental/embeddings/Embeddings_Distance_Logging.ipynb