>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments) to leverage the power of whylogs and WhyLabs together!*

# Columnar Segmented Performance Metrics

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Segments.ipynb)

## Installing whylogs

If you don't have it installed already, install whylogs:

In [None]:
%pip install 'whylogs[whylabs]' -q

## Getting the Data & Defining the Segments

Let's first download the data we'll be working with.

This dataset contains transaction information for an online grocery store, such as:

- product description
- category
- user rating
- market price
- number of items sold last week

In [None]:
import pandas as pd
import numpy as np
import random

DEMO_LENGTH_IN_DAYS = 8
ANOMALY_RATIO_RANDOM = 0.15
SKIP_PROFILING_SEGMENT_INPUTS = False
USE_CLASSIFICATION = True

url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Ecommerce/baseline_dataset_base.csv"
df = pd.read_csv(url)[["date","product","category", "rating", "market_price","sales_last_week"]]
df['rating'] = df['rating'].astype(int)

# Here we simulate some kind of prediction, this is a crude prediction that is usually correct but
# guesses when actual rating is 5 (which are a relatively small portion of all product rating values)
def predict_rating(actual_rating):
    prediction = actual_rating
    if actual_rating < 4:
        if random.randint(1,100) > 98:
            prediction = actual_rating + 1
    else:
        prediction = random.randint(1,5)
    return prediction

df['predicted_rating'] = df['rating'].apply(predict_rating)
# This simulates an anomaly (change in model performance) that starts after 'error_starts_days_ago'
# back in time (e.g. with a value of 3 here, the most recent 2 days have degraded model performance)
error_starts_days_ago = 3


# now let's split the data into n batches randomly to mimic different days of data
n=DEMO_LENGTH_IN_DAYS
batches = np.array_split(df.sample(frac=1), n)

# here is the additional error we are injecting to create an anomaly: randomize the prediced
# rating 15% of the time if ANOMALY_RATIO_RANDOM = 0.15
def inject_more_error(rating):
    if random.random() > (1.0-ANOMALY_RATIO_RANDOM):
        return random.randint(1,5)
    return rating

days_ago_count = 0
for df in batches:
    days_ago_count = days_ago_count + 1
    if days_ago_count < error_starts_days_ago:
        df['predicted_rating'] = df['predicted_rating'].apply(inject_more_error)

<a class="anchor" id="single"></a>

### Configure the API and env vars for upload to WhyLabs

In [None]:
# Let's upload the unsegmented data profile to WhyLabs, set the env variables
import getpass
import os
import whylogs as why


# set your org-id here - should be something like "org-xxxx"
print("Enter your WhyLabs Org ID") 
os.environ["WHYLABS_DEFAULT_ORG_ID"] = input()
print("Enter the model-id")
os.environ["WHYLABS_DEFAULT_DATASET_ID"] =  input()
print("Enter your WhyLabs API Key")
os.environ["WHYLABS_API_KEY"] = getpass.getpass()

print("Using API Key ID: ", os.environ["WHYLABS_API_KEY"][0:10])

## Segmenting on a Single Column

Let say you have a few columns that you want to segment on, but you don't want to see the cartesian product of those columns. In this example we will choose the column `category` and then the column `rating` as interesting features to segment on and calculate performance metrics on these segments.

In [None]:
import datetime
from typing import List
import whylogs as why
from whylogs.core.schema import DatasetSchema
from whylogs.core.segmentation_partition import segment_on_column
from whylogs.core.dataset_profile import DatasetProfile
from whylogs.core.model_performance_metrics.model_performance_metrics import ModelPerformanceMetrics

# helper to support going from string segment key to typed group values in pandas
def lookup_typed_key(grouped_pdf, key: str):
    typed_keys = set(grouped_pdf.indices.keys())
    for typed_key in typed_keys:
        if key == str(typed_key):
            return typed_key
    return key

#create a helper method to add performance metrics to segmented pandas dataframes
def add_performance_to_segments(grouped_pdf, prediction_column, target_column, results):
    partition = results.partitions[0]
    segments = results.segments_in_partition(partition)
    for segment in segments:
        # A segment's key is a tuple of the column values, since we segmented on a single column
        # the first value in the tuple will be the same as a groupby key in pandas ('Baby care',),
        # so get that the pandas group using this key. e.g. "Baby care" etc
        typed_key = lookup_typed_key(grouped_pdf, segment.key[0])

        segmented_pdf = grouped_pdf.get_group(typed_key)
        perf = ModelPerformanceMetrics()
        if USE_CLASSIFICATION:
            perf.compute_confusion_matrix(
                predictions=segmented_pdf[prediction_column].to_list(),
                targets=segmented_pdf[target_column].to_list(),
            )
        else:
            perf.compute_regression_metrics(
                predictions=segmented_pdf[prediction_column].to_list(),
                targets=segmented_pdf[target_column].to_list(),
            )

        if SKIP_PROFILING_SEGMENT_INPUTS:
            #create a blank profile for the segment so that we only have the model performance.
            segments[segment] = DatasetProfile()

        segmented_profile = results.profile(segment)
        segmented_profile.add_model_performance_metrics(perf)

# helper method to profile and upload segmented perf metrics for each column
# (non-cartesian product) for the last n days
def upload_columnar_segmented_performance_for_past_n_days(
    bathes,
    segment_columns: List[str],
    prediction_column: str,
    target_column: str,
    n: int = 7):
    for column_name in segment_columns:
        print(f"** Profiling segmented performance metrics for column({column_name}) "
              f"using targets ({target_column}) and predictions ({prediction_column})")
        # split the pandas data frame on the same column that segmented profiles used

        for i in range(n):
            df = batches[i]
            grouped_pdf = df.groupby(column_name)
            dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
            print(f"    About to log data for {i} days ago-> {dt}")
            results = why.log(df, schema=DatasetSchema(segments=segment_on_column(column_name)))
            print(f"    Segmented profile for {column_name} on {dt} has {results.count} segments")
            add_performance_to_segments(grouped_pdf, prediction_column, target_column, results)
            results.set_dataset_timestamp(dt)
            print(f"    Uploading profiles for {column_name} on {dt}, this may take a while...")
            results.writer("whylabs").write()
            print(f"    Done uploading profiles for {column_name} on {dt}")
    print(f"** Done uploading profiles!")

In [None]:
# set the column(s) you want to segment on here
# as well as the target and prediction columns
columns = ["category", "rating"]
prediction_column = 'predicted_rating'
target_column = 'rating'

upload_columnar_segmented_performance_for_past_n_days(
    batches, columns, prediction_column, target_column, n)