>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments) to leverage the power of whylogs and WhyLabs together!*

# Intro to Segmentation

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Segments.ipynb)

Sometimes, certain subgroups of data can behave very differently from the overall dataset. When monitoring the health of a dataset, it’s often helpful to have visibility at the sub-group level to better understand how these subgroups are contributing to trends in the overall dataset. whylogs supports data segmentation for this purpose.

Data segmentation is done at the point of profiling a dataset.

Segmentation can be done by a single feature or by multiple features simultaneously. For example, you could have different profiles according to the gender of your dataset ("M" or "F"), and also for different combinations of, let's say, Gender and City Code. You can also further filter the segments for specific partitions you are interested in - let's say, Gender "M" with age above 18.

In this example, we will show you a number of ways you can segment your data, and also how you can write these profiles to different locations.

## Table of Content

1. Segmenting on a single column
2. Segmenting on multiple columns
3. Filtering Segments
4. Writing Segmented Results to Disk
5. Sending Segmented Results to WhyLabs

## Installing whylogs

If you don't have it installed already, install whylogs:

In [2]:
%pip install 'whylogs[whylabs]' -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Getting the Data & Defining the Segments

Let's first download the data we'll be working with.

This dataset contains transaction information for an online grocery store, such as:

- product description
- category
- user rating
- market price
- number of items sold last week

In [30]:
import pandas as pd

url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Ecommerce/baseline_dataset_base.csv"
df = pd.read_csv(url)[["date","product","category", "rating", "market_price","sales_last_week"]]
df['rating'] = df['rating'].astype(int)
# simulate some kind of prediction, here we have a crude model that is correct but
# misses ratings of 5 (which are a relatively small portion of ratings)
df['predicted_rating'] = df['rating'].apply(lambda x: x if x < 4 else 4)


df.head()

Unnamed: 0,date,product,category,rating,market_price,sales_last_week,predicted_rating
0,2022-08-09 00:00:00+00:00,Wood - Centre Filled Bar Infused With Dark Mou...,Snacks and Branded Foods,4,350.0,1,4
1,2022-08-09 00:00:00+00:00,Toasted Almonds,Gourmet and World Food,3,399.0,1,3
2,2022-08-09 00:00:00+00:00,Instant Thai Noodles - Hot & Spicy Tomyum,Gourmet and World Food,3,95.0,1,3
3,2022-08-09 00:00:00+00:00,Thokku - Vathakozhambu,Snacks and Branded Foods,4,336.0,1,4
4,2022-08-09 00:00:00+00:00,Beetroot Powder,Gourmet and World Food,3,150.0,1,3


<a class="anchor" id="single"></a>

## Segmenting on a Single Column

It looks like the `category` feature is a good one to segment on. Let's see how many categories there are for the complete dataset:

In [31]:
df['category'].value_counts()

Beauty and Hygiene            9793
Gourmet and World Food        6201
Kitchen, Garden and Pets      4493
Snacks and Branded Foods      3826
Cleaning and Household        3446
Foodgrains, Oil and Masala    3059
Beverages                     1034
Bakery, Cakes and Dairy        979
Fruits and Vegetables          749
Baby Care                      707
Eggs, Meat and Fish            456
Name: category, dtype: int64

There are 11 categories.

We might be interested in having access to metrics specific to each category, so let's generate segmented profiles for each category.

In [32]:
from whylogs.core.segmentation_partition import segment_on_column

column_segments = segment_on_column("category")

In [33]:
column_segments

{'category': SegmentationPartition(name='category', mapper=ColumnMapperFunction(col_names=['category'], map=None, field_projector=<whylogs.core.projectors.FieldProjector object at 0x7f5131e3dac0>, id='31aee7544d31ada00c3bb3d94ca2e0595c42a1f21c266da65e132168914ed61fe8b1b8c99aaa1a0c5cf5e2dfbd621aa3f9700bef1f6e85f4de4ca6364f149592'), id='8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8', filter=None)}

`column_segments` is a dictionary for different SegmentationPartition, with informations such as id and additional logic. For the moment, all we're interested in is that we can pass it to our `DatasetSchema` in order to generate segmented profiles while logging: 

In [34]:
import whylogs as why
from whylogs.core.schema import DatasetSchema

results = why.log(df, schema=DatasetSchema(segments=column_segments))

Since we had 11 different categories, we can expect the `results` to have 11 segments. Let's make sure that is the case:

In [35]:
print(f"After profiling the result set has: {results.count} segments")

After profiling the result set has: 11 segments


Great.

Now, let's visualize the metrics for a single segment (the first one).

Results can have multiple partitions, and each partition can have multiple segments. Segments within a partition are non-overlapping. Segments across partitions, however, might overlap. 

In this example, we have only one partition with 11 non-overlapping segments. Let's fetch the available segments:

Now, let's visualize the metrics for the first segment:

In [36]:
first_segment = results.segments()[0]
segmented_profile = results.profile(first_segment)

print("Profile view for segment {}".format(first_segment.key))
segmented_profile.view().to_pandas()

Profile view for segment ('Baby Care',)


Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,distribution/stddev,frequent_items/frequent_strings,type,types/boolean,types/fractional,types/integral,types/object,types/string,ints/max,ints/min
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
category,1.0,1.0,1.00005,0,707,0,0,,0.0,,...,0.0,"[FrequentItem(value='Baby Care', est=707, uppe...",SummaryType.COLUMN,0,0,0,0,707,,
date,8.0,8.0,8.0004,0,707,0,0,,0.0,,...,0.0,[FrequentItem(value='2022-08-15 00:00:00+00:00...,SummaryType.COLUMN,0,0,0,0,707,,
market_price,57.000008,57.0,57.002854,0,707,0,0,2799.0,621.190948,299.0,...,713.745256,,SummaryType.COLUMN,0,707,0,0,0,,
predicted_rating,2.0,2.0,2.0001,0,707,0,0,4.0,3.770863,4.0,...,0.420575,"[FrequentItem(value='4', est=545, upper=545, l...",SummaryType.COLUMN,0,0,707,0,0,4.0,3.0
product,69.000012,69.0,69.003457,0,707,0,0,,0.0,,...,0.0,"[FrequentItem(value='Baby Powder', est=21, upp...",SummaryType.COLUMN,0,0,0,0,707,,
rating,3.0,3.0,3.00015,0,707,0,0,5.0,3.823197,4.0,...,0.500566,"[FrequentItem(value='4', est=508, upper=508, l...",SummaryType.COLUMN,0,0,707,0,0,5.0,3.0
sales_last_week,5.0,5.0,5.00025,0,707,0,0,6.0,1.391796,1.0,...,1.003162,"[FrequentItem(value='1', est=557, upper=557, l...",SummaryType.COLUMN,0,0,707,0,0,6.0,1.0


We can see that the first segment is for product transactions of the `Baby Care` category, and we have 707 rows for that particular segment.

Again, let's check the first segment:

In [40]:
partition = results.partitions[0]
segments = results.segments_in_partition(partition)

first_segment = next(iter(segments))
print(first_segment)
segmented_profile = results.profile(first_segment)

print("Profile view for segment {}".format(first_segment.key))
segmented_profile.view().to_pandas()

Segment(key=('Baby Care',), parent_id='8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8')
Profile view for segment ('Baby Care',)


Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,distribution/stddev,frequent_items/frequent_strings,type,types/boolean,types/fractional,types/integral,types/object,types/string,ints/max,ints/min
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
category,1.0,1.0,1.00005,0,707,0,0,,0.0,,...,0.0,"[FrequentItem(value='Baby Care', est=707, uppe...",SummaryType.COLUMN,0,0,0,0,707,,
date,8.0,8.0,8.0004,0,707,0,0,,0.0,,...,0.0,[FrequentItem(value='2022-08-15 00:00:00+00:00...,SummaryType.COLUMN,0,0,0,0,707,,
market_price,57.000008,57.0,57.002854,0,707,0,0,2799.0,621.190948,299.0,...,713.745256,,SummaryType.COLUMN,0,707,0,0,0,,
predicted_rating,2.0,2.0,2.0001,0,707,0,0,4.0,3.770863,4.0,...,0.420575,"[FrequentItem(value='4', est=545, upper=545, l...",SummaryType.COLUMN,0,0,707,0,0,4.0,3.0
product,69.000012,69.0,69.003457,0,707,0,0,,0.0,,...,0.0,"[FrequentItem(value='Baby Powder', est=21, upp...",SummaryType.COLUMN,0,0,0,0,707,,
rating,3.0,3.0,3.00015,0,707,0,0,5.0,3.823197,4.0,...,0.500566,"[FrequentItem(value='4', est=508, upper=508, l...",SummaryType.COLUMN,0,0,707,0,0,5.0,3.0
sales_last_week,5.0,5.0,5.00025,0,707,0,0,6.0,1.391796,1.0,...,1.003162,"[FrequentItem(value='1', est=557, upper=557, l...",SummaryType.COLUMN,0,0,707,0,0,6.0,1.0


## Adding Performance metrics per segmented

In [48]:
from whylogs.core.model_performance_metrics.model_performance_metrics import (
    ModelPerformanceMetrics
)

# split the pandas data frame on the same column that segmented profiles used
grouped_pdf = df.groupby('category')

prediction_column = 'predicted_rating'
target_column = 'rating'
for segment in segments:
    # A segment's key is a tuple of the column values, since we segmented on a single column
    # the first value in the tuple will be the same as a groupby key in pandas ('Baby care',),
    # so get that the pandas group using this key. e.g. "Baby care" etc
    key = segment.key[0]
    segmented_pdf = grouped_pdf.get_group(key)
    perf = ModelPerformanceMetrics()
    perf.compute_regression_metrics(
        predictions=segmented_pdf[prediction_column].to_list(),
        targets=segmented_pdf[target_column].to_list(),
    )
    segmented_profile = results.profile(segment)
    segmented_profile.add_model_performance_metrics(perf)


## Sending Results with Segmented Profile Performance to WhyLabs

With the whylogs Writer, you can write your profiles to different locations. If you have a WhyLabs account, you can easily send your segmented profiles to be monitored in your dashboard.

We will show briefly how to do it in this example. If you want more details, please check the [WhyLabs Writer Example](../integrations/writers/Writing_to_WhyLabs.ipynb) (also available [in our documentation](https://whylogs.readthedocs.io/en/latest/examples/integrations/writers/Writing_to_WhyLabs.html)).

Provided you already have the required information and keys, let's first set our environment variables:

In [None]:
import getpass
import os

# set your org-id here - should be something like "org-xxxx"
print("Enter your WhyLabs Org ID") 
os.environ["WHYLABS_DEFAULT_ORG_ID"] = input()

# set your datased_id (or model_id) here - should be something like "model-xxxx"
print("Enter your WhyLabs Dataset ID")
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = input()


# set your API key here
print("Enter your WhyLabs API key")
os.environ["WHYLABS_API_KEY"] = getpass.getpass()
print("Using API Key ID: ", os.environ["WHYLABS_API_KEY"][0:10])

Then, it's as simple as calling `writer("whylabs")`:

In [46]:
import datetime
# convenience method if you need to upload historical data, you can set all segmented
# profile timestamps through the result set
results.set_dataset_timestamp(datetime.datetime.now(datetime.timezone.utc))


In [47]:
results.writer("whylabs").write()