# Summarize_target

This tutorial will guide you through the `summarease.summarize_target` module, which includes tools to summarize target variables and visualize class balance, making it easier to evaluate target variable characteristics.

## Getting Started

The `summarease.summarize_target` module provides the following core functionalities:

1. Summarizing Target Variables:
- Handles both categorical and numerical target variables.  
- Outputs class proportions and imbalance flags with threshold for categorical targets.  
- Provides basic statistical summaries for numerical targets.

2. Visualizing Target Balance:
- Generates bar plots to visualize class proportions and expected balanced ranges for categorical targets.  
- Highlights imbalanced classes with different colors for easy identification.

## Necessary Libraries

To use the `summarease.summarize_target` module, ensure the following libraries are installed:

In [1]:
import pandas as pd
import altair as alt
from summarease.summarize_target import summarize_target_df, summarize_target_balance_plot

## Example Dataset

We'll use the following datasets to demonstrate the module's functionality:

#### Dataset 1: Categorical Target Dataset

In [2]:
categorical_data = pd.DataFrame({
    'target': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D']
})

#### Dataset 2: Numerical Target Dataset

In [3]:
numerical_data = pd.DataFrame({
    'target': [1.2, 2.3, 3.1, 4.8, 5.5, 6.7, 8.9, 10.1]
})

## How to Apply summarease.summarize_target Module

1. Summarizing Target Variables with `summarize_target_df`

This function calculates the class proportions and checks imbalance for categorical targets, and provides basic statistical summaries for numerical targets. There are four parameters for `summarize_target`.  

- dataset_name: The input dataset containing target variable. It must be a DataFrame.
- target_variable: The name of target column. It must be a string.
- target_type: The type of target variable. It must be a string and within {"categorical", "numerical"}.
- threshold: Only feasible for "categorical" type to identify class imbalance. It must be a float and the default value is 0.2. It will raise an warning if you transfer a value into the function when `target_type` is numerical.

The following are two examples for demestrating using `summarize_target` for both categorical and numerical targets.
   
Example 1: Summarizing a Categorical Target

In [4]:
# Summarize the categorical target variable
summary_categorical = summarize_target_df(
    dataset_name=categorical_data, 
    target_variable='target', 
    target_type='categorical', 
    threshold=0.2
)
print(summary_categorical)

  class  proportion  imbalanced  threshold
0     A    0.181818        True        0.2
1     B    0.181818        True        0.2
2     C    0.272727       False        0.2
3     D    0.363636        True        0.2


Example 2: Summarizing a Numerical Target

In [5]:
# Summarize the numerical target variable
summary_numerical = summarize_target_df(
    dataset_name=numerical_data, 
    target_variable='target', 
    target_type='numerical'
)
print(summary_numerical)

        count   mean       std  min  25%   50%   75%   max
target    8.0  5.325  3.137219  1.2  2.9  5.15  7.25  10.1


2. Visualizing Target Balance with `summarize_target_balance_plot`  

This function visualizes the balance condition of a categorical target variable using an interactive bar plot. There are one parameter in `summarize_target_balance_plot`.

- summary_df: The input DataFrame, expected to match the output of summarize_target_df() when target_type is categorical. It must contain the columns ['class', 'proportion', 'imbalanced', 'threshold'].

The following is an example for demostrating the usage of `summarize_target_balance_plot`.

Example: Visualizing Class Balance

In [6]:
# Visualize the balance of the categorical target variable
balance_plot = summarize_target_balance_plot(summary_categorical)

# Display the plot
balance_plot.show()

The error bars show the balance ranges with the threshold around average proportion(0.2 default). The figure flags the classes within the balance ranges by green and that within imbalance ranges by red.