# Label Ambiguity 
This notebooks provides an overview for using and understanding the label ambiguity check.

**Structure:**

- [What is Label Ambiguity?](#what_is_label_ambiguity)
- [Load data](#load_data_model)
- [Run the check](#run_check)
- [Define a condition](#define_condition)


In [1]:
from deepchecks.checks.integrity import LabelAmbiguity
from deepchecks.base import Dataset
import pandas as pd

<a id='what_is_label_ambiguity'></a>
## What is Label Ambiguity?

Label Ambiguity is a messure of identical samples with different labels. This can occure from either mislabeld data, or from missing information in a dataset to seperate labels.

<a id='load_data_model'></a>
## Load Data


In [2]:
from deepchecks.datasets.classification.phishing import load_data

phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])

Automatically inferred these columns as categorical features: numParams, num_%20, num_@. 



<a id='run_check'></a>

## Run the check

In [3]:
LabelAmbiguity().run(phishing_dataset)

Unnamed: 0_level_0,urlLength,numDigits,numParams,num_%20,num_@,bodyLength,numTitles,numImages,numLinks,specialChars
Observed Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"(1, 0)",81,6,0,0,0,0,0,0,0,0
"(1, 0)",82,2,0,0,0,0,0,0,0,0
"(0, 1)",85,0,0,0,0,0,0,0,0,0
"(0, 1)",85,20,0,0,0,0,0,0,0,0
"(0, 1)",88,0,0,0,0,0,0,0,0,0


In [4]:
LabelAmbiguity(columns=['urlLength', 'numDigits']).run(phishing_dataset)

Unnamed: 0_level_0,urlLength,numDigits
Observed Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
"(1, 0)",81,0
"(1, 0)",81,6
"(1, 0)",82,2
"(0, 1)",84,2
"(0, 1)",85,0


<a id='define_condition'></a>

## Define a condition

Now, we define a condition that enforce the ratio of ambigous samples to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong.

In [5]:
check = LabelAmbiguity()
check.add_condition_ambiguous_sample_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Status,Condition,More Info
✖,Ambiguous sample ratio is not greater than 0%,Found ratio of samples with multiple labels above threshold: 0.6%
