#### How deepchecks detects outliers

Outlier Sample Detection searches for outliers samples (jointly across all features) using the LoOP algorithm. The LoOP algorithm is a robust method for detecting outliers in a dataset across multiple variables by comparing the density in the area of a sample with the densities in the areas of its nearest neighbors (see link for further details).

LoOP relies on a distance matrix. In our implementation we use the Gower distance that averages the distances per feature between samples. For numeric features it calculates the absolute distance divided by the range of the feature and for categorical features it is an indicator whether the values are the same (see link for further details).

In [1]:
import pandas as pd
from deepchecks.tabular.datasets.classification.phishing import load_data
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks.integrity.outlier_sample_detection import OutlierSampleDetection

#### Load Dataset

In [2]:
phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset

Unnamed: 0,target,month,scrape_date,ext,urlLength,numDigits,numParams,num_%20,num_@,entropy,...,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
0,0,1,2019-01-01,net,102,8,0,0,0,-4.384032,...,191,32486,3,5,330,9419,23919,0.736286,0.289940,2.539442
1,0,1,2019-01-01,country,154,60,0,2,0,-3.566515,...,0,16199,0,4,39,2735,794,0.049015,0.168838,0.290311
2,0,1,2019-01-01,net,171,5,11,0,0,-4.608755,...,104,103344,18,9,302,27798,83817,0.811049,0.268985,2.412174
3,0,1,2019-01-01,com,94,10,0,0,0,-4.548921,...,466,34093,11,43,199,9087,19427,0.569824,0.266536,2.137889
4,0,1,2019-01-01,other,95,11,0,0,0,-4.717188,...,928,202,1,0,0,39,0,0.000000,0.193069,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11345,0,1,2020-01-15,country,89,7,0,0,0,-4.254491,...,0,4117,5,0,1,971,1866,0.625302,0.213266,2.932029
11346,0,1,2020-01-15,other,107,13,0,0,0,-4.758879,...,1882,17788,47,58,645,3185,4228,0.291069,0.214348,1.357928
11347,0,1,2020-01-15,com,112,10,0,0,0,-4.723014,...,1011,0,0,0,0,0,0,0.000000,0.000000,0.000000
11348,0,1,2020-01-15,html,111,3,0,0,0,-4.289384,...,265,0,0,0,0,0,0,0.000000,0.000000,0.000000


In [3]:
phishing_dataset.describe()

Unnamed: 0,target,month,urlLength,numDigits,numParams,num_%20,num_@,entropy,has_ip,dsr,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
count,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0,11350.0
mean,0.030837,6.28,114.025991,14.304053,0.523172,0.028811,0.005286,-4.480238,0.0,3918.588106,439.518414,49569.19,11.489251,15.10141,97.996828,11413.852423,31586.85,0.455681,0.21183,3.668419
std,0.172884,3.55003,43.531701,19.219679,1.511591,0.325926,0.078358,0.304459,0.0,3593.174607,695.881955,131794.4,31.267348,38.472225,164.578212,32281.711555,109790.9,0.374175,0.103761,19.381514
min,0.0,1.0,37.0,0.0,0.0,0.0,0.0,-5.90286,0.0,0.0,-1277.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,3.0,91.0,4.0,0.0,0.0,0.0,-4.630107,0.0,0.0,0.0,194.0,1.0,0.0,0.0,35.0,0.0,0.0,0.181818,0.0
50%,0.0,6.0,102.0,8.0,0.0,0.0,0.0,-4.48822,0.0,3819.0,182.0,12667.5,1.0,2.0,24.0,2707.0,4647.0,0.494361,0.244586,1.721923
75%,0.0,9.0,120.0,18.0,0.0,0.0,0.0,-4.321382,0.0,7078.0,387.0,39567.0,11.0,13.0,140.0,9434.75,20335.5,0.78889,0.271936,2.747405
max,1.0,12.0,772.0,454.0,12.0,12.0,2.0,-3.165609,0.0,11391.0,3643.0,2043278.0,1042.0,642.0,1235.0,506888.0,1988078.0,1.001634,0.6171,368.420245


#### Outlier Check

In [4]:
check = OutlierSampleDetection(nearest_neighbors_percent=0.01, extent_parameter=3)
check.run(phishing_dataset)

Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
10 categorical features were inferred: target, month, ext, numParams, num_%20, num_@, has_ip... For full list use dataset.cat_features


VBox(children=(HTML(value='<h4><b>Outlier Sample Detection</b></h4>'), HTML(value='<p>Detects outliers in a da…

#### Define a condition

In [5]:
check = OutlierSampleDetection()
check.add_condition_outlier_ratio_not_greater_than(max_outliers_ratio=0.001, outlier_score_threshold=0.9)
check.run(phishing_dataset)

Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
10 categorical features were inferred: target, month, ext, numParams, num_%20, num_@, has_ip... For full list use dataset.cat_features


VBox(children=(HTML(value='<h4><b>Outlier Sample Detection</b></h4>'), HTML(value='<p>Detects outliers in a da…