# Data Duplicates 

This notebooks provides an overview for using and understanding the data duplicates check.

**Structure:**

- [Why data duplicates?](#what_is_data_duplicates)
- [Load data](#load_data_model)
- [Run the check](#run_check)
- [Define a condition](#define_condition)


In [1]:
from deepchecks.tabular.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.tabular.base import Dataset, Suite
from datetime import datetime
import pandas as pd


<a id='what_is_data_duplicates'></a>
## Why data duplicates?

The `DataDuplicates` check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset's nature it has identical-looking samples) this may be valid, however if this is an hidden issue we're not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attention.





<a id='load_data_model'></a>
## Load data

In [2]:
from deepchecks.tabular.datasets.classification.phishing import load_data

phishing_dataset = load_data(as_train_test=False, data_format='DataFrame')
phishing_dataset

Unnamed: 0,target,month,scrape_date,ext,urlLength,numDigits,numParams,num_%20,num_@,entropy,...,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
0,0,1,2019-01-01,net,102,8,0,0,0,-4.384032,...,191,32486,3,5,330,9419,23919,0.736286,0.289940,2.539442
1,0,1,2019-01-01,country,154,60,0,2,0,-3.566515,...,0,16199,0,4,39,2735,794,0.049015,0.168838,0.290311
2,0,1,2019-01-01,net,171,5,11,0,0,-4.608755,...,104,103344,18,9,302,27798,83817,0.811049,0.268985,2.412174
3,0,1,2019-01-01,com,94,10,0,0,0,-4.548921,...,466,34093,11,43,199,9087,19427,0.569824,0.266536,2.137889
4,0,1,2019-01-01,other,95,11,0,0,0,-4.717188,...,928,202,1,0,0,39,0,0.000000,0.193069,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11345,0,1,2020-01-15,country,89,7,0,0,0,-4.254491,...,0,4117,5,0,1,971,1866,0.625302,0.213266,2.932029
11346,0,1,2020-01-15,other,107,13,0,0,0,-4.758879,...,1882,17788,47,58,645,3185,4228,0.291069,0.214348,1.357928
11347,0,1,2020-01-15,com,112,10,0,0,0,-4.723014,...,1011,0,0,0,0,0,0,0.000000,0.000000,0.000000
11348,0,1,2020-01-15,html,111,3,0,0,0,-4.289384,...,265,0,0,0,0,0,0,0.000000,0.000000,0.000000


<a id='run_check'></a>
## Running the check

In [3]:
from deepchecks.tabular.checks import DataDuplicates
DataDuplicates().run(phishing_dataset)

Unnamed: 0_level_0,Unnamed: 1_level_0,target,month,scrape_date,ext,urlLength,numDigits,numParams,num_%20,num_@,entropy,has_ip,hasHttp,hasHttps,urlIsLive,dsr,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
Instances,Number of Duplicates,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
"4696, 4719",2,0,6,2019-06-06,other,123,28,4,0,0,-4.91,0,1,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0


### With Check Parameters

`DataDuplicates` check can also use a specific subset of columns (or alternatively use all columns except specific `ignore_columns` to check duplication):

In [4]:
DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

Unnamed: 0_level_0,Unnamed: 1_level_0,entropy,numParams
Instances,Number of Duplicates,Unnamed: 2_level_1,Unnamed: 3_level_1
"82, 974, 1557, 2150, 2360, 3528, 6560, 7...",13,-4.31,0
"1641, 1729, 2213, 2234, 4412, 4638, 6328...",8,-4.57,4
"2719, 4634, 6504, 6774, 6783, 7528, 9592...",8,-4.49,8
"929, 2499, 4047, 7989, 8391, 9348, 9932,...",8,-4.25,0
"1020, 1670, 1802, 2984, 6666, 9138, 1092...",7,-4.65,5


In [5]:
DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)

Unnamed: 0_level_0,Unnamed: 1_level_0,target,month,ext,urlLength,numDigits,numParams,num_%20,num_@,entropy,has_ip,hasHttp,hasHttps,urlIsLive,dsr,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
Instances,Number of Duplicates,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
"4696, 4719, 5398",3,0,6,other,123,28,4,0,0,-4.91,0,1,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
"82, 11342",2,0,1,html,92,2,0,0,0,-4.31,0,1,0,0,0,0,149,1,0,0,25,0,0.0,0.17,0.0
"250, 790",2,0,1,php,107,4,8,0,0,-4.53,0,1,0,0,1381,79,0,1,0,0,0,0,0.0,0.0,0.0
"6, 217",2,0,1,php,107,5,8,0,0,-4.52,0,1,0,0,1381,79,0,1,0,0,0,0,0.0,0.0,0.0
"609, 763",2,0,1,php,113,6,8,0,0,-4.63,0,1,0,0,1381,79,0,1,0,0,0,0,0.0,0.0,0.0
"974, 1557",2,0,2,html,92,2,0,0,0,-4.31,0,1,0,0,0,0,149,1,0,0,25,0,0.0,0.17,0.0
"2150, 2360",2,0,3,html,92,2,0,0,0,-4.31,0,1,0,0,0,0,149,1,0,0,25,0,0.0,0.17,0.0
"2238, 2489",2,0,3,php,108,3,8,0,0,-4.51,0,1,0,0,1381,79,0,1,0,0,0,0,0.0,0.0,0.0
"3192, 3444",2,0,4,other,123,28,4,0,0,-4.92,0,1,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0
"3277, 3498",2,0,4,php,93,31,1,0,0,-4.93,0,1,0,0,0,0,281,0,0,0,74,142,0.51,0.26,1.92


<a id='define_condition'></a>
## Define a condition


Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks' way to validate model and data quality, and let you know if anything goes wrong.

In [6]:
check = DataDuplicates()
check.add_condition_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

Status,Condition,More Info
!,Duplicate data ratio is not greater than 0%,Found 0.0088% duplicate data


As it can be seen, the condition found that we have data duplicates in our dataset!