### What is a Dominant Frequency Change?

Dominant Frequency Change is a data integrity check which simply checks whether dominant values have increased significantly between test data and train data. Sharp changes in dominant values can indicate a problem with the data collection or data processing pipeline (for example, a sharp increase in a common null or constant value), and will cause the model to fail to generalize well. This check goal is to catch these issues early in the pipeline.

This check compares the dominant values of each feature in the test data to the dominant values of the same feature in the train data. If the ratio of the test to train dominant values is greater than a threshold, the check fails. This threshold can be configured by specifying the ratio_change_thres parameter of the check.

### The Definition of a Dominant Value

The dominant value is defined as a value that is frequent in data at least more than dominance_ratio times from the next most frequent value. The dominance_ratio is a configurable parameter of the check.

In [1]:
from deepchecks.tabular.checks.integrity import DominantFrequencyChange
from deepchecks.tabular.datasets.classification import phishing

#### Load Data

In [2]:
train_ds, test_ds = phishing.load_data(data_format='Dataset', as_train_test=True)

#### Add Duplicates in the Test Data

In [9]:
test_ds.data.loc[test_ds.data.index % 2 == 0, 'urlLength'] = 5.1
test_ds.data.loc[test_ds.data.index / 3 > 8, 'numDigits'] = 2.7

In [10]:
test_ds.data.head()

Unnamed: 0,target,month,scrape_date,ext,urlLength,numDigits,numParams,num_%20,num_@,entropy,...,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
8172,0,10,2019-10-01,country,5.1,2.7,0,0,0,-4.187942,...,0,0,0,0,0,0,0,0.0,0.0,0.0
8173,0,10,2019-10-01,com,114.0,2.7,0,0,0,-4.632417,...,0,1432003,7,74,124,316311,924249,0.645424,0.220887,2.921963
8174,0,10,2019-10-01,com,5.1,2.7,0,0,0,-4.489435,...,254,23051,1,12,24,7256,12438,0.462568,0.26985,1.714168
8175,0,10,2019-10-01,com,87.0,2.7,0,0,0,-4.293408,...,1586,18432,1,65,157,6600,19360,0.70475,0.287369,2.452424
8176,0,10,2019-10-01,com,5.1,2.7,0,0,0,-4.301842,...,355,0,0,1,0,0,0,0.0,0.0,0.0


#### Run The Check

In [11]:
check = DominantFrequencyChange()
check.run(test_ds, train_ds)

VBox(children=(HTML(value='<h4><b>Dominant Frequency Change</b></h4>'), HTML(value='<p>Check if dominant value…