### What is the purpose of the check?

The check compares the same categorical column within train and test and checks whether there are variants of similar strings that exists only in test and not in train. Finding those mismatches is helpful to prevent errors when inferring on the test data. For example, in train data we have category ‘New York’, and in our test data we have ‘new york’. We would like to be acknowledged that the test data contain a new variant of the train data, so we can address the problem.

### How String Mismatch Defined?

To recognize string mismatch, we transform each string to it’s base form. The base form is the string with only its alphanumeric characters in lowercase. (For example “Cat-9?!” base form is “cat9”). If two strings have the same base form, they are considered to be the same.

In [1]:
import pandas as pd
from deepchecks.tabular.checks import StringMismatchComparison


In [2]:
data = {'col1': ['Deep', 'deep', 'deep!!!', 'earth', 'foo', 'bar', 'foo?']}
compared_data = {'col1': ['Deep', 'deep', '$deeP$', 'earth', 'foo', 'bar', 'foo?', '?deep']}


In [3]:
data

{'col1': ['Deep', 'deep', 'deep!!!', 'earth', 'foo', 'bar', 'foo?']}

In [4]:
compared_data

{'col1': ['Deep', 'deep', '$deeP$', 'earth', 'foo', 'bar', 'foo?', '?deep']}

In [5]:
check = StringMismatchComparison()
result = check.run(pd.DataFrame(data=data), pd.DataFrame(data=compared_data))
result

Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred
Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred


VBox(children=(HTML(value='<h4><b>String Mismatch Comparison</b></h4>'), HTML(value='<p>Detect different varia…

#### Add a condition for above check

In [6]:
check = StringMismatchComparison().add_condition_no_new_variants()
result = check.run(pd.DataFrame(data=data), pd.DataFrame(data=compared_data))
result.show(show_additional_outputs=False)

Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred
Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred


VBox(children=(HTML(value='<h4><b>String Mismatch Comparison</b></h4>'), HTML(value='<p>Detect different varia…