### What is the purpose of the check?

String Mismatch works on a single dataset, and it looks for mismatches in each string column in the data. Finding mismatches in strings is helpful for identifying errors in the data. For example, if your data is aggregated from multiple sources, it might have the same values but with a little variation in the formatting, like a leading uppercase. In this case, the model’s ability to learn may be impaired since it will see categories that are supposed to be the same, as different categories.

#### How String Mismatch Defined?

To recognize string mismatch, we transform each string to it’s base form. The base form is the string with only its alphanumeric characters in lowercase. (For example “Cat-9?!” base form is “cat9”). If two strings have the same base form, they are considered to be the same.

In [1]:
import pandas as pd

from deepchecks.tabular.checks import StringMismatch

data = {'col1': ['Deep', 'deep', 'deep!!!', '$deeP$', 'earth', 'foo', 'bar', 'foo?']}
df = pd.DataFrame(data=data)


In [2]:
df

Unnamed: 0,col1
0,Deep
1,deep
2,deep!!!
3,$deeP$
4,earth
5,foo
6,bar
7,foo?


In [3]:
result = StringMismatch().run(df)

Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred


In [4]:
result

VBox(children=(HTML(value='<h4><b>String Mismatch</b></h4>'), HTML(value='<p>Detect different variants of stri…

#### Condition for above check

In [5]:
check = StringMismatch().add_condition_no_variants()
result = check.run(df)
result.show(show_additional_outputs=False)


Received a "pandas.DataFrame" instance, initializing "deepchecks.tabular.Dataset" from it
It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data.
0 categorical features were inferred


VBox(children=(HTML(value='<h4><b>String Mismatch</b></h4>'), HTML(value='<p>Detect different variants of stri…