# Automatic Suggestion of Constraints

In our experience, a major hurdle in data validation is that someone needs to come up with the actual constraints to apply on the data. This can be very difficult for large, real-world datasets, especially if they are very complex and contain information from a lot of different sources. We build so-called constraint suggestion functionality into deequ to assist users in finding reasonable constraints for their data.

Our constraint suggestion first [profiles the data](./data_profiling_example.ipynb) and then applies a set of heuristic rules to suggest constraints. In the following, we give a concrete example on how to have constraints suggested for your data.

In [2]:
from pyspark.sql import SparkSession, Row, DataFrame
import json
import pandas as pd
import sagemaker_pyspark

import pydeequ

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

### Let's first generate some example data:

In [4]:
df = spark.sparkContext.parallelize([
    Row(productName="thingA", totalNumber="13.0", status="IN_TRANSIT", valuable="true"),
    Row(productName="thingA", totalNumber="5", status="DELAYED", valuable="false"),
    Row(productName="thingB", totalNumber=None, status="DELAYED", valuable=None),
    Row(productName="thingC", totalNumber=None, status="IN_TRANSIT", valuable="false"),
    Row(productName="thingD", totalNumber="1.0", status="DELAYED", valuable="true"),
    Row(productName="thingC", totalNumber="7.0", status="UNKNOWN", valuable=None),
    Row(productName="thingC", totalNumber="20", status="UNKNOWN", valuable=None),
    Row(productName="thingE", totalNumber="20", status="DELAYED", valuable="false"),
    Row(productName="thingA", totalNumber="13.0", status="IN_TRANSIT", valuable="true"),
    Row(productName="thingA", totalNumber="5", status="DELAYED", valuable="false"),
    Row(productName="thingB", totalNumber=None, status="DELAYED", valuable=None),
    Row(productName="thingC", totalNumber=None, status="IN_TRANSIT", valuable="false"),
    Row(productName="thingD", totalNumber="1.0", status="DELAYED", valuable="true"),
    Row(productName="thingC", totalNumber="17.0", status="UNKNOWN", valuable=None),
    Row(productName="thingC", totalNumber="22", status="UNKNOWN", valuable=None),
    Row(productName="thingE", totalNumber="23", status="DELAYED", valuable="false")]).toDF()

Now, we ask PyDeequ to compute constraint suggestions for us on the data. It will profile the data and then apply the set of rules specified in `addConstraintRules()` to suggest constraints.

In [8]:
from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

We can now investigate the constraints that deequ suggested. We get a textual description and the corresponding Python code for each suggested constraint. Note that the constraint suggestion is based on heuristic rules and assumes that the data it is shown is 'static' and correct, which might often not be the case in the real world. Therefore the suggestions should always be manually reviewed before being applied in real deployments.

In [9]:
for sugg in suggestionResult['constraint_suggestions']:
    print(f"Constraint suggestion for \'{sugg['column_name']}\': {sugg['description']}")
    print(f"The corresponding Python code is: {sugg['code_for_constraint']}\n")

Constraint suggestion for 'valuable': 'valuable' has less than 62% missing values
The corresponding Python code is: .hasCompleteness("valuable", lambda x: x >= 0.38, "It should be above 0.38!")

Constraint suggestion for 'valuable': 'valuable' has type Boolean
The corresponding Python code is: .hasDataType("valuable", ConstrainableDataTypes.Boolean)

Constraint suggestion for 'totalNumber': 'totalNumber' has no negative values
The corresponding Python code is: .isNonNegative("totalNumber")

Constraint suggestion for 'totalNumber': 'totalNumber' has less than 47% missing values
The corresponding Python code is: .hasCompleteness("totalNumber", lambda x: x >= 0.53, "It should be above 0.53!")

Constraint suggestion for 'totalNumber': 'totalNumber' has type Fractional
The corresponding Python code is: .hasDataType("totalNumber", ConstrainableDataTypes.Fractional)

Constraint suggestion for 'productName': 'productName' has value range 'thingC', 'thingA', 'thingB', 'thingE', 'thingD'
The cor

The first suggestions we get are for the `valuable` column. **PyDeequ** correctly identified that this column is actually a `boolean` column 'disguised' as string column and therefore suggests a constraint on the `boolean` datatype. Furthermore, it saw that this column contains some missing values and suggests a constraint that checks that the ratio of missing values should not increase in the future.

```
Constraint suggestion for 'valuable': 'valuable' has less than 62% missing values
The corresponding Python code is: .hasCompleteness("valuable", lambda x: x >= 0.38, "It should be above 0.38!")

Constraint suggestion for 'valuable': 'valuable' has type Boolean
The corresponding Python code is: .hasDataType("valuable", ConstrainableDataTypes.Boolean)
```

Next we look at the `totalNumber` column. PyDeequ identified that this column is actually a numeric column 'disguised' as string column and therefore suggests a constraint on a fractional datatype (such as `float` or `double`). Furthermore, it saw that this column contains some missing values and suggests a constraint that checks that the ratio of missing values should not increase in the future. Additionally, it suggests that values in this column should always be positive (as it did not see any negative values in the example data), which probably makes a lot of sense for this count-like data.

```
Constraint suggestion for 'totalNumber': 'totalNumber' has no negative values
The corresponding Python code is: .isNonNegative("totalNumber")

Constraint suggestion for 'totalNumber': 'totalNumber' has less than 47% missing values
The corresponding Python code is: .hasCompleteness("totalNumber", lambda x: x >= 0.53, "It should be above 0.53!")

Constraint suggestion for 'totalNumber': 'totalNumber' has type Fractional
The corresponding Python code is: .hasDataType("totalNumber", ConstrainableDataTypes.Fractional)
```

Finally, we look at the suggestions for the `productName` and `status` columns. Both of them did not have a single missing value in the example data, so an `isComplete` constraint is suggested for them. Furthermore, both of them only have a small set of possible values, therefore an `isContainedIn` constraint is suggested, which would check that future values are also contained in the range of observed values.

```
Constraint suggestion for 'productName': 'productName' has value range 'thingC', 'thingA', 'thingB', 'thingE', 'thingD'
The corresponding Python code is: .isContainedIn("productName", ["thingC", "thingA", "thingB", "thingE", "thingD"])

Constraint suggestion for 'productName': 'productName' is not null
The corresponding Python code is: .isComplete("productName")

Constraint suggestion for 'status': 'status' has value range 'DELAYED', 'UNKNOWN', 'IN_TRANSIT'
The corresponding Python code is: .isContainedIn("status", ["DELAYED", "UNKNOWN", "IN_TRANSIT"])

Constraint suggestion for 'status': 'status' is not null
The corresponding Python code is: .isComplete("status")
```

Currently, we leave it up to the user to decide whether they want to apply the suggested constraints or not, and provide the corresponding Scala code for convenience. For larger datasets, it makes sense to evaluate the suggested constraints on some held-out portion of the data to see whether they hold or not. You can test this by adding an invocation of .useTrainTestSplitWithTestsetRatio(0.1) to the ConstraintSuggestionRunner. With this configuration, it would compute constraint suggestions on 90% of the data and evaluate the suggested constraints on the remaining 10%.

Finally, we would also like to note that the constraint suggestion code provides access to the underlying [column profiles](./data_profiling_example.ipynb) that it computed via `suggestionResult.columnProfiles`.