# Data Validation

Data validation is having checks in place to make sure that data comes in the format and specifications that we expect. As data pipelines become more interconnected, the chance of changes unintentionally breaking other pipelines also increases. Validations are used to guarantee that upstream changes will not break the integrity of downstream data operations. 

Common data validation patterns include checking for NULL values or checking data frame shape to ensure transformations don’t drop any records. Other frequently used operations are checking for column existence and schema. Using data validation avoids silent failures of data processes where everything will run successfully but provide inaccurate results. 

There are a couple things that using Fugue provides:
* Allows validation code to be reused for both Pandas and Spark projects
* Ability to use familiar Pandas-based libraries on Spark
* Simple interface for validation on each partition of data

To illustrate this, we'll use a simple example with the following Pandas DataFrame.


In [1]:
import pandas as pd 

data = pd.DataFrame({'state': ['FL','FL','FL','CA','CA','CA'], 
                     'city': ['Orlando', 'Miami', 'Tampa',
                              'San Francisco', 'Los Angeles', 'San Diego'],
                     'price': [8, 12, 10, 16, 20, 18]})
data.head()

Unnamed: 0,state,city,price
0,FL,Orlando,8
1,FL,Miami,12
2,FL,Tampa,10
3,CA,San Francisco,16
4,CA,Los Angeles,20



## Pandera for Data Validation

Data Validation can be placed at the start of the data pipeline to make sure that any transformations happen smoothly, and it can also be placed at the end to make sure everything is working well before output gets committed to the database. [Pandera](https://github.com/pandera-dev/pandera) is a good data validation framework because it is lightweight and has an expressive syntax. 

For the above DataFrame, we want to guarantee that the price is within a certain range. We want to make sure that the `price` column is at least 8 and not more than 20.

In [2]:
import pandera as pa

price_check = pa.DataFrameSchema({
    "price": pa.Column(pa.Int, pa.Check.in_range(min_value=8,max_value=20)),
})

def price_validation(data:pd.DataFrame) -> pd.DataFrame:
    price_check.validate(data)
    return data

price_validation(data)

Unnamed: 0,state,city,price
0,FL,Orlando,8
1,FL,Miami,12
2,FL,Tampa,10
3,CA,San Francisco,16
4,CA,Los Angeles,20
5,CA,San Diego,18


In the example above, we are using pandera's `DataFrameSchema` to create a validation schema. In the `price_check` variable, we have a `Check` that is applied to a `Column` named price. That validation guarantees that the prices are within an acceptable range of values. We don't need to wrap the validation inside a `price_validation` function, but this will make bring the validation to Spark seamless.

We highly suggest checking the [Pandera documentation](https://pandera.readthedocs.io/en/stable/) for more information. If you want to see Pandera in action, change the `min_value` or `max_value` in the code above to trigger an error.

## Using Pandera on Spark

[Pandera](https://github.com/pandera-dev/pandera) is a great library, but unfortunately it is only avaialable in Pandas at the moment. When the data size becomes too big to handle in Pandas, users would need to switch to a data validation library with Spark support. This would involve re-implementing the same logic on Spark. Fugue, as an abstraction layer, allows users to keep using their logic written in Pandera, and port it to Spark or Dask.

Below is an example of bringing Pandera validations to Spark with minimal code changes.

*Note that pyspark needs to be installed in order for the code snippet below to run*

In [3]:
from fugue import FugueWorkflow
from fugue_spark import SparkExecutionEngine

price_check = pa.DataFrameSchema({
    "price": pa.Column(pa.Int, pa.Check.in_range(min_value=5,max_value=20)),
})

# schema: *
def price_validation(data:pd.DataFrame) -> pd.DataFrame:
    price_check.validate(data)
    return data

# Bring the code to spark
with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df(data).transform(price_validation)
    df.show()

SparkDataFrame
state:str|city:str                                                                       |price:long
---------+-------------------------------------------------------------------------------+----------
FL       |Orlando                                                                        |8         
FL       |Miami                                                                          |12        
FL       |Tampa                                                                          |10        
CA       |San Francisco                                                                  |16        
CA       |Los Angeles                                                                    |20        
CA       |San Diego                                                                      |18        
Total count: 6



There were very minimal code changes to bring the code to Spark. We only needed to add the schema hint as a comment. Fugue reads this and checks to see if the schema is upheld. If users move away from Fugue, the comment just stays as a helpful comment.

The bottom section is the only addition to bring it to Spark. We use the Fugue transform method to apply the function. Because we passed `SparkExecutionEngine` to `FugueWorkflow`, this code will now run in Spark.

## Validation by Partition with Fugue

There is one current shortcoming of the current data validation frameworks. For the data we have, the price ranges of CA and FL are drastically different. Because the validation is applied per column, we don’t have a way to specify different price ranges for each location. It would be ideal however if we could apply a different check for each group of data. This is what we call **validation by partition**

This operation becomes very trivial to perform with Fugue. In the above example, we want to apply a different validation for the data in FL and the data in CA. On average, the CA data points have a higher price so we want to create two validation rules depending on the `state`. We do this in the code below.

In [4]:
price_check_FL = pa.DataFrameSchema({
    "price": pa.Column(pa.Int, pa.Check.in_range(min_value=7,max_value=13)),
})

price_check_CA = pa.DataFrameSchema({
    "price": pa.Column(pa.Int, pa.Check.in_range(min_value=15,max_value=21)),
})

price_checks = {'CA': price_check_CA, 'FL': price_check_FL}

# schema: *
def price_validation(df:pd.DataFrame) -> pd.DataFrame:
    location = df['state'].iloc[0]
    check = price_checks[location]
    check.validate(df)
    return df

with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df(data).partition(by=["state"]).transform(price_validation)
    df.show()

SparkDataFrame
state:str|city:str                                                                       |price:long
---------+-------------------------------------------------------------------------------+----------
CA       |San Francisco                                                                  |16        
CA       |Los Angeles                                                                    |20        
CA       |San Diego                                                                      |18        
FL       |Orlando                                                                        |8         
FL       |Miami                                                                          |12        
FL       |Tampa                                                                          |10        
Total count: 6



The code below should already look familiar by now. All we did was create two different Pandera schema objects. After that, we modified the `price_validation` to pull the location from the DataFrame and apply the approprivate validation. There are two states in our original DataFrame: CA and FL. However, when the data enters the price_validation function, it is already partitioned by the state because of the `partition(by=["state"])` method call before `transform()`. This means the function is applied twice: one for FL and once for CA.

Here, we are taking advantage of the SparkExecutionEngine by distributing the task across multiple partitions. We partition the data by `state`, and then apply different rules depending on the `state`.

## Conclusion

In this demo we showed how Fugue allows Pandas-based data validation frameworks to be used in Spark. This is helpful for organizations that find themselves implementing validation rules twice to support Spark and Pandas implementations. Even though we demoed with Pandera here, this will work with other data validation libraries.

Fugue also allows users to perform **validation by paritition**, a missing feature in the current data validation frameworks.