# Data Validation with Fugue

Fugue can be applied to data validation. Data validation is having checks in place to make sure that data comes in the format and specifications that we expect. As data pipelines become more interconnected, the chances of changes unintentionally breaking other pipelines also increase. Validations are used to guarantee that upstream changes will not break the integrity of downstream data operations. Common data validation patterns include checking for NULL values or checking data frame shape to ensure transformations don’t drop any records. Other frequently used operations are checking for column existence and schema. Using data validation avoids silent failures of data processes where everything will run successfully but provide inaccurate results.
Data Validation can be placed at the start of the data pipeline to make sure that any transformations happen smoothly, and it can also be placed at the end to make sure everything is working well before output gets committed to the database. This is where a tool like Pandera can be used. For this post, we’ll make a small Pandas DataFrame to show examples. There are three columns, State, City, and Price.

There are three things that using Fugue provides
* Ability to use Pandas libraries on Spark
* Validation on each partition of data

In [3]:
import pandas as pd 

data = pd.DataFrame({'state': ['FL','FL','FL','CA','CA','CA'], 
                     'city': ['Orlando', 'Miami', 'Tampa',
                              'San Francisco', 'Los Angeles', 'San Diego'],
                     'price': [8, 12, 10, 16, 20, 18]})
data.head()

Unnamed: 0,state,city,price
0,FL,Orlando,8
1,FL,Miami,12
2,FL,Tampa,10
3,CA,San Francisco,16
4,CA,Los Angeles,20


For the above DataFrame, we want to guarantee that the price is within a certain range. We want to make sure that the `price` column is at least 8 and not more then 20. This is where we can use a data validation framework. We'll use Pandera to apply a check because it is lightweight and has an expressive syntax. 

In [4]:
import pandera as pa

price_check = pa.DataFrameSchema({
    "price": pa.Column(pa.Int, pa.Check.in_range(min_value=8,max_value=20)),
})

# schema: *
def price_validation(data:pd.DataFrame) -> pd.DataFrame:
    price_check.validate(data)
    return data

price_validation(data)

ModuleNotFoundError: No module named 'pandera'