PoC - Data Quality in Fabric. 

This Notebook depends on the Environment _Libraries_, created beforehand and linked with the dropdown menu at the top. It allows the installation of libraries directly in the session and avoids installing it with _pip install_ everytime the notebook runs.   

Getting Libraries From Great Expectations 

In [None]:
import great_expectations as gx
from great_expectations import expectations as gxe

Creating a Context : default storage location for metadata of the expectations, data sources, validators ... 

In [None]:
context = gx.get_context()

print(type(context).__name__)

#context = gx.get_context(mode=\"file")
#context = gx.get_context(mode="file", project_root_dir="./new_context_folder")

Datasource is used to define and connect to data

In [4]:
datasource_name = "spark_datasource"

#data_source = context.data_sources.add_pandas(name=data_source_name)
data_source = context.data_sources.add_spark(name=datasource_name)

# Once created, get the data source if necessary. 
#data_source = context.data_sources.get(datasource_name)

data_asset_name = "my_dataframe_data_asset"


StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 7, Finished, Available, Finished)

Data Assets regroups the validation results. It should group logical content together (Orders across all steps of integration for instance)

In [5]:
data_asset_name = "my_dataframe_data_asset"
data_asset = data_source.add_dataframe_asset(name=data_asset_name)

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 8, Finished, Available, Finished)

Batch definitions defines how the data should be retrieved. 

In [6]:
batch_definition_name = "my_batch_definition"
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 9, Finished, Available, Finished)

In [None]:
FactsDf = spark.sql("SELECT * FROM Facts LIMIT 1000")
batch_parameters = {"dataframe": FactsDf}

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 10, Finished, Available, Finished)

In [8]:
batch_definition = (
    context.data_sources.get(datasource_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 11, Finished, Available, Finished)

Once the method of retrieval, the datasource configured, we need to define the list of expectations we have, to assess our data. 

List of expectations : https://greatexpectations.io/expectations/

Here we test if the productId is null or not : 

In [None]:
productId_notNull = gx.expectations.ExpectColumnValuesToNotBeNull(
    column="productId",
    mostly=1 # defines from 0 to 1 the ratio of values to be not null to validate the expectation. 
)

# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

validation_results = batch.validate(productId_notNull)
print(validation_results)


Result is good as success == true. We only tested a single expectation, but we can setup a list of expectation called a **suite** : 

In [None]:
suite_name = "Facts_ExpectationsSuite"
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 13, Finished, Available, Finished)

We then add the expectations to the suite : 

In [None]:
productId_notNull = gx.expectations.ExpectColumnValuesToNotBeNull(
    column="productId",
    mostly=1 # defines from 0 to 1 the ratio of values to be not null to validate the expectation. 
)

suite.add_expectation(productId_notNull) # suite was already pre defined steps before 

billing_postal_code_notNull = gx.expectations.ExpectColumnValuesToNotBeNull(
    column="billing_postal_code",
    mostly=1
)

suite.add_expectation(billing_postal_code_notNull)

In [None]:
validation_results = batch.validate(suite)
print(validation_results)

Once the expectation is tested, we can assess the full table : we need to redefine the batch and validate it against the suite : 

In [None]:
FactsDf = spark.sql("SELECT productId, billing_postal_code, user_contact_email FROM Facts")
batch_parameters = {"dataframe": FactsDf}

user_contact_email_notEmpty = gx.expectations.ExpectColumnValueLengthsToBeBetween(
    column="user_contact_email",
    min_value=1,
    max_value=256
)

billing_postal_code_notEmpty = gx.expectations.ExpectColumnValueLengthsToBeBetween(
    column="billing_postal_code",
    min_value=1,
    max_value=10
)

suite.add_expectation(user_contact_email_notEmpty)
suite.add_expectation(billing_postal_code_notEmpty)

batch = batch_definition.get_batch(batch_parameters=batch_parameters)

validation_results = batch.validate(suite)

In [18]:
import json
import pandas as pd

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 23, Finished, Available, Finished)

In [19]:
jsonResult = json.loads(str(validation_results))
resultList = []

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 24, Finished, Available, Finished)

In [20]:
for result in jsonResult['results']:
    success_status = result['success']
    expectation = result['expectation_config']['type']
    dataset_column = result['expectation_config']['kwargs']['column']
    element_count = result['result']['element_count']
    unexpected_percent = result['result']['unexpected_percent']
    
    resultList.append({
        'success_status': success_status,
        'expectation': expectation,
        'dataset_column': dataset_column,
        'element_count': element_count,
        'unexpected_percent': unexpected_percent
    })

df = pd.DataFrame(resultList)

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, 25, Finished, Available, Finished)

In [None]:
import matplotlib.pyplot as plt

df['success_percent'] = 100 - df['unexpected_percent']
df['expectation_column'] = df['expectation'] + " - " + df['dataset_column']
df['unexpected_count'] = df["element_count"] * (df['success_percent']/100)

fig, ax = plt.subplots(figsize=(12, 8))

for index, row in df.iterrows():
    color = 'green' if row['success_status'] else 'red'
    
    ax.barh(row['expectation_column'], row['success_percent'], color=color, label='Success' if row['success_status'] else 'Failure')
    ax.barh(row['expectation_column'], row['unexpected_percent'], left=row['success_percent'], color='green', label='Unexpected Percent')
    ax.text(row['success_percent'] + row['unexpected_percent'] / 2, index, f'{int(row["unexpected_count"])}', ha='center', va='center', color='black', fontsize=10)

ax.set_xlabel('Percentage (%)', fontsize=14)
ax.set_ylabel('Expectation - Column', fontsize=14)
ax.set_title('Expectations and Columns', fontsize=16)

#ax.legend(loc='lower center')

plt.tight_layout()
plt.show()

StatementMeta(, 7e382d05-b693-485c-b118-5f440ce87150, -1, Cancelled, , Cancelled)

In [None]:
gxDf = spark.createDataFrame(df)
gxDf.write.mode("overwrite").option("overwriteSchema", "True").format("delta").saveAsTable("gx_results")