Common considerations when it comes to data quality includes but not limited to:

    Completeness
    Consistency
    Correctness
    Uniqueness

Expectations are assertions of the data. As the name implies, we are validating if the data is what we are expecting it to be. Great Expectations comes with predefined expectations for common data quality checks.

    expect_column_values_to_be_not_null
    expect_column_values_tpytho_be_unique
    expect_column_values_to_match_regex
    expect_column_median_to_be_between
    expect_table_row_count_to_be_between

In [3]:
import great_expectations as ge
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from pyspark.sql import functions as f, Window
import json
from pyspark.sql import SparkSession


In [4]:
spark = SparkSession.builder.appName('com.spark-dataframe').getOrCreate()

23/03/12 20:02:37 WARN Utils: Your hostname, Antonys-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.19 instead (on interface en0)
23/03/12 20:02:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/12 20:02:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [12]:
df = spark.read.format('csv').option('header', True).load('./Data/bank/bank-full.csv', sep = ';')\
    .withColumn('job', f.when(f.col('job') == 'unknown', f.lit(None)).otherwise(f.col('job')))\
    .withColumn('id', f.monotonically_increasing_id())

In [15]:
df.show(5,truncate=True)

+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+---+
|age|         job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y| id|
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+---+
| 58|  management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|  0|
| 44|  technician| single|secondary|     no|     29|    yes|  no|unknown|  5|  may|     151|       1|   -1|       0| unknown| no|  1|
| 33|entrepreneur|married|secondary|     no|      2|    yes| yes|unknown|  5|  may|      76|       1|   -1|       0| unknown| no|  2|
| 47| blue-collar|married|  unknown|     no|   1506|    yes|  no|unknown|  5|  may|      92|       1|   -1|       0| unknown| no|  3|
| 33|        null| single|  unknown|     no|      1|     no|  

SparkDFDataset is a thin wrapper around PySpark DataFrame which allows us to use Great Expectation methods on Pyspark DataFrame.

In [17]:
gdf = SparkDFDataset(df)

In [20]:
expected_columns = ['age', 'job', 'marital',
                    'education', 'default', 'balance',
                    'housing', 'loan', 'contact',
                    'day', 'month', 'duration',
                    'campaign', 'pdays', 'previous',
                    'poutcome', 'y','id']

In [21]:
gdf.expect_table_columns_to_match_set(column_set = expected_columns)

{
  "meta": {},
  "result": {
    "observed_value": [
      "age",
      "job",
      "marital",
      "education",
      "default",
      "balance",
      "housing",
      "loan",
      "contact",
      "day",
      "month",
      "duration",
      "campaign",
      "pdays",
      "previous",
      "poutcome",
      "y",
      "id"
    ]
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [22]:
gdf.expect_column_values_to_be_in_set(column = 'marital', value_set = {'single', 'married', 'divorced'})


                                                                                

{
  "meta": {},
  "result": {
    "element_count": 45211,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [23]:
gdf.expect_column_values_to_not_be_null(column = 'job')


{
  "meta": {},
  "result": {
    "element_count": 45211,
    "unexpected_count": 288,
    "unexpected_percent": 0.6370131162770122,
    "unexpected_percent_total": 0.6370131162770122,
    "partial_unexpected_list": [
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null
    ]
  },
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [24]:
gdf.expect_column_values_to_be_unique('id')


{
  "meta": {},
  "result": {
    "element_count": 45211,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [25]:
expectation_suite = gdf.get_expectation_suite(discard_failed_expectations=False)


In [26]:
# save expectation suite
with open('my_expectation_suite.json', 'w') as my_file:
    my_file.write(
        json.dumps(expectation_suite.to_json_dict())
    )