# Validation notebook

This notebook is executed using Databricks Workflows as defined in resources/notebook_validation_job.yml. It is used to check summary table for valid results.

## Data Frame assert
Compare results from test data set against an expected set of values that is generated with simpler logic. This is more dynamic but involves putting more logic into the test.

In [None]:
from pyspark.testing.utils import assertDataFrameEqual

result_counts = spark.sql("""
        SELECT count(distinct pickup_date) dt_count, count(1) rows
        FROM main.datakickstart_dev.trip_summary
        """)

expected_counts = spark.sql("""
        WITH source_agg (
            SELECT cast(tpep_pickup_datetime as date) dt,
                   pickup_zip,
                   1 as row_count
            FROM samples.nyctaxi.trips
            GROUP BY dt, pickup_zip
        )
        SELECT count(distinct dt) dt_count, count(1) rows
        FROM source_agg
        """)

assertDataFrameEqual(result_counts, expected_counts)

In [None]:
result_counts.show()

## Simple assert
Option you can use if counts will stay consistent in the test environment.

In [None]:
from pyspark.sql import Row

result = spark.sql("""
        SELECT count(distinct pickup_date) dt_count, count(1) rows
        FROM main.datakickstart_dev.trip_summary
        """).first()

# Option 1
assert result.dt_count == 60
assert result.rows == 3290

# Option 2
expected_counts = Row(dt_count=60, rows=3290)
assert result == expected_counts

In [None]:
print("No errors detected")