# Purpose

Demonstrate how to make SQL-based features testable and verifiable within Python client, as well as show how the same client can run in pre-production and production environments.

# Setup

To reproduce this, you'll need to:

1. Install dependencies: `pip install google-cloud-bigquery google-cloud-storage jinja2 pyyaml`
1. Setup [GCP authenication](https://cloud.google.com/docs/authentication/getting-started)
1. Create a BigQuery table and populate using the national summary data from the [Atlantic COVID Tracking Project data](https://covidtracking.com/data/download).
1. You'll need to tweak the code in `client.py`

# Sample data

Here's a sample query using the BigQuery client.

In [1]:
from google.cloud import bigquery

client = bigquery.Client()

QUERY = (
    """SELECT date, state, death, hospitalized 
         FROM `testable-features-poc.covid.us-states` 
        ORDER BY date DESC, state ASC
        LIMIT 5""")
query_job = client.query(QUERY)
rows = query_job.result()

for row in rows:
    print(row)

Row((datetime.date(2021, 3, 7), 'AK', 305, 1293), {'date': 0, 'state': 1, 'death': 2, 'hospitalized': 3})
Row((datetime.date(2021, 3, 7), 'AL', 10148, 45976), {'date': 0, 'state': 1, 'death': 2, 'hospitalized': 3})
Row((datetime.date(2021, 3, 7), 'AR', 5319, 14926), {'date': 0, 'state': 1, 'death': 2, 'hospitalized': 3})
Row((datetime.date(2021, 3, 7), 'AS', 0, None), {'date': 0, 'state': 1, 'death': 2, 'hospitalized': 3})
Row((datetime.date(2021, 3, 7), 'AZ', 16328, 57907), {'date': 0, 'state': 1, 'death': 2, 'hospitalized': 3})


## Define & validate feature
We'll define the feature in our notebook, and without publishing it, validate it against production data.

I doubt developers will get direct access to production data on their notebooks, so in practice, we'll probably need a service that proxies these requests between local environments and various prod and pre-prod environments.

In [9]:
from client import FeaturesClient, FeatureDefinition
import os

# TODO: parameters and sources can be dictionaries, not lists
feature_def = FeatureDefinition("""
  parameters:
    - foo: bar
  sources:
    - source1: 
        prod: testable-features-poc.covid.us-states
  query: |
    SELECT death
      FROM `{{ source1 }}`
     WHERE state = '{{ state }}'
     ORDER BY date DESC
     LIMIT 1
""")

c = FeaturesClient.load_feature(feature_def)

In [13]:
# connect to production environ to validate result
os.environ["ENV"] = "prod" 

# this is a non-blocking LRO, like a future/promise...
c.inference(state="DC")    # expect 1030

1030

# Unit test

Now we'll write junit tests, enabling us to repeatedly validate our feature definition as well as test out edge cases.

In [11]:
# Condition via CSV file.
#   Behind the scenes, this will load the CSV data into GBQ and create a reference to this table in feature definition

data = """
"date","state","death"
"2021-11-27","DC",123
"2021-11-27","VA",456
"""

c.condition_env("dev", data)

In [14]:
# Switch to local development environment, which will use the data we conditioned
os.environ["ENV"] = "dev"

# pytest goes here...
assert 123 == c.inference(state="DC")
assert None == c.inference(state="quebec")

Note results are different for `dev` and `prod` environments. We can switch back to `prod` to resume querying from production database:

In [15]:
os.environ["ENV"] = "prod" 

c.inference(state="DC")    # expect 1030

1030