![](lecture_title.png)


# About Me 🙈

- Data architect at bigabid 👷.
- Data architect consultant 🤓.
- Big passion for python, big data, databases and machine learning 🐍🤖.
- Online at [medium](https://medium.com/@Eyaltra) | [website](https://eyaltrabelsi.github.io/) 🌐

# Today


<br>

- Data-intensive applications in general.
- Testing in general.
- Tests to avoid dumb bugs.
- Tests to check code correctness.
- Tests to check artifacts' correctness.

# Data Intensive applications




<br>

- Complex by nature.
- Consist of configurations, scripts/notebooks, queries, and pipelines.
- Result in datasets, models, and dashboards.
- Break in different ways than regular software.

<i><big>"Testing is a contract between the present and the future."</big></i>📄

# Should we do automatic testing?

<br>

    + Reduces "whoops" moments 🦺.
    + Enabling fearless refactoring 🏗️.
    + Documentation of the program expectations 📚.

    - Knowing what can go wrong and how to test requires some experience 👴.
    - They can fail incorrectly 🤡.
    - Takes time to write and maintain them ⏰.
    - May affect system flexibility 🤸‍♀️.
    - May take a long time to run 🐢.

# What is enough automatic testing?


<br>

- [Testing is a spectrum](http://squarism.com/2018/11/08/testing-is-a-spectrum/) 🌈.
- There's no such thing as no testing (hopefully🙏). 
- Write tests whose estimated value is greater than their estimated cost 🤑. 
- Is test coverage a useful metric? 🗺️
- Incremental by nature 🪜.

- **Pro Tip**💃: Test only what matter.
- **Pro Tip**💃: Use other methodologies when beneficial.
- **Pro Tip**💃: Cheat on your homework.

# We don't want dumb bugs

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

<br>

| What | Expectations | How | Do I Cover this? |
| --- | --- | --- |--- |
| Code | Has valid syntax | [precommit](https://pre-commit.com/) |🟩|
| Code | Has valid syntax and types | [mypy](https://mypy.readthedocs.io/en/stable/getting_started.html) |🟥|
| Pipeline | Dags are imported correctly | Importing dags |🟩|
| Query | Has valid syntax | [sqlfluff](https://github.com/sqlfluff/sqlfluff) |🟥 |
| Query | Has valid syntax and schema | Explain |🟩|
| Configurations | Has valid syntax | [precommit](https://pre-commit.com/) |🟩|
| Configurations | Has valid syntax and schema | [cerberus](https://docs.python-cerberus.org/en/stable/) | 🟥|


## precommit

<br>

- precommit identify simple issues using static code analysis.
- A form of testing that looks at the code without running it.
- precommit has many types of validations.


``` yaml
// .pre-commit-config.yaml
repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-ast
    -   id: check-yaml
    -   id: check-json
```      

- **Difficulty**: Easy to test 👶.
- **Execution time**: Fast 🐆.

- **Pro Tip**💃: start with the basic and add schema/types if needed.

## Queries have valid syntax and schema

<br>

- Prepend Explain to a query results in the query's execution plan.
- If your query is invalid, the Explain query will fail.
- Available in most SQL dialects.
- Require database connectivity.



``` python          
def test_queries_synatx():
    engine = create_engine('<COOL_CONNECTION_STRING>')
    with engine.connect() as con:
        queries_list = glob.glob('**/*.sql',recursive=True)
        for query_file in queries_list:
            with open(query_file) as f:
                query = f.read()
                test_query = f"Explain {query}"
                con.execute(test_query)
        
                    
    ```

- **Difficulty**: Easy* to test 👶.
- **Execution time**: Fast* 🐆.

- **Pro Tip**💃: much "stronger" than static code analysis for queries.

## Dags are imported correctly

<br>

- Parsing and loading all the pipelines.
- Different syntax in different orchestrators.



``` python        
from airflow.models import DagBag

def test_no_import_errors():
  dag_bag = DagBag(dag_folder='<DAG_FOLDER>', include_examples=False)
  assert len(dag_bag.import_errors) == 0, "No Import Failures"

    ```    

- **Difficulty**: Easy to test 👶.
- **Execution time**: Fast 🐆.

- **Pro Tip**💃: may be annoying with half-baked environments.

# Testing the code behavior

<br>

| What | Expectations | How | Do I Cover this? |
| --- | --- | --- |--- |
| Code | work in expected way on specific input | [pytest](https://docs.pytest.org/en/7.1.x/)/[doctest](https://docs.python.org/3/library/doctest.html) |🟩|
| Code | is fast/scaleable enough | [pytest](https://docs.pytest.org/en/7.1.x/) |🟥|
| Code | work in expected way on potential inputs | [pytest](https://docs.pytest.org/en/7.1.x/) + [hypothesis](https://hypothesis.readthedocs.io/) |🟩|
| Pipeline | operators work correctly | [pytest](https://docs.pytest.org/en/7.1.x/) + [orchestrator code](https://godatadriven.com/blog/testing-and-debugging-apache-airflow/)  |🟥| 


## Work in expected way on specific input

<br>

- Check the "happy path".
- Given some example input(s), the output is correct.
- Counter-examples should show up as incorrect.
- Check for decreasing loss after one batch of training
- make sure a single gradient step on a batch of data yields a decrease in your loss
- test operator code



``` python
def test_cool_code_on_specific_input():
    text = 'WOW this is an awesome lecture'
    text_bagged = bag_of_words(text)
    assert len(text_bagged) == 6
    assert ' ' not in text_bagged    
```    

- **Difficulty**: Somewhat easy to test 👦.
- **Execution time**: Tend to be fast 🐇.

- **Pro Tip**💃: small is better.
- **Pro Tip**💃: cool functionality like @parametrize, @skipif, etc.

## Work in expected way on potential inputs

<br>

- Given the input's specification, the output makes sense using properties.
- Generating inputs using a specific strategy.
- You can create custom strategies and use generators like [faker](https://github.com/joke2k/faker).
- Many cool integrationS if schema/type is known.
- Properties you could test:
    - The code does not crash.
    - Contract.
    - Equivalent functions return the same results.
    - Idempotent, cummutative, associative etc.
    - Stateful model based tests.
    - Data invariants.



``` python

from hypothesis import given
from hypothesis import strategies as st

@given(st.text())
def test_cool_code_properties(s):
    text_bagged = bag_of_words(s)
    assert len(text_bagged) >= 0
    assert ' ' not in text_bagged   
    ```    

- **Difficulty**: Confusing, but somewhat easy to test 👦🤯.
- **Execution time**: Tend to be slow 🐢.

- **Pro Tip**💃: when facing users' input, unit-tests are not enough.
- **Pro Tip**💃: cool functionality like @given.

# Testing the code artifacts

<br>

| What | Expectations |  Do I Cover this? |
| --- | --- | --- |
| Dataset | generated datateset work in expected way |  🟩|
| Model | generated model work in expected way|  🟩|
| Model | is fast/scaleable enough|  🟥|
| Dashboard | is fast/scaleable enough  |🟥|
| Dashboard | [generated dashboard work in expected way](https://www.youtube.com/watch?v=-MoWsngubI4&ab_channel=TableauTim)|  🟥|

## Generated datateset  work in expected way

<br>

- Does model perform worse than a previous datasets? Reference testing on mocked input.
- Testing properties of the model:
    - Has valid schema.
    - Cardinality.
    - Number of missing values.
    - Underlying distributions.
    - Check for leakage.
 

    

``` python
     
def test_cool_dataset_properties():
    df = load_mock_input("<COOL_INPUT_DATA_PATH>")
    run_pipeline(df)    
    actual_df = load_mock_data("<COOL_DATA_PATH>")
    assert(set(actual_df.columns) == set(["X", "Y", "Z"]))
    assert(actual_df["X"].isna().sum()==0)
    ```        

- **Difficulty**: Confusing 🤯, and tend to be hard test. 👨.
- **Execution time**: Tend to be slow 🐢.

- **Pro Tip**💃: Useful as a data quality mechanism in production.
- **Pro Tip**💃: Use tools like great-expectations.
- **Pro Tip**💃: Treat any shared data as immutable.

## Generated model work in expected way


<br>

- Does the model perform worse than previous datasets?
- Does model work in expected way on specific inputs?
- Testing properties of the model:
    - Check the shape of your model output.
    - Check the output ranges.
    - Check the output distribution.
    - Check idempotent mutations.
    - Check directional mutations



``` python
def test_cool_model_properties():
    df = load_mock_input("<COOL_DATA_PATH>")
    model = load_model("<COOL_MODEL_PATH>")
    predictions = model.predict(df)
    assert all(0 =< prediction < 120 for prediction in predictions age >= 0)
    ```    

- **Difficulty**: Confusing 🤯, but somewhat easy to test 👦.
- **Execution time**: Tend to be slow 🐢.


![](https://memegenerator.net/img/instances/63439861.jpg)

![](sponsors.png)
