# **Unit-testing a data pipeline**

- Unit tests are code that helps to test the functionality of other code and are commonly used in software engineering workflows. 
- Unit tests are the foundation for code validation, and can be used by data engineers to ensure components of a data pipeline work as expected. 
- Unit tests can also be written to validate data produced by a pipeline. In a typical data pipeline workflow, unit tests will be written and run before end-to-end validation is completed, to validate both the code and resulting data.

Unit tests:

- Commonly used in software engineering workflows
- Ensure code works as expected
- Help to validate data

# **pytest for unit testing**

- To build and run unit tests with Python, we'll be using a library called pytest. With pytest, unit tests are written as functions. 
- Typically, these function names start with "test", which allows pytest to automatically parse and run tests within a project. 
- In this example, we define a function test_transformed_data. This function asserts that the clean_stock_data object indeed takes type pd-dot-DataFrame. 
- When the command python dash-m pytest is executed, this test will be parsed and run. 
- If no AssertionErrors are raised, a success message will be output. 
- Let's take a closer look at the isinstance function, as well as the assert keyword.

In [None]:
from pipeline import extract, transform, load

# Build a unit test, asserting the type of clean_stock_data
def test_transformed_data():
    raw_stock_data = extract("raw_stock_data.csv")
    clean_stock_data = transform(raw_stock_data)
    assert isinstance(clean_stock_data, pd.DataFrame)

In [None]:
> python -m pytest

test_transformed_data .                               [100%]
============================= 1 passed in 1.17s ==============================

# **assert and isinstance**

- To check the object's type, we'll use the isinstance function. 
- isinstance takes two arguments: an object and a data type. 
- If the object matches the data type, the function returns True. 
- Otherwise, isinstance will return False. Here, "ETL" is assigned to the pipeline_type variable, and isinstance returns True when called, since pipeline_type takes the type string. 
- The assert keyword validates that a boolean expression is indeed True, and raises an AssertionError otherwise. 
- Here, we validate that pipeline_type indeed takes the value "ETL". 
- Since the statement evaluates to True, no error is raised. 
- When writing unit tests, we'll use assert and isinstance together to validate the type that objects take.

In [None]:
pipeline_type = "ETL"

# Check if pipeline_type is an instance of a str
isinstance(pipeline_type, str)

True

# Assert that the pipeline does indeed take value "ETL"
assert pipeline_type == "ETL"

# Combine assert and isinstance
assert isinstance(pipeline_type, str)

# **AssertionError**

- In this example, "ETL" is again assigned to the pipeline_type variable. 
- This time, we attempt to assert that this object is a float. 
- Since this is False, an AssertionError is raised, as shown here. 
- If this statement were placed within a unit test, the test would fail when run.

In [None]:
pipeline_type = "ETL"
# Create an AssertionError
assert isinstance(pipeline_type, float)

In [None]:
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
AssertionError

# **Mocking data pipeline components with fixtures**

- pytest fixtures are functions that allow test data and objects to be shared across multiple tests. 
- They can be used to simplify test setup, and provide a common set of test data for multiple tests. 
- In this example, we create a fixture called clean_data, which returns a cleaned DataFrame. 
- The fixture is then passed to the test_transformed_data function. 
- When run, this unit test will be able to access the cleaned DataFrame created and returned by the clean_data fixture. 
- We'll explore fixtures more in the following exercises.

In [None]:
import pytest

@pytest.fixture()
def clean_data():
  raw_stock_data = extract("raw_stock_data.csv")
  clean_stock_data = transform(raw_data)
  return clean_stock_data

In [None]:
def test_transformed_data(clean_data):
    assert isinstance(clean_data, pd.DataFrame)

# **Unit testing DataFrames**

- In addition to testing functions, we can also test data. 
- In this example, we'll test the clean_data DataFrame passed into the test as a fixture. 
- Using the dot-columns attribute, we assert that there are four columns in this DataFrame. 
- We can use other built-in tools, such as dot-min, to assert that all values in the open column take value greater than zero. 
- This can be taken one step further by validating the max value of this column with the dot-max method. 
- Running unit tests against data helps to confirm that data follows business rules and requirements, and can help to catch data quality issues before a pipeline is shipped to production.

In [None]:
def test_transformed_data(clean_data):
    # Include other assert statements here
    ...

    # Check number of columns
    assert len(clean_data.columns) == 4

    # Check the lower bound of a column
    assert clean_data["open"].min() >= 0
    
    # Check the range of a column by chaining statements with "and"
    assert clean_data["open"].min() >= 0 and clean_data["open"].max() <= 1000