# Best Data Testing Practices for Data Science

Eric J. Ma

MIT Biological Engineering

## Why tests?

- We make assumptions about our code & data. 
- There are cases where those assumptions are violated.
- Therefore, automated testing of those assumptions is important.

## Tests

> A contract between your current self and your future self.
> What you expect to be right now should hold true in the future.
> What you expect to be wrong now should still be wrong in the future.
> Unless the requirements have changed!

## For code, what needs to be tested?

- Given some example input(s), the output is correct.
- Counter-examples should show up as incorrect.
- Boundary cases are accounted for using defensive programming.
- All lines of code are subject to at least one test.

## For data, what needs to be tested?

- Data types are appropriate. (Types)
- Data has not been tampered with. (Integrity)
- Missing values are accounted for. (Completeness)
- Data schema is complete. (Structure)

## For statistical analysis & ML, what else needs to be done?

- Underlying distributions for real-valued (numeric; integer or floats) data.
- Classifying data as categorical, ordinal, count, compositional, or continuous.
- Categorical/ordinal values represented as strings should be converted to numerical representations.

## What to expect

- Simple exercises getting you familiar with how to write tests.
- One collaborative project + discussion at the end to write tests for functions from untested project.

## Take-Homes

- Essentially, you'll get a ton of practice with [`pytest`](https://docs.pytest.org/en/latest/) and assertion statements. 
- You'll also get a bit of practice using [`hypothesis`](https://hypothesis.readthedocs.io/en/latest/) to do property-based testing.