A short tutorial for data scientists on how to write tests for code + data.
Jupyter Notebook Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data updated data May 13, 2017
docs updated HTML versions of notebooks May 21, 2017
.gitignore Updated gitignore to ignore pycache May 14, 2017
1-introduction.ipynb minor update to material May 18, 2017
2-pytest-introduction.ipynb minor update to material May 18, 2017
3-data-checks.ipynb minor additions May 20, 2017
LICENSE Initial commit May 30, 2015
Makefile updated Makefile with wildcards May 21, 2017
README.md Updated README with information on setup May 19, 2017
bonus-1-test-coverage.ipynb Added test_clip and test_eq_roots May 13, 2017
bonus-2-property-based-testing.ipynb updated notebooks May 14, 2017
bonus-3-file-integrity.ipynb updated notebooks May 14, 2017
bonus-4-projects.ipynb Added test_clip and test_eq_roots May 13, 2017
checkenv.py Removed colorama from checkenv script May 19, 2017
conda-setup.sh Created a conda-setup.sh script May 13, 2017
corrupt-data-changes.md Updated documentation of corrupt data changes. May 13, 2017
datafuncs_soln.py bug fixes + added functions May 14, 2017
environment.yml Updated Environment script spec May 19, 2017
feedback-20170513.md added feedback from today's tutorial run May 13, 2017
makedocs.py Added test_clip and test_eq_roots May 13, 2017
record_file_hash_soln.py created a script that records file hash May 13, 2017
requirements.txt Updated requirements.txt May 13, 2017
syllabus.md first commit of syllabus May 31, 2016
test_datafuncs_soln.py bug fixes + added functions May 14, 2017
update_env.sh Made sure May 12, 2017
venv-setup.sh Added a venv setup script. May 13, 2017

README.md

Best Testing Practices for Data Science

A short tutorial for data scientists on how to write tests for your code and your data. Before the tutorial, please read through this README file, for it contains a lot of useful information that will help you best prepare for the tutorial.

How to use this repository

The tutorial notes are typed up in Jupyter notebooks, and static HTML versions are available under the docs folder. For the non-bonus material, I suggest working through the notes in order. With the exception of the Projects, the bonus material can be tackled in any order. During the tutorial, be sure to have the HTML versions open.

Pre-Requisite Knowledge

I am assuming you are of the following type of coder:

  • You are a data analytics type, who knows how to read/write CSV files with Pandas, and do basic data manipulation (slicing, indexing rows + columns, using the .apply() function).
  • You are not necessarily a seasoned software developer who has experience running tests.
  • You are comfortable with operating in the Terminal environment.
  • You have some rudimentary knowledge of numpy, particularly the the array.min(), array.max(), array.mean(), array.std(), and numpy.allclose(a1, a2) function calls.

In order to prepare for the tutorial, there are some pieces of Python syntax that will come in handy to know:

  • the context manager syntax (with ....),
  • assertions (assert conditions1 == condition2),
  • file I/O (with open(....) as f:...),
  • list/dict/tuple comprehensions ([a for a in container if condition(a)]),
  • checking types & attributes (isinstance(obj, type) or hasattr(obj, attr)).

Feedback

If you've taken a version of this tutorial, please leave feedback here. I use the suggestions in there to adjust the tutorial content and make it better. The changes are always released publicly on GitHub, so everybody benefits!

Environment Setup

conda setup

This installation route should work cross-platform. I recommend using the Anaconda distribution of Python because it is a good way to bootstrap your data science environment.

To get setup, create a conda environment based on the provided environment.yml spec file. Run the following commands in your bash terminal.

$ bash conda-setup.sh

pip setup

The alternative way is to use a virtualenv environment:

$ bash venv-setup.sh
$ source datatest/bin/activate

Alternatively, you can pip install each of the dependencies listed in the environment.yml file. (The requirements.txt file may be less eagerly maintained than the environment.yml file, given the conda-biases that I have.)

Manual Setup

If you prefer having more control over your installation process, conda or pip install the dependencies listed in the environment.yml file.

Checks

To check whether the environment is correctly setup, run the checkenv.py script:

$ python checkenv.py

It should print to your terminal, All packages found; environment checks passed.. Otherwise, conda or pip install the necessary packages stated (they will show up one by one).

Authors

Contributors

Special thanks goes to individuals who have contributed in ways big and small to the improvement of the material.

  • Renee Chu
  • Matt Bachmann: @Bachmann1234
  • Hugo Bowne-Anderson: @hugobowne
  • Boston Python tutorial attendees:
    • @races1986
    • Thao Nguyen: @ThaoNguyen15
    • @ChrisMuir

Data Credits