Skip to content

Commit

Permalink
[WIP] New version with cleaner options (#162)
Browse files Browse the repository at this point in the history
* WIP - New version with cleaner options

* Fix find-replace error (#177)

* Remove unnecessary .gitkeep

* Remove unused tox.ini

* Split reqs into dev/non-dev

* Add basic packages support

* Add tests for testing environment creation and requirements

* Set up CI with Azure Pipelines (#194)

* Change archived asciinema example (#163)

* Change archived asciinema example

* Update README.md

Fix Asciinema powerline error

* Update docs to show updated asciinema example

* Added source and destination to Make data target (#169)

* Fix broken Airflow link (#182)

* Fixed: Typo in Makefile (#184)

Fixed typo in Makefile, section "Set up python interpreter environment": intalled --> installed

* Set up CI with Azure Pipelines

[skip ci]

* Update azure-pipelines.yml for Azure Pipelines

* Update azure-pipelines.yml for Azure Pipelines

* Update azure-pipelines.yml for Azure Pipelines

* str paths for windows support

* handle multiple data providers (#199)

* Add missing env directory bin/activate path

* Remove version from PYTHON_INTERPRETER command

* Search for virtualenvwrapper.sh path if executable not found

* Try chardet for character encoding detection

* Specify python and virtualenv binaries for virtualenvwrapper

* Add shebang to virtualenvwrapper.sh

* Diagnostic

* Try virtualenvwrapper-win

* Set encoding if detected None

* Fixes to Mac and Windows tests on Azure pipelines (#217)

* Temporarily comment out py36

* Update azure-pipelines.yml

* Fix tests on Windows and Mac (#1)

* Temporarily remove py37

* Update virtualenv_harness.sh

* put py37 back in

* Set encoding to utf-8

* Comment out rmvirtualenv

* Update test_creation.py

* Update virtualenv_harness.sh

* Add --show-capture

* Update azure-pipelines.yml

* Update azure-pipelines.yml

* Update test_creation.py

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update Makefile

* Update virtualenv_harness.sh

* Update cookiecutter.json

* Update cookiecutter.json

* Update virtualenv_harness.sh

* Update Makefile

* Update Makefile

* Update Makefile

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update Makefile

* Update Makefile

* Update Makefile

* Update Makefile

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update Makefile

* Update Makefile

* Update virtualenv_harness.sh

* Update Makefile

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update test_creation.py

* Update azure-pipelines.yml

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update virtualenv_harness.sh

* Update cookiecutter.json

* Update conda_harness.sh

* Update conda_harness.sh

* Update conda_harness.sh

Co-authored-by: Eric Jalbert <ericmjalbert@users.noreply.github.com>
Co-authored-by: Jonathan Raviotta <jraviotta@users.noreply.github.com>
Co-authored-by: Wes Roach <wesr000@gmail.com>
Co-authored-by: Christopher Geis <16896724+geisch@users.noreply.github.com>
Co-authored-by: Peter Bull <pjbull@gmail.com>
Co-authored-by: Ian Preston <17241371+ianepreston@users.noreply.github.com>
Co-authored-by: Jay Qi <jayqi@users.noreply.github.com>
Co-authored-by: inchiosa <4316698+inchiosa@users.noreply.github.com>

* More graceful deprecation

* Make tests pass locally

* test version match installed version

* Remove unused imports

* Unremove used import

* Move to GH Actions

* Fix typo

* Test non-windows

* Add netlify configs

* Update suggestion to keep using deprecated cookiecutter template (#231)

* Add mkdocs requirements file to docs directory

* Try setting python version in runtime txt for netlify

* Trigger build

* Python 3.8 netlify

* Python 3.6 netlify

* Do not specify python runtime for netlify

* Use 3.7

This reverts commit 898d7d3.

Co-authored-by: James Myatt <james@jamesmyatt.co.uk>
Co-authored-by: drivendata <info@drivendata.org>
Co-authored-by: Eric Jalbert <ericmjalbert@users.noreply.github.com>
Co-authored-by: Jonathan Raviotta <jraviotta@users.noreply.github.com>
Co-authored-by: Wes Roach <wesr000@gmail.com>
Co-authored-by: Christopher Geis <16896724+geisch@users.noreply.github.com>
Co-authored-by: Ian Preston <17241371+ianepreston@users.noreply.github.com>
Co-authored-by: Jay Qi <jayqi@users.noreply.github.com>
Co-authored-by: inchiosa <4316698+inchiosa@users.noreply.github.com>
Co-authored-by: Robert Gibboni <robert@drivendata.org>
  • Loading branch information
11 people committed Mar 20, 2021
1 parent c077603 commit 1fe968d
Show file tree
Hide file tree
Showing 42 changed files with 806 additions and 249 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/tests.yml
@@ -0,0 +1,36 @@

name: tests

on:
push:
branches: [master]
pull_request:
schedule:
# Run every Sunday
- cron: "0 0 * * 0"

jobs:
build:
name: ${{ matrix.os }}, Python ${{ matrix.python-version }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest]
python-version: [3.6, 3.7, 3.8]

steps:
- uses: actions/checkout@v2

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r dev-requirements.txt --use-feature=2020-resolver
- name: Run tests
run: |
pytest -vvv
7 changes: 6 additions & 1 deletion .gitignore
Expand Up @@ -6,4 +6,9 @@ docs/site/
# test cache
.cache/*
tests/__pycache__/*
*.pytest_cache/
*.pytest_cache/
*.pyc

# other local dev info
.vscode/
cookiecutter_data_science.egg-info/
6 changes: 4 additions & 2 deletions README.md
Expand Up @@ -73,8 +73,10 @@ The directory structure of your new project looks like this:
│ generated with `pip freeze > requirements.txt`
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
├── {{ cookiecutter.module_name }} <- Source code for use in this project.
│ │
│ ├── __init__.py <- Makes {{ cookiecutter.module_name }} a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
Expand Down
30 changes: 30 additions & 0 deletions ccds.json
@@ -0,0 +1,30 @@
{
"project_name": "project_name",
"repo_name": "{{ cookiecutter.project_name.lower().replace(' ', '_') }}",
"module_name": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}",
"author_name": "Your name (or your organization/company/team)",
"description": "A short description of the project.",
"python_version_number": "3.7",
"dataset_storage": [
{"none": "none"},
{"azure": {"container": "container-name"}},
{"s3": {"bucket": "bucket-name", "aws_profile": "default"}},
{"gcs": {"bucket": "bucket-name"}}
],
"environment_manager" : [
"virtualenv",
"conda",
"pipenv",
"none"
],
"dependency_file": [
"requirements.txt",
"environment.yml",
"Pipfile"
],
"pydata_packages": [
"none",
"basic"
],
"open_source_license": ["MIT", "BSD-3-Clause", "No license file"]
}
File renamed without changes.
26 changes: 26 additions & 0 deletions ccds/__main__.py
@@ -0,0 +1,26 @@
# Monkey-patch jinja to allow variables to not exist, which happens with sub-options
import jinja2

jinja2.StrictUndefined = jinja2.Undefined


# Monkey-patch cookiecutter to allow sub-items
from cookiecutter import prompt
from ccds.monkey_patch import generate_context_wrapper, prompt_for_config

prompt.prompt_for_config = prompt_for_config


# monkey-patch context to point to ccds.json
from cookiecutter import generate
from ccds.monkey_patch import generate_context_wrapper
generate.generate_context = generate_context_wrapper

# for use in tests need monkey-patched api main
from cookiecutter import cli
from cookiecutter import main as api_main
main = cli.main


if __name__ == "__main__":
main()
122 changes: 122 additions & 0 deletions ccds/monkey_patch.py
@@ -0,0 +1,122 @@
from collections import OrderedDict
from pathlib import Path

from cookiecutter.exceptions import UndefinedVariableInTemplate
from cookiecutter.environment import StrictEnvironment
from cookiecutter.generate import generate_context
from cookiecutter.prompt import (prompt_choice_for_config, render_variable, read_user_variable, read_user_choice)
from jinja2.exceptions import UndefinedError


def _prompt_choice_and_subitems(cookiecutter_dict, env, key, options, no_input):
result = {}

# first, get the selection
rendered_options = [
render_variable(env, list(raw.keys())[0], cookiecutter_dict) for raw in options
]

if no_input:
selected = rendered_options[0]
else:
selected = read_user_choice(key, rendered_options)

selected_item = [list(c.values())[0] for c in options if list(c.keys())[0] == selected][0]

result[selected] = {}

# then, fill in the sub values for that item
if isinstance(selected_item, dict):
for subkey, raw in selected_item.items():
# We are dealing with a regular variable
val = render_variable(env, raw, cookiecutter_dict)

if not no_input:
val = read_user_variable(subkey, val)

result[selected][subkey] = val
elif isinstance(selected_item, list):
val = prompt_choice_for_config(
cookiecutter_dict, env, selected, selected_item, no_input
)
result[selected] = val
elif isinstance(selected_item, str):
result[selected] = selected_item

return result


def prompt_for_config(context, no_input=False):
"""
Prompts the user to enter new config, using context as a source for the
field names and sample values.
:param no_input: Prompt the user at command line for manual configuration?
"""
cookiecutter_dict = OrderedDict([])
env = StrictEnvironment(context=context)

# First pass: Handle simple and raw variables, plus choices.
# These must be done first because the dictionaries keys and
# values might refer to them.
for key, raw in context[u'cookiecutter'].items():
if key.startswith(u'_'):
cookiecutter_dict[key] = raw
continue

try:
if isinstance(raw, list):
if isinstance(raw[0], dict):
val = _prompt_choice_and_subitems(
cookiecutter_dict, env, key, raw, no_input
)
cookiecutter_dict[key] = val
else:
# We are dealing with a choice variable
val = prompt_choice_for_config(
cookiecutter_dict, env, key, raw, no_input
)
cookiecutter_dict[key] = val
elif not isinstance(raw, dict):
# We are dealing with a regular variable
val = render_variable(env, raw, cookiecutter_dict)

if not no_input:
val = read_user_variable(key, val)

cookiecutter_dict[key] = val
except UndefinedError as err:
msg = "Unable to render variable '{}'".format(key)
raise UndefinedVariableInTemplate(msg, err, context)

# Second pass; handle the dictionaries.
for key, raw in context[u'cookiecutter'].items():

try:
if isinstance(raw, dict):
# We are dealing with a dict variable
val = render_variable(env, raw, cookiecutter_dict)

if not no_input:
val = read_user_dict(key, val)

cookiecutter_dict[key] = val
except UndefinedError as err:
msg = "Unable to render variable '{}'".format(key)
raise UndefinedVariableInTemplate(msg, err, context)

return cookiecutter_dict


def generate_context_wrapper(*args, **kwargs):
''' Hardcoded in cookiecutter, so we override:
https://github.com/cookiecutter/cookiecutter/blob/2bd62c67ec3e52b8e537d5346fd96ebd82803efe/cookiecutter/main.py#L85
'''
# replace full path to cookiecutter.json with full path to ccds.json
kwargs['context_file'] = str(Path(kwargs['context_file']).with_name('ccds.json'))

parsed_context = generate_context(*args, **kwargs)

# replace key
parsed_context['cookiecutter'] = parsed_context['ccds']
del parsed_context['ccds']
return parsed_context
9 changes: 1 addition & 8 deletions cookiecutter.json
@@ -1,10 +1,3 @@
{
"project_name": "project_name",
"repo_name": "{{ cookiecutter.project_name.lower().replace(' ', '_') }}",
"author_name": "Your name (or your organization/company/team)",
"description": "A short description of the project.",
"open_source_license": ["MIT", "BSD-3-Clause", "No license file"],
"s3_bucket": "[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')",
"aws_profile": "default",
"python_interpreter": ["python3", "python"]
"DEPRECATED": "Use of the `cookiecutter` command is deprecated. Please use `ccds` in place of `cookiecutter`. To continue using the deprecated template, use `cookiecutter ... -c v1`."
}
10 changes: 10 additions & 0 deletions dev-requirements.txt
@@ -0,0 +1,10 @@
-r requirements.txt
-e .

chardet
mkdocs
mkdocs-cinder
pipenv
pytest
virtualenvwrapper; sys_platform != 'win32'
virtualenvwrapper-win; sys_platform == 'win32'
18 changes: 9 additions & 9 deletions docs/docs/index.md
Expand Up @@ -54,7 +54,7 @@ Disagree with a couple of the default folder names? Working on a project that's
## Getting started

With this in mind, we've created a data science cookiecutter template for projects in Python. Your analysis doesn't have to be in Python, but the template does provide some Python boilerplate that you'd want to remove (in the `src` folder for example, and the Sphinx documentation skeleton in `docs`).
With this in mind, we've created a data science cookiecutter template for projects in Python. Your analysis doesn't have to be in Python, but the template does provide some Python boilerplate that you'd want to remove (in the `{{ cookiecutter.module_name }}` folder for example, and the Sphinx documentation skeleton in `docs`).

### Requirements

Expand Down Expand Up @@ -103,8 +103,8 @@ cookiecutter https://github.com/drivendata/cookiecutter-data-science
│ generated with `pip freeze > requirements.txt`
├── setup.py <- Make this project pip installable with `pip install -e`
├── src <- Source code for use in this project.
│   ├── __init__.py <- Makes src a Python module
├── {{ cookiecutter.module_name }} <- Source code for use in this project.
│   ├── __init__.py <- Makes {{ cookiecutter.module_name }} a Python module
│ │
│   ├── data <- Scripts to download or generate data
│   │   └── make_dataset.py
Expand All @@ -129,7 +129,7 @@ There are some opinions implicit in the project structure that have grown out of

### Data is immutable

Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figure (see [Analysis is a DAG](#analysis-is-a-dag)), but anyone should be able to reproduce the final products with only the code in `src` and the data in `data/raw`.
Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis. You shouldn't have to run all of the steps every time you want to make a new figure (see [Analysis is a DAG](#analysis-is-a-dag)), but anyone should be able to reproduce the final products with only the code in `{{ cookiecutter.module_name }}` and the data in `data/raw`.

Also, if data is immutable, it doesn't need source control in the same way that code does. Therefore, ***by default, the data folder is included in the `.gitignore` file.*** If you have a small amount of data that rarely changes, you may want to include the data in the repository. Github currently warns if files are over 50MB and rejects files over 100MB. Some other options for storing/syncing large data include [AWS S3](https://aws.amazon.com/s3/) with a syncing tool (e.g., [`s3cmd`](http://s3tools.org/s3cmd)), [Git Large File Storage](https://git-lfs.github.com/), [Git Annex](https://git-annex.branchable.com/), and [dat](http://dat-data.com/). Currently by default, we ask for an S3 bucket and use [AWS CLI](http://docs.aws.amazon.com/cli/latest/reference/s3/index.html) to sync data in the `data` folder with the server.

Expand All @@ -141,18 +141,18 @@ Since notebooks are challenging objects for source control (e.g., diffs of the `

1. Follow a naming convention that shows the owner and the order the analysis was done in. We use the format `<step>-<ghuser>-<description>.ipynb` (e.g., `0.3-bull-visualize-distributions.ipynb`).

2. Refactor the good parts. Don't write code to do the same task in multiple notebooks. If it's a data preprocessing task, put it in the pipeline at `src/data/make_dataset.py` and load data from `data/interim`. If it's useful utility code, refactor it to `src`.
2. Refactor the good parts. Don't write code to do the same task in multiple notebooks. If it's a data preprocessing task, put it in the pipeline at `{{ cookiecutter.module_name }}/data/make_dataset.py` and load data from `data/interim`. If it's useful utility code, refactor it to `{{ cookiecutter.module_name }}`.

Now by default we turn the project into a Python package (see the `setup.py` file). You can import your code and use it in notebooks with a cell like the following:

```
# OPTIONAL: Load the "autoreload" extension so that code can change
%load_ext autoreload
# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
# OPTIONAL: always reload modules so that as you change code in {{ cookiecutter.module_name }}, it gets loaded
%autoreload 2
from src.data import make_dataset
from {{ cookiecutter.module_name }}.data import make_dataset
```

### Analysis is a directed acyclic graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))
Expand Down Expand Up @@ -192,10 +192,10 @@ OTHER_VARIABLE=something

#### Use a package to load these variables automatically.

If you look at the stub script in `src/data/make_dataset.py`, it uses a package called [python-dotenv](https://github.com/theskumar/python-dotenv) to load up all the entries in this file as environment variables so they are accessible with `os.environ.get`. Here's an example snippet adapted from the `python-dotenv` documentation:
If you look at the stub script in `{{ cookiecutter.module_name }}/data/make_dataset.py`, it uses a package called [python-dotenv](https://github.com/theskumar/python-dotenv) to load up all the entries in this file as environment variables so they are accessible with `os.environ.get`. Here's an example snippet adapted from the `python-dotenv` documentation:

```python
# src/data/dotenv_example.py
# {{ cookiecutter.module_name }}/data/dotenv_example.py
import os
from dotenv import load_dotenv, find_dotenv

Expand Down
2 changes: 2 additions & 0 deletions docs/requirements.txt
@@ -0,0 +1,2 @@
mkdocs
mkdocs-cinder
1 change: 1 addition & 0 deletions docs/runtime.txt
@@ -0,0 +1 @@
3.7

0 comments on commit 1fe968d

Please sign in to comment.