You and your hundred smartest colleagues want to collaborate on a feature engineering project. How will you organize your work? You are in the right place to learn. With the Ballet framework, contributors to your project will write self-contained feature engineering source code. Then, Ballet will take care of the rest: submitting proposed features as pull requests to your GitHub repository, carefully validating the proposed features, and combining all of the accepted features into a single feature engineering pipeline.
In this section, we will describe how the Ballet framework can be leveraged for your project, which we will call myproject
.
Before creating the project, the maintainer must have a training dataset used for developing features and details about the prediction problem they are ultimately trying to solve.
Then, install Ballet </installation>
on your development machine.
To instantiate a project, use the ballet quickstart
command. (You may want to look ahead<maintainer_guide:Automatic repository creation>
and see what options are available for this command, such as for automatically creating a GitHub repository for the project.):
fragments/maintainer-guide/ballet-quickstart.txt
This command uses cookiecutter to render a project template using information supplied by the project maintainer. The resulting files are then committed to a new git repository. Note that the specification of a scorer for the not-chosen problem type can be skipped (by selecting n/a
).
Let's see what files have we have created:
fragments/maintainer-guide/tree-project.txt
Importantly, by keeping this project structure intact, Ballet will be able to automatically care for your feature engineering pipeline.
ballet.yml
: a Ballet configuration file, with details about the prediction problem, the training data, and location of feature engineering source code..travis.yml
: a Travis CI configuration file pre-configured to run a Ballet validation suite.src/myproject/api.py
: this is where Ballet will look for functionality implemented by your project, including a function to load training/test data or collected features. Stubs for this functionality are already provided by the template but you can further adapt them.
For local development, you can then install your project. This will make your feature engineering pipeline accessible in interactive settings (Python interpreter, Jupyter notebook) and as a command-line tool.
$ cd ballet-my-project
$ conda create -n myproject -y && conda activate myproject # or your preferred environment tool
(myproject) $ pip install invoke && invoke install
Under the hood, contributors will collaborate using the powerful functionality provided by git and GitHub. In fact, after the quickstart step, you will already have a git-tracked repository and a git remote set up.
fragments/maintainer-guide/git-log.txt
The matching remote repository on GitHub must be created. This can be done automatically by the quickstart command by passing the --create-github-repo
flag. This causes Ballet to use the GitHub API to create a repository under the account of the github_owner
that you specified earlier (in this case, jane_developer
), and then push the local repository to GitHub. You must provide a GitHub access token with the appropriate permissions, either by exposing the GITHUB_TOKEN
environment variable, or by passing it to the quickstart command using the --github-token
option. See more details on these options here.
Alternately, you can manually create the repository on GitHub. Do not initialize the project with any sample files that GitHub offers. Once you do this, push your local copy.
$ git push --all origin
Ballet makes uses of the continuous integration service Travis CI in order to validate code that contributors propose as well as perform streaming feature definition selection. You must enable Travis CI for your project on GitHub by following these simple directions. You can skip any steps that have to do with customizing the .travis.yml
file, as we have already done that for you in the quickstart.
Many Ballet project use bots to assist maintainers.
1. Ballet bot. Install it here. Ballet bot will automatically merge or close PRs based on the CI test result and the project settings configured in the ballet.yml
file.
2. Repolockr. Install it here. Repolockr checks every PR to ensure that "protected" files have not been changed. These are files listed in the Repolockr config file on the master branch. A contributor might accidentally modify a protected file like ballet.yml
which could break the project or the CI pipeline; Repolockr will detect this and fail the PR which might accidentally pass otherwise.
Ballet allows you to configure many aspects of your project.
Configuration is stored in the project root ballet.yml
file. More details about project configuration will be added soon.
Here is an incomplete list of configuration options, identified by the dotted keys from a root config
object:
config.validation.project_structure_validator
: fully-qualified name of the class used to validate changes to the project structureconfig.validation.feature_api_validator
: fully-qualified name of the class used to validate the feature API of new featuresconfig.validation.feature_accepter
: fully-qualified name of the class used to validate the ML performance of new featuresconfig.validation.feature_pruner
: fully-qualified name of the class used to prune existing features with respect to their ML performanceconfig.validation.split
: the name of the data split used for validating contributions. It will be passed as a keyword argument to yourload_data
function, i.e.load_data(split=split)
. This split should probably appear under the list atconfig.data.splits
.
At this point, your feature engineering pipeline contains no features. How will your contributors add more?
Using any of a number of development workflows, contributors write new features and submit them to your project for validation. For more details on the contributor workflow, see /contributor_guide
.
The ballet-my-project
repository has received a new pull request which triggers an automatic evaluation.
- The PR is examined by the CI service.
- The
ballet validate
command is run, which validates the proposed feature contribution using functionality within theballet.validation
package. - If the feature can be validated successfully, the PR passes, and the proposed feature can be merged into the project.
Once a feature has been accepted and merged into your project's master branch, it may mean that an older feature has now become "redundant": the new feature is providing all of the information contained in the old feature, and more.
- Each commit to master is examined by the CI service.
- The
ballet validate
command is run and automatically determines whether the commit is a merge commit that comes from merging an accepted feature. - If so, then the set of existing features is pruned to remove redundant features.
- Pruned features are automatically deleted from your source repository by an automated service.
If there are updates to the Ballet framework after you have started working on your project, you can access them easily.
First, update the ballet
package itself using the usual pip
mechanism:
$ pip install --upgrade ballet
Pip will complain that the upgraded version of ballet is incompatible with the version required by the installed project. That is okay, as we will presently update the project itself to work with the new version of ballet.
Next, use the updated version of ballet
to incorporate any updates to the "upstream" project template used to create new projects.
$ ballet update-project-template --push
This command will re-render the project template using the saved inputs you have provided in the past and then safely merge it first to your project-template
branch and then to your master
branch. Finally, given the --push
flag it will push updates to origin/master
and origin/project-template
. The usage of this command is described in more detail here.