Skip to content

Commit

Permalink
improved documentation and contributiono guide
Browse files Browse the repository at this point in the history
  • Loading branch information
arthurpaulino committed Apr 18, 2019
1 parent 73f87b7 commit 36f216c
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 21 deletions.
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ Before you start coding, checkout to a new branch called `issue-<#issue>` (e.g.:

Before commiting your changes, remember to increment the package version according
to the [Semantic Versioning][semver] specification, with one difference: there is
also an UPDATE identifier, which MUST be incremented when the change does not
directly affect the way that the code works (eg.: updating the documentation or
editing the `Makefile`).
also an UPDATE identifier, which MUST be incremented if the change affects only
docstrings. If the change does not affect ``.py`` files, it's not necessary to
change the version.

The version can be incremented by calling `make` with one of the following
directives: `major`, `minor`, `patch` or `update`. Feel free to call `$ make help`
Expand Down
46 changes: 28 additions & 18 deletions docs/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,15 +126,23 @@ Ensembling base models

It is possible to combine the predictions of various base models in order to reach
even higher scores. This process is done by computing a straightforward linear
combination of the base models' predictions. The score of the ensemble is computed
by comparing the training target and the linear combination of the predictions for
the training dataset. The predictions for the testing dataset is computed by
performing the same linear combination on the predictions for the testing dataset
from the base models.
combination of the base models' predictions.

Now, the obvious question is: how to find smart coefficients (or weights) for the
linear combination? This is where the concept of `ensembling cycles` comes into
play.
More precisely, suppose we have a set of base models. For each base model :math:`i`,
let :math:`tr_i` and :math:`ts_i` be its predictions for the training and testing
dataset, respectively. The ensemble of the base models is based on a set of
coefficients :math:`w` (weights), for which we can compute the combined predictions
:math:`E_{tr}` and :math:`E_{ts}` for the training and testing datasets, respectively,
according to the formula:

:math:`(E_{tr}, E_{ts}) = \left(\frac{\sum w_i tr_i}{\sum w_i},
\frac{\sum w_i ts_i}{\sum w_i}\right)`

With a smart choice of :math:`w`, the score for :math:`E_{tr}` may be better than
the score of any :math:`tr_i`.

Now, the obvious question is: how to find a good :math:`w`? This is where the
concept of `ensembling cycles` comes into play.

An ensembling cycle is an attempt to generate good weights stochastically, based
on the the score of each base model individually. This is done by using `triangular
Expand All @@ -143,19 +151,21 @@ distributions <https://en.wikipedia.org/wiki/Triangular_distribution>`_.
The weight of the best base model is drawn from the triangular distribution that
varies from 0 to 1, with mode 1.

For another base model :math:`i`, the weight is drawn from a triangular
distribution that varies from 0 to `range`, with mode 0. It means that its weight
will most likely be close to 0. The upperbound is defined by the `range` variable.
For every other base model :math:`i` (not a base model with the highest score),
the weight is drawn from a triangular distribution that varies from 0 to `range`,
with mode 0. It means that its weight will most likely be close to 0. The upperbound
is defined by the `range` variable.

Now, `range` should depend on the relative score of the base model. But preventing
it from reaching 1 would be too prohibitive. The solution for this is: `range` is
chosen from a triangular distribution that varies from 0 to 1, with mode `normalized`.
The variable `normalized` measures the relative quality of the base model.
The value of `range` should depend on the relative score of the base model. But
preventing it from reaching 1 would be too prohibitive. A solution for this is:
`range` is chosen from a triangular distribution that varies from 0 to 1, with mode
`normalized`. The variable `normalized` measures the relative quality of the base
model.

The value of `normalized` is computed by the formula :math:`(s_i-s_\textrm{min})/
(s_\textrm{max}-s_\textrm{min})`, where :math:`s_i` is the score of the current
base model and :math:`s_\textrm{min}` and :math:`s_\textrm{max}` are the scores
of the worst and the best base models, respectively.
(s_\textrm{max}-s_\textrm{min})`, where :math:`s_i` is the score of the base model
and :math:`s_\textrm{min}` and :math:`s_\textrm{max}` are the scores of the worst
and the best base models, respectively.

In the end, bad base models can still influence the ensemble, but their
probabilities of having high weights are relatively low.

0 comments on commit 36f216c

Please sign in to comment.