Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressor generalization #101

Merged
merged 9 commits into from
Jun 14, 2016
Merged

Regressor generalization #101

merged 9 commits into from
Jun 14, 2016

Conversation

dwysocki
Copy link
Member

This resolves issue #90, and the regressor-specific portion of #91.

It removes the hand-chosen regressor objects, such as --regressor LassoCV, and replaces them with the ability to specify any regressor. So the above call would need to be more verbose: --regressor sklearn.linear_model.LassoCV, but in exchange for some verbosity, we gain complete generality. This is implemented with plotypus.utils.import_name.

This also removes the --max-iter and --regularization-cv options, which were passed to specific regressors, and instead implements a catch-all --regressor-options KEY VALUE ... KEY VALUE option. This allows the user to provide any option to their regressor, provided VALUE is a simple Python literal (parsable by the standard library function ast.literal_eval or a string. Value parsing is implemented with plotypus.utils.literal_eval_str, and conversion to a dict is done with plotypus.utils.strlist_to_dict.

Because of these changes, we could no longer assume that fit_intercept=False on the regressors. In fact, it is most likely that fit_intercept=True, as that is the default for every scikit-learn regressor that I know of. For this reason, I have removed the one's column from preprocessing.Fourier.design_matrix, and get_lightcurve now takes the A_0 term from the regressor's intercept_ attribute.

There is a lot of coupling when it comes to phased/unphased data, so I
had to make a lot of changes in this one commit.

The main point of this commit was to deal with temporal-space data
instead of phase-space data wherever possible. This was especially
important in the `Fourier` class, to prepare for multiperiodic signal
support. Now the period needs to be provided to `Fourier` before
`Fourier.fit` or `Fourier.transform` is called. To allow for setting the
`period` parameter properly when `Fourier` is nested within another
estimator, I had to make `Fourier` a subclass of scikit-learn's
`BaseEstimator`, which we should have done long ago.

Most of the changes in `get_lightcurve` deal with replacing `phase` with
`time`. There are also some changes which deal with providing the
correct data when dealing with masked arrays. There is also a new piece
of code which passes the period into the `Fourier` object, which
required a little bit of hackery (see the comments before the call to
`predictor.set_params`).

The `rephase` function in `plotypus.periodogram` was also modified, such
that masked values are still rephased.
Instead of multiplying each index `i` by `2 * pi`, and then dividing
every element of the matrix by `period`, I instead define `omega = 2 *
pi / period` (as we usually do), and multiply each index `i` by that.
Added a new command line option: `--regressor-options key1 val1 ... keyN valN`. Allows user to
pass arbitrary arguments. Old regressor options now ignored (such as `--max-iter`) and will be removed soon.
Need to at least pass `fit_intercept False` to retain previous functionality, which will
change soon.
All CLI regressor options are now encompassed by the
`--regressor-options` option. The old options have been removed
completely.

Also, a help string has been added to `--regressor-options`.
Now the Fourier design matrix no longer provides the intercept column,
and relies on the regressor object to fit the intercept separately. This
is already the default behavior of all sklearn regressors (that I'm
aware of), and so this simply saves the user from needing to provide
`fit_intercept=False` every time.
Since it takes a series of key-value pairs, I made the metavar "KEY
VALUE". This makes it look like there are two metavars (desired) when
it's really just one with a space inbetween.
Instead of having the `--regressor` option take one of a few hand-chosen
values, it now takes the name of _any arbitrary regressor object_, such
as the default, `--regressor sklearn.linear_model.LassoLarsIC`. This
requires a little more user-knowledge, but opens a world of
possibilities.

This is accomplished through the new `plotypus.utils.import_name`
function, which imports an object from its full name and returns it,
without affecting the enclosing namespace.
@dwysocki dwysocki merged commit b63d61f into develop Jun 14, 2016
@dwysocki dwysocki deleted the regressor-generalization branch June 14, 2016 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant