# Principle options for regression tools
The object of our base regression tools is to provide an easy way to train and test models, automatically including standard covariates and properly handling cross validation issues. In many cases you can set up a regression test with just a few lines of code.

There are four main items of documentation for the base regression tools:

1. The [tutorial notebook](http://10.6.39.8:8888/notebooks/ML%20-%20baseline%20regression%20and%20classification%20tools.ipynb) works through a series of examples of how these tools work, from a simple three line example to more complex configurations.
2. There is a [notebook](http://10.6.39.8:8888/notebooks/Users/Peter/Sample2.ipynb) which uses these tools to generate our production eye color models.
3. There is [online API documentation.](http://jarvis.hli.io/~pgarst/datastack/datastack/)
4. This notebook provides a brief summary of the principle methods and options as a quick reference for those already familiar with the basic ideas.

In [1]:
import datastack.ml.baseregress as baseregress
import datastack.ml.baseclassify as baseclassify

We will cover the main methods exposed by the regression class, in the order you usually call them:

1. Instantiation
2. Adding data
3. Running training
4. Displaying the results

The other documentation discusses other methods and options; the ones discussed here are only the most common.

## Instantiating the regression object
The constructor has one required argument which specifies the targets for the regression. You can use one name, a list of names, or a dictionary mapping short names to full names.

In [None]:
base = baseregress.BaseRegress(targets)

These are the most important optional parameters:

1. nopreproc=False. If you set this to true, it turns off all preprocessing. In particular, if you do this make sure your data has no missing values, because handling those is part of preprocessing.
2. use_predicted_baseline_covariates=True. The tool uses age, gender and ethnicity as baseline covariates for predicting your targets. By default the tool uses predicted values for these, but this parameter lets you use actual age (and gender, although that hardly matters) rather than predicted age as a baseline covariate.
3. By default the tool uses the current version of Rosetta for the HG38 namespace for any database access it requires, but you can set these to a specific version or namespace using the rversion and namespace parameters.

## Configuration
The regression object has a number of fields which you can set to configure your test. These are the most important:

1. covariates is a dictionary which lets you specify covariate sets to try against your target variables. 
2. estimatorsToRun is a dictionary, also tied to the names of covariate sets, which lets you specify a list of estimators to run instead of the default ridge regression. This also allows you to specify the hyperparameters to search in cross validation.

In [None]:
size = ['facepheno.height', 'dynamic.FACE.pheno.v1.bmi']
base.covariates['size'] = size

params = {'alpha': [ 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]}
est1 = {'est' : linear_model.Lasso(), 'params' : params}
base.estimatorsToRun['size'] = [est1]

## Adding data
Use this method to explicitly add data to the object. The tools can pull data out of Rosetta, so you only need to do this when you are using data which is not in Rosetta (for example, calculated values) or when it is more convenient. You can call this method as many times as you want.

The column names in the frame you add will generally be names of variables you use as targets or covariates.

The data frame you add must have a column ds.index.sample_key or ds.index.sample_name so that it can be merged with other data.

This method has no optional parameters.

In [None]:
base.addData(dataframe)

## Running training
When your regression tool is fully configured you can run training. There are no required parameters, but some of the optional ones can significantly speed up this very slow operation.

Before you can run, every variable you use as a target or as a covariate must be available, either as a column in Rosetta or as a column in a data frame you added.

In [None]:
base.run()

There are lots of optional parameters, of which the most important are these:

1. with_aggregate_covariates=True. By default the tool runs each covariate set you specify alone and also in combination with the baseline age, gender and ethnicity. If you are uninterested in the combination runs set this to false, or use the run_keys parameter to be even more precise about what you want to see.          
2. run_keys=None. This parameter lets you specify a list of exactly the covariate sets you want to see in the output. It can save a lot of time if you are not exploring ethnicity and the other baseline variables.
3. with_bootstrap=True. By default to tool does a bootstrap operation to estimate the significance of the results. If you don't want this result, you can save a lot of time by setting this to false.
4. kfargs=None. This is an argument to pass into the kfolds generator. By far the most common use is to tell it to use all the data, rather than reserving a holdout set, with the usage kfargs={'n_folds_holdout': 0}.
5. fit_params=None. This parameter allows you to pass information into the fit function for the underlying estimators. You may find it useful if you are using custom estimators which require additional information.         

## Displaying the results
After running training, the regression tool generates a dataframe in which each row is a covariate set and each column some item of information about the resulting model, such as r2 or mae for one of the targets. With no arguments the display method shows this frame, which is usually what you want.

See the tutorial for information on how to get cross validation objects and use them to plot observed vs. predicted data.

In [None]:
base.display()

You can use the optional frame parameter to display a data frame other than the standard results from the run.

The fmtdict parameter lets you set the precision for different columns. The following value, passed in as the fmtdict parameter to display, would tell it to use 5 decimal places for each column containing 'MAE'

In [None]:
fmt = {}
fmt['MAE'] = 5