Templates for derived variables #98

smmaurer · 2019-02-27T05:04:04Z

This issue is to plan out a set of templates for derived variables.

Overview

Each of these templates would generate an indexed pd.Series() associated with an Orca table: not a local column that's part of the wrapped DataFrame, but a stand-alone column that can be evaluated lazily but still accessed as part of the table.

Open questions:

Is it better to define separate templates, or implement this as a single template with multiple use cases? Leaning toward separate because it's clearer which parameters apply for each case
Are there cache settings we can assume, or should we provide all the options?
How should we handle creation of new tables?

Some resources:

The variable_generators project: https://github.com/UDST/variable_generators
The column_builder project: https://github.com/urbansim/column_builder
Indicators used in recent models, e.g. LCOG
Pages in Notion

1. `urbansim_templates.data.ColumnFromExpression()`

Creates a column from a string expression, accepting anything that can be passed to df.eval(). This could be math, an existing column to duplicate, or something more complicated. Cannot involve columns from other tables, though.

Params:

column_name (column to be created)
destination_table
expression

2. `urbansim_templates.data.ColumnFromBroadcast()`

Creates a column by "broadcasting" coarse-grained data to another table, taking advantage of join key relationships. For example: adding the census tract id to the households table, or adding the mean zonal building height to the buildings table.

I think our implementation of this should not require Orca broadcasts, because of the limitations discussed in Issue #78. Instead, we can use overlapping column names as implicit join keys.

Params:

column_name (column to be created)
destination_table
source_table (allow a chain, or not?)
source_column (column name or expression)

3. `urbansim_templates.data.ColumnFromAggregation()`

Creates a column by grouping and transforming finer-grained data, taking advantage of join key relationships among tables. For example: number of households per tract, mean home price by zone, etc.

Params:

column_name (column to be created)
destination_table
source_table
source_column (column name or expression)
filters (?)
group_by (column name, must appear in source table and destination table)
aggregation (min, max, mean, count, sum, stdev, etc.)

4. `urbansim_templates.data.ColumnFromNetwork()`

Creates a column from a Pandana network aggregation. (Params will need to be fine-tuned a bit..)

Params:

column_name (column to be created)
network
radius
aggregation (min, max, mean, count, sum, etc.)
decay

The text was updated successfully, but these errors were encountered:

smmaurer · 2019-03-08T20:07:24Z

YAML representations

Good idea from @apdjustino to write specs from the perspective of what the yaml representation of templates would look like as well.

Here's what case 1 from above would look like, more or less.

(ModelManager currently saves parameters in alphabetical order, which is an easy way to make the yaml files play nicely with git diffs. It might be better to customize the ordering for each template, but we haven't implemented that yet.)

modelmanager_version: 0.2.dev3

saved_object:
    autorun: true  # register column when yaml is loaded
    cache: true  # for orca
    cache_scope: iteration  # for orca
    column_name: pct_low_income
    expression: low_income*100/population
    name: pct_low_income  # name of saved object, we can provide good defaults
    table: blocks
    tags:
    - estimation
    - demographics
    template: ColumnFromExpression
    template_version: 0.2.dev2

template, template_version, name, tags, and autorun are standard parameters for every template.

This format is not really optimized for users to create yaml directly; the objective is more for it to be human-readable and editable while storing more metadata than Orca currently does. But if folks have ideas for improving the format, we should definitely explore them!

Multiple objects in a single yaml file

Adding a link to issue #104, where we're discussing what kind of super-structure to create for storing multiple columns and other associated info..

smmaurer · 2019-04-01T21:21:58Z

Implementation using settings objects

I'm part-way through building these templates, and I think it would be a good idea to implement them using the "settings objects" sketched out in issue #54.

Here's what it would look like:

CoreTemplateSettings, for all templates

name, tags, notes, autorun

ExpressionSettings, for the ColumnFromExpression template

table, expression

BroadcastSettings, for the ColumnFromBroadcast template

tables, expression

AggregationSettings, for the ColumnFromAggregation template

tables, expression, aggregation, filters

OutputColumnSettings, used by any template that generates or modifies a column

column_name, table, data_type, missing_values, cache, cache_scope

This way, the signature of a template can be much simpler -- no need to duplicate core/output parameters across many templates:

class ColumnFromBroadcast():
    """
    Parameters
    ----------
    meta : CoreTemplateSettings
    data : BroadcastSettings
    output : OutputColumnSettings

Usage would look like this:

c = ColumnFromBroadcast()
c.data.tables = ['households', 'zones']
c.data.expression = 'residential_vacancy_rate * 100'
c.output.column_name = 'residential_vacancy_pct'

And the yaml file would remain similar, but group settings into three dicts representing the component settings objects.

Pros and cons

This will substantially reduce the amount of boilerplate code that's copied from template to template to support repeated parameters. I think it will also make it easier for users -- for example, every template that creates a new column will accept the same settings and interpret them the same way. More shared code makes it easier to implement and test new templates, too.

The main thing to worry about is that this does add another layer of abstraction to the code, which can make things more confusing and fragile. But i think the advantages outweigh this.

Implementation

I'll create the settings objects in a new PR, first adding them to the ColumnFromExpression template which is already finished. Then i'll use them to build the rest of the column templates.

smmaurer mentioned this issue Mar 12, 2019

[0.2.dev5] Template for column from expression #105

Merged

4 tasks

smmaurer mentioned this issue Mar 29, 2019

Work in progress: ColumnFromBroadcast and ColumnFromAggregation #107

Draft

5 tasks

smmaurer mentioned this issue Apr 2, 2019

[0.2.dev6] Introducing settings objects #108

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Templates for derived variables #98

Templates for derived variables #98

smmaurer commented Feb 27, 2019 •

edited

Loading

smmaurer commented Mar 8, 2019 •

edited

Loading

smmaurer commented Apr 1, 2019 •

edited

Loading

Templates for derived variables #98

Templates for derived variables #98

Comments

smmaurer commented Feb 27, 2019 • edited Loading

Overview

1. urbansim_templates.data.ColumnFromExpression()

2. urbansim_templates.data.ColumnFromBroadcast()

3. urbansim_templates.data.ColumnFromAggregation()

4. urbansim_templates.data.ColumnFromNetwork()

smmaurer commented Mar 8, 2019 • edited Loading

YAML representations

Multiple objects in a single yaml file

smmaurer commented Apr 1, 2019 • edited Loading

Implementation using settings objects

Pros and cons

Implementation

smmaurer commented Feb 27, 2019 •

edited

Loading

1. `urbansim_templates.data.ColumnFromExpression()`

2. `urbansim_templates.data.ColumnFromBroadcast()`

3. `urbansim_templates.data.ColumnFromAggregation()`

4. `urbansim_templates.data.ColumnFromNetwork()`

smmaurer commented Mar 8, 2019 •

edited

Loading

smmaurer commented Apr 1, 2019 •

edited

Loading