Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Templates for derived variables #98

Open
smmaurer opened this issue Feb 27, 2019 · 2 comments
Open

Templates for derived variables #98

smmaurer opened this issue Feb 27, 2019 · 2 comments

Comments

@smmaurer
Copy link
Member

smmaurer commented Feb 27, 2019

This issue is to plan out a set of templates for derived variables.

Overview

Each of these templates would generate an indexed pd.Series() associated with an Orca table: not a local column that's part of the wrapped DataFrame, but a stand-alone column that can be evaluated lazily but still accessed as part of the table.

Open questions:

  • Is it better to define separate templates, or implement this as a single template with multiple use cases? Leaning toward separate because it's clearer which parameters apply for each case
  • Are there cache settings we can assume, or should we provide all the options?
  • How should we handle creation of new tables?

Some resources:

1. urbansim_templates.data.ColumnFromExpression()

Creates a column from a string expression, accepting anything that can be passed to df.eval(). This could be math, an existing column to duplicate, or something more complicated. Cannot involve columns from other tables, though.

Params:

  • column_name (column to be created)
  • destination_table
  • expression

2. urbansim_templates.data.ColumnFromBroadcast()

Creates a column by "broadcasting" coarse-grained data to another table, taking advantage of join key relationships. For example: adding the census tract id to the households table, or adding the mean zonal building height to the buildings table.

I think our implementation of this should not require Orca broadcasts, because of the limitations discussed in Issue #78. Instead, we can use overlapping column names as implicit join keys.

Params:

  • column_name (column to be created)
  • destination_table
  • source_table (allow a chain, or not?)
  • source_column (column name or expression)

3. urbansim_templates.data.ColumnFromAggregation()

Creates a column by grouping and transforming finer-grained data, taking advantage of join key relationships among tables. For example: number of households per tract, mean home price by zone, etc.

Params:

  • column_name (column to be created)
  • destination_table
  • source_table
  • source_column (column name or expression)
  • filters (?)
  • group_by (column name, must appear in source table and destination table)
  • aggregation (min, max, mean, count, sum, stdev, etc.)

4. urbansim_templates.data.ColumnFromNetwork()

Creates a column from a Pandana network aggregation. (Params will need to be fine-tuned a bit..)

Params:

  • column_name (column to be created)
  • network
  • radius
  • aggregation (min, max, mean, count, sum, etc.)
  • decay
@smmaurer
Copy link
Member Author

smmaurer commented Mar 8, 2019

YAML representations

Good idea from @apdjustino to write specs from the perspective of what the yaml representation of templates would look like as well.

Here's what case 1 from above would look like, more or less.

(ModelManager currently saves parameters in alphabetical order, which is an easy way to make the yaml files play nicely with git diffs. It might be better to customize the ordering for each template, but we haven't implemented that yet.)

modelmanager_version: 0.2.dev3

saved_object:
    autorun: true  # register column when yaml is loaded
    cache: true  # for orca
    cache_scope: iteration  # for orca
    column_name: pct_low_income
    expression: low_income*100/population
    name: pct_low_income  # name of saved object, we can provide good defaults
    table: blocks
    tags:
    - estimation
    - demographics
    template: ColumnFromExpression
    template_version: 0.2.dev2

template, template_version, name, tags, and autorun are standard parameters for every template.

This format is not really optimized for users to create yaml directly; the objective is more for it to be human-readable and editable while storing more metadata than Orca currently does. But if folks have ideas for improving the format, we should definitely explore them!

Multiple objects in a single yaml file

Adding a link to issue #104, where we're discussing what kind of super-structure to create for storing multiple columns and other associated info..

@smmaurer
Copy link
Member Author

smmaurer commented Apr 1, 2019

Implementation using settings objects

I'm part-way through building these templates, and I think it would be a good idea to implement them using the "settings objects" sketched out in issue #54.

Here's what it would look like:

CoreTemplateSettings, for all templates

  • name, tags, notes, autorun

ExpressionSettings, for the ColumnFromExpression template

  • table, expression

BroadcastSettings, for the ColumnFromBroadcast template

  • tables, expression

AggregationSettings, for the ColumnFromAggregation template

  • tables, expression, aggregation, filters

OutputColumnSettings, used by any template that generates or modifies a column

  • column_name, table, data_type, missing_values, cache, cache_scope

This way, the signature of a template can be much simpler -- no need to duplicate core/output parameters across many templates:

class ColumnFromBroadcast():
    """
    Parameters
    ----------
    meta : CoreTemplateSettings
    data : BroadcastSettings
    output : OutputColumnSettings

Usage would look like this:

c = ColumnFromBroadcast()
c.data.tables = ['households', 'zones']
c.data.expression = 'residential_vacancy_rate * 100'
c.output.column_name = 'residential_vacancy_pct'

And the yaml file would remain similar, but group settings into three dicts representing the component settings objects.

Pros and cons

This will substantially reduce the amount of boilerplate code that's copied from template to template to support repeated parameters. I think it will also make it easier for users -- for example, every template that creates a new column will accept the same settings and interpret them the same way. More shared code makes it easier to implement and test new templates, too.

The main thing to worry about is that this does add another layer of abstraction to the code, which can make things more confusing and fragile. But i think the advantages outweigh this.

Implementation

I'll create the settings objects in a new PR, first adding them to the ColumnFromExpression template which is already finished. Then i'll use them to build the rest of the column templates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant