Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load non-timeseries data from file #92

Closed
3 tasks
brynpickering opened this issue Apr 4, 2018 · 5 comments · Fixed by #532
Closed
3 tasks

Load non-timeseries data from file #92

brynpickering opened this issue Apr 4, 2018 · 5 comments · Fixed by #532
Projects
Milestone

Comments

@brynpickering
Copy link
Member

Problem description

Large models in the spatial dimension can become incredibly cumbersome to write in YAML, particularly for links. It would be good to have the possibility of loading from CSV or pandas DataFrame that information.

This partially matches issue #91, for loading from DataFrame

Steps to introduce functionality

  • Decide on syntax within YAML environment
    • for links, could we just say links: file=link_matrix.csv?
  • Decide on required CSV syntax
    • A matrix or single column?
  • Implement file loading in creation of model._model_run
@brynpickering brynpickering self-assigned this Apr 4, 2018
@brynpickering brynpickering removed their assignment Apr 9, 2018
@sjpfenninger sjpfenninger added this to the 0.6.x milestone Jan 8, 2019
@brynpickering brynpickering added this to To do in v0.7.0 May 20, 2021
@brynpickering brynpickering moved this from To do to Highest priority changes in v0.7.0 May 20, 2021
@brynpickering brynpickering moved this from Sharing and visualising data to Refactoring user-facing in v0.7.0 May 20, 2021
@sjpfenninger
Copy link
Member

sjpfenninger commented Jul 29, 2021

Possible approach:

We introduce a new top-level item called data_tables. This is a list of dicts which specify the CSV files from which we read data. By specifying parameter, index, and columns, we can define exactly what we want to read from file, and how the file is structured. This allows enough flexibility even when we add new dimensions in the future.

add_dimensions allows the user to give a single value for a dimension which is not in the file itself. This allows us to have a file with costs without the need to have cost class in the columns or the index. It also allows us to replicate the functionality that currently exists when reading in resources from file: use one file that doesn't specify its technology to define the resource for multiple technologies.

data_sources:
    # A simple per-technology, per-node parameter
    - parameter: energy_cap_max
      rows: [nodes]
      columns: [techs]
      file: energy_cap_max.csv

    # A cost that changes through time
    - parameter: cost_om_prod
      file: cost_om_prod.csv
      add_dimensions:
            costs: monetary
      rows: [timesteps]
      columns: [techs, nodes]

    # Resource read from the same file for two separate techs
    # Note: existing way to read a resource from file would be
    # removed, so there is just one way to read in CSV files
    - parameter: resource
      file: resource_pv.csv
      add_dimensions:
            techs: pv_rooftop
      rows: [timesteps]
      columns: [nodes]
    - parameter: resource
      file: resource_pv.csv
      add_dimensions:
            techs: pv_largescale
      rows: [timesteps]
      columns: [nodes]
    - parameter: energy_cap_max
      rows: [nodes]
      columns: [techs]
      file: energy_cap_max.csv
    - parameter: resource
      file: resource_pv.nc
      add_dimensions:
            techs: pv_rooftop
    - parameter: resource
      file: resource_pv.nc
      add_dimensions:
            techs: pv_largescale

Consider NetCDF

You would still need to explicitly load each parameter from file that you want to read from file? For the first two parameters from the above example:

data_sources:
    - parameter: energy_cap_max
      file: model_data.nc  # has a variable energy_cap_max, with nodes and techs as dims

    - parameter: cost_om_prod
      file: model_data.nc  # also has a variable cost_om_prod, with timesteps, techs, nodes as dims
      add_dimensions:
            costs: monetary  # can still add a dim

Suggestion on how to use the same resource data for two techs:

data_sources:
    - parameter: resource
      file: resource_pv.nc  # has a variable resource, with timesteps and nodes as dims
      add_dimensions:
            techs: [pv_rooftop, pv_openfield]

@brynpickering
Copy link
Member Author

Based on offline discussions, the above structure seems to make sense. One change would be to make 'parameter' its own dimension, so that parameters can be defined within the CSV or can be added as dimensions, e.g.:

data_sources:
    - rows: [techs]
      columns: [parameter]
      file: tech_params.csv
    - rows: [nodes]
      columns: [techs]
      add_dimensions:
        parameter: energy_cap_max
      file: energy_cap_max.csv

@brynpickering
Copy link
Member Author

Some places where the format might need refinement are w.r.t links and carriers. The way in which we may now define links and techs as lists of dictionaries (#324, #362) might fit well. E.g., for links, we might want to define all of them in a CSV:

from to
HUN_to_AUT HUN AUT
AUT_to_CHE AUT CHE

Which replaces:

links:
    - id: HUN_to_AUT
      from: HUN
      to: AUT
    - id: AUT_to_CHE
      from: AUT
      to: CHE

with:

data_sources:
    # A simple per-technology, per-node parameter
    - rows: [links]
      columns: [parameter]
      file: links.csv

But, how do we handle the following:

links:
    - id: HUN_to_AUT
      from: HUN
      to: AUT
      link_techs:
        - id: ac_transmission
          distance: 10
          energy_cap_max: 20
    - id: AUT_to_CHE
      from: AUT
      to: CHE
      link_techs:
        - id: ac_transmission
          energy_cap_max: 5
        - id: dc_transmission
          energy_cap_max: 10

It seems we would need a separate file to define the nodes connected by links (the above table) and to define the technologies and their capacities within links:

ac_transmission ac_transmission dc_transmission
distance energy_cap_max energy_cap_max
HUN_to_AUT 10 20
AUT_to_CHE 5 10

then use the data source definition:

data_sources:
    # A simple per-technology, per-node parameter
    - rows: [links]
      columns: [link_techs, parameter]
      file: link_params.csv

@brynpickering
Copy link
Member Author

One addition: there should be the ability to ignore rows/columns that define comments/references/units that aren't relevant to model dimensions.

@brynpickering
Copy link
Member Author

brynpickering commented Apr 20, 2022

And maybe the possibility for a direct SQL connection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants