Support multiple data sets for each component #180

AJamesPhillips · 2021-12-23T11:45:25Z

AJamesPhillips
Dec 23, 2021
Maintainer

TODO:

split this into parts.
Cover what happens when we have a duplicate component we want to merge with another component, e.g. if someone adds another component for "gravitational field strength" or "population of city X in year Y".
Do we want there to be a canonical source of truth
Can we pull any of these values easily from wikidata.org? As of writing this, the entry for "gravitational field strength" does not have its value.
How do changes in one component or equation get propagated through the system? If it's versioned then you can warn there is a newer version. This isn't a github for data, it's a package manager for data, it's like npm, pypi, apt
- Modelica has a standard library that contains constants like g and pi. How might this be leveraged?
- Modelica also has many packages of constants, equations etc

After #178 we could implement a place for people to deposit and later improve, discuss, use (calculations, graphics, simulations) ~~primarily temporal and spatial~~ data relevant to situations of public interest.

Support datasets
- each dataset is associated with on component (requires Provide single pages for components, like a wikipedia page #178) where the component should describe what the data is, and the dataset describes its method of collection or processing relative to other datasets (potential or actual, and in DataCurator or not)
- each dataset should allow different versions of it
- each dataset should have one author?
- each dataset should belong in one knowledge base? (to allow for authorisation / permissions). The alternative is that we make it public but allowing for datasets also allows for curation of public content so that large but duplicate / unhelpful datasets can be prevented from going public until they are ready, and therefore prevent / reduce link rot problems. Later, faster & more transparent methods can be introduced e.g. an open submission process where pending datasets are public but not marked as "stable".
For each component (requires Provide single pages for components, like a wikipedia page #178), display its different data sets, showing only the last version of each dataset

A lot of the data we may be interested in has a spatial and or temporal component to it though some data like constants do not have either.

We also need a specific spatial data set type that can be used by other data sets.

Potential schema and use

`space` table

Is similar to a component in that it has a title & description but it is always public so that it is accessible across all bases as it is used as a "type" for the data table.
Or could add a "space" type to wcomponent. Would then need to:

always load the base that contained these components
or would need to load the components on demand.
the foreign key reference at the DB level would not stop someone making a non-space component, and then using it as a reference for a space

I imagine this will be used similar to the data_set entries in that there might be two entries for area A, one from the point of view (POV) of country B and second from the POV of country C. They would have the similar titles of "Area A (POV Country B)", and a second with "Area A (POV Country C)". When adding POV functionality, will likely want to add a many-to-many table between wcomponent and space to match wcomponent political actors to the specific space table entries that form their POV.

name	type
id	uuid
title	string
description	string
external_source_id	uuid -> wcomponent.id nullable
author_id	uuid
created_at	datetime
valid_from	datetime nullable

description provides an opportunity for the author to explain why this space entry is different from other entries describing the similar space.
valid_from describes when this space "came into being" according to the space's "creators", e.g. when a native population was thought to have arrived in a geographical area, when a population "formally" "founded" a new country.

`space_version` table

name	type
id	uuid -> space.id
version	integer
description	string
author_id	uuid
created_at	datetime
border	geometry(LinestringZM,4326) nullable

id & version have a unique compound key
description provides an opportunity for the author to explain what has changed in this version of the data.
border see https://kartoza.com/en/blog/adding-elevation-to-a-line-from-a-dem-in-postgis-and-creating-accurate-measures/ perhaps geometry(LinestringZM,4326)

`space_parent_child` table

When a user adds a "parent" space, this table allows the user to capture the children spaces that the parent space encompasses. We have to capture the space_version.version of both parent and child otherwise a child of a new version might have changed borders with lie outside its parent.

name	type
space_parent_version_id	uuid -> space_version.id
space_parent_version	number -> space_version.version
space_child_version_id	uuid -> space_version.id
space_child_version	uuid -> space_version.version

`data_set` table

name	type
id	uuid
component_id	uuid -> wcomponent.id
description	string
external_source_id	uuid -> wcomponent.id nullable
author_id	uuid
created_at	datetime
valid_from	datetime nullable
(valid_to)	(datetime nullable)
(temporal_resolution_s)	number nullable
space	uuid -> spatial_data_set.id nullable
(spatial_resolution_m)	number nullable
(base_id)	(uuid)

description provides an opportunity for the author to explain why this data set is strong / weak / different from other data sets describing the specific component.
external_source_id can be used like a special label to point to a component describing the external source (e.g. a gov / charity / corporate web page)
todo: add an internal_dependency many to many map from data_set to other data_sets with a: source_data_set_id, derived_data_set_id to allow for data sets derived from other data sets to be updated when the source data sets get updated with a newer version.
valid_from is nullable for non temporal data like constants and for when the data set varies over time
valid_from is inclusive (valid_to is exclusive)
except for description, all other fields should be immutable
temporal_resolution_s is in seconds
spatial_resolution_m is in meters

`data_set_version` table

name	type
id	uuid -> data_set.id
version	integer
description	string
author_id	uuid
created_at	datetime
(base_id	uuid)

id & version have a unique compound key
description provides an opportunity for the author to explain what has changed in this version of the data.

`data_set_group` table?

Perhaps we also need a data_set_group to accommodate a large data provider like a government agency that can provide data_sets on many hundreds of different components simultaneously and who does not want to duplicate a similar description for each data_set entry, instead wants to see: "This DataSet is provided by abc organisation. See data_set_group xyz for explanation of strengths and weakness." Perhaps can use the wcomponent for this? And add a many-to-many component_id to data_set, to allow data_sets to know what they relate to and allow the parent component (that's used as the label) to know what it has labeled / what it "contains". (to be developed further, in context of existing wcomponent.label_ids)

`data`

A table for normalised data values

name	type
id	bigserial
data_set_version_id	uuid -> data_set_version.id
data_set_version	number -> data_set_version.version
valid_from	datetime nullable
(valid_to)	(datetime nullable)
space	uuid -> space.id nullable
value	number nullable
value_str	string nullable

`data_qualifiers` table

name	type
data_id	bigint -> data.id
component_label_id	uuid -> wcomponent.id

This schema could result in a large amount of duplication of data on new versions and on alternative data sets. An alternative is that there is a "qualifier set" that the data table references with a qualifies_set_id, and the following table becomes 2 tables: one for qualifier set and a second for data_qualifier_set_component_label_ids.

`data_qualifiers_set` table

name	type
id	serial
description	string

perhaps description is a string of people in @@space_id1 who are @@component_id1, @@component_id2 and @@component_id3, where @@space_id1 references a space with title of "Nottinghamshire", @@component_id1 references component with title === "20-30 years old", component_id3.title === "female", component_id2.title === "farmer",

`data_qualifier_set_component_label_ids` table

name	type
data_qualifiers_set_id	serial -> data_qualifiers_set.id
component_label_id	uuid -> wcomponent.id

Additional questions to resolve

How are counterfactual data sets captured by this schema? Are they even allowed or encouraged?
How are data sets that might exist in a detached space / time be captured by this schema? e.g. if X happens, then the following values should be observed, where event X's space and time is not known (e.g. a forest fire's impact on the % of homes destroyed in an area).
How are simulated data sets captured by this schema? e.g. when a data set has copyright restrictions but we can derive a simulated dataset with the same standard deviation, or day/week trend, how is this marked as such?
How are averages / aggregates handled. E.g. averages solar energy over different area sizes, over different time windows.

Some examples (todo)

Vaccination rates over time by city by age group

e.g. the "Weekly COVID-19 Municipality Vaccination Report" from: https://www.mass.gov/info-details/archive-of-covid-19-vaccination-reports

Can make a component for "Covid-19 Vaccination rates over time by city by age group in Massachusetts".
The spatial and temporal components are easy to capture.
How is the age group captured?

Could make 7 different components:
1. "age groups"
2. "age group 0-19", "age group 20-25" etc, and label these with "age groups"
then could use these component_ids in the data_qualifiers table.
Alternatively:
Could add "0-19" in the value_str, as the "age group" aspect of the data is already captured in the title of the top level component.

When someone comes to this data set they would find the top level component with only one possible data set.
The data set would have the author_id of the Massachusetts public health department. Would have a description linking to the source page.
There would be an indicator saying "30 versions".
You could click on the versions and it would present a list with the dates of the data_set_version.created_at
You could click on an individual version, or on the previous view, you could click on latest version and be taken to the version of information for that dataset for that component.
That version would have "3232 rows" where the data table matches the data_set_version_id and data_set_version.
There would be a columns of "space", "time", "value", "value_str"

Under the first plan for capturing age group it would join each row of data with data_qualifiers containing all the labels that were applied to them. It would get a list of unique data_qualifiers.component_label_id and fetch those component_ids to show then as labels next to each of the data rows.

Under the second plan, the "age group 0-19", "age group 20-25" etc would just be shown in the value_str column.

Later when someone comes to use the data, i.e. join it by age group e.g. they might have education level, or time spent on social media by age group for cities in Massachusetts for 2021... then the user would want to ensure that join was unique... otherwise we don't solve the problem of joining data.

If someone also wants to get the vaccination rate over time for all 20-25 year olds then this is also brittle if it's just a string in the value_str as it relies on the upload always keeping the same format of the data, e.g. not using "Age Group 20-25" then "age_group20 to 25". Instead with the data_qualifiers joined onto data, a WHERE clause for the matching id can be used.

If someone wants to aggregate by the age group then again, the data_qualifiers can be used, though unless they are all also tagged with the parent component of "age group" then the recursive sql call would be needed or a subselect on wcomponent.label_ids where they contain the parent "age group" wcomponent.id.

Population data for a country

Explore how the following data examples would be stored in each of the different tables, including how they would be disputed, and versioned:

population of country A in 2020 -> single number
population of country A -> 1D number over time
population of country A by state in 2020 -> 1D number over states of country A
population of country A by state -> "2D" number for each state over time
population of country A (by state, age, gender, occupation) -> "5D" number for each state, age, gender & occupation over time

Simpson's paradox data

https://web.archive.org/web/20180201210227/http://vudlab.com/simpsons/

Renewable energy, wind and sun

https://github.com/TheWorldSim/world-sim-data/tree/master/data/solarpv_capacity_summary/data
https://github.com/TheWorldSim/world-sim-data/tree/master/data/wind_turbine_capacity_summary/data
https://github.com/TheWorldSim/world-sim-data/blob/master/data/solarpv_capacity_summary/data/_2017_texas_loss10percent_tracking1_tilt35_azim180_month_average%40core%400.0.10.csv

Estimating cost of undersea power cable & connections

a = Annual average energy demand of recipient area (for year X according to source Y using estimation method Z) in W
p = Annual peak energy demand of recipient area in W
b = Capacity of a specific cable & connection in W
c = Cost of that cable & connection in $, £, € etc (in which year) per cable
d = number of cables needed = a / b (or p / b) cables
e = estimated cost in $ = d cables * c ($ per cable)

Ideally allow for an aggregate of undersea cable costs

User might like to compute the average of all undersea power cables. So would need to be able to:
- select all components with that label e.g. https://www.slideshare.net/billkarwin/models-for-hierarchical-data
- have another label like "cost" to intersect with. Would then want a schema to keep things consistent, otherwise someone would use a label from component "price" or "capital expenditure" etc...

Data growth: two extremes. At one end the user would create a single component for the name of the undersea cable, and either add an attribute like "cost in 2015 USD": 8 million. At the other extreme, they would upload a whole table of undersea cables, with their names, locations, paths, costs, construction start and end times, etc etc.

Estimate hydro energy storage of region X

g = gravitational field strength in m/s2 (= 9.81)
a = average height difference of water between reservoirs in m
b = max volume of water in m3
c = density of water = kg per m3
d = max mass of water in kg = c * b
e = max potential energy in J = g * a * d
f = efficiency of turbines (unitless, from J electrical energy / J potential energy)
h = max potential electrical energy = e * f

A couple of examples (need to update these now)

You might want to store "Population of Chelsea, MA, USA", with a description of: "1 year resolution, beginning in 1950"
And someone else like the US census bureau might want to store: "Population of USA by municipality in 2021".

--

There would be two top level components:

"Population of Chelsea, MA, USA"
- The "over time" is implied by expectations and lack of a specific year
"Population of USA by municipality in 2021"

--

For each USA municipality there would be one entry in the spatial_data_set table and a corresponding one in the spatial_data_set_version table with version == 1.

--

There would be a data_set entry with:
component_id == id of component "Population of Chelsea, MA, USA"
description == "Population over time, 1 year resolution, beginning in 1950"
valid_at == null
(temporal_resolution_s) == 31557600
space == (Chelsea, MA, USA).id

There would be another data_set entry with:
component_id == id of component "Population of USA by municipality in 2021"
description == "Population of USA by municipality in 2021"
valid_at == 2021-01-01
space == null

Q) what happens when next year, they want to add the next round of data?
A) This is where the data_set_group comes in. You'd probably make a component for "Population of USA by municipality (over time)" and then use that as a label on all the data_sets where it applied. (to be developed further)

--

There would also be two corresponding data_set_version entries for both of the data_set entries above.

--

Then there would be entries in data, for the "Population of Chelsea, MA, USA" (over time), each would be like:

data_set_version_id == (value from data_set_version entry).id
data_set_version == (value from data_set_version entry).version
valid_at == some date
(valid_from) == null
(valid_to) == null
space == null // takes value from joining to data_set, alternatively could denormalise but then how to keep up to date?
value == 1234
value_str == null

--

The data entries for "Population of USA by municipality in 2021" would be like:

data_set_version_id == (value from above)
data_set_version == (value from above)
valid_at == null // again could denormalise here or get through a join
valid_from == null
valid_to == null
space == the municipality id
value == 1234
value_str == null

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple data sets for each component #180

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Support multiple data sets for each component #180

AJamesPhillips Dec 23, 2021 Maintainer

Potential schema and use

space table

space_version table

space_parent_child table

data_set table

data_set_version table

data_set_group table?

data

data_qualifiers table

data_qualifiers_set table

data_qualifier_set_component_label_ids table

Additional questions to resolve

Some examples (todo)

Vaccination rates over time by city by age group

Population data for a country

Simpson's paradox data

Renewable energy, wind and sun

Estimating cost of undersea power cable & connections

Ideally allow for an aggregate of undersea cable costs

Estimate hydro energy storage of region X

A couple of examples (need to update these now)

Replies: 0 comments

AJamesPhillips
Dec 23, 2021
Maintainer

`space` table

`space_version` table

`space_parent_child` table

`data_set` table

`data_set_version` table

`data_set_group` table?

`data`

`data_qualifiers` table

`data_qualifiers_set` table

`data_qualifier_set_component_label_ids` table