Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new DisplayField Schema #264

Closed
vforgione opened this issue Mar 16, 2018 · 9 comments
Closed

Create a new DisplayField Schema #264

vforgione opened this issue Mar 16, 2018 · 9 comments
Assignees

Comments

@vforgione
Copy link
Member

We should add another schema that holds information about what fields should be displayed on search results.

For most cases, this is similar to the virtual fields and constraints: give the owner a list of the fields attached to the meta. Where this gets tricky is with jsonb fields. For these, we need to inspect unique paths within the JSON and provide them as well.

For example, if the data set has fields id, lat, lon, timestamp and observations, where observations is a JSON field with the following structure:

{
  "sensor 1": {
    "temperature": "whatever"
  },
  "sensor 2": {
    "temperature": "whatever",
    "humidity": "whatever
  }
}

we should display:

  • id
  • lat
  • lon
  • timestamp
  • observations.sensor1.temperature
  • observations.sensor2.temperature
  • observations.sensor2.humidity
@vforgione vforgione added the core label Mar 16, 2018
@vforgione vforgione added this to the Data Set Fix Ups milestone Mar 16, 2018
@HeyZoos
Copy link
Collaborator

HeyZoos commented Mar 16, 2018

When you say display do you mean API results or frontend results? Would this schema be used to return json objects that look like:

{
    "observations.sensor1.temperature": 1,
    "observations.sensor2.temperature": 2,
    "observations.sensor2.humidity": 3
}

@vforgione
Copy link
Member Author

Front end results: search specifically. The way the graphs are shown right now in the general explorer app is I grab the first few numeric fields of a data set and graph them out, regardless if they are significant to the actual data set. It's a very naive approach.

In the case of AoT, or any other data set that has embedded JSON, I use a dummy set of fields and values to draw the graphs.

What I want to be able to do is give data set owners the ability to highlight significant fields in their data sets to be used in displaying search results.

@HeyZoos
Copy link
Collaborator

HeyZoos commented Mar 16, 2018

So a DisplayField should give me the information I need to dive down into a json field and extract a value?

@HeyZoos
Copy link
Collaborator

HeyZoos commented Mar 16, 2018

I think it's a little overkill, especially if it's just for displaying search results. I could see providing those values through a simple query with json_object_keys and the json operators postgres provides.

@vforgione
Copy link
Member Author

vforgione commented Mar 16, 2018

Kind of. I'm thinking of it like this,

fields = DisplayFieldActions.list(for_meta: meta)
query = """
select
  <%= for field <- fields do %>
  avg(<%= field %>,
  <% end %>
  ...
"""

or something less crappy. You get the idea though.

Edit: ideally, we would use an ecto query and the model for the data set from the registry.

@vforgione
Copy link
Member Author

The thing is, I have no way of limiting fields to display in search results without some arbitrary cut off. Especially when it comes to jsonb fields, doing a full introspection and then graphing those values not only looks like garbage on the front end, but can potentially be dangerously expensive to the DB and the web server.

This is a simple solution to that.

@HeyZoos
Copy link
Collaborator

HeyZoos commented Mar 16, 2018

Do you mean limiting the number of fields to display? I feel like the cutoff is still arbitrary with the new schema. I also don't see how it's expensive for the database. We assume that the json values all have the same "shape", so it's a matter of grabbing the first row, deriving the structure from that, and using that information in your subsequent queries.

In regards to the snippet you posted, you'd grab the fields value with a postgres query that maps the key structure of the first row. It's the same number of trips to the database but without a new schema that introduces 1:M (Meta to DisplayField) and 1:1 (DisplayField to DataSetField) relationships.

@vforgione
Copy link
Member Author

Here are some counter examples. In the case of Beach Lab Culture Tests we have the following rows;

Name Type
Beach text
Culture Note text
Culture Reading Mean float
Culture Sample 1 Reading float
Culture Sample 1 Timestamp timestamptz
Culture Sample 2 Reading float
Culture Sample 2 Timestamp timestamptz
Culture Sample Interval integer
Culture Test ID integer
DNA Reading Mean float
DNA Sample 1 Reading float
DNA Sample 2 Reading float
DNA Sample Timestamp timestamptz
DNA Test ID integer
Latitude float
Location text
Longitude float

These rows include DNA testing data because the source of the data didn't exclude them from the results for whatever reason. If we were to display every numeric field in here it would be 1) crowded given the number of fields and 2) include fields that have little to no relevance to the core of the data set.

In the case of Array of Things data, not every observation sub-document contains identical keys. It's true that most nodes have a relatively stable set of sensors, but as the builds continue boards will be added and removed. Additionally, on the back end of AoT they scrub certain malformed data (such as temperatures that read 98 bajillion degrees). If we were to naively grab the first row we could not guarantee that it is an accurate representation of the data set. Additionally, jsonb_object_keys only gets the top level keys:

 node_id |       array_agg        
---------+------------------------
 034     | {BMP180,HTU21D,TSYS01}
 056     | {BMP180,HTU21D,TSYS01}
 032     | {BMP180,HTU21D,TSYS01}
 048     | {BMP180,HTU21D,TSYS01}
 08D     | {BMP180,HTU21D,TSYS01}
 091     | {BMP180,HTU21D,TSYS01}
 03D     | {HTU21D,TSYS01}
 05D     | {BMP180,HTU21D,TSYS01}
 053     | {BMP180,HTU21D,TSYS01}
 086     | {BMP180,HTU21D,TSYS01}
 06B     | {BMP180,HTU21D,TSYS01}
 037     | {BMP180,HTU21D,TSYS01}
 00D     | {BMP180,HTU21D,TSYS01}
...

Note that this is an aggregate of all the records. For most of these nodes they only had 2 sensors reporting in for a while - only recently did a third board start reporting in, and there's like a dozen others still to come.

So, to me, having a predetermined, limited set of fields that we pull from to draw graphs for search results greatly simplifies mitigating these pitfalls.

@vforgione
Copy link
Member Author

vforgione commented Apr 2, 2018

Idea to surface key paths from embedded JSONB fields:

create or replace function jsonb_key_paths(_value jsonb)
  returns text[]
  language sql as
$$
with recursive _struct (key, value) as (

  select
    t.key,
    t.value

  from
    jsonb_each(_value) as t

  union all

  select
    concat(_struct.key, '.', t.key),
    t.value

  from
    _struct,
    jsonb_each(
      case
        when jsonb_typeof(_struct.value) <> 'object'
        then '{}'::jsonb
        else _struct.value
      end
    ) as t

)

select array(
  select distinct key
  from _struct
  where jsonb_typeof(_struct.value) not in ('array', 'object')
  order by key
)
$$;

In the case of AoT Chicago, we have a data set table with an observations field that has non-deterministic depth and breadth data in JSON format. We want to be able to surface those embedded keys so that we can highlight them in graphs. Using a function like this one, we can show the data set owner a list of keys that exist in the data set that they can use to build graphs.

For example, the result of the function is

select node_id, timestamp, observations, jsonb_key_paths(observations) from "ds_no3g4DGcITlkTTYu" limit 10;

 node_id |       timestamp        |                                                              observations                                                              |                              jsonb_key_paths
---------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------
 02F     | 2018-03-16 09:19:17-05 | {"BMP180": {"temperature": -1.65}, "HTU21D": {"humidity": 71.94, "temperature": -0.78}, "TSYS01": {"temperature": -1.1}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:19:40-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.52, "temperature": -0.72}, "TSYS01": {"temperature": -1.04}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:20:03-05 | {"BMP180": {"temperature": -1.7000000000000002}, "HTU21D": {"humidity": 71.33, "temperature": -0.79}, "TSYS01": {"temperature": -1.1}} | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:20:29-05 | {"BMP180": {"temperature": -1.7000000000000002}, "HTU21D": {"humidity": 71.29, "temperature": -0.8}, "TSYS01": {"temperature": -1.15}} | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:20:52-05 | {"BMP180": {"temperature": -1.65}, "HTU21D": {"humidity": 72.51, "temperature": -0.72}, "TSYS01": {"temperature": -1.07}}              | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:21:15-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.33, "temperature": -0.75}, "TSYS01": {"temperature": -1.09}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:21:39-05 | {"BMP180": {"temperature": -1.65}, "HTU21D": {"humidity": 72.3, "temperature": -0.74}, "TSYS01": {"temperature": -1.1}}                | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:22:02-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.8, "temperature": -0.72}, "TSYS01": {"temperature": -1.08}}                | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:22:26-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 73.07, "temperature": -0.67}, "TSYS01": {"temperature": -1.02}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:22:50-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.77, "temperature": -0.6900000000000001}, "TSYS01": {"temperature": -1.05}} | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
(10 rows)

which we can further aggregate to get a master list of keys for a given data set.

Or, a more simple example:

plenario_dev=# SELECT jsonb_key_paths('{"foo":{"bar":"baz","boom":"bang","bleh":{"barf":"yack"}}, "top": "level"}');
           jsonb_key_paths            
--------------------------------------
 {foo.bar, foo.bleh.barf, foo.boom, top}
(1 row)

vforgione pushed a commit that referenced this issue Apr 4, 2018
Adding a new schema so that data set owners and admins can store
metadata about data sets to more easily generate charts/graphs.

- migration for schema
- migration that adds function to recursively name subkeys in jsonb
  fields
- schema file with function to surface fields
- test for function

Updates #264
vforgione pushed a commit that referenced this issue Apr 4, 2018
- added changesets, queries, actions
- wrote tests for actions

Updates #264
vforgione pushed a commit that referenced this issue Apr 4, 2018
give it:
- a chart
- a truncation span ("minutes", "hours", "days", etc)
- (optional) time span
- (optional) bbox

and you get aggregated data for the truncated timestamps

Updates #264
vforgione pushed a commit that referenced this issue Apr 5, 2018
can now get results aggregated to points

Updates #264
vforgione pushed a commit that referenced this issue Apr 5, 2018
the ranges came out uneven so i updated the query and made it work with
the series data rather than min/max of the data in the bucket. it's a
little heavier in run time, but it makes the output much cleaner and
easier to comprehend.

also added tests for limiting to time range and bbox.

updates #264
vforgione pushed a commit that referenced this issue Apr 5, 2018
- added web stuff for chart meta
  - add to meta on show page
- added basic display for bucket data

the bucket data seems to be kind of wonky: it doesn't vary at all and
shows only straight lines. i need to investigate this further.

updates #264
@vforgione vforgione self-assigned this Jun 28, 2018
vforgione pushed a commit that referenced this issue Jul 11, 2018
The way we're currently aggregating data and rendering charts is awful.
It needed to be completely rebuilt. The new schemas, controllers, etc
allow users to create default chart renderings for their data sets that
can then be used for display in search results.

Stuff Completed:

- new `Chart` schema for top level metadata
- new `ChartDataset` schema for aggregate metadata
- controllers, views, templates to CRUD the schemas
- special endpoint for rendering the graph serverside

Stuff To Do:

- use charts in explorer
- provide api access to aggregated data (json instead of rendered html)
- add in functionality for use with leaflet
- add text fields to views

Regarding _add text fields to views_, for some charts we could really
use being able to aggregate on text fields. Leaving them off the views
made sense at first, as we only wanted to use them to filter non-text
data. But now we've come to a point where it makes sense to have it in
there.

It will be beneficial to charting, but it will also allow us to start
adding trigram indexes so we can offer basic text search capabilities as
well -- both in the explorer and via the API.

Updates #264
vforgione pushed a commit that referenced this issue Jul 11, 2018
I added text fields to the data set views to enable aggregation around
text values in charts. I also added a new index for each text field so
that we can perform full text search later. For now, we're doing nothing
with it, but later we can start using it as the indexes will be there
and populated moving forward.

This will require another migration to enable the Postgres Trigram
extension. We will also have to do manual migrations to rebuild the
views:

```elixir
ids =
  MetaActions.list(ready_only: true)
  |> Enum.map(& &1.id)

ids
|> Enum.map(&MetaActions.get/1)
|> Enum.each(&DataSetActions.down!/1)

ids
|> Enum.map(&MetaActions.get/1)
|> Enum.each(&DataSetActions.up!/1)

ids
|> Enum.map(&MetaActions.get/1)
|> Enum.each(&PlenarioEtl.import_data_set_on_demand/1)
```

Updates #264
vforgione pushed a commit that referenced this issue Jul 12, 2018
I needed to add in functionality to limit queries to bboxes and time
ranges. It also made sense to apply aggregate granularity to the data
sets that are being aggregated by date trunc.

The unfortunate thing about Ecto is that it can be really, really dumb.
I mean part of its strength is that it's crazy versatile and you can
throw a lot at it, but the fragments don't like dynamic values, so I had
to build a billion different functions to handle granularity.

Other than that, this is pretty straight forward -- just a lot of code
to cover the cases where dynamic fragments would have been helpful.

Updates #264
vforgione pushed a commit that referenced this issue Jul 12, 2018
So the entire point of this was to render charts that didn't suck and
had a little meaning to them in the search results. We've got that now.

Additions:

- new `with_charts` option for getting metas
- totally wiped and redid the explorer controller
- made small chages here and there to glue it together

To Do:

- make the charts smaller
- clean up my god awful js
- add an api endpoint for the chart aggregates

Updates #264
vforgione pushed a commit that referenced this issue Jul 12, 2018
I added a new JS dependency to throttle the calls and cleaned up the
script. This should be good now.

Updates #264
vforgione pushed a commit that referenced this issue Jul 12, 2018
The results charts are now limited to the bounds given by the search in
the form of the explorer.

Updates #264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants