How do we handle different data formats? #27

andreagrandi · 2020-03-27T18:51:31Z

Hi everyone,

something we didn't initislly discuss is: how are we going to handle different data sources?

This API is born to support and serve the data from Johns Hopkins CSSE, which is fine, but it's worth noticing that they only support a few useful fields: Confirmed, Deaths and Recovered

The data source from Italian "Protezione Civile", which I would really like to support, offers more useful data (for example the number of tests done in each city, which is useful in addition to the new positive cases found, so you can know if the trend is growing or decreasing):

I'm sure that other countries are offering different data and formats too, which could all be useful.

Now, if we wanted to give our users the possibilities to query every different data source that we support, how should we structure this? I'm thinking about at least 1 model for each data source, but how are we structuring the API endpoints?

ie:

Italian Protezione Civile: /api/v1/itapc/national-reports
John Hopkins: /api/v1/jh/world-reports
etc... ?

Does anyone have any idea about how we could support this? cc @MatMoore @audreyr @fundor333

The text was updated successfully, but these errors were encountered:

lbhdc · 2020-03-28T14:35:32Z

Perhaps adding an abstraction layer will make it easier to add in new datasources.

def john_hopkins_data():
  return fetched_data()

source_map = {
  "john_hopkins": john_hopkins_data,
  "another_source": another_source
}

def get_data(source_map, source_name):
  getter = source_map[source_name]
  return getter()

Doing something like this would make it easy to add new sources without updating your api. Serious downside though is this requires magic strings.

andreagrandi · 2020-03-28T14:42:49Z

Perhaps adding an abstraction layer will make it easier to add in new datasources.
def john_hopkins_data():
  return fetched_data()

source_map = {
  "john_hopkins": john_hopkins_data,
  "another_source": another_source
}

def get_data(source_map, source_name):
  getter = source_map[source_name]
  return getter()
Doing something like this would make it easy to add new sources without updating your api. Serious downside though is this requires magic strings.

Oh I see! So the end user would only call GET /api/v1/daily-report

and the response would contain something like:

{
    "john_hopkins": {
        .... (data from JH)
    },
    "protezione_civile": {
        .... (data from Italian PC)
    }
}

and it would be up to the user to pick the one they want, right?

We could even include the possibility to limit the returned source/sources or to exclude one we don't want.

Do we agree that each data source should have their own models? Cheers

MatMoore · 2020-03-28T14:45:36Z

I haven't thought about this very much, but seeing as the reports won't be consistent between organisations, maybe we could simplify it to just /itapc or jh instead of categorising them further with national-reports or world-reports?

Then if you query the root of the api we could return a list of all the reports available.

So something like

GET /api/v1/ ->
[
{
   "source": "Protezione Civile",
   "reports_url": "/api/v1/itapc/"
},
{
  "source": "John Hopkins CSSE",
  "reports_url": "/api/v1/jh/"
}
]

GET /api/v1/jh/ -> what's now /daily-reports

GET /api/v1/itapc/ -> all the data from the Protezione Civile dataset

I think it makes sense to model each source independently and reuse column names from the original dataset, rather than trying to map each report to a common vocabulary, because the exact meaning of each metric will depend on how its collected/recorded.

lbhdc · 2020-03-28T15:04:32Z

I think separate endpoints for each data source is a great way to go. It will make the payload size smaller for your consumer since they can be more granular in their fetch.

andreagrandi · 2020-04-05T14:43:51Z

I'm closing this since we agreed on a solution.

andreagrandi added help wanted Extra attention is needed question Further information is requested labels Mar 27, 2020

andreagrandi self-assigned this Mar 27, 2020

andreagrandi added this to Backlog in Covid API Mar 28, 2020

andreagrandi mentioned this issue Mar 31, 2020

Refactor the code to support additional sources of data #39

Closed

andreagrandi closed this as completed Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we handle different data formats? #27

How do we handle different data formats? #27

andreagrandi commented Mar 27, 2020 •

edited

lbhdc commented Mar 28, 2020

andreagrandi commented Mar 28, 2020

MatMoore commented Mar 28, 2020

lbhdc commented Mar 28, 2020

andreagrandi commented Apr 5, 2020

How do we handle different data formats? #27

How do we handle different data formats? #27

Comments

andreagrandi commented Mar 27, 2020 • edited

lbhdc commented Mar 28, 2020

andreagrandi commented Mar 28, 2020

MatMoore commented Mar 28, 2020

lbhdc commented Mar 28, 2020

andreagrandi commented Apr 5, 2020

andreagrandi commented Mar 27, 2020 •

edited