Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we handle different data formats? #27

Closed
andreagrandi opened this issue Mar 27, 2020 · 5 comments
Closed

How do we handle different data formats? #27

andreagrandi opened this issue Mar 27, 2020 · 5 comments
Assignees
Labels
help wanted Extra attention is needed question Further information is requested
Projects

Comments

@andreagrandi
Copy link
Owner

andreagrandi commented Mar 27, 2020

Hi everyone,

something we didn't initislly discuss is: how are we going to handle different data sources?

This API is born to support and serve the data from Johns Hopkins CSSE, which is fine, but it's worth noticing that they only support a few useful fields: Confirmed, Deaths and Recovered

Screenshot 2020-03-27 19 43 33

The data source from Italian "Protezione Civile", which I would really like to support, offers more useful data (for example the number of tests done in each city, which is useful in addition to the new positive cases found, so you can know if the trend is growing or decreasing):

Screenshot 2020-03-27 19 43 50

I'm sure that other countries are offering different data and formats too, which could all be useful.

Now, if we wanted to give our users the possibilities to query every different data source that we support, how should we structure this? I'm thinking about at least 1 model for each data source, but how are we structuring the API endpoints?

ie:

  • Italian Protezione Civile: /api/v1/itapc/national-reports
  • John Hopkins: /api/v1/jh/world-reports
  • etc... ?

Does anyone have any idea about how we could support this? cc @MatMoore @audreyr @fundor333

@andreagrandi andreagrandi added help wanted Extra attention is needed question Further information is requested labels Mar 27, 2020
@andreagrandi andreagrandi self-assigned this Mar 27, 2020
@lbhdc
Copy link

lbhdc commented Mar 28, 2020

Perhaps adding an abstraction layer will make it easier to add in new datasources.

def john_hopkins_data():
  return fetched_data()

source_map = {
  "john_hopkins": john_hopkins_data,
  "another_source": another_source
}

def get_data(source_map, source_name):
  getter = source_map[source_name]
  return getter()

Doing something like this would make it easy to add new sources without updating your api. Serious downside though is this requires magic strings.

@andreagrandi
Copy link
Owner Author

Perhaps adding an abstraction layer will make it easier to add in new datasources.

def john_hopkins_data():
  return fetched_data()

source_map = {
  "john_hopkins": john_hopkins_data,
  "another_source": another_source
}

def get_data(source_map, source_name):
  getter = source_map[source_name]
  return getter()

Doing something like this would make it easy to add new sources without updating your api. Serious downside though is this requires magic strings.

Oh I see! So the end user would only call GET /api/v1/daily-report

and the response would contain something like:

{
    "john_hopkins": {
        .... (data from JH)
    },
    "protezione_civile": {
        .... (data from Italian PC)
    }
}

and it would be up to the user to pick the one they want, right?

We could even include the possibility to limit the returned source/sources or to exclude one we don't want.

Do we agree that each data source should have their own models? Cheers

@MatMoore
Copy link
Collaborator

I haven't thought about this very much, but seeing as the reports won't be consistent between organisations, maybe we could simplify it to just /itapc or jh instead of categorising them further with national-reports or world-reports?

Then if you query the root of the api we could return a list of all the reports available.

So something like

GET /api/v1/ ->
[
{
   "source": "Protezione Civile",
   "reports_url": "/api/v1/itapc/"
},
{
  "source": "John Hopkins CSSE",
  "reports_url": "/api/v1/jh/"
}
]
GET /api/v1/jh/ -> what's now /daily-reports
GET /api/v1/itapc/ -> all the data from the Protezione Civile dataset

I think it makes sense to model each source independently and reuse column names from the original dataset, rather than trying to map each report to a common vocabulary, because the exact meaning of each metric will depend on how its collected/recorded.

@lbhdc
Copy link

lbhdc commented Mar 28, 2020

I think separate endpoints for each data source is a great way to go. It will make the payload size smaller for your consumer since they can be more granular in their fetch.

@andreagrandi
Copy link
Owner Author

I'm closing this since we agreed on a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
Covid API
  
Backlog
Development

No branches or pull requests

3 participants