Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: adds 'json rows' serialisation #56

Merged
merged 5 commits into from Nov 13, 2018
Merged

Feature: adds 'json rows' serialisation #56

merged 5 commits into from Nov 13, 2018

Conversation

lsh-0
Copy link
Contributor

@lsh-0 lsh-0 commented Nov 12, 2018

research-article-report-index is the first to get it

research-article-report-index is the first to get it
@lsh-0
Copy link
Contributor Author

lsh-0 commented Nov 12, 2018

@giorgiosironi , context of PR is:

  • output of observer is being pushed into BigQuery
  • conversion to csv loses type information, so obj-to-csv and then csv-to-json means integers become strings etc
  • pushing this back into Observer spares NiFi some boilerplate wrangling .

There are ways and means to do this in NiFi, I just haven't fully explored them yet and it's something observer could use

@@ -92,7 +92,7 @@ def upcoming_articles():
@report(article_meta(
title='published research article index',
description='The dates and times of publication for all _research_ articles published at eLife. If an article had a POA version, the date and time of the POA version is included.',
serialisations=[CSV],
serialisations=[CSV, JSON],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very easy to add to a particular report

from django.test import Client
from django.core.urlresolvers import reverse

class One(base.BaseCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestJsonResponse|TestJsonRowsIntegration with a test_published_research_article_index method? Supports adding more examples for reproducing bugs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are tests that go through every article and every serialisation of an article.

I was planning on switching to pytest for it's network isolation, but I'll update the names for you

return lu[type(val)](val)

#raise TypeError('Object of type %s with value of %s is not JSON serializable' % (type(obj), repr(obj)))
return "[unserialisable]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not an explicit error? This string will propagate through reports (potentially in Nifi) as incorrect data

body = []
body.append((utils.safe_json_dumps(row) + "\n") for row in rows)
response = StreamingHttpResponse(itertools.chain.from_iterable(body), content_type="application/json")
response['Content-Disposition'] = 'attachment; filename="%s.rows.json"' % filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing strange at the HTTP headers level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observer isn't following the REST model like the other services, it was never integrated properly. It might even be replaced entirely if BQ or something feeding off of BQ can return RSS feeds. Observer is very nice for querying article data but it just never caught on

@@ -1,5 +1,5 @@
import copy
from . import models, rss, csv, logic
from . import models, rss, csv, logic, json_rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently discovered there is a de-facto name for this format, JSON Lines, should rename to that. It also has a Python library that is not needed here, but validates that it is somewhat popular as a concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good to know, thanks. I wanted to leave room for a 'proper' json response where the entire body could be read in rather than per-row. The library looks a little pointless but I'll rename the module.

def streaming_response(filename, rows):
body = []
body.append((utils.safe_json_dumps(row) + "\n") for row in rows)
response = StreamingHttpResponse(itertools.chain.from_iterable(body), content_type="application/json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's multiple lines, the whole response is not valid JSON hence content_type needs to be something else. Trying to find a valid value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wardi/jsonlines#9 says application/x-ndjson which is from a "competing" standard but better than inventing a new value. The Python library does support both "standards".

@lsh-0 lsh-0 merged commit 8a27a86 into develop Nov 13, 2018
@lsh-0 lsh-0 deleted the feat-json-rows branch November 13, 2018 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants