Skip to content

bobthemighty/firehose-sipper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What's this?

A tiny library for reading concatenated json blobs out of S3.

Why on earth would I want that?

Kinesis firehose, by default, writes json to files in S3 with no delimiter. Eg. if you have json blobs of the form {key: value}and firehose writes three of them to the same file, you'll end up with the string {key: value}{key:value}{key:value}.

The default json.load() function in Python treats such files as invalid JSON, not without justification.

How do I use it?

import firehose_sipper

# Read a single file out of S3

for entry in firehose_sipper.sip(bucket=some_bucket, key=some_key):
    # Each entry is a dict, parsed from a json object
    print(entry)
    
# or go nuts and read all objects under a prefix
for entry in firehose_sipper.sip(bucket=some_bucket, prefix=some_prefix):
    print(entry)

The library respects gzip encoding automatically, so you can point it at a bucket and start processing.

How do I install it?

pip install firehose-sipper

I have concatenated json NOT in S3

No problem, friend. The object_stream generator reads concatenated json from arbitrary text-mode file-like objects.

from io import StringIO
from firehose_sipper import object_stream

data = StringIO(3 * json.dumps({"A":123, "B": 234}))

result = list(object_stream(data))

assert len(result) === 3

I need to customise the handling of JSON

The sip function takes an optional JSONDecoder so that you can deserialise custom types, or intercept object creation.

for entry in sip(bucket=..., prefix=..., decoder=my_custom_decoder):
    ...

About

Tiny lib for parsing Kinesis Firehose'd JSON in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages