add new way to load json data #20

oraizen · 2022-07-31T10:04:50Z

Problem with the current way to load json data is that it loads the data fully into memory and then most of that scanned json is thrown away and only track URIs are used for future processing.
I propose to streamline the process of loading json data. The actual dataset consists of 1000 files with json encoded data, with every file containing 1000 playlists. One such file on average takes around 31 MB on disk. Although it may not seem so big, but when processing is performed the usual way by reading the full file into memory, the memory can peak up to 6 times the original file's size (you can check it by any memory profiler, as it was done in here).

We need a way to scan these json files in chunks, and iJson is the tool that can help us to do that. The documentation for ijson says that prefix can be used to select which elements should be built and returned by ijson.

I wrote a script that utilizes the ijson library to load the neccessary part from the dataset that we are actually going to use, and return it as the generator.

add new way to load json data

2aa1443

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new way to load json data #20

add new way to load json data #20

oraizen commented Jul 31, 2022

add new way to load json data #20

Are you sure you want to change the base?

add new way to load json data #20

Conversation

oraizen commented Jul 31, 2022