Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new way to load json data #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

oraizen
Copy link

@oraizen oraizen commented Jul 31, 2022

Problem with the current way to load json data is that it loads the data fully into memory and then most of that scanned json is thrown away and only track URIs are used for future processing.
I propose to streamline the process of loading json data. The actual dataset consists of 1000 files with json encoded data, with every file containing 1000 playlists. One such file on average takes around 31 MB on disk. Although it may not seem so big, but when processing is performed the usual way by reading the full file into memory, the memory can peak up to 6 times the original file's size (you can check it by any memory profiler, as it was done in here).

We need a way to scan these json files in chunks, and iJson is the tool that can help us to do that. The documentation for ijson says that prefix can be used to select which elements should be built and returned by ijson.

I wrote a script that utilizes the ijson library to load the neccessary part from the dataset that we are actually going to use, and return it as the generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant