Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem with the current way to load json data is that it loads the data fully into memory and then most of that scanned json is thrown away and only track URIs are used for future processing.
I propose to streamline the process of loading json data. The actual dataset consists of 1000 files with json encoded data, with every file containing 1000 playlists. One such file on average takes around 31 MB on disk. Although it may not seem so big, but when processing is performed the usual way by reading the full file into memory, the memory can peak up to 6 times the original file's size (you can check it by any memory profiler, as it was done in here).
We need a way to scan these json files in chunks, and iJson is the tool that can help us to do that. The documentation for ijson says that prefix can be used to select which elements should be built and returned by ijson.
I wrote a script that utilizes the ijson library to load the neccessary part from the dataset that we are actually going to use, and return it as the generator.