Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster xml load #839

Merged
merged 48 commits into from
Dec 1, 2016
Merged

Faster xml load #839

merged 48 commits into from
Dec 1, 2016

Conversation

rob-c
Copy link
Member

@rob-c rob-c commented Oct 25, 2016

This PR contains:

  1. Some minor test fixes (obviously bad code if the test was run interactively!)

  2. Minor speed improvements (Pieces of code which turn up in benchmarks way too much)

  3. Fix for slow reading in of XML files

  4. consists of adding a new class factory method to GangaObject which makes correct use of the __new__ method as well as fully initializing the created class instance without populating the internal dictionary with the default values from the schema. This is useful in situations where this expensive operation is a waste as the code intends to replace these objects in the next step.
    This factory is not a good solution to initializing a normal class for normal use but for the specific corner cases such as a deep copy or constructing a class for populating from XML this offers a real demonstrable performance boost.

This has the advantage that it doesn't require fixing __init__ to be correct all across the codebase. It works as expected. It doesn't require special types to get it's manipulations done and it doesn't add significant overhead into the GangaObject or MetaClass objects.

…rt rather than waiting for it to be requested to be discovered as users will get confused and likely create a new one if none is given
@rob-c rob-c self-assigned this Oct 25, 2016
@rob-c rob-c changed the base branch from 6.3-features to develop November 18, 2016 19:25
@rob-c
Copy link
Member Author

rob-c commented Nov 24, 2016

@mesmith75 I would like to have this considered for 6.3 (if not 6.4 feels OK).

This addresses a specific problem which LHCb users are still afflicted with.
When a user has a large inputdata set (10k+ files) writing this to disk and reading it in is a very expensive OP.

This PR addresses the XML problem by reducing the need to fully construct objects which don't need to be fully qualified and reduces the number of I/O ops for the writing of the XML to disk by using a String buffer which allows for the data to be written in a single bust after being constructed.

Testing with a dummy Job containing 10k DiracFile in the inputfile field:

Without this PR: 70sec + 37sec = 107sec
With this PR: 54sec + 17sec = 71sec

~30% performance improvement in the basic test and I claim some general speed improvements in casual use using this PR.

Code used for testing:

Construction:

%time j=Job(inputfiles=[DiracFile(str(_)+'.root') for _ in range(10000)])

restart ganga with --no-mon

%time print jobs[-1].backend

@mesmith75
Copy link
Contributor

@rob-c I agree this looks nice to have in for 6.3. Just let me give it a quick try and then we can merge it.

@mesmith75
Copy link
Contributor

Ok this is good. If there are no objections I'll merge this and start the release.

@mesmith75 mesmith75 merged commit f0acce4 into develop Dec 1, 2016
@mesmith75 mesmith75 deleted the fasterXMLLoad branch December 1, 2016 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants