New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shallow copy where possible in data processing (15% faster) #85
Conversation
results = self.match_sentence_lengths(data, copy.deepcopy(results), | ||
# FORMER DEEPCOPY, SHALLOW AS ONLY INTERNAL | ||
results = self.match_sentence_lengths(data, dict(results), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm concerned that results is a nested dict and this will alter the input.
To me, if this is the functionality we wanted, it would have been better to process without a return so the user knows that it is altering the value.
Not sure if a user ever uses the processor directly, but in this case they would unknowingly end up with an altered input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think if a user is creating their own data processing pipeline they can deepcopy in their pipeline. They would likely find the issue if they were doing it. That being said, that's not a use case I think we need to prioritize.
638459d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consensus seems to be that we could add an inplace
flag similar to pandas which in our case is default True. This would allow a user to prevent mutability if they desire it as False.
Essentially, this allows the user the ability in a future case to output the model results to two processors without a negative effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lettergram said we should not address the case of inplace
approving as result despite my concerns.
…ne#85) * shallow copy in data processing where possible * added deep copy * added deep copy * added back in deepcopy
Reduced runtime by 10-15% across the board on test files.
Documented issue #84
There is a significant number of "deepcopy" calls. Currently, it is the calls slowing the library, such as the line below:
DataProfiler/dataprofiler/labelers/data_processing.py
Line 1437 in 7c05449
After removing deepcopy, 3 tests fail when testing data processing:
I suspect, that's due to issues with either the tests OR a function that needs to not manipulate the input. After removing deepcopy the function(s) still all worked fine in practice (as far as I could tell).
In either case, a shallow copy likely would work fine here. When tested it did pass all tests:
This resulted in a 10-15% reduction in profiling runtime (tested on test file
diamonds.csv
)Used the following code to cProfile: https://gist.github.com/lettergram/d8f7d9f3d19856d4a0187462445382a0
Master repo (sort by tottime):
After removing deepcopy on results and there was a 15% speedupo (fails 3 tests):
Shallow copy implementation (passes all tests):