Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shallow copy where possible in data processing (15% faster) #85

Merged
merged 6 commits into from Mar 22, 2021

Conversation

lettergram
Copy link
Contributor

@lettergram lettergram commented Mar 22, 2021

Reduced runtime by 10-15% across the board on test files.

Documented issue #84

There is a significant number of "deepcopy" calls. Currently, it is the calls slowing the library, such as the line below:

results = self.match_sentence_lengths(data, copy.deepcopy(results),

After removing deepcopy, 3 tests fail when testing data processing:

pytest dataprofiler/tests/labelers/test_data_processing.py

I suspect, that's due to issues with either the tests OR a function that needs to not manipulate the input. After removing deepcopy the function(s) still all worked fine in practice (as far as I could tell).

In either case, a shallow copy likely would work fine here. When tested it did pass all tests:

results = self.match_sentence_lengths(data, dict(results), flatten_separator)

This resulted in a 10-15% reduction in profiling runtime (tested on test file diamonds.csv)

Used the following code to cProfile: https://gist.github.com/lettergram/d8f7d9f3d19856d4a0187462445382a0

Master repo (sort by tottime):

         33022296 function calls (30312047 primitive calls) in 20.389 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.849    0.037    1.851    0.037 {method 'tolist' of 'numpy.ndarray' objects}
2262585/575    1.294    0.000    2.548    0.004 copy.py:132(deepcopy)
    12733    0.517    0.000    0.517    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
 85312/40    0.496    0.000    2.452    0.061 copy.py:210(_deepcopy_list)
    23976    0.402    0.000    0.557    0.000 numerical_column_stats.py:386(_get_percentile)
  4755217    0.341    0.000    0.342    0.000 {method 'get' of 'dict' objects}
     1100    0.332    0.000    0.332    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.323    0.000    0.323    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.318    0.000    0.324    0.000 _collections_abc.py:742(__iter__)
  990/330    0.285    0.000    0.285    0.001 version_utils.py:98(swap_class)
     1100    0.270    0.000    0.473    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.270    0.000    0.272    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
    31850    0.237    0.000    1.217    0.000 ops.py:1880(__init__)
     1500    0.233    0.000    0.233    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    29420    0.225    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485219/1480389    0.215    0.000    0.260    0.000 {built-in method builtins.isinstance}
       10    0.212    0.021    2.541    0.254 character_level_cnn_model.py:698(predict)
    29200    0.191    0.000    0.191    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.182    0.000    0.472    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.178    0.000    0.178    0.000 {built-in method marshal.loads}

After removing deepcopy on results and there was a 15% speedupo (fails 3 tests):

         19387413 function calls (18978038 primitive calls) in 17.416 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.784    0.036    1.786    0.036 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.512    0.000    0.512    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.422    0.000    0.571    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.321    0.000    0.321    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.310    0.000    0.310    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.307    0.000    0.314    0.000 _collections_abc.py:742(__iter__)
  990/330    0.283    0.000    0.283    0.001 version_utils.py:98(swap_class)
     1100    0.263    0.000    0.461    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.254    0.000    0.256    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.227    0.000    0.227    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.225    0.000    1.160    0.000 ops.py:1880(__init__)
    29420    0.213    0.000    0.249    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485055/1480225    0.209    0.000    0.251    0.000 {built-in method builtins.isinstance}
       10    0.202    0.020    2.446    0.245 character_level_cnn_model.py:698(predict)
    29200    0.184    0.000    0.184    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.178    0.000    0.460    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.152    0.000    0.152    0.000 {built-in method marshal.loads}
     3190    0.144    0.000    0.144    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.137    0.000    0.137    0.000 {method 'search' of 're.Pattern' objects}
       80    0.137    0.002    0.359    0.004 {pandas._libs.lib.map_infer_mask}

Shallow copy implementation (passes all tests):

         19389622 function calls (18980246 primitive calls) in 17.991 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.887    0.038    1.889    0.038 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.522    0.000    0.522    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.378    0.000    0.530    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.338    0.000    0.338    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.326    0.000    0.326    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.321    0.000    0.328    0.000 _collections_abc.py:742(__iter__)
  990/330    0.287    0.000    0.288    0.001 version_utils.py:98(swap_class)
     1100    0.275    0.000    0.480    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.269    0.000    0.271    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.242    0.000    0.242    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.236    0.000    1.224    0.000 ops.py:1880(__init__)
    29420    0.224    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485818/1480988    0.214    0.000    0.257    0.000 {built-in method builtins.isinstance}
     2389    0.201    0.000    0.201    0.000 {built-in method marshal.loads}
       10    0.199    0.020    2.559    0.256 character_level_cnn_model.py:698(predict)
    29200    0.190    0.000    0.190    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.181    0.000    0.474    0.000 function_deserialization.py:481(_list_function_deps)
     3190    0.150    0.000    0.150    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.139    0.000    0.139    0.000 {method 'search' of 're.Pattern' objects}
    37515    0.139    0.000    0.140    0.000 {built-in method tensorflow.python._tf_stack.extract_stack}

@lettergram lettergram added Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library labels Mar 22, 2021
@lettergram lettergram added this to In progress in v0.4.0 via automation Mar 22, 2021
@lettergram lettergram changed the title Shallow copy where possible in data processing Shallow copy where possible in data processing (15% faster) Mar 22, 2021
AnhTruong
AnhTruong previously approved these changes Mar 22, 2021
Comment on lines -1437 to +1442
results = self.match_sentence_lengths(data, copy.deepcopy(results),
# FORMER DEEPCOPY, SHALLOW AS ONLY INTERNAL
results = self.match_sentence_lengths(data, dict(results),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm concerned that results is a nested dict and this will alter the input.

To me, if this is the functionality we wanted, it would have been better to process without a return so the user knows that it is altering the value.

Not sure if a user ever uses the processor directly, but in this case they would unknowingly end up with an altered input

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think if a user is creating their own data processing pipeline they can deepcopy in their pipeline. They would likely find the issue if they were doing it. That being said, that's not a use case I think we need to prioritize.

v0.4.0 automation moved this from In progress to In Progress Mar 22, 2021
Copy link
Collaborator

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consensus seems to be that we could add an inplace flag similar to pandas which in our case is default True. This would allow a user to prevent mutability if they desire it as False.
Essentially, this allows the user the ability in a future case to output the model results to two processors without a negative effect.

Copy link
Collaborator

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lettergram said we should not address the case of inplace approving as result despite my concerns.

@grant-eden grant-eden merged commit 6e4a0a4 into capitalone:main Mar 22, 2021
v0.4.0 automation moved this from In Progress to Done Mar 22, 2021
stevensecreti pushed a commit to stevensecreti/DataProfiler that referenced this pull request Jun 15, 2022
…ne#85)

* shallow copy in data processing where possible

* added deep copy

* added deep copy

* added back in deepcopy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants