Shallow copy where possible in data processing (15% faster) #85

lettergram · 2021-03-22T03:51:51Z

Reduced runtime by 10-15% across the board on test files.

Documented issue #84

There is a significant number of "deepcopy" calls. Currently, it is the calls slowing the library, such as the line below:

DataProfiler/dataprofiler/labelers/data_processing.py

Line 1437 in 7c05449

results = self.match_sentence_lengths(data, copy.deepcopy(results),

After removing deepcopy, 3 tests fail when testing data processing:

pytest dataprofiler/tests/labelers/test_data_processing.py

I suspect, that's due to issues with either the tests OR a function that needs to not manipulate the input. After removing deepcopy the function(s) still all worked fine in practice (as far as I could tell).

In either case, a shallow copy likely would work fine here. When tested it did pass all tests:

results = self.match_sentence_lengths(data, dict(results), flatten_separator)

This resulted in a 10-15% reduction in profiling runtime (tested on test file diamonds.csv)

Used the following code to cProfile: https://gist.github.com/lettergram/d8f7d9f3d19856d4a0187462445382a0

Master repo (sort by tottime):

         33022296 function calls (30312047 primitive calls) in 20.389 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.849    0.037    1.851    0.037 {method 'tolist' of 'numpy.ndarray' objects}
2262585/575    1.294    0.000    2.548    0.004 copy.py:132(deepcopy)
    12733    0.517    0.000    0.517    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
 85312/40    0.496    0.000    2.452    0.061 copy.py:210(_deepcopy_list)
    23976    0.402    0.000    0.557    0.000 numerical_column_stats.py:386(_get_percentile)
  4755217    0.341    0.000    0.342    0.000 {method 'get' of 'dict' objects}
     1100    0.332    0.000    0.332    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.323    0.000    0.323    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.318    0.000    0.324    0.000 _collections_abc.py:742(__iter__)
  990/330    0.285    0.000    0.285    0.001 version_utils.py:98(swap_class)
     1100    0.270    0.000    0.473    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.270    0.000    0.272    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
    31850    0.237    0.000    1.217    0.000 ops.py:1880(__init__)
     1500    0.233    0.000    0.233    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    29420    0.225    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485219/1480389    0.215    0.000    0.260    0.000 {built-in method builtins.isinstance}
       10    0.212    0.021    2.541    0.254 character_level_cnn_model.py:698(predict)
    29200    0.191    0.000    0.191    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.182    0.000    0.472    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.178    0.000    0.178    0.000 {built-in method marshal.loads}

After removing deepcopy on results and there was a 15% speedupo (fails 3 tests):

         19387413 function calls (18978038 primitive calls) in 17.416 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.784    0.036    1.786    0.036 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.512    0.000    0.512    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.422    0.000    0.571    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.321    0.000    0.321    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.310    0.000    0.310    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.307    0.000    0.314    0.000 _collections_abc.py:742(__iter__)
  990/330    0.283    0.000    0.283    0.001 version_utils.py:98(swap_class)
     1100    0.263    0.000    0.461    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.254    0.000    0.256    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.227    0.000    0.227    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.225    0.000    1.160    0.000 ops.py:1880(__init__)
    29420    0.213    0.000    0.249    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485055/1480225    0.209    0.000    0.251    0.000 {built-in method builtins.isinstance}
       10    0.202    0.020    2.446    0.245 character_level_cnn_model.py:698(predict)
    29200    0.184    0.000    0.184    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.178    0.000    0.460    0.000 function_deserialization.py:481(_list_function_deps)
     2389    0.152    0.000    0.152    0.000 {built-in method marshal.loads}
     3190    0.144    0.000    0.144    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.137    0.000    0.137    0.000 {method 'search' of 're.Pattern' objects}
       80    0.137    0.002    0.359    0.004 {pandas._libs.lib.map_infer_mask}

Shallow copy implementation (passes all tests):

         19389622 function calls (18980246 primitive calls) in 17.991 seconds

   Ordered by: internal time
   List reduced from 9981 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    1.887    0.038    1.889    0.038 {method 'tolist' of 'numpy.ndarray' objects}
    12733    0.522    0.000    0.522    0.000 {method 'ParseFromString' of 'google.protobuf.pyext._message.CMessage' objects}
    23976    0.378    0.000    0.530    0.000 numerical_column_stats.py:386(_get_percentile)
     1100    0.338    0.000    0.338    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphImportGraphDefWithResults}
    11392    0.326    0.000    0.326    0.000 {method 'SerializeToString' of 'google.protobuf.pyext._message.CMessage' objects}
   250330    0.321    0.000    0.328    0.000 _collections_abc.py:742(__iter__)
  990/330    0.287    0.000    0.288    0.001 version_utils.py:98(swap_class)
     1100    0.275    0.000    0.480    0.000 function_def_to_graph.py:122(function_def_to_graph_def)
    10250    0.269    0.000    0.271    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_FastPathExecute}
     1500    0.242    0.000    0.242    0.000 {built-in method tensorflow.python._pywrap_tfe.TFE_Py_Execute}
    31850    0.236    0.000    1.224    0.000 ops.py:1880(__init__)
    29420    0.224    0.000    0.262    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_OperationGetAttrValueProto}
1485818/1480988    0.214    0.000    0.257    0.000 {built-in method builtins.isinstance}
     2389    0.201    0.000    0.201    0.000 {built-in method marshal.loads}
       10    0.199    0.020    2.559    0.256 character_level_cnn_model.py:698(predict)
    29200    0.190    0.000    0.190    0.000 {method 'CopyFrom' of 'google.protobuf.pyext._message.CMessage' objects}
     2200    0.181    0.000    0.474    0.000 function_deserialization.py:481(_list_function_deps)
     3190    0.150    0.000    0.150    0.000 {built-in method tensorflow.python._pywrap_tf_session.TF_GraphCopyFunction}
   756337    0.139    0.000    0.139    0.000 {method 'search' of 're.Pattern' objects}
    37515    0.139    0.000    0.140    0.000 {built-in method tensorflow.python._tf_stack.extract_stack}

dataprofiler/labelers/data_processing.py

JGSweets · 2021-03-22T15:09:11Z

dataprofiler/labelers/data_processing.py

-        results = self.match_sentence_lengths(data, copy.deepcopy(results),
+        # FORMER DEEPCOPY, SHALLOW AS ONLY INTERNAL
+        results = self.match_sentence_lengths(data, dict(results),


i'm concerned that results is a nested dict and this will alter the input.

To me, if this is the functionality we wanted, it would have been better to process without a return so the user knows that it is altering the value.

Not sure if a user ever uses the processor directly, but in this case they would unknowingly end up with an altered input

Personally, I think if a user is creating their own data processing pipeline they can deepcopy in their pipeline. They would likely find the issue if they were doing it. That being said, that's not a use case I think we need to prioritize.

JGSweets

Consensus seems to be that we could add an inplace flag similar to pandas which in our case is default True. This would allow a user to prevent mutability if they desire it as False.
Essentially, this allows the user the ability in a future case to output the model results to two processors without a negative effect.

JGSweets

@lettergram said we should not address the case of inplace approving as result despite my concerns.

…ne#85) * shallow copy in data processing where possible * added deep copy * added deep copy * added back in deepcopy

shallow copy in data processing where possible

19e413a

lettergram added Medium Priority Significant improvement or bug / feature reducing overall performance Refactor Code that is being modified to improve the library labels Mar 22, 2021

lettergram requested review from JGSweets, AnhTruong, ChrisWallace2020 and grant-eden March 22, 2021 03:51

lettergram added this to In progress in v0.4.0 via automation Mar 22, 2021

lettergram assigned JGSweets Mar 22, 2021

lettergram changed the title ~~Shallow copy where possible in data processing~~ Shallow copy where possible in data processing (15% faster) Mar 22, 2021

ChrisWallace2020 previously approved these changes Mar 22, 2021

View reviewed changes

AnhTruong previously approved these changes Mar 22, 2021

View reviewed changes

JGSweets reviewed Mar 22, 2021

View reviewed changes

dataprofiler/labelers/data_processing.py Outdated Show resolved Hide resolved

JGSweets reviewed Mar 22, 2021

View reviewed changes

added deep copy

638459d

lettergram dismissed stale reviews from AnhTruong and ChrisWallace2020 via 638459d March 22, 2021 15:10

v0.4.0 automation moved this from In progress to In Progress Mar 22, 2021

lettergram added 2 commits March 22, 2021 10:11

added deep copy

c0c7d83

added back in deepcopy

193110c

JGSweets reviewed Mar 22, 2021

View reviewed changes

JGSweets approved these changes Mar 22, 2021

View reviewed changes

grant-eden approved these changes Mar 22, 2021

View reviewed changes

lettergram added 2 commits March 22, 2021 11:24

Merge branch 'main' into data_processing

2a737b0

Merge branch 'main' into data_processing

12827f3

lettergram mentioned this pull request Mar 22, 2021

Add inplace options in the process function in the CharPostprocessor class in the data_processing.py file #91

Closed

grant-eden merged commit 6e4a0a4 into capitalone:main Mar 22, 2021

v0.4.0 automation moved this from In Progress to Done Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shallow copy where possible in data processing (15% faster) #85

Shallow copy where possible in data processing (15% faster) #85

lettergram commented Mar 22, 2021 •

edited

JGSweets Mar 22, 2021

lettergram Mar 22, 2021

JGSweets left a comment

JGSweets left a comment

Shallow copy where possible in data processing (15% faster) #85

Shallow copy where possible in data processing (15% faster) #85

Conversation

lettergram commented Mar 22, 2021 • edited

JGSweets Mar 22, 2021

Choose a reason for hiding this comment

lettergram Mar 22, 2021

Choose a reason for hiding this comment

JGSweets left a comment

Choose a reason for hiding this comment

JGSweets left a comment

Choose a reason for hiding this comment

lettergram commented Mar 22, 2021 •

edited