Memory testing and data gen scripts #781

ksneab7 · 2023-04-13T19:46:47Z

This PR is created to add space and time analysis code.

Successful run output:

2023-04-14 14:10:26.971746: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Evaluating sample size: 0
COMPLETE sample size: 0
Profiled in 0.027366161346435547 seconds
Merge in 2.5987625122070312e-05 seconds

Evaluating sample size: 100
INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 378.43it/s]
INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 
  0%|                                                                                                                                                                                                             | 0/4 [00:00<?, ?
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 22.15it/s]
COMPLETE sample size: 100
Profiled in 0.2430250644683838 seconds
Merge in 0.033471107482910156 seconds

Results Saved
INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 119.99it/s]
INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 
  0%|                                                                                                                                                                                                             | 0/4 [00:00<?, ?
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.33it/s]

dataprofiler/tests/space_time_analysis/dataset_generation.py

JGSweets · 2023-04-13T20:48:05Z

dataprofiler/tests/space_time_analysis/dataset_generation.py

+    return round(rng.random(min_value, max_value, (num_rows,)), sig_figs)
+
+
+def random_string(rng: Generator, categories: List[str]=None, num_rows: int=1,


categories: List[str]=None, -> chars: Optional[str]=None?

dataprofiler/tests/space_time_analysis/dataset_generation.py

JGSweets · 2023-04-13T20:50:20Z

dataprofiler/tests/space_time_analysis/dataset_generation.py

+    import dataprofiler as dp
+
+
+def convert_data_to_df(np_data: np.array, path: str=None) -> pd.DataFrame:


anyplace that has None would need the optional tagging for typing.

dataprofiler/tests/space_time_analysis/dataset_generation.py

JGSweets · 2023-04-13T20:52:48Z

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py

-    data = dp.Data("data/time_structured_profiler.csv")
-
-    # [0] allows model to be initialzied and added to labeler
+def nan_injection(df: pd.DataFrame) -> pd.DataFrame:


should this be in the dataset gen instead?

we should pass in variables like:
PERCENT_TO_NAN -> percent_to_nan

currently, not taking in rng.

JGSweets · 2023-04-13T21:06:55Z

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py

+    if TIME_ANALYSIS:
+        dp_time_analysis(sample_sizes, data,
+                         path="structured_profiler_times.json")
+    if SPACE_ANALYSIS:
+        profile = dp_profile_space_analysis(data=data,
+                                            path="profile_space_analysis.bin")
+        dp_merge_space_analysis(profile=profile,
+                                path="merge_space_analysis.bin")


I'm fine with doing this separately rn.

I do think there is a possibility of doing it concurrently though.

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py

JGSweets · 2023-04-13T21:07:53Z

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py

+    # set seed
+    random.seed(0)
+    np.random.seed(0)
+    dp.set_seed(0)


dataprofiler/tests/space_time_analysis/structured_throughput_testing.py

JGSweets · 2023-04-13T21:11:03Z

requirements-test.txt

@@ -6,3 +6,4 @@ pytest-cov>=2.8.1
 pytest-xdist>=2.1.0
 pytest-forked>=1.3.0
 toolz>=0.10.0
+memray>=1.7.0


JGSweets

Some requested changes.

dataprofiler/tests/space_time_analysis/dataset_generation.py

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py

dataprofiler/tests/space_time_analysis/dataset_generation.py

dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py

…ce tests in throughput testing script.

…re data class specifications

…re data class specifications part 2

…lace of them.

JGSweets · 2023-04-14T17:52:08Z

dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py

+                "percent_to_nan": PERCENT_TO_NAN,
+                "allow_subsampling": ALLOW_SUBSAMPLING,


need to fix

JGSweets · 2023-04-14T17:52:15Z

dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py

+                    "percent_to_nan": PERCENT_TO_NAN,
+                    "allow_subsampling": ALLOW_SUBSAMPLING,


need to fix

ksneab7 requested review from JGSweets, taylorfturner, micdavis and tyfarnan as code owners April 13, 2023 19:46

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

micdavis reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 13, 2023

View reviewed changes

JGSweets suggested changes Apr 13, 2023

View reviewed changes

JGSweets enabled auto-merge (squash) April 14, 2023 14:50

ksneab7 force-pushed the memory_testing_and_data_gen branch 2 times, most recently from 51178ea to 5b34cba Compare April 14, 2023 15:34

This was referenced Apr 14, 2023

Improving out of generate_dataset_by_class function to include naming convention #782

Open

Running space analysis async rather than synchronous #783

Open

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_throughput_testing.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/dataset_generation.py Outdated Show resolved Hide resolved

taylorfturner previously approved these changes Apr 14, 2023

View reviewed changes

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py Outdated Show resolved Hide resolved

JGSweets reviewed Apr 14, 2023

View reviewed changes

dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py Outdated Show resolved Hide resolved

ksneab7 added 11 commits April 14, 2023 13:28

generation of dataset column scripts

eb2d5e1

Created dataset generation for space-time analysis tests. Created spa…

3b32aa8

…ce tests in throughput testing script.

added comments to new functionality

1c1291b

fixed for generated text using string generation functions

985d510

fix for repeat of options in global var

03c667d

added space test and generation dataset to readme

7b2f610

added docstrings to generation scripts and finalized code for PR.

3fe92a4

PR comment fixes for generation of data and function mapping for futu…

b6224a3

…re data class specifications

PR comment fixes for generation of data and function mapping for futu…

f24e1e3

…re data class specifications part 2

sturcture of code changed to get rid of globals and added params in p…

3b80133

…lace of them.

dataset sampling code rework

3c05811

ksneab7 dismissed taylorfturner’s stale review via 3c05811 April 14, 2023 17:35

ksneab7 force-pushed the memory_testing_and_data_gen branch from 1f603a5 to 3c05811 Compare April 14, 2023 17:35

JGSweets reviewed Apr 14, 2023

View reviewed changes

Typing fix

f5a5028

JGSweets approved these changes Apr 14, 2023

View reviewed changes

micdavis approved these changes Apr 14, 2023

View reviewed changes

JGSweets merged commit 7ce6307 into capitalone:main Apr 14, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory testing and data gen scripts #781

Memory testing and data gen scripts #781

ksneab7 commented Apr 13, 2023 •

edited

JGSweets Apr 13, 2023 •

edited

JGSweets Apr 13, 2023

JGSweets Apr 13, 2023

JGSweets Apr 13, 2023

JGSweets Apr 13, 2023

JGSweets Apr 13, 2023

JGSweets Apr 13, 2023

JGSweets Apr 13, 2023

JGSweets left a comment

JGSweets Apr 14, 2023

JGSweets Apr 14, 2023

		return round(rng.random(min_value, max_value, (num_rows,)), sig_figs)


		def random_string(rng: Generator, categories: List[str]=None, num_rows: int=1,

		import dataprofiler as dp


		def convert_data_to_df(np_data: np.array, path: str=None) -> pd.DataFrame:

		"percent_to_nan": PERCENT_TO_NAN,
		"allow_subsampling": ALLOW_SUBSAMPLING,

Memory testing and data gen scripts #781

Memory testing and data gen scripts #781

Conversation

ksneab7 commented Apr 13, 2023 • edited

JGSweets Apr 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksneab7 commented Apr 13, 2023 •

edited

JGSweets Apr 13, 2023 •

edited