Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory testing and data gen scripts #781

Merged
merged 12 commits into from
Apr 14, 2023

Conversation

ksneab7
Copy link
Contributor

@ksneab7 ksneab7 commented Apr 13, 2023

This PR is created to add space and time analysis code.

Successful run output:

2023-04-14 14:10:26.971746: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Evaluating sample size: 0
COMPLETE sample size: 0
Profiled in 0.027366161346435547 seconds
Merge in 2.5987625122070312e-05 seconds

Evaluating sample size: 100
INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 378.43it/s]
INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 
  0%|                                                                                                                                                                                                             | 0/4 [00:00<?, ?
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 22.15it/s]
COMPLETE sample size: 100
Profiled in 0.2430250644683838 seconds
Merge in 0.033471107482910156 seconds

Results Saved
INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 119.99it/s]
INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 
  0%|                                                                                                                                                                                                             | 0/4 [00:00<?, ?
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.33it/s]


return round(rng.random(min_value, max_value, (num_rows,)), sig_figs)


def random_string(rng: Generator, categories: List[str]=None, num_rows: int=1,
Copy link
Collaborator

@JGSweets JGSweets Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categories: List[str]=None, -> chars: Optional[str]=None?

import dataprofiler as dp


def convert_data_to_df(np_data: np.array, path: str=None) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyplace that has None would need the optional tagging for typing.

data = dp.Data("data/time_structured_profiler.csv")

# [0] allows model to be initialzied and added to labeler
def nan_injection(df: pd.DataFrame) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in the dataset gen instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should pass in variables like:
PERCENT_TO_NAN -> percent_to_nan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, not taking in rng.

Comment on lines 219 to 226
if TIME_ANALYSIS:
dp_time_analysis(sample_sizes, data,
path="structured_profiler_times.json")
if SPACE_ANALYSIS:
profile = dp_profile_space_analysis(data=data,
path="profile_space_analysis.bin")
dp_merge_space_analysis(profile=profile,
path="merge_space_analysis.bin")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with doing this separately rn.

I do think there is a possibility of doing it concurrently though.

# set seed
random.seed(0)
np.random.seed(0)
dp.set_seed(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

@@ -6,3 +6,4 @@ pytest-cov>=2.8.1
pytest-xdist>=2.1.0
pytest-forked>=1.3.0
toolz>=0.10.0
memray>=1.7.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Collaborator

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requested changes.

taylorfturner
taylorfturner previously approved these changes Apr 14, 2023
Comment on lines 146 to 147
"percent_to_nan": PERCENT_TO_NAN,
"allow_subsampling": ALLOW_SUBSAMPLING,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix

Comment on lines 162 to 163
"percent_to_nan": PERCENT_TO_NAN,
"allow_subsampling": ALLOW_SUBSAMPLING,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix

@JGSweets JGSweets merged commit 7ce6307 into capitalone:main Apr 14, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants