-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory testing and data gen scripts #781
Memory testing and data gen scripts #781
Conversation
return round(rng.random(min_value, max_value, (num_rows,)), sig_figs) | ||
|
||
|
||
def random_string(rng: Generator, categories: List[str]=None, num_rows: int=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categories: List[str]=None, -> chars: Optional[str]=None
?
import dataprofiler as dp | ||
|
||
|
||
def convert_data_to_df(np_data: np.array, path: str=None) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyplace that has None
would need the optional tagging for typing.
data = dp.Data("data/time_structured_profiler.csv") | ||
|
||
# [0] allows model to be initialzied and added to labeler | ||
def nan_injection(df: pd.DataFrame) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be in the dataset gen instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should pass in variables like:
PERCENT_TO_NAN
-> percent_to_nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, not taking in rng.
if TIME_ANALYSIS: | ||
dp_time_analysis(sample_sizes, data, | ||
path="structured_profiler_times.json") | ||
if SPACE_ANALYSIS: | ||
profile = dp_profile_space_analysis(data=data, | ||
path="profile_space_analysis.bin") | ||
dp_merge_space_analysis(profile=profile, | ||
path="merge_space_analysis.bin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with doing this separately rn.
I do think there is a possibility of doing it concurrently though.
dataprofiler/tests/space_time_analysis/structured_throughput_testing.py
Outdated
Show resolved
Hide resolved
# set seed | ||
random.seed(0) | ||
np.random.seed(0) | ||
dp.set_seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good
dataprofiler/tests/space_time_analysis/structured_throughput_testing.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/space_time_analysis/structured_throughput_testing.py
Outdated
Show resolved
Hide resolved
@@ -6,3 +6,4 @@ pytest-cov>=2.8.1 | |||
pytest-xdist>=2.1.0 | |||
pytest-forked>=1.3.0 | |||
toolz>=0.10.0 | |||
memray>=1.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some requested changes.
51178ea
to
5b34cba
Compare
dataprofiler/tests/space_time_analysis/structured_throughput_testing.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py
Outdated
Show resolved
Hide resolved
…ce tests in throughput testing script.
…re data class specifications
…re data class specifications part 2
1f603a5
to
3c05811
Compare
"percent_to_nan": PERCENT_TO_NAN, | ||
"allow_subsampling": ALLOW_SUBSAMPLING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix
"percent_to_nan": PERCENT_TO_NAN, | ||
"allow_subsampling": ALLOW_SUBSAMPLING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix
This PR is created to add space and time analysis code.
Successful run output: