Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate selection #107

Merged
merged 16 commits into from
Nov 29, 2023
Merged

Approximate selection #107

merged 16 commits into from
Nov 29, 2023

Conversation

ttt-77
Copy link
Collaborator

@ttt-77 ttt-77 commented Nov 21, 2023

@ttt-77
Copy link
Collaborator Author

ttt-77 commented Nov 21, 2023

@ddkang Could you please review this PR?

aidb/utils/constants.py Outdated Show resolved Hide resolved
@ttt-77
Copy link
Collaborator Author

ttt-77 commented Nov 22, 2023

Screenshot 2023-11-21 at 23 12 03

Sorry, I didn't state clearly. For example, we have a query 'select blob_id, entity_id from entities00 where type LIKE 'EVENT' '. Because one derived row is 'EVENT', another is 'ORG'. Suppose blob 0 is the cluster representative, should we set true score of blob 0 as 0.5 or 1?

@ttt-77
Copy link
Collaborator Author

ttt-77 commented Nov 28, 2023

@ddkang Could you please review this PR?

aidb/query/query.py Outdated Show resolved Hide resolved
tests/test_approx_select.py Outdated Show resolved Hide resolved
@ttt-77
Copy link
Collaborator Author

ttt-77 commented Nov 28, 2023

Please check the new commit and issue #116 , #117

aidb/query/query.py Outdated Show resolved Hide resolved
dataset = self.get_sampled_proxy_blob(proxy_score_for_all_blobs)

# This is used for parallel test
seed = (mp.current_process().pid * np.random.randint(100000, size=1)[0]) % (2**32 - 1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pass in the seed as an argument to the base engine? I don't think it's good to set here


logger.info(f'num_samples: {len(additional_samples)}')

additional_satisfied_sampled_results, additional_all_sampled_results = await self.get_inference_results(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also want to take the positive records from the pilot sample?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have taken the positive records from the pilot sample. The variable names are ambiguous, I will change them.

R1 = sorted_satisfied_sampled_results.index
R2 = dataset[dataset[PROXY_SCORE] >= tau_modified].index
additional_samples = list(set(R1).union(set(R2)))

@ttt-77
Copy link
Collaborator Author

ttt-77 commented Nov 28, 2023

@ddkang Could you please review the new commit?

@ddkang ddkang self-requested a review November 29, 2023 02:41
@ddkang ddkang merged commit 19ceae8 into main Nov 29, 2023
1 check passed
@ddkang ddkang deleted the approximate_selection branch November 29, 2023 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants