New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should downsample_* methods be inplace (instead of returning a new copy)? #618

luizirber opened this Issue Jan 10, 2019 · 0 comments


None yet
1 participant
Copy link

luizirber commented Jan 10, 2019

def downsample_n(self, new_num):
if self.num and self.num < new_num:
raise ValueError('new sample n is higher than current sample n')
a = MinHash(new_num, deref(self._this).ksize,
deref(self._this).is_protein, self.track_abundance,
deref(self._this).seed, 0)
if self.track_abundance:
return a
def downsample_max_hash(self, *others):
max_hashes = [ x.max_hash for x in others ]
new_max_hash = min(self.max_hash, *max_hashes)
new_scaled = get_scaled_for_max_hash(new_max_hash)
return self.downsample_scaled(new_scaled)
def downsample_scaled(self, new_num):
if self.num:
raise ValueError('num != 0 - cannot downsample a standard MinHash')
max_hash = self.max_hash
if max_hash is None:
raise ValueError('no max_hash available - cannot downsample')
old_scaled = get_scaled_for_max_hash(self.max_hash)
if old_scaled > new_num:
raise ValueError('new scaled {} is lower than current sample scaled {}'.format(new_num, old_scaled))
new_max_hash = get_max_hash_for_scaled(new_num)
a = MinHash(0, deref(self._this).ksize,
deref(self._this).is_protein, self.track_abundance,
deref(self._this).seed, new_max_hash)
if self.track_abundance:
return a

It's better to make a explicit copy if needed than do a new copy by default.

@ctb comment: " got a little bit wary of modifying minhashes in place a while back after running into some problems with md5sums changing"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment