Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating test set with hash #71

Closed
SumoBBQ opened this issue Aug 8, 2017 · 13 comments
Closed

Creating test set with hash #71

SumoBBQ opened this issue Aug 8, 2017 · 13 comments

Comments

@SumoBBQ
Copy link

SumoBBQ commented Aug 8, 2017

Hi @ageron , i really love your book and it is easy to understand for me, but in the chapter 2 you wrote:

For example, you could compute a hash of each
instance’s identifier, keep only the last byte of the hash, and put the
instance in the test set if this value is lower or equal to 51 (~20% of 256).
This ensures that the test set will remain consistent across multiple runs,
even if you refresh the dataset. The new test set will contain 20% of the
new instances, but it will not contain any instance that was previously in
the training set

Could you give me some explanation, what is hash of instance's identifier, how can I apply it to split data set into 2 parts training set and test set? And one more question, instance's identifier are longtitude, median_house_value, population,... is that correct?

Sorry for my bad English :( Thank you in advanced.

@ageron
Copy link
Owner

ageron commented Aug 17, 2017

Thanks for your kind words and your question.

What we are trying to achieve is to have a stable test set across multiple runs of your program. If the test set changes every time you run the program, then you run the risk of gradually tweaking your program and your model to fit all of the data (including test instances), which is bad because it makes the evaluation of your model on the test set unreliable (the metric will be too optimistic).

Next question: how can you ensure that the test set is stable across multiple runs? Well, there are countless solutions. For example, you could simply randomly pick a set for each instance, just once, then write the sets in separate files. In subsequent runs of your program, you would just use these existing files. End of story. However, this extra preprocessing is not necessary if every instance has a unique ID: suppose the ID is a random integer, then you can simply decide that instances whose IDs end with 00 to 19 belong to the test set, and the rest belong to the training set (this would ensure that 20% of your instances go to the test set, and 80% go to the training set). So for example, the instance with ID=5910406 is in the test set, but the instance with ID=48290483 is not. This solution would work fine. However, database IDs are often sequential, not random, so you need a way to shuffle the IDs to ensure that the instances are randomly distributed across the training set and the test set. This leads us to hash functions.

As Wikipedia says, "A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.". The most famous example is the MD5 function:

>>> from hashlib import md5
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This may be a very long document").digest()
b'}\xeb\xdb\xdc\r\x81D\xa0\xa1\\\x19\x8c\xccT:-'

As this example shows, the MD5 function is deterministic: if you run it twice on the same data, it outputs the same result twice. However, if you run it on two different documents (even if they differ by only one bit), it outputs very different results, seemingly random. This is exactly what we need to shuffle the instance IDs! What we can do is just run the instance IDs through MD5 (or any other hash function), and this will give us a new, seemingly random ID (but actually stable across multiple runs, since hash functions are deterministic).

Now MD5 outputs a 128 bit number, so if we want to have 20% of the instances in the test set, we can just compute the MD5 of each instance's ID, and if the result is lower than 2^128 * 20%, it goes to the test set, otherwise it goes to the training set. Unfortunately, things get slightly ugly in Python when dealing with MD5, since it actually returns a byte array (with 16 bytes), not a large integer. We could convert the whole byte array to a long 128 bit integer (using the struct module), but instead it is simpler to just pick the last byte of the hash and use it. If we want 20% of our instances to go to the test set, we can just compute the MD5 of the ID, get the last byte out of the result (which will look like a random byte, from 0 to 255), and if that byte is lower or equal to 51 (which is about 20% of 256) then the instance goes to the test set, otherwise it goes to the training set. Phew... It sounds complicated, but in practice it just means using a one-line function to decide in which set each instance will go.

Lastly, what do we do if the dataset has no IDs at all? Well, we can just apply the MD5 function to each instance's input features (assuming they are unique and stable across runs). In the case of the California Housing dataset, you could use the longitude, latitude, median_house_value, population and so on. For example:

>>> doc = "\n".join([str(field)
...                  for field in (longitude, latitude, median_house_value, population)])
...
>>> h = md5(doc.encode("ascii")).digest()
>>> is_test = (h[-1] <= 51)
>>> is_test
True

I hope this helps!

@SumoBBQ
Copy link
Author

SumoBBQ commented Aug 18, 2017

Thank you for a clear explanation.
Just one thing I do not fully understand. My question is that:
*Can we get exact 20% instances of the data set if we compute the MD5 hash function and take all the instance' ID, which is lower or equal 51 (20% of 256)?. *
I mean, it depends on the distribution of hash function. In my opinion, if we want 20% of the data set to go to the test set, we must know the distribution of hash function applied to our data set and choose the right number (may be not 51) to split our data set. Please correct me if I am wrong.

@ageron
Copy link
Owner

ageron commented Aug 29, 2017

Hi @SumoBBQ,

The MD5 hash function acts like a pseudo-random number generator with a uniform probability distribution (as do most hash functions, in fact). So there will be roughly 52/256=20.3125% hashes whose last byte is strictly lower than 52 (i.e., lower or equal to 51).
You can have something closer to 20% by taking the hashes whose last byte is strictly lower than 51 (i.e., lower or equal to 50), since 51/256=19.921875. You can get something closer to 20% by taking the whole hash and converting it to a 128 bit number (MD5 hashes have 128 bits), and then checking whether or not that number is smaller than 2^128 * 20%.

Hope this helps.

@SumoBBQ
Copy link
Author

SumoBBQ commented Aug 30, 2017

Nice! Thank you for your help 👍

@SumoBBQ SumoBBQ closed this as completed Aug 30, 2017
@zahraegh
Copy link

zahraegh commented Jan 19, 2018

Hi @ageron ,

Thanks for the nice explanation above! I am a beginner in the hash area, so my questions may sound strange! I apologize in advance! I just want to have a clear picture in my mind.
I was wondering why the input to the hash function has to be converted to int64 (based on the code in your book)?
Also, is the output of hash function always 128 bits? what about the situation when we have a really big data set and the indexing goes to a very large number? I guess in order to keep the one-to-one mapping the hash bit output should increase too. I guess at the end it doesn't matter here how many bit size the hash output is, because we run the 20% division on the last byte. It can be problematic if the hash output up to certain index is 128 bits and then after that it goes up to 256 bits or higher number, is this possible?

Thanks a lot,

@ageron
Copy link
Owner

ageron commented Jan 21, 2018

Hi @zahraegh ,
Thanks for your questions.

Here's a typical usage of the MD5 hash function:

>>> from hashlib import md5
>>> md5(b"hello")
<md5 HASH object @ 0x101988580>
>>> md5(b"hello").digest()
b']A@*\xbcK*v\xb9q\x9d\x91\x10\x17\xc5\x92'

You give it a byte array, of any length, and when you call digest() it returns a byte array of 128 bits (16 bytes), regardless of the length of the input byte array:

>>> len(md5(b"hello").digest())
16
>>> len(md5(b"hello").digest())*8
128
>>> len(md5(b"hello this is longer").digest())*8
128

The md5() function expects an input that implements the buffer API, such as a byte array, as above. But if you just give it an integer, it does not work, and it tells you why:

>>> md5(123)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required

One workaround is to convert the integer to a string, then encode that string (e.g. using ASCII) to a byte-array. But that adds some unnecessary overhead. Instead, we can just create a NumPy int64 which is nice enough to implement the buffer API:

>>> import numpy as np
>>> md5(np.int64(123))
<md5 HASH object @ 0x101f2ba58>
>>> md5(np.int64(123)).digest()
b'\xf1\x8b\x8d\xbe\xfe\x02\xa0\xef\xce(\x1d\xebU\xa2\t\xcd'

Whether the dataset is big or small, we want to keep about 20% for the test set, and 80% for the training set. We don't need a one-to-one mapping from instance to hash: all we need is to have a deterministic mapping from each instance in the dataset to either the training set or the test set, so it does not matter if two instances end up with the same hash. We just need the mapping to be deterministic, so that we a given instance always gets assigned to the same subset (train or test). And it should also be pseudo-random so that instances get well distributed across the training set and the test set.
If the IDs in the original dataset are not random (for example, say they are just incremental: 1, 2, 3, 4...), and if we use a method that does not randomize sufficiently, then we may then there is a risk that the training set and the test set will not look like each other: for example, the training set may have all the old instances, while the test set may have all the new ones. This would be pretty bad. Fortunately, the MD5 hash function (and most commonly used hash functions, such as SHA1 and so on) have both these properties: they are deterministic (all hash functions are, by definition), and they are sufficiently pseudo-random for our use case (MD5 is not sufficiently random for some use cases, such as in cryptography, but it is random enough for splitting a dataset).

Hope this helps.

@duydang2110
Copy link

duydang2110 commented May 8, 2018

I would like to ask one question ? In the function

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
     ids = data[id_column]
     in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
     return data.loc[~in_test_set], data.loc[in_test_set]

Where is id_ come from ? Sorry I'm new to Pandas lib.

@SureshK-T2S
Copy link

id_ is just a variable name and can be called anything else. The code:

in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))

applies the function test_set_check element wise across ids which is a series containing the ID's of data, which I assume returns a boolean based on whether or not it belongs in the test set. This list of booleans, in_test_set, is used to index data and return the train and test sets.

In case you don't understand how the function is applied in the above code I'd suggest looking up lambda functions. Basically,

lambda x: x+1

is equivalent to:

def func(x):
      return x+1

@duydang2110
Copy link

Thank you . @skp104 skp

@Riadh123-star
Copy link

Thanks for a clear explanation
Can you tell me why you use the last byte of the hash code to determine the percentage of the test set even though it generates a 128-bit number created for each ID? Does the last byte in the hash code have special characteristics

@ageron
Copy link
Owner

ageron commented Sep 6, 2019

Hi @Riadh123-star ,
Thanks for your question. We could use any byte, I just chose the last byte out of habit, no particular reason (the habit is due to the fact that the last part of a number is usually the one that varies the most, but in the case of hashes, in general all bytes vary equally).
Hope this helps

@Mechazo11
Copy link

Dear Mr. Ageron,
Good afternoon. I am a grad student from UL Lafayette, writing to you from Lafayette, Louisiana. I am using the third edition of your Hands on ML book to learn Machine Learning. Just like my cohorts in the above comment threads, I am struggling to understand how split_train_test function from pg 52 (Chapter 2) is working. Would you kindly give an easy to follow example about this hashing method and give some suggestion on how to construct this function without using lambda functions?
Thanks in advance.

@minertom
Copy link

minertom commented Jan 5, 2021

Thanks for your kind words and your question.

What we are trying to achieve is to have a stable test set across multiple runs of your program. If the test set changes every time you run the program, then you run the risk of gradually tweaking your program and your model to fit all of the data (including test instances), which is bad because it makes the evaluation of your model on the test set unreliable (the metric will be too optimistic).

Next question: how can you ensure that the test set is stable across multiple runs? Well, there are countless solutions. For example, you could simply randomly pick a set for each instance, just once, then write the sets in separate files. In subsequent runs of your program, you would just use these existing files. End of story. However, this extra preprocessing is not necessary if every instance has a unique ID: suppose the ID is a random integer, then you can simply decide that instances whose IDs end with 00 to 19 belong to the test set, and the rest belong to the training set (this would ensure that 20% of your instances go to the test set, and 80% go to the training set). So for example, the instance with ID=5910406 is in the test set, but the instance with ID=48290483 is not. This solution would work fine. However, database IDs are often sequential, not random, so you need a way to shuffle the IDs to ensure that the instances are randomly distributed across the training set and the test set. This leads us to hash functions.

As Wikipedia says, "A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.". The most famous example is the MD5 function:

>>> from hashlib import md5
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This could be a very long document").digest()
b'\x98\x14\x8d\x9f\xca\xf4\xf69F\x82\xf9kF\xda\x01\x06'
>>> md5(b"This may be a very long document").digest()
b'}\xeb\xdb\xdc\r\x81D\xa0\xa1\\\x19\x8c\xccT:-'

As this example shows, the MD5 function is deterministic: if you run it twice on the same data, it outputs the same result twice. However, if you run it on two different documents (even if they differ by only one bit), it outputs very different results, seemingly random. This is exactly what we need to shuffle the instance IDs! What we can do is just run the instance IDs through MD5 (or any other hash function), and this will give us a new, seemingly random ID (but actually stable across multiple runs, since hash functions are deterministic).

Now MD5 outputs a 128 bit number, so if we want to have 20% of the instances in the test set, we can just compute the MD5 of each instance's ID, and if the result is lower than 2^128 * 20%, it goes to the test set, otherwise it goes to the training set. Unfortunately, things get slightly ugly in Python when dealing with MD5, since it actually returns a byte array (with 16 bytes), not a large integer. We could convert the whole byte array to a long 128 bit integer (using the struct module), but instead it is simpler to just pick the last byte of the hash and use it. If we want 20% of our instances to go to the test set, we can just compute the MD5 of the ID, get the last byte out of the result (which will look like a random byte, from 0 to 255), and if that byte is lower or equal to 51 (which is about 20% of 256) then the instance goes to the test set, otherwise it goes to the training set. Phew... It sounds complicated, but in practice it just means using a one-line function to decide in which set each instance will go.

Lastly, what do we do if the dataset has no IDs at all? Well, we can just apply the MD5 function to each instance's input features (assuming they are unique and stable across runs). In the case of the California Housing dataset, you could use the longitude, latitude, median_house_value, population and so on. For example:

>>> doc = "\n".join([str(field)
...                  for field in (longitude, latitude, median_house_value, population)])
...
>>> h = md5(doc.encode("ascii")).digest()
>>> is_test = (h[-1] <= 51)
>>> is_test
True

I hope this helps!

I have one simple question. During the hashing, only the last byte of the actual hash is considered as the test if the data in question belongs to the test set. Yes, the whole hash is a unique value (unless a collision happens). But, only the last byte 0-255 is used as the determinant of belonging in the data set. So, are you saying that because the hashing algorithm provides a "uniform distribution" that 20% of the values that represent the last byte of the hash will be less than 51 (20% of 256)?

Thank You
Tom

BTW, I purchased your book. Love it so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants