I Have Been Pwned With Bloom Filter
This little hack aims at letting every program that deals with passwords benefit of a lookup in Troy Hunt's pawned passwords list.
Troy Hunt's I Have Been Pwned database : https://haveibeenpwned.com/Passwords
This list is 22 GB large, and this size can be a challenge for some programs. Using a Bloom filter, we can trade some GB of database for a little probability of false lookup.
The database is still about 9-10 GB in a raw binary format. We propose a way to reduce this amount to 1 GB. 1GB can then be stored in RAM for a server that do this kind of lookup frequently.
For an explanation of what a Bloom filter is : https://en.wikipedia.org/wiki/Bloom_filter
Language and dependencies
The project has two phases :
- generation of the Bloom filter : only in Python, only to be done once (like a compilation)
- lookup in the filter. Two small library modules provided, in Python and in C.
Using Idle 3.5 on Windows and executed on Windows Python 3.5.4 and Debian 9 Python 3.5.3. Using few Python dependencies, so it should work at yours with little or no effort. All the project is contained in the same directory. But for runtime, you only need the module and the binary 1GB Bloom filter. C module compiled on Debian 9. Only dependancy is OpenSSL. You must have OpenSSL devel package to compile. No other dependancy (so it should be easy to compile everywere).
Note on files and disk space usage :
- pwned-passwords-ordered-by-hash.txt 22 GB is not included in the present repo, you have to find it using download instructions on IHBP site
- pwned-passwords-ordered-by-hash.bin 10 GB is generated by the generation script, but not really useful, unless I (or you) write a direct binary lookup. for future use. Can be deleted after filter computation.
- pwned-passwords-bf.bin 1 GB is the generated lookup filter that is used at runtime.
Hashes of files
- sha-1("pwned-passwords-ordered-by-hash.7z") = 10c001292d52a04dc0fb58a7fb7dd0b6ea7f7212 (obtained from download page)
- sha256("pwned-passwords-ordered-by-hash.txt") = 0d5ec491d37ed6c98bd2a8142dc7ce3c97531600f75cdb87507edfa547205cfb
- sha256("pwned-passwords-bf.bin") = c6d3848ece93413771e8395eff167590c0bdba4b19945dbc33534d307f1ff73d
You should check that you have a trustworthy source. As you can expect to have the same input file as me, you should obtain the same result for the generated file.
Format of binary files
(10 GB .bin generated file, not used yet) 20-bytes SHA1 one after another. Filesize will be divisible by 20. No header of field separator needed.
Bloom filter design
Read Wiki's article first.
Bloom filter parameters
- We use this online Bloom filter calculator : https://hur.st/bloomfilter/?n=517000000&p=&m=1GiB&k=
- Number of elements to index : 517 000 000
- Filter size : GiB = 1024^3 bytes = 8 * 1024^3 bits
- Number of hashes : 12
- Expected false positive probability : 1/2918
We use a sha512, which provide sufficient bits for the sub-hashes. It is subdivised in 12 blocks (number of hashes) of 33 bits (log2 of filter size) In term of Python programming, the sha512 bytearray is casted to a large integer which is then cut in 12 chunks of 33 bits each.
Warning : Troy Hunt's list is composed of SHA1, and we compute SHA512 of theses SHA1. Don't be confused.
We choose SHA512 because we need 12*33 = 396 bits. Additional available bits are not used (perhaps after a following update of list).
Why not use directly SHA1 as it's in the original list ? Because we need more than 160 bits in our filter design.
Generating the Bloom filter : ihbpwbf-gen.py
Input file : pwned-passwords-ordered-by-hash.txt
Output files :
pwned-passwords-ordered-by-hash.bin : same as the original, but in binary format to spare disk space (not used yet)
pwned-passwords-bf.bin : the filter itself, exactly 2^33 bits
Output of program :
ber@serveur:/home2/ber/ihbpwbf$ python3.5 ihbpwbf-gen.py
Testing 'in filter' item. Number of items=517297
Every item is in filter. The filter seems consistent
Testing 'not in filter' item. Number of items=1000000
False positive aca08ef115b7537cc731f8a538f32bd31586b795 is reported in filter, but is probably not
[...many other false positive...]
False positive a2aaffebd961894cb9d2d4a5b3892359eca074ba is reported in filter, but is probably not
false positive ratio 0.000341
ber@serveur:/home2/ber/ihbpwbf$ ls -lah *.bin
-rw-r--r-- 1 ber ber 1,0G déc. 22 23:59 pwned-passwords-bf.bin
-rw-r--r-- 1 ber ber 9,7G déc. 22 23:58 pwned-passwords-ordered-by-hash.bin
Targeted false positive rate = 1/2918 = 0.000342
Achieved false positive rate = 0.000341
Duration of the computation : I didn't know, I slept at the end. It's probably 2-3 hours on PC from 2012. It will display progression for each 100000 hashes processed
Testing the Bloom filter
As you can see in the output, the program perform some self-test on the resulting filter.
About 1 of 1000 SHA1 records are kept in a testing list. After the filter computation, this list is used to check if every item is really in the filter. A negative result would be abnormal.
not in filter checking
We generate 1000000 random 20 bytes strings. The probability to choose a string that exists in IHBP list is insignificant (1000000 * 500.10^6 / 2^160 = 342.10^-36)
Using theses random string, we can evaluate the false positive rate and compare it to the projected value (see above).
Using the filter in Python : pyihbpwbf.py
This is the main lookup module, that you can import in your programs. Exposed API :
pyihbpwbf.setFilterFileName(filename) Set the binary 1GB filter file path.
pyihbpwbf.loadFilter() Load the 1GB filter. You do not need to call this fuction because it's already done at the first need.
pyihbpwbf.unloadFilter() Release ressources allocated for filter load
pyihbpwbf.checkSHA1(item) Check a SHA1 password (eg : you have to compute SHA1 yourself). 'item' given as a 20-bytes bytearray. Result is False or True. In case 'item' is not 20 bytes long, throw an 'ihbpwbf.invalidHash' exception.
pyihbpwbf.checkPassword(pwd) Check a password (eg : the module will compute SHA1). Result is False or True.
(note : in previous versions, I ommited the 'py' prefix, and I encountered conflict with the .so library in C with the same name.)
Memory management : the filter is loaded using mmap(), to allow for sharing memory by the OS, when multiple instances of the program access it (tested : when using the Python test program and the C test program, they both share the same mapped file. Use cat /proc/xxx/mem and fuser).
Python module demo : ihbpwbf-test-mod.py
Used to test the ihbpwbf.py module. Provide a simple password lookup application.
ber@serveur:/home2/ber/ihbpwbf$ python3.5 ihbpwbf-test-mod.py
Enter a password to test (or nothing to exit program) : p@ssw0rd
Warning : probably compromised !
Enter a password to test (or nothing to exit program) : kldjsklfhsjkfgnejkl
Good : This password is not in the filter
Using the filter in C/C++ : ihbpwbf.c / ihbpwbf.h / ihbpwbf-test.c
The API is the same as the python module. See .h file for precise description. I provide a short Makefile that will compile the library as .so and .a file, but it does not do the installation.
The ihbpwbf-test program provided works as a 'cat' filter. You pipe a password list into stdin, and it will pass every password that is probably in the filter.
Not tested with C++, but it should work. Tell me in case of problem.
Bertrand Maujean https://github.com/bertrand-maujean
This project is licensed under GPLv3. See LICENSE.TXT
Many thanks to Troy Hunt who gave us a unique tool for raising up awareness https://www.troyhunt.com/