# Rockyou password database collision test

In this small project, we'll check if there are any collisions in the hash values of the passwords contained in the famous rockyou database.

In [1]:
%matplotlib inline

import hashlib
import pandas as pd
import os
from tqdm import tqdm

def file_len(fname):
    with open(fname) as f:
        i=0
        for i, l in enumerate(f):
            pass
        return i + 1

If the database does not exist in the computer, it will be
1. downloaded from a repo on GitHub
2. untarred
3. UTF-8 corrected
4. made sure there are only unique passwords

In [2]:
if not os.path.exists('rockyou.txt'):
    if not os.path.exists('rockyou.txt.tar.gz'):
        import urllib.request
        url = 'https://github.com/danielmiessler/SecLists/raw/master/Passwords/Leaked-Databases/rockyou.txt.tar.gz'
        print('Downlading passwords...')
        urllib.request.urlretrieve(url, './rockyou.txt.tar.gz')
        print('Done')

    import tarfile
    print('Decompressing data...')
    tar = tarfile.open("rockyou.txt.tar.gz")
    tar.extractall()
    tar.close()

    import codecs
    print('Fixing UTF-8 encoding issues before starting...')
    f = codecs.open('rockyou.txt', encoding='utf-8', errors='ignore')
    data = set()
    for line in f:
        data.add(line)

    print('Overwriting file with UTF-8 compliant content...')
    target = open('rockyou.txt', 'w')
    target.write(''.join(data))
    print('Done')

num_passwords = file_len('rockyou.txt')
print(num_passwords, 'passwords to process!')

14344328 passwords to process!


We'll only use MD5 and SHA-1, since these are the functions known to produce hash collisions...

In [3]:
hashes = pd.DataFrame()

# Which algorithms we are going to test
experiments = [('MD5', hashlib.md5), ('SHA1', hashlib.sha1)]

Let's run our experiments.

In [4]:
# For each algorithm...
for algo_name, algo_function in experiments:      
    
    hashes_list = []
    with open("rockyou.txt", "r") as ins:
        
        for password in tqdm(ins, desc=algo_name, total=num_passwords):
            # create our hashing object
            m = algo_function()
            m.update(password.encode('utf-8'))
            hashe = str(m.hexdigest())
            
            if algo_name not in hashes.columns:
                hashes_list.append(hashe)
                
        # We are going to save the hashes to our database, in case they don't still exist
        if algo_name not in hashes.columns:
            hashes[algo_name] = hashes_list

MD5: 100%|██████████| 14344328/14344328 [00:21<00:00, 667914.05it/s]
SHA1: 100%|██████████| 14344328/14344328 [00:21<00:00, 661789.07it/s]


pandas will help us find non-unique values. As we can see, everything is unique, so there are no collisions. That concludes our project.

In [5]:
hashes.describe()

Unnamed: 0,MD5,SHA1
count,14344328,14344328
unique,14344328,14344328
top,dc335875b411a43e28c7f4310c396c90,d63f4a576e3915b376ff7e3f8e60145f318074c3
freq,1,1
