# Hashing Data in Python 
## with the libraries  Hashlib, secrets and HMAC

Hashing is a way to maintain information while masking personally identifiable information (PII).

A hashed value is one that has been replaced with meaningless characters. 

In many cases, you will want multiple instances of the same value, eg an email address, to have the same hashed value. This allows you to identify unique individuals and match their data without knowing the true values for these variables.

Hashing requires a hashing algorithm and a `salt` value. Salting adds random data to the input of a hash function to guarantee a unique output, `the hash`. This further masks the PII values.

- Hashing with the same salt will always return the same hash output for the same input.
- Don't use the same salt for every value in the table.

Protect the salt since with it one can un-hash the data.

Hashing is one way, you typically don't unhash. You store the restricted use data on a secure server and distribute the hashed data to a less secure server/computer for analysis.

A cryptographic salt is made up of random bits added to each password instance before its hashing.


In [1]:
#==============================================================================#
# ==> Imports
# APPENDIX : HMAC PYTHON EXAMPLE
#==============================================================================#
import hashlib 
import hmac 
import math
import secrets # use Python 3.6 secrets package to generate random token
import sys

In [2]:
#==============================================================================#
# ==> Generate salt
#==============================================================================#
# declare variables 
token_hex = None 
token_int = None
token_bit_count = -1
token_byte_count = -1 
token_bytes= None 
salt_hash = None
salt = None

# Create random hexadecimal token of default size (32 bytes, 64 hex digits) 
token_hex = secrets.token_hex()

# View the token and its length
print( "token: " + token_hex + "; length = " + str( len( token_hex ) ) )

token: 5d5bb84ec158e38cfa77286640c799ce560660172880f5a679678e29b6ee4dc6; length = 64


In [3]:
# convert the token to an integer 
##
## NOTE - PLF cannot find this reference
## Tyler Springer provides a good overview of this process in AWS using Open Source tools (Springer,
##2016): http://tylerspringer.com/destroying-sensitive-information-stored-in-aws/
##

#token_int = int(token_hex, 1 )

token_int = int(token_hex, 16) # hex is base 16
print( "Default token int: " + str( token_int ) )


Default token int: 42227150045704095646362307474182794411380520596895119484424968588593411804614


In [4]:
# get bit count
token_bit_count = token_int.bit_length()
print( "token bit count = " + str( token_bit_count ) )

# get byte count
token_byte_count = token_bit_count / 8 
token_byte_count = math.ceil( token_byte_count ) 
token_byte_count = int(token_byte_count ) 
print( "token byte count = " + str(token_byte_count ) )

# convert the integer value to bytes
token_bytes = token_int.to_bytes( token_byte_count, byteorder = sys.byteorder ) 
print("token bytes = " + str( token_bytes ) )

# Use hashlib to create a salt value from the token
salt_hash = hashlib.sha256( token_bytes ) 
salt = salt_hash.hexdigest() 
print( "salt: " + str( salt ) + "; type = " + str(type( salt ) ) )

token bit count = 255
token byte count = 32
token bytes = b'\xc6M\xee\xb6)\x8egy\xa6\xf5\x80(\x17`\x06V\xce\x99\xc7@f(w\xfa\x8c\xe3X\xc1N\xb8[]'
salt: b97e5f579a08610a0378ece1a02a2974f5495736953a27cbb9db96869bb31a28; type = <class 'str'>


In [5]:
#==============================================================================# 
# ==> Hash using salt and HMAC
#==============================================================================#
# declare variables 
salt_IN = None
message_IN = None 
sha256_instance = None 
encoded_passphrase = None
passphrase_hash = None
encoded_message = None
hmac_key_hash_instance = None
hmac_key = None

# set salt (from above) and message
salt_IN = salt
message_IN = "donkey"

# ==> convert salt to encryption key
# get hasher 
sha256_instance = hashlib.sha256()

# update it with the message
encoded_passphrase = salt_IN.encode( "utf-8" ) 
sha256_instance.update(encoded_passphrase )

# get digest 
passphrase_hash = sha256_instance.digest() 

# store as key
hmac_key_hash_instance = sha256_instance 
hmac_key = passphrase_hash

# ==> hash message with HMAC and salt key
# encode 
encoded_message = message_IN.encode( "utf8" )

# make hmac_instance with key, message, and hash function.
hmac_instance = hmac.new( hmac_key, encoded_message, digestmod = hashlib.sha256 )

# perform hash. 
value_OUT = hmac_instance.hexdigest()

In [6]:
value_OUT

'04f1157e0a6af1ed5d7b18647723dd54996810b53475fe365d03f84cd31ef8ac'

### Create a function that will generate a random salt string value.

In [7]:
def get_random_salt_str():
    '''
    Generate salt using secrets library
    '''

    # declare variables 
    token_hex = None 
    token_int = None
    token_bit_count = -1
    token_byte_count = -1 
    token_bytes= None 
    salt_hash = None
    salt = None

    # Create random hexadecimal token of default size (32 bytes, 64 hex digits) 
    token_hex = secrets.token_hex()

    # View the token and its length
    #print( "token: " + token_hex + "; length = " + str( len( token_hex ) ) )
    
    # convert the token to an integer 
    #token_int = int(token_hex, 1 )

    token_int = int(token_hex, 16) # hex is base 16
    #print( "Default token int: " + str( token_int ) )
    
    # get bit count
    token_bit_count = token_int.bit_length()
    #print( "token bit count = " + str( token_bit_count ) )

    # get byte count
    token_byte_count = token_bit_count / 8 
    token_byte_count = math.ceil( token_byte_count ) 
    token_byte_count = int(token_byte_count ) 
    #print( "token byte count = " + str(token_byte_count ) )

    # convert the integer value to bytes
    token_bytes = token_int.to_bytes( token_byte_count, byteorder = sys.byteorder ) 
    #print("token bytes = " + str( token_bytes ) )

    # Use hashlib to create a salt value from the token
    salt_hash = hashlib.sha256( token_bytes ) 
    salt = salt_hash.hexdigest() 
    #print( "salt: " + str( salt ) + "; type = " + str(type( salt ) ) )
    return salt
    

In [8]:
# This function should always return a new random string
get_random_salt_str()

'fddff5268af7b532acba782f07987a8098c644e6b8ca68e21528c3f41513a470'

In [9]:
# Check
for i in range(1,10):
    print(str(i), get_random_salt_str())

1 de9a9591831b5aa78cd8618d6cfe7cdfe3671db434338ea16bb40a0926ab8882
2 0e96a6cf9d4898af71db11ad80925b8d93a644143389907d50218fe9c8e3b67a
3 ba4726bd3c8d993698231dd3e0938090052c83fe62d5e4114600f6241e0dc2a1
4 21c3633b8a995a878b6361afde8c696a59e6355bb5cc963800bf5d7edb057960
5 5d8d5391df630b4b8b65b307ee3d67e621802005ccc7f26298c2c0a7e9b39a96
6 7c508410f31a5aa7878ad3806584b021b3aeaf38f7fb78a0bd1370cacb423f00
7 59cf2096fcf0125df655ab486780a3c05d95727cbea8ba2be77a0d8fbb9a8b6a
8 8d4b01857221d640dd68046149e9a0a1ec3f76f80e2dc6792a08e7e358f261d7
9 25d9e17467b9dd25011cad291861f6eb325e05de711e9bc22a8c0af3a804aed2


In [10]:
# Check - if we run again do we get the same values
for i in range(1,10):
    print(str(i), get_random_salt_str())

1 b6f0fdbba2fa2915a63981d074c3638df443a54c2c1149980d0a7339079de441
2 75290f42017279cedcf3ed5b37c41ebde2f0bb23a22f5d3a20eabe9c93893978
3 d18a09759f0b5113a7cd67e7afd44428f7644df723dbdf30b87197ef3c0a4434
4 b6b66b989078fd62f5ea74ace461323089af8b056b33b425f6c4941a406a5109
5 7a875afdb8c6e27f02b2d74a14e02f1bc780c5e9dbc6bde7bbbb16ce192f1781
6 b9ab63abe2fb1e2126c38ddba2b3c5930ed001bc6093a3fb2f2df29c9d7eccc4
7 d09891320d1780077b1d2f27eb2e6b12b913f5782d9e62d981ad4ca6f375b9dc
8 ebe739c2f23a6d6787dcab9d5a58e00b40a35534a1540b9d0eee7aab9378f94a
9 5bc40cd26d1d8c6ab32637a4144841e041b702f0ddb24a4199218de7b6e62cc0


### Create a function to return a hashed value given a specified salt value

In [11]:
def get_hashed_value(value_to_hash, salt):

    '''Hash using a salt and HMAC'''
    
    # set salt and message
    salt_IN = salt
    message_IN = value_to_hash
    
    # declare variables 
    sha256_instance = None 
    encoded_passphrase = None
    passphrase_hash = None
    encoded_message = None
    hmac_key_hash_instance = None
    hmac_key = None

    ######################################
    # ==> convert salt to encryption key
    ######################################
    # get hasher 
    sha256_instance = hashlib.sha256()

    # update it with the message
    encoded_passphrase = salt_IN.encode( "utf-8" ) 
    sha256_instance.update(encoded_passphrase )

    # get digest 
    passphrase_hash = sha256_instance.digest() 

    # store as key
    hmac_key_hash_instance = sha256_instance 
    hmac_key = passphrase_hash

    ##########################################
    # ==> hash message with HMAC and salt key
    ###########################################
    # encode 
    encoded_message = message_IN.encode( "utf8" )

    # make hmac_instance with key, message, and hash function.
    hmac_instance = hmac.new( hmac_key, encoded_message, digestmod = hashlib.sha256 )

    # perform hash. 
    value_OUT = hmac_instance.hexdigest()
    
    return value_OUT

In [12]:
# The same value hased with the same salt will return the same hashed value
# The same value hashed with a different salt will return a different hashed value
print("donkey, hashed:", get_hashed_value('donkey', 'salty_salt_string'))
print("donkey, hashed:", get_hashed_value('donkey', 'salty_salt_string'))
print("peanut, hashed:", get_hashed_value('peanut', 'salty_salt_string'))
print("peanut, hashed:", get_hashed_value('peanut', 'salty_salt_string'))

donkey, hashed: 2f826e27bee3b91acb9164a8eb854ccaba8d179ff480b64d4778ac84f7bf2a15
donkey, hashed: 2f826e27bee3b91acb9164a8eb854ccaba8d179ff480b64d4778ac84f7bf2a15
peanut, hashed: c075913a175718c2d3d9a9d2015a2b1aa36a40cec4d47f6d3e92f666737239dc
peanut, hashed: c075913a175718c2d3d9a9d2015a2b1aa36a40cec4d47f6d3e92f666737239dc


In [13]:
# The same value hashed with a different salt will return a different hashed value
print("donkey, hashed:", get_hashed_value('donkey', get_random_salt_str()))
print("donkey, hashed:", get_hashed_value('donkey', get_random_salt_str()))
print("peanut, hashed:", get_hashed_value('peanut', get_random_salt_str()))
print("peanut, hashed:", get_hashed_value('peanut', get_random_salt_str()))

donkey, hashed: 41aa6e06d6f5640ce702a2cc2f7a6279d97a0f40745d6fa08b7fad8509e59a2f
donkey, hashed: 84bb701fd3cfcca12eeb3b8903d39223d015609861347908e19f6568e84ab001
peanut, hashed: 168e41d85f5149887c008ce70aec9bad0c813ac29f3c124afe6a38bb6071727a
peanut, hashed: 9367d38ae27b1f23390e95d690d62390dabc807e225e36113444d51f7b41d147


# Example: Applying Keyed Hashing to a pandas dataframe


In [14]:
import pandas as pd

# Create a test dataframe
data = {'email':['albert@gmail.com', 'bree@hotmail.com', 'titan@live.com', 'epic@yahoo.com','albert@gmail.com'],
        'gender':['male', 'male', 'male', 'male','them'],
        'age':[20, 21, 19, 18, 66],
        'income':[25000, 34000, 31000, 10000, 50000],
        'country':['Germany', 'Germany', 'Germany', 'Sweden','USA'],
        'userid':['101', '666', '3344', '1212','101'],
        } 
df = pd.DataFrame(data)

# Print out the dataframe
print("INPUT DF")
print(df)

INPUT DF
              email gender  age  income  country userid
0  albert@gmail.com   male   20   25000  Germany    101
1  bree@hotmail.com   male   21   34000  Germany    666
2    titan@live.com   male   19   31000  Germany   3344
3    epic@yahoo.com   male   18   10000   Sweden   1212
4  albert@gmail.com   them   66   50000      USA    101


In [15]:
def apply_hashing(the_val, the_colname):
    global key_df
  
    if the_val not in key_df[key_df.hashed_col==the_colname]['hashed_IN'].values:
        # If we havent already hashed this email, hash it
        
        # First create a random salt just for this email
        salt = get_random_salt_str()
        
        # Now do the salted hashing
        hashed_val = get_hashed_value(the_val, salt)
        
        # save a lookup table of the secret bits
        key_df = key_df.append({'hashed_col': the_colname, 'hashed_IN' : the_val, 'hashed_val' : hashed_val, 'salt' : salt},  
                ignore_index = True)
        
        # return the hashed email
        return hashed_val
    else:
        # else just return the hashed email
        #return key[the_colname]
        print("\n", the_val, 'HASHED ALREADY - CHECK OUTPUT HASHED DF to make sure values are the same')
        
        return key_df[((key_df.hashed_col==the_colname) & (key_df['hashed_IN'] == the_val))]['hashed_val'].squeeze()


##
## IMPLEMENT THE HASHING
##

# Create DF to hold our secret lookup table
# YOU WOULD never want to store this on the same server as the hashed data because that defeats the purpose!
# This lookup table is also not necessary if you have the unhashed data.
key_df = pd.DataFrame(columns = ['hashed_col', 'hashed_IN', 'hashed_val', 'salt']) 

# HASH EMAIL
the_colname = 'email'
df.email = df.email.apply(lambda x: apply_hashing(x, the_colname))

# HASH Userid
the_colname = 'userid'
df.userid = df.userid.apply(lambda x: apply_hashing(x, the_colname))

#=================================
print("\n\nHASHED DF")
print(df)

#=================================
print("\n\nSECRET SAUCE")
key_df


 albert@gmail.com HASHED ALREADY - CHECK OUTPUT HASHED DF to make sure values are the same

 101 HASHED ALREADY - CHECK OUTPUT HASHED DF to make sure values are the same


HASHED DF
                                               email gender  age  income  \
0  d02c4feb89fab9a428c16203489519e00317e4fcd3e85d...   male   20   25000   
1  5fecfef10c899850aeb2bb9c30d2de3a92c88980485c45...   male   21   34000   
2  87af91b179f43ce8f54ccca0c64cc435efb05dbea8f5db...   male   19   31000   
3  9920fd51bf1d8c0f30f73e5dee53ce17c99eeaddd4991d...   male   18   10000   
4  d02c4feb89fab9a428c16203489519e00317e4fcd3e85d...   them   66   50000   

   country                                             userid  
0  Germany  1b7d4dde3ee311d5d1841a5f82b2515af6f7300ff1f472...  
1  Germany  9a2aa3e0049ca6de079a2b25e14bbb9f01cf68c00a8fab...  
2  Germany  c1925ed7961a3061724c2435ca09508e6f140d3aa6150a...  
3   Sweden  52b4dbee53e28d5732b37a029b9a20391352a1ea69fe5d...  
4      USA  1b7d4dde3ee311d5d1841a5f82b2

Unnamed: 0,hashed_col,hashed_IN,hashed_val,salt
0,email,albert@gmail.com,d02c4feb89fab9a428c16203489519e00317e4fcd3e85d...,fb958ec61e72e2090f9aa67174cc10edcb0128b045d407...
1,email,bree@hotmail.com,5fecfef10c899850aeb2bb9c30d2de3a92c88980485c45...,477533867f1ab38888d85f63a15d7ced893d09abc81cf5...
2,email,titan@live.com,87af91b179f43ce8f54ccca0c64cc435efb05dbea8f5db...,81e1ae6e4205466996335560162a39fe377559562a9319...
3,email,epic@yahoo.com,9920fd51bf1d8c0f30f73e5dee53ce17c99eeaddd4991d...,037fc1fc86aeba84fe52b34ad162efe4e83e18cf01dec3...
4,userid,101,1b7d4dde3ee311d5d1841a5f82b2515af6f7300ff1f472...,3e5bf11a61a5d9e33073c13d1500c15c097f66f217e1ee...
5,userid,666,9a2aa3e0049ca6de079a2b25e14bbb9f01cf68c00a8fab...,44e7b27b5671483559489ac1ceee0f722215046efc4eb2...
6,userid,3344,c1925ed7961a3061724c2435ca09508e6f140d3aa6150a...,2686b71a70ec7fea431eccfb556f08333c0890987755ad...
7,userid,1212,52b4dbee53e28d5732b37a029b9a20391352a1ea69fe5d...,aa31357c3570ae2702b38a4a7e92f9746d0782aa122c02...


# References

https://auth0.com/blog/adding-salt-to-hashing-a-better-way-to-store-passwords/#:~:text=A%20cryptographic%20salt%20is%20made,compute%20them%20using%20the%20salts.
    

## Let's do it again, but a bit more simply

In [16]:
# SIMPLE HASH EXAMPLE
# Mostly from: https://korniichuk.medium.com/gdpr-guide-2-7c399b44ba3

#import the libraries we will use
import hashlib
import secrets

In [17]:
# Create a test dataframe
data = {'email':['albert@gmail.com', 'bree@hotmail.com', 'titan@live.com', 'epic@yahoo.com','albert@gmail.com'],
        'gender':['male', 'female', 'male', 'female','male'],
        'age':[20, 21, 19, 18,20],
        'income':[25000, 34000, 31000, 10000,15000],
        'country':['Germany', 'Germany', 'Germany', 'Sweden','Germany']} 
df = pd.DataFrame(data)

# Display our test dataframe
print("INPUT DF")
print(df)

INPUT DF
              email  gender  age  income  country
0  albert@gmail.com    male   20   25000  Germany
1  bree@hotmail.com  female   21   34000  Germany
2    titan@live.com    male   19   31000  Germany
3    epic@yahoo.com  female   18   10000   Sweden
4  albert@gmail.com    male   20   15000  Germany


In [18]:
def get_hashed_value(the_val, the_colname):
    
    '''A function that will has a given value'''
    
    global key_df  # Assumes we maintain a lookup table of already hashed values
    
    if the_val not in key_df[key_df.hashed_col==the_colname]['hashed_IN'].values:
        # if val not in lookup table of already hashed values
        # then hash it
        
        # First create a random salt just for this value
        salt = secrets.token_hex()
        
        # Now hash the salted value
        sha3 = hashlib.sha3_512()  # Hash algorithm
        data = salt + str(the_val)        # value to salt
        sha3.update(data.encode('utf-8')) # Hash the value
        hexdigest = sha3.hexdigest()  # return the hexdigest of the salted output value
        
        # save value and hashed value to a lookup table of the secret bits
        key_df = key_df.append({'hashed_col': the_colname, 'hashed_IN' : the_val, 'hashed_val' : hexdigest, 'salt' : salt},  
                ignore_index = True)
        
        # return the hashed value
        return hexdigest
    
    else:
        # else just return the hashed email
        print("\n", the_val, 'HASHED ALREADY - CHECK OUTPUT HASHED DF to make sure values are the same')
        
        return key_df[((key_df.hashed_col==the_colname) & (key_df['hashed_IN'] == the_val))]['hashed_val'].squeeze()


In [19]:
##
## IMPLEMENT THE HASHING
##

# Create DF to hold our secret lookup table.
# We use this table to make sure the same input values have the same hashed value (needed for data linkage)
# YOU WOULD never want to store this on the same server as the hashed data because that defeats the purpose!
# This lookup table is also not necessary if you have the unhashed data.
key_df = pd.DataFrame(columns = ['hashed_col', 'hashed_IN', 'hashed_val', 'salt']) 

# You can hash as many columns as you want
# Make a copy of the input df if you don't want to overwrite the data
df_hashed = df.copy()

# HASH EMAIL
the_colname = 'email'
df_hashed.email = df_hashed.email.apply(lambda x: get_hashed_value(x, the_colname))

# HASH income
the_colname = 'income'
df_hashed.income = df_hashed.income.apply(lambda x: get_hashed_value(x, the_colname))

#=================================
print("\n\nHASHED DF")
print(df_hashed)

#=================================
print("\n\nSECRET SAUCE")
key_df


 albert@gmail.com HASHED ALREADY - CHECK OUTPUT HASHED DF to make sure values are the same


HASHED DF
                                               email  gender  age  \
0  7be40296d784b881b37e9cc02937207eb738848f144d11...    male   20   
1  35c93916fff462b62da378e1bb41be6c4ad83fd26a7b64...  female   21   
2  6890d7258972961b405b0bb9150cf0f72928425a1f6f89...    male   19   
3  2802e3d6d9be3035cde875ce21d6735dd8c231667a0c35...  female   18   
4  7be40296d784b881b37e9cc02937207eb738848f144d11...    male   20   

                                              income  country  
0  661e45b1a7626404a22f32a41b5aaf15d60e3324d3cee5...  Germany  
1  1732d2daf9720f3cc4a84e26c8747c3f91227d2d14a670...  Germany  
2  dad09c5f7f3a10aa9626a466c95bbb44b0cfaaa4fcf41d...  Germany  
3  ad9e645c2b7cbf9740c7479655944f47383c11885b819b...   Sweden  
4  6841e90b7d895d01cd6a14a9c36de2ebd27301960b79f5...  Germany  


SECRET SAUCE


Unnamed: 0,hashed_col,hashed_IN,hashed_val,salt
0,email,albert@gmail.com,7be40296d784b881b37e9cc02937207eb738848f144d11...,6d795ba1835a1acc92ee71499f1777b58f833913935461...
1,email,bree@hotmail.com,35c93916fff462b62da378e1bb41be6c4ad83fd26a7b64...,3de166e62007baa4dbb78fc4397ca610c124ac92a85156...
2,email,titan@live.com,6890d7258972961b405b0bb9150cf0f72928425a1f6f89...,f357780c86d29a13937bc7e1bfa380028755e8f8d8ddb1...
3,email,epic@yahoo.com,2802e3d6d9be3035cde875ce21d6735dd8c231667a0c35...,82219d8b0631f52dcddc94a038da6be9fa8ac257bdd178...
4,income,25000,661e45b1a7626404a22f32a41b5aaf15d60e3324d3cee5...,b425384eb28f35a094106fd985222cc409d4030e6dcd5c...
5,income,34000,1732d2daf9720f3cc4a84e26c8747c3f91227d2d14a670...,6c18c466ec11dd63dd6a65cba16cf2d299cd2b977510db...
6,income,31000,dad09c5f7f3a10aa9626a466c95bbb44b0cfaaa4fcf41d...,8fdf8a04a39a38cacfaa9d92af46eec041171bb1e19d23...
7,income,10000,ad9e645c2b7cbf9740c7479655944f47383c11885b819b...,b622f90350b04f0537b191d70f29670bba02646735bfdf...
8,income,15000,6841e90b7d895d01cd6a14a9c36de2ebd27301960b79f5...,bd65c3ce7b740e5fdfe1244c2c64ccf13ed6bdd07e2377...


## Finishing up

At this point you would check the hashed data (`df_hashed`), save it to file, and then you could use it or share it without worrying about revealing personally identifiable information. Store the input data and the code you used to hash and salt it on a secure server.

That said, there is always a chance that someone could reverse engineer your hashing if they can figure out your hashing and salting method. So the security methods for working with highly restricted use data (P3 and above) should always be double checked by a campus data security professional.

In [None]:
---
Created by Patty Frontiera, pfrontiera@berkeley.edu
Last updated: 1/31/2022
