# Hash-Anonymize Data with Spark UDF and Python Hashlib SHA-256 #
Example loading some simulated sensitive data, anonymizing it and saving it in Parquet format partitioned by year and month.

### Stage the Data ###
Source data from web-site:  
https://eforexcel.com/wp/downloads-17-sample-csv-files-data-sets-for-testing-credit-card/  
(download a 1000 record sample-set).  

Copy the data in to Docker:
```
docker cp 1000-CC-Records.csv jupyterlab:/opt/workspace/datain/credit-cards/
```
(it may be necessary to create the `datain/credit-cards` folder if it doesn't already exist)

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.\
        builder.\
        appName("hash_anon_udf").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        config("spark.eventLog.enabled", "true").\
        config("spark.eventLog.dir", "file:///opt/workspace/events").\
        getOrCreate()  

22/02/23 08:03:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
# read the data in from a CSV
df = spark.read.option("inferSchema", True).option("header", True)\
    .csv("/opt/workspace/datain/credit-cards/1000-CC-Records.csv")

                                                                                

In [4]:
# rename the columns
old_columns = df.columns
new_columns = ["card_type", "bank", "card_number", "card_holder", "cvv", "issue_date", "expiry_date", "billing_date", "card_pin", "credit_limit"]

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

df = renameCols(df, old_columns, new_columns)

In [5]:
df.printSchema()

root
 |-- card_type: string (nullable = true)
 |-- bank: string (nullable = true)
 |-- card_number: long (nullable = true)
 |-- card_holder: string (nullable = true)
 |-- cvv: integer (nullable = true)
 |-- issue_date: string (nullable = true)
 |-- expiry_date: string (nullable = true)
 |-- billing_date: integer (nullable = true)
 |-- card_pin: integer (nullable = true)
 |-- credit_limit: integer (nullable = true)



In [6]:
df.show(5)

+-------------------+----------------+----------------+---------------+----+----------+-----------+------------+--------+------------+
|          card_type|            bank|     card_number|    card_holder| cvv|issue_date|expiry_date|billing_date|card_pin|credit_limit|
+-------------------+----------------+----------------+---------------+----+----------+-----------+------------+--------+------------+
|               Visa|           Chase|4431465245886276|  Frank Q Ortiz| 362|   09/2016|    09/2034|           7|    1247|      103700|
|           Discover|        Discover|6224764404044446|Tony E Martinez|  35|   06/2012|    06/2030|          23|    6190|       92900|
|Japan Credit Bureau|             JCB|3541789329050940|    Ana M Downs| 945|   03/2017|    03/2021|          10|    8550|       71500|
|   American Express|American Express| 371306399244328| Calvin T House|3868|   09/2007|    09/2018|          26|    1777|      190500|
|               Visa|           Chase|4332985341176660|

## Create a new `df_anon` with Anonymized Names, and Card-Number, remove the PIN ##

Use Python `hashlib` to tokenise / anonymize the data.  A given value will always have the same hash-value, so analysis, aggregations etc can be performed against the anonymized data.

https://docs.python.org/3/library/hashlib.html

`hashlib.pbkdf2_hmac(hash_name, password, salt, iterations, dklen=None)`


EG

In [7]:
# Example using pbkdf2_hmac - this could be used to generate a salted password / secret
import hashlib
dk = hashlib.pbkdf2_hmac('sha256', b'password', b'salt', 100000)
dk.hex()

'0394a2ede332c9a13eb82e9b24631604c31df978b4e2f0fbd2c549944f9d79a5'

In [8]:
# Secret to make this hash different to other hashes for a given value (not salted)
SECRET = "ZZxxyy##1234"
# Iterations chosen based on the hash algorithm and computing power. As of 2013, at least 100,000 iterations of SHA-256 are suggested
ITERATIONS = 1000

#### Create a Spark UDF that calls Anonymized Data fn ####

In [9]:
import hashlib
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType

# function to be called by UDF to sha256-hash a string
def hash_anonymized(payload):
    return hashlib.pbkdf2_hmac('sha256', payload.encode('utf-8'), SECRET.encode('utf-8'), ITERATIONS).hex()
    #return None

# Anon data using a UDF
spark_udf = udf(hash_anonymized, StringType())

In [10]:
# function to anonymize specified columns in a dataframe
def anonymize_data_frame(df, list_of_columns, replace_column = True):
    for col in list_of_columns:
        if replace_column:
            df = df.withColumn(col, spark_udf(col))
        else:
            df = df.withColumn(col + "_anon", spark_udf(col))

    return df

In [11]:
# https://spark.apache.org/docs/2.1.1/api/python/_modules/pyspark/sql/types.html
from pyspark.sql.types import StringType, IntegerType,BooleanType,DateType, LongType, DecimalType, TimestampType

# because the card  number is not a string, need to cast it to string first before it can be hash-tokenized
df = df.withColumn("card_number", df["card_number"].cast(StringType()))

In [12]:
# split the issue_date into Year and Month - could just keep this as year_month, depending on access patterns
from pyspark.sql.functions import split
split_col = split(df["issue_date"], "/")
df = df.withColumn("issue_year", split_col.getItem(1))
df = df.withColumn("issue_month", split_col.getItem(0))

In [13]:
df.printSchema()

root
 |-- card_type: string (nullable = true)
 |-- bank: string (nullable = true)
 |-- card_number: string (nullable = true)
 |-- card_holder: string (nullable = true)
 |-- cvv: integer (nullable = true)
 |-- issue_date: string (nullable = true)
 |-- expiry_date: string (nullable = true)
 |-- billing_date: integer (nullable = true)
 |-- card_pin: integer (nullable = true)
 |-- credit_limit: integer (nullable = true)
 |-- issue_year: string (nullable = true)
 |-- issue_month: string (nullable = true)



#### Call the Anonymize UDF ####

In [14]:
anon_cols = ["card_number","card_holder"]
# create new DF based on the source data with [anon_cols] hashed
anon_df = anonymize_data_frame(df, anon_cols, replace_column=True).drop(*["card_pin","cvv", "issue_date"])

In [15]:
# sample of anonymized data frame
anon_df.show(5)

[Stage 3:>                                                          (0 + 1) / 1]

+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|          card_type|            bank|         card_number|         card_holder|expiry_date|billing_date|credit_limit|issue_year|issue_month|
+-------------------+----------------+--------------------+--------------------+-----------+------------+------------+----------+-----------+
|               Visa|           Chase|a07bf636334354c7a...|5b748745f61c945bc...|    09/2034|           7|      103700|      2016|         09|
|           Discover|        Discover|9231ba4f885c17605...|d7775505e407aa08c...|    06/2030|          23|       92900|      2012|         06|
|Japan Credit Bureau|             JCB|1d3b85e871e3fbabb...|0d4ee40547bf2820c...|    03/2021|          10|       71500|      2017|         03|
|   American Express|American Express|783544b31ab243f14...|2392de94a79599647...|    09/2018|          26|      190500|      2007|         09|
|     

                                                                                

In [16]:
# sample of original data
df.show(5)

+-------------------+----------------+----------------+---------------+----+----------+-----------+------------+--------+------------+----------+-----------+
|          card_type|            bank|     card_number|    card_holder| cvv|issue_date|expiry_date|billing_date|card_pin|credit_limit|issue_year|issue_month|
+-------------------+----------------+----------------+---------------+----+----------+-----------+------------+--------+------------+----------+-----------+
|               Visa|           Chase|4431465245886276|  Frank Q Ortiz| 362|   09/2016|    09/2034|           7|    1247|      103700|      2016|         09|
|           Discover|        Discover|6224764404044446|Tony E Martinez|  35|   06/2012|    06/2030|          23|    6190|       92900|      2012|         06|
|Japan Credit Bureau|             JCB|3541789329050940|    Ana M Downs| 945|   03/2017|    03/2021|          10|    8550|       71500|      2017|         03|
|   American Express|American Express| 3713063992443

## Write Data Out ##
Partition the data by issue-date Year and Month.  Option to Overwrite or Append to the data

Consider strategies how to add new data but not duplicate data: https://stackoverflow.com/questions/42317738/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-n


In [17]:
anon_df.write.format("parquet")\
                .mode("overwrite")\
                .partitionBy('issue_year', 'issue_month')\
                .save("/opt/workspace/dataout/credit-cards/")


                                                                                

In [18]:
spark.stop()