**How to add Sequence generated surrogate key as a column in dataframe?**

**Topics Covered**

- monotonically_increasing_id
- Using MD5
- Using CRC32
- hash
- Using sha1
- Using sha2
- Using window function row_number()

#### **1) monotonically_increasing_id**

- monotonically_increasing_id generates sequence or **surrogate key**.
- The monotonically_increasing_id function in Databricks is useful for generating **unique identifiers for rows in a DataFrame** and it generates a **column** with monotonically increasing **64-bit integers**. However, the IDs are not contiguous due to the way Spark operates in a distributed manner.

**Key Characteristics**:
- **Monotonically Increasing:**
  - The values generated are guaranteed to be monotonically **increasing and unique**, but they are **not guaranteed to be consecutive**.
- **Distributed Processing:**
  - This function is optimized for distributed computing environments, such as Apache Spark, on which Databricks is built. The **IDs are unique across partitions** and can be used to identify rows uniquely.
- **Non-Consecutive IDs:**
  - Since the IDs are generated in a distributed manner, they are **not consecutive**.

**Syntax**

     monotonically_increasing_id()
**Arguments**
- This function takes no arguments.

**Returns**
- BIGINT

In [0]:
from pyspark.sql.functions import col, lit, monotonically_increasing_id, spark_partition_id, concat, concat_ws
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType

**Ex 01: Consecutive numbers**

In [0]:
df = spark.range(10000)
display(df)

id
0
1
2
3
4
5
6
7
8
9


In [0]:
# Create a DataFrame with 1000 records
df_consec = spark.range(1000).withColumn("id", monotonically_increasing_id())\
                             .withColumn("partition_id", spark_partition_id())

# Display the DataFrame
display(df_consec)

id,partition_id
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


**Ex 02: Non-consecutive numbers**

In [0]:
# Create a DataFrame with 10 records
# df_Non_Consec = spark.range(10).withColumn("id", monotonically_increasing_id())\
#                                .withColumn("partition_id", spark_partition_id())

df_Non_Consec = spark.range(20).withColumn("id", monotonically_increasing_id())\
                               .withColumn("partition_id", spark_partition_id())

# Display the DataFrame
display(df_Non_Consec)

id,partition_id
0,0
1,0
8589934592,1
8589934593,1
8589934594,1
17179869184,2
17179869185,2
25769803776,3
25769803777,3
25769803778,3


In [0]:
df = spark.read.csv("/FileStore/tables/Emp_Hash-3.csv", header=True, inferSchema=True)
display(df)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date
100,Smitha,23,2,IT,16700.0,17-12-1980,800,3678.0,20,2022-01-01
101,Anil,26,3,ADMIN,16750.0,20-02-1981,1600,211.0,30,2022-01-02
102,Watson,29,4,ADF,12345.0,22-02-1981,1250,344.0,30,2022-01-03
103,James,32,5,ADB,45678.0,4/2/1981,2975,,20,2022-01-04
104,Mathew,35,6,ADE,23456.0,21-09-1981,1250,12345.0,30,2022-01-05
105,Sree,38,7,SALES,98765.0,5/1/1981,2850,,30,2022-01-06
106,Rajesh,41,8,PROD,49876.0,6/9/1981,2450,,10,2022-01-07
107,Swetha,44,9,DEVELOPER,6577.0,19-04-1987,3000,2456.0,20,2022-01-08
108,Kapil,47,10,ACCOUNTS,,1/11/1981,5000,345.0,10,2022-01-09
109,Tarun,50,11,TRANSPORT,34590.0,9/8/1981,1500,0.0,30,2022-01-10


In [0]:
# Creating new column as partition_id using monotonically_increasing_id() function
df_surr = df.withColumn("ID_KEY", monotonically_increasing_id())
display(df_surr)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY
100,Smitha,23,2,IT,16700.0,17-12-1980,800,3678.0,20,2022-01-01,0
101,Anil,26,3,ADMIN,16750.0,20-02-1981,1600,211.0,30,2022-01-02,1
102,Watson,29,4,ADF,12345.0,22-02-1981,1250,344.0,30,2022-01-03,2
103,James,32,5,ADB,45678.0,4/2/1981,2975,,20,2022-01-04,3
104,Mathew,35,6,ADE,23456.0,21-09-1981,1250,12345.0,30,2022-01-05,4
105,Sree,38,7,SALES,98765.0,5/1/1981,2850,,30,2022-01-06,5
106,Rajesh,41,8,PROD,49876.0,6/9/1981,2450,,10,2022-01-07,6
107,Swetha,44,9,DEVELOPER,6577.0,19-04-1987,3000,2456.0,20,2022-01-08,7
108,Kapil,47,10,ACCOUNTS,,1/11/1981,5000,345.0,10,2022-01-09,8
109,Tarun,50,11,TRANSPORT,34590.0,9/8/1981,1500,0.0,30,2022-01-10,9


**Setting a Custom Starting Point**

In [0]:
# Set a custom starting point for the IDs
start_id = 1

# Creating new column as partition_id using monotonically_increasing_id() function
df_surr = df.withColumn("ID_KEY", monotonically_increasing_id() + start_id)
display(df_surr)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY
100,Smitha,23,2,IT,16700.0,17-12-1980,800,3678.0,20,2022-01-01,1
101,Anil,26,3,ADMIN,16750.0,20-02-1981,1600,211.0,30,2022-01-02,2
102,Watson,29,4,ADF,12345.0,22-02-1981,1250,344.0,30,2022-01-03,3
103,James,32,5,ADB,45678.0,4/2/1981,2975,,20,2022-01-04,4
104,Mathew,35,6,ADE,23456.0,21-09-1981,1250,12345.0,30,2022-01-05,5
105,Sree,38,7,SALES,98765.0,5/1/1981,2850,,30,2022-01-06,6
106,Rajesh,41,8,PROD,49876.0,6/9/1981,2450,,10,2022-01-07,7
107,Swetha,44,9,DEVELOPER,6577.0,19-04-1987,3000,2456.0,20,2022-01-08,8
108,Kapil,47,10,ACCOUNTS,,1/11/1981,5000,345.0,10,2022-01-09,9
109,Tarun,50,11,TRANSPORT,34590.0,9/8/1981,1500,0.0,30,2022-01-10,10


In [0]:
# Set a custom starting point for the IDs
start_id = 1000

# Creating new column as partition_id using monotonically_increasing_id() function
df_surr = df.withColumn("ID_KEY", monotonically_increasing_id() + start_id)
display(df_surr)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY
100,Smitha,23,2,IT,16700.0,17-12-1980,800,3678.0,20,2022-01-01,1000
101,Anil,26,3,ADMIN,16750.0,20-02-1981,1600,211.0,30,2022-01-02,1001
102,Watson,29,4,ADF,12345.0,22-02-1981,1250,344.0,30,2022-01-03,1002
103,James,32,5,ADB,45678.0,4/2/1981,2975,,20,2022-01-04,1003
104,Mathew,35,6,ADE,23456.0,21-09-1981,1250,12345.0,30,2022-01-05,1004
105,Sree,38,7,SALES,98765.0,5/1/1981,2850,,30,2022-01-06,1005
106,Rajesh,41,8,PROD,49876.0,6/9/1981,2450,,10,2022-01-07,1006
107,Swetha,44,9,DEVELOPER,6577.0,19-04-1987,3000,2456.0,20,2022-01-08,1007
108,Kapil,47,10,ACCOUNTS,,1/11/1981,5000,345.0,10,2022-01-09,1008
109,Tarun,50,11,TRANSPORT,34590.0,9/8/1981,1500,0.0,30,2022-01-10,1009


#### **Using CRC32**

- Calculates the **cyclic redundancy check value (CRC32)** of a **binary column** and returns the value as a **bigint**.
- It generates duplicates for every 100k / 200k records.
- We should **not use CRC32** for **surrogate key** generation on **large tables**.

**Syntax**

     crc32(expr)
**Arguments**
- **expr:** A BINARY expression.

**Returns**
- BIGINT

In [0]:
from pyspark.sql.functions import crc32, col

# Creating new column as partition_id using md5() function
df_CRC32 = df_surr.withColumn("CRC32_KEY", crc32(col("EMPNO").cast("string")))
display(df_CRC32)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY
100,Smitha,23,2,IT,16700.0,17-12-1980,800,3678.0,20,2022-01-01,1000,595022058
101,Anil,26,3,ADMIN,16750.0,20-02-1981,1600,211.0,30,2022-01-02,1001,1416650876
102,Watson,29,4,ADF,12345.0,22-02-1981,1250,344.0,30,2022-01-03,1002,3447271878
103,James,32,5,ADB,45678.0,4/2/1981,2975,,20,2022-01-04,1003,3128820048
104,Mathew,35,6,ADE,23456.0,21-09-1981,1250,12345.0,30,2022-01-05,1004,605721843
105,Sree,38,7,SALES,98765.0,5/1/1981,2850,,30,2022-01-06,1005,1394451557
106,Rajesh,41,8,PROD,49876.0,6/9/1981,2450,,10,2022-01-07,1006,3390371295
107,Swetha,44,9,DEVELOPER,6577.0,19-04-1987,3000,2456.0,20,2022-01-08,1007,3172189513
108,Kapil,47,10,ACCOUNTS,,1/11/1981,5000,345.0,10,2022-01-09,1008,766302424
109,Tarun,50,11,TRANSPORT,34590.0,9/8/1981,1500,0.0,30,2022-01-10,1009,1521215566


In [0]:
from pyspark.sql.functions import concat, col, crc32, row_number
from pyspark.sql.window import Window

df_CRC32 = df_CRC32.withColumn("concat", concat(col("Sales"), col("Quantity"), col("Commodity"), col("Experience")))
df_CRC32 = df_CRC32.withColumn("CRC32_id", crc32(col("concat")))
df_CRC32 = df_CRC32.withColumn("duplicates", row_number().over(Window.partitionBy("CRC32_id").orderBy("CRC32_id")))
display(df_CRC32)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1


In [0]:
display(df_CRC32.filter("duplicates>1"))

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3
170,Swadesh,67,9,PROD,789900.0,19-04-1987,3000,234.0,28,2022-01-23,1070,1815529005,78990030002349,1088802028,2
194,Swadesh,29,9,PROD,789900.0,19-04-1987,3000,234.0,20,2022-01-23,1094,4124585914,78990030002349,1088802028,3
198,Swadesh,56,9,PROD,789900.0,19-04-1987,3000,234.0,22,2022-01-23,1098,4235092881,78990030002349,1088802028,4
150,Watson,29,4,ADF,12345.0,22-02-1981,1250,344.0,35,2022-01-03,1050,1577100463,1234512503444,1104184368,2
174,Watson,33,4,ADF,12345.0,22-02-1981,1250,344.0,67,2022-01-03,1074,1801126452,1234512503444,1104184368,3


#### **Using MD5**
- It generates duplicates for every 100k / 200k records.
- It not suggestionable to use while generating millions of records.

In [0]:
from pyspark.sql.functions import md5,col

# Creating new column as partition_id using md5() function
df_md5  = df_CRC32.withColumn("MD5_KEY", md5(col("EMPNO").cast("string")))
display(df_md5)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates,MD5_KEY
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1,65ded5353c5ee48d0b7d48c591b8f430
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1,4c56ff4ce4aaf9573aa5dff913df997a
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2,3636638817772e42b59d74cff571fbb3
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3,bd686fd640be98efaae0091fa301e613
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4,85d8ce590ad8981ca2c8286f79f59954
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1,1afa34a7f984eeabdbb0a7d494132ee5
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1,7f6ffaa6bb0b408017b62254211691b5
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2,b73ce398c39f506af761d2277d853a92
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3,6cdd60ea0045eb7a6ec44c54d29ed402
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1,7f1de29e6da19d22b51c68001e7e0e54


#### **hash**
- Calculates the **hash code** of given **columns**, and returns the result as an **int column**.
- Used to **mask sensitive information**, e.g. Date of Birth, Social Security Number, I.P. Address, etc. Other times it may be needed to derive a repeatable ID/PrimaryKey column.

In [0]:
df_hash = df_md5.withColumn("hash", f.hash(col("EmpNo")))
display(df_hash)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates,MD5_KEY,hash
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1,65ded5353c5ee48d0b7d48c591b8f430,-1274578763
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1,4c56ff4ce4aaf9573aa5dff913df997a,674970737
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2,3636638817772e42b59d74cff571fbb3,-1925686346
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3,bd686fd640be98efaae0091fa301e613,398045411
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4,85d8ce590ad8981ca2c8286f79f59954,-631093714
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1,1afa34a7f984eeabdbb0a7d494132ee5,78282775
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1,7f6ffaa6bb0b408017b62254211691b5,839133079
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2,b73ce398c39f506af761d2277d853a92,1642837935
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3,6cdd60ea0045eb7a6ec44c54d29ed402,-405575218
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1,7f1de29e6da19d22b51c68001e7e0e54,-1727984828


#### **Using sha1**

In [0]:
from pyspark.sql.functions import sha1

# Creating new column as partition_id using md5() function
df_sha1 = df_hash.withColumn("SHA1_KEY", sha1(concat(col("EMPNO"), col("Sales"), col("Quantity"))))
display(df_sha1)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates,MD5_KEY,hash,SHA1_KEY
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1,65ded5353c5ee48d0b7d48c591b8f430,-1274578763,f37c99a4a73777981b46345118aec24f7a95f744
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1,4c56ff4ce4aaf9573aa5dff913df997a,674970737,e02f1b049f55f2fc86cb82110b4792a7a05719f8
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2,3636638817772e42b59d74cff571fbb3,-1925686346,5387f97b06715d5b267040b7eec3a1d7340271cb
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3,bd686fd640be98efaae0091fa301e613,398045411,d7757ac81e251ce32600ef0e31b8b964846c0ff5
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4,85d8ce590ad8981ca2c8286f79f59954,-631093714,0651dcd0475721f726022803f5473823112c3698
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1,1afa34a7f984eeabdbb0a7d494132ee5,78282775,d06c83a8a57426c6061811ae48f7769338d89a22
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1,7f6ffaa6bb0b408017b62254211691b5,839133079,23e2eb0549d133568a7d61247718638bdf765c6e
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2,b73ce398c39f506af761d2277d853a92,1642837935,a11e15d79b5e792a2e49a3fe7a79806ca6c1fc84
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3,6cdd60ea0045eb7a6ec44c54d29ed402,-405575218,95927c3382244861207b0a3634699814954c295b
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1,7f1de29e6da19d22b51c68001e7e0e54,-1727984828,a136bde6d29651303ae7a5f0e7a315f8bb237f86


#### **Using sha2**

     df_CRC32.withColumn("SHA2_KEY", sha2(col("EMPNO").cast("string"), 256)) --> upto 200 million records
     df_CRC32.withColumn("SHA2_KEY", sha2(col("EMPNO").cast("string"), 512)) --> more than 200 million records

In [0]:
from pyspark.sql.functions import sha2

# Creating new column as partition_id using md5() function
df_sha2 = df_sha1.withColumn("SHA2_KEY", sha2(col("EMPNO").cast("string"),256))
display(df_sha2)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates,MD5_KEY,hash,SHA1_KEY,SHA2_KEY
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1,65ded5353c5ee48d0b7d48c591b8f430,-1274578763,f37c99a4a73777981b46345118aec24f7a95f744,dbb1ded63bc70732626c5dfe6c7f50ced3d560e970f30b15335ac290358748f6
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1,4c56ff4ce4aaf9573aa5dff913df997a,674970737,e02f1b049f55f2fc86cb82110b4792a7a05719f8,89aa1e580023722db67646e8149eb246c748e180e34a1cf679ab0b41a416d904
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2,3636638817772e42b59d74cff571fbb3,-1925686346,5387f97b06715d5b267040b7eec3a1d7340271cb,f57e5cb1f4532c008183057ecc94283801fcb5afe2d1c190e3dfd38c4da08042
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3,bd686fd640be98efaae0091fa301e613,398045411,d7757ac81e251ce32600ef0e31b8b964846c0ff5,684fe39f03758de6a882ae61fa62312b67e5b1e665928cbf3dc3d8f4f53e3562
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4,85d8ce590ad8981ca2c8286f79f59954,-631093714,0651dcd0475721f726022803f5473823112c3698,8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f674e88b22365bd2e2ad
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1,1afa34a7f984eeabdbb0a7d494132ee5,78282775,d06c83a8a57426c6061811ae48f7769338d89a22,eeca91fd439b6d5e827e8fda7fee35046f2def93508637483f6be8a2df7a4392
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1,7f6ffaa6bb0b408017b62254211691b5,839133079,23e2eb0549d133568a7d61247718638bdf765c6e,b1556dea32e9d0cdbfed038fd7787275775ea40939c146a64e205bcb349ad02f
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2,b73ce398c39f506af761d2277d853a92,1642837935,a11e15d79b5e792a2e49a3fe7a79806ca6c1fc84,a512db2741cd20693e4b16f19891e72b9ff12cead72761fc5e92d2aaf34740c1
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3,6cdd60ea0045eb7a6ec44c54d29ed402,-405575218,95927c3382244861207b0a3634699814954c295b,52f11620e397f867b7d9f19e48caeb64658356a6b5d17138c00dd9feaf5d7ad6
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1,7f1de29e6da19d22b51c68001e7e0e54,-1727984828,a136bde6d29651303ae7a5f0e7a315f8bb237f86,13671077b66a29874a2578b5240319092ef2a1043228e433e9b006b5e53e7513


In [0]:
from pyspark.sql.functions import concat_ws, lit
df_sha2_lit = df_sha2.withColumn("SHA2_KEY_lit", concat_ws('_', lit('Salted'), col('EmpNo')))
df_sha2_lit = df_sha2_lit.withColumn("SHA2_KEY_concat", sha2(col("SHA2_KEY_lit").cast("string"),256))
display(df_sha2_lit)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates,MD5_KEY,hash,SHA2_KEY,SHA2_KEY_lit,SHA2_KEY_concat
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1,65ded5353c5ee48d0b7d48c591b8f430,-1274578763,dbb1ded63bc70732626c5dfe6c7f50ced3d560e970f30b15335ac290358748f6,Salted_132,7389564d0a8f71b04d9826bfdc7836fe076c5b7144aa08806e10f185d4e03552
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1,4c56ff4ce4aaf9573aa5dff913df997a,674970737,89aa1e580023722db67646e8149eb246c748e180e34a1cf679ab0b41a416d904,Salted_121,8b05ad581698a614ef949a950ece356b5580c7087f9f83ae3e9935c48db38b45
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2,3636638817772e42b59d74cff571fbb3,-1925686346,f57e5cb1f4532c008183057ecc94283801fcb5afe2d1c190e3dfd38c4da08042,Salted_169,4d7b760d0f18c86144aed17e5bb94f8d180367e3499253017203d86c678f8d27
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3,bd686fd640be98efaae0091fa301e613,398045411,684fe39f03758de6a882ae61fa62312b67e5b1e665928cbf3dc3d8f4f53e3562,Salted_193,4ea896caa666b5039365a12438b3b86fee4a929c5b4cd7c34f24efd17dc35bad
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4,85d8ce590ad8981ca2c8286f79f59954,-631093714,8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f674e88b22365bd2e2ad,Salted_197,73d7bd07401d8b6863f3f2f4e6fe3cf3c1b079d861e45386717b049872c6813d
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1,1afa34a7f984eeabdbb0a7d494132ee5,78282775,eeca91fd439b6d5e827e8fda7fee35046f2def93508637483f6be8a2df7a4392,Salted_131,61fd69433afb6c2eedf551e10652e3382a65709e7ae3906508e323b3c454f5b3
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1,7f6ffaa6bb0b408017b62254211691b5,839133079,b1556dea32e9d0cdbfed038fd7787275775ea40939c146a64e205bcb349ad02f,Salted_112,86f5476e2aa5ca33b9a031ac9fab5305e081a49ad4a64b8f0e2619fa8b5a063e
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2,b73ce398c39f506af761d2277d853a92,1642837935,a512db2741cd20693e4b16f19891e72b9ff12cead72761fc5e92d2aaf34740c1,Salted_160,dea6cb9de2df09144cc21eb1ba9154a3367616048e8e1b9379ba6a958c378ae0
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3,6cdd60ea0045eb7a6ec44c54d29ed402,-405575218,52f11620e397f867b7d9f19e48caeb64658356a6b5d17138c00dd9feaf5d7ad6,Salted_184,4ff16eb0394d0a5318b7cc9db0af84cd5805fc478b08a5a48061aa373eff3e2a
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1,7f1de29e6da19d22b51c68001e7e0e54,-1727984828,13671077b66a29874a2578b5240319092ef2a1043228e433e9b006b5e53e7513,Salted_135,e5453d335cf31c215e0ce2c3e8ac7af23ea191dc198a82bada1d61970bdf2a36


In [0]:
df_sha2_lit_tr = df_sha1.withColumn("SHA2_KEY", sha2(col("EMPNO").cast("string"),200))
display(df_sha2_lit_tr)

[0;31m---------------------------------------------------------------------------[0m
[0;31mIllegalArgumentException[0m                  Traceback (most recent call last)
File [0;32m<command-2946882821384582>, line 1[0m
[0;32m----> 1[0m df_sha2_lit_tr [38;5;241m=[39m df_sha1[38;5;241m.[39mwithColumn([38;5;124m"[39m[38;5;124mSHA2_KEY[39m[38;5;124m"[39m, sha2(col([38;5;124m"[39m[38;5;124mEMPNO[39m[38;5;124m"[39m)[38;5;241m.[39mcast([38;5;124m"[39m[38;5;124mstring[39m[38;5;124m"[39m),[38;5;241m200[39m))
[1;32m      2[0m display(df_sha2_lit_tr)

File [0;32m/databricks/spark/python/pyspark/sql/utils.py:264[0m, in [0;36mtry_remote_functions.<locals>.wrapped[0;34m(*args, **kwargs)[0m
[1;32m    262[0m     [38;5;28;01mreturn[39;00m [38;5;28mgetattr[39m(functions, f[38;5;241m.[39m[38;5;18m__name__[39m)([38;5;241m*[39margs, [38;5;241m*[39m[38;5;241m*[39mkwargs)
[1;32m    263[0m [38;5;28;01melse[39;00m:
[0;32m--> 264[0m     [38;5;28;01

#### **Using window function row_number()**

In [0]:
from pyspark.sql.functions import sha2,row_number,lit
from pyspark.sql.window import Window

# Creating new column as partition_id using md5() function
df_row = df_sha2_lit.withColumn("ROW_NUMBER", row_number().over(Window.partitionBy(lit('')).orderBy(lit(''))))
display(df_row)

EmpNo,Emp_Name,Age,Experience,Department,Sales,Mfg_Date,Quantity,Commodity,Dept_No,Start_Date,ID_KEY,CRC32_KEY,concat,CRC32_id,duplicates,MD5_KEY,hash,SHA2_KEY,SHA2_KEY_lit,SHA2_KEY_concat,ROW_NUMBER
132,Sam,38,6,ADE,98765.0,17-12-1981,2060,43680.0,78,2022-02-05,1032,3864289797,987652060436806,317860539,1,65ded5353c5ee48d0b7d48c591b8f430,-1274578763,dbb1ded63bc70732626c5dfe6c7f50ced3d560e970f30b15335ac290358748f6,Salted_132,7389564d0a8f71b04d9826bfdc7836fe076c5b7144aa08806e10f185d4e03552,1
121,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1021,1715864318,78234245034563,412961798,1,4c56ff4ce4aaf9573aa5dff913df997a,674970737,89aa1e580023722db67646e8149eb246c748e180e34a1cf679ab0b41a416d904,Salted_121,8b05ad581698a614ef949a950ece356b5580c7087f9f83ae3e9935c48db38b45,2
169,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,65,2022-01-22,1069,217141192,78234245034563,412961798,2,3636638817772e42b59d74cff571fbb3,-1925686346,f57e5cb1f4532c008183057ecc94283801fcb5afe2d1c190e3dfd38c4da08042,Salted_169,4d7b760d0f18c86144aed17e5bb94f8d180367e3499253017203d86c678f8d27,3
193,Rajesh,36,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1093,1807530521,78234245034563,412961798,3,bd686fd640be98efaae0091fa301e613,398045411,684fe39f03758de6a882ae61fa62312b67e5b1e665928cbf3dc3d8f4f53e3562,Salted_193,4ea896caa666b5039365a12438b3b86fee4a929c5b4cd7c34f24efd17dc35bad,4
197,Rajesh,26,3,SALES,78234.0,6/9/1981,2450,3456.0,10,2022-01-22,1097,1825668608,78234245034563,412961798,4,85d8ce590ad8981ca2c8286f79f59954,-631093714,8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f674e88b22365bd2e2ad,Salted_197,73d7bd07401d8b6863f3f2f4e6fe3cf3c1b079d861e45386717b049872c6813d,5
131,Mohit,35,5,SALESMAN,23456.0,20-02-1981,2050,43679.0,65,2022-02-04,1031,2136814527,234562050436795,496027066,1,1afa34a7f984eeabdbb0a7d494132ee5,78282775,eeca91fd439b6d5e827e8fda7fee35046f2def93508637483f6be8a2df7a4392,Salted_131,61fd69433afb6c2eedf551e10652e3382a65709e7ae3906508e323b3c454f5b3,6
112,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1012,3563192455,7566563000null4,585254287,1,7f6ffaa6bb0b408017b62254211691b5,839133079,b1556dea32e9d0cdbfed038fd7787275775ea40939c146a64e205bcb349ad02f,Salted_112,86f5476e2aa5ca33b9a031ac9fab5305e081a49ad4a64b8f0e2619fa8b5a063e,7
160,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,67,2022-01-13,1060,1965946732,7566563000null4,585254287,2,b73ce398c39f506af761d2277d853a92,1642837935,a512db2741cd20693e4b16f19891e72b9ff12cead72761fc5e92d2aaf34740c1,Salted_160,dea6cb9de2df09144cc21eb1ba9154a3367616048e8e1b9379ba6a958c378ae0,8
184,Farid,29,4,SCIENTIST,756656.0,12/3/1981,3000,,20,2022-01-13,1084,3972210427,7566563000null4,585254287,3,6cdd60ea0045eb7a6ec44c54d29ed402,-405575218,52f11620e397f867b7d9f19e48caeb64658356a6b5d17138c00dd9feaf5d7ad6,Salted_184,4ff16eb0394d0a5318b7cc9db0af84cd5805fc478b08a5a48061aa373eff3e2a,9
135,Mathew,47,9,DEVELOPER,,20-02-1983,6789,43683.0,34,2022-02-08,1035,2016475046,null6789436839,803391266,1,7f1de29e6da19d22b51c68001e7e0e54,-1727984828,13671077b66a29874a2578b5240319092ef2a1043228e433e9b006b5e53e7513,Salted_135,e5453d335cf31c215e0ce2c3e8ac7af23ea191dc198a82bada1d61970bdf2a36,10
