#### concat_ws

- used to concatenate multiple input **string columns** into a **single string column**, using a specified **separator**.
- The **ws** stands for **with separator**. 
- Concatenates `multiple columns` with a **separator**.
- **Skips NULL** values automatically.

**Null Handling:**
- Unlike the basic concat function, `concat_ws` **ignores NULL values** in the `input columns`, meaning `no separator` is added for `missing values`, which helps `avoid entire rows becoming NULL`.

##### Syntax

     concat_ws(sep, *cols)

**sep:**
- A string literal for the `separator` (e.g., **",", " - "**).

***cols:**
- The `columns` you want to `concatenate`.
- This can be specified as `individual column objects` or dynamically using a `list of column names` with the `asterisk (*) to unpack` them. 

**Topics Covered:**
- `Combining Multiple String Columns`
- `concat_ws() with NULL values & empty strings`
- `Handling empty strings explicitly`
- `Concatenate numeric and string columns`
- `Concatenate array column using concat_ws()`
- `Creating composite key`

In [0]:
from pyspark.sql.functions import concat_ws, col, lit

##### 1) Combining Multiple String Columns

In [0]:
data = [("A1", "2024", "India", "M", 25000),
        ("A2", "2025", "USA", "F", 35000),
        ("A3", "2023", "UK", "M", 33000),
        ("A4", "2022", "Sweden", "F", 34500),
        ("A5", "2021", "Norway", "M", 25988),
        ("A6", "2020", "Germany", "M", 351250),
        ("A7", "2019", "France", "F", 45981),
        ("A8", "2018", "Spain", "M", 19245),
        ("A9", "2017", "Italy", "M", 32765),
        ("A10", "2016", "China", "F", 12398)]

df_01 = spark.createDataFrame(data, ["id", "year", "country", "gender", "salary"])
display(df_01)

id,year,country,gender,salary
A1,2024,India,M,25000
A2,2025,USA,F,35000
A3,2023,UK,M,33000
A4,2022,Sweden,F,34500
A5,2021,Norway,M,25988
A6,2020,Germany,M,351250
A7,2019,France,F,45981
A8,2018,Spain,M,19245
A9,2017,Italy,M,32765
A10,2016,China,F,12398


In [0]:
# Create a new column with a space between each column value
# string columns: "id", "year" & "country"
# string columns: "combined_key_space" & "combined_key_underscore"
df_ws_select = df_01.select("id", "year", "country",
                            concat_ws(" ", "id", "year", "country").alias("combined_key_space"),
                            concat_ws("_", "id", "year", "country").alias("combined_key_underscore"))

display(df_ws_select)

id,year,country,combined_key_space,combined_key_underscore
A1,2024,India,A1 2024 India,A1_2024_India
A2,2025,USA,A2 2025 USA,A2_2025_USA
A3,2023,UK,A3 2023 UK,A3_2023_UK
A4,2022,Sweden,A4 2022 Sweden,A4_2022_Sweden
A5,2021,Norway,A5 2021 Norway,A5_2021_Norway
A6,2020,Germany,A6 2020 Germany,A6_2020_Germany
A7,2019,France,A7 2019 France,A7_2019_France
A8,2018,Spain,A8 2018 Spain,A8_2018_Spain
A9,2017,Italy,A9 2017 Italy,A9_2017_Italy
A10,2016,China,A10 2016 China,A10_2016_China


In [0]:
# Create a new column with a space between each column value
# string columns: "id", "year" & "country"
# string columns: "combined_key_space" & "combined_key_underscore"
df_ws_with = df_01.withColumn("combined_key_space", concat_ws(" ", "id", "year", "country")) \
                  .withColumn("combined_key_underscore", concat_ws("_", "id", "year", "country")) \
                  .select("id", "year", "country", "combined_key_space", "combined_key_underscore")

display(df_ws_with)

id,year,country,combined_key_space,combined_key_underscore
A1,2024,India,A1 2024 India,A1_2024_India
A2,2025,USA,A2 2025 USA,A2_2025_USA
A3,2023,UK,A3 2023 UK,A3_2023_UK
A4,2022,Sweden,A4 2022 Sweden,A4_2022_Sweden
A5,2021,Norway,A5 2021 Norway,A5_2021_Norway
A6,2020,Germany,A6 2020 Germany,A6_2020_Germany
A7,2019,France,A7 2019 France,A7_2019_France
A8,2018,Spain,A8 2018 Spain,A8_2018_Spain
A9,2017,Italy,A9 2017 Italy,A9_2017_Italy
A10,2016,China,A10 2016 China,A10_2016_China


In [0]:
# Create a new column with a space between each column value
# string columns: "id", "year" & "country"
# string columns: "combined_key_comma"
df_ws_comma = df_01.select("*", concat_ws(", ", "id", "year", "country").alias("combined_key_comma"),
                                lit(col("salary")*2).alias("new_salary"))
                           
display(df_ws_comma)

id,year,country,gender,salary,combined_key_comma,new_salary
A1,2024,India,M,25000,"A1, 2024, India",50000
A2,2025,USA,F,35000,"A2, 2025, USA",70000
A3,2023,UK,M,33000,"A3, 2023, UK",66000
A4,2022,Sweden,F,34500,"A4, 2022, Sweden",69000
A5,2021,Norway,M,25988,"A5, 2021, Norway",51976
A6,2020,Germany,M,351250,"A6, 2020, Germany",702500
A7,2019,France,F,45981,"A7, 2019, France",91962
A8,2018,Spain,M,19245,"A8, 2018, Spain",38490
A9,2017,Italy,M,32765,"A9, 2017, Italy",65530
A10,2016,China,F,12398,"A10, 2016, China",24796


##### 2) concat_ws() with NULL values & empty strings
- `NULL columns` are **skipped** automatically.

In [0]:
data = [("A1", "2024", "India"),
        ("A2", None, "USA"),
        ("A3", "2023", None),
        ("A4", "2022", "Sweden"),
        (None, "2021", "Norway"),
        ("A6", "2020", "Germany"),
        ("A7", "2019", None),
        ("", "2018", "Spain"),
        ("A9", None, "Italy"),
        ("A10", "", "China")]

df_null_empty = spark.createDataFrame(data, ["id", "year", "country"])
display(df_null_empty)

id,year,country
A1,2024.0,India
A2,,USA
A3,2023.0,
A4,2022.0,Sweden
,2021.0,Norway
A6,2020.0,Germany
A7,2019.0,
,2018.0,Spain
A9,,Italy
A10,,China


In [0]:
df_null_with = df_null_empty.withColumn("combined_key_space", concat_ws(" ", "id", "year", "country")) \
                            .withColumn("combined_key_underscore", concat_ws("_", "id", "year", "country"))

display(df_null_with)

id,year,country,combined_key_space,combined_key_underscore
A1,2024.0,India,A1 2024 India,A1_2024_India
A2,,USA,A2 USA,A2_USA
A3,2023.0,,A3 2023,A3_2023
A4,2022.0,Sweden,A4 2022 Sweden,A4_2022_Sweden
,2021.0,Norway,2021 Norway,2021_Norway
A6,2020.0,Germany,A6 2020 Germany,A6_2020_Germany
A7,2019.0,,A7 2019,A7_2019
,2018.0,Spain,2018 Spain,_2018_Spain
A9,,Italy,A9 Italy,A9_Italy
A10,,China,A10 China,A10__China


##### 3) Handling empty strings explicitly
- concat_ws() `skips NULL`, `not empty strings`

In [0]:
from pyspark.sql.functions import when

dfNullEmpty = df_null_empty \
    .withColumn("full_name_space", concat_ws(" ", when(col("id") == "", None).otherwise(col("id")),
                                                  when(col("year") == "", None).otherwise(col("year")),
                                                  when(col("country") == "", None).otherwise(col("country")))) \
    .withColumn("full_name_under", concat_ws("_", when(col("id") == "", None).otherwise(col("id")),
                                                  when(col("year") == "", None).otherwise(col("year")),
                                                  when(col("country") == "", None).otherwise(col("country")))) \
    .withColumn("combined_key_underscore", concat_ws("_", "id", "year", "country"))

display(dfNullEmpty)

id,year,country,full_name_space,full_name_under,combined_key_underscore
A1,2024.0,India,A1 2024 India,A1_2024_India,A1_2024_India
A2,,USA,A2 USA,A2_USA,A2_USA
A3,2023.0,,A3 2023,A3_2023,A3_2023
A4,2022.0,Sweden,A4 2022 Sweden,A4_2022_Sweden,A4_2022_Sweden
,2021.0,Norway,2021 Norway,2021_Norway,2021_Norway
A6,2020.0,Germany,A6 2020 Germany,A6_2020_Germany,A6_2020_Germany
A7,2019.0,,A7 2019,A7_2019,A7_2019
,2018.0,Spain,2018 Spain,2018_Spain,_2018_Spain
A9,,Italy,A9 Italy,A9_Italy,A9_Italy
A10,,China,A10 China,A10_China,A10__China


##### 4) Concatenate numeric and string columns
- `Spark` automatically **casts** `numbers to strings`.

In [0]:
data = [
    (101, "ADLS", 55000),
    (102, "Airflow", 25000),
    (103, "Bolb", 26700),
    (104, "databricks", 19876),
    (105, "dbt", 32786)
]

df_num_str = spark.createDataFrame(data, ["id", "product", "price"])
display(df_num_str)

id,product,price
101,ADLS,55000
102,Airflow,25000
103,Bolb,26700
104,databricks,19876
105,dbt,32786


In [0]:
df_numStr = df_num_str \
    .withColumn("product_info_space", concat_ws(" - ", col("id"), col("product"), col("price"))) \
    .withColumn("product_info_underscore", concat_ws("_", col("id"), col("product"), col("price"))) \
    .withColumn("product_info_backslash", concat_ws(" / ", col("id"), col("product"), col("price")))
display(df_numStr)

id,product,price,product_info_space,product_info_underscore,product_info_backslash
101,ADLS,55000,101 - ADLS - 55000,101_ADLS_55000,101 / ADLS / 55000
102,Airflow,25000,102 - Airflow - 25000,102_Airflow_25000,102 / Airflow / 25000
103,Bolb,26700,103 - Bolb - 26700,103_Bolb_26700,103 / Bolb / 26700
104,databricks,19876,104 - databricks - 19876,104_databricks_19876,104 / databricks / 19876
105,dbt,32786,105 - dbt - 32786,105_dbt_32786,105 / dbt / 32786


##### 5) Concatenate array column using concat_ws()
- Very useful when `flattening arrays`.

In [0]:
data = [
    (1, ["Java", "Python", "Spark"]),
    (2, ["SQL", "Excel"]),
    (3, ["Scala", "Python", "R"]),
    (4, ["SQL", "PySpark", "Java"]),
    (5, ["Airflow", "Devops", "SparkSQL"])
]

df_arry = spark.createDataFrame(data, ["id", "skills"])
display(df_arry)

id,skills
1,"List(Java, Python, Spark)"
2,"List(SQL, Excel)"
3,"List(Scala, Python, R)"
4,"List(SQL, PySpark, Java)"
5,"List(Airflow, Devops, SparkSQL)"


In [0]:
df_arry_conc = df_arry.withColumn("skills_str", concat_ws(", ", "skills"))
display(df_arry_conc)

id,skills,skills_str
1,"List(Java, Python, Spark)","Java, Python, Spark"
2,"List(SQL, Excel)","SQL, Excel"
3,"List(Scala, Python, R)","Scala, Python, R"
4,"List(SQL, PySpark, Java)","SQL, PySpark, Java"
5,"List(Airflow, Devops, SparkSQL)","Airflow, Devops, SparkSQL"


In [0]:
df_arry_conc_01 = df_arry.withColumn("skills_str", concat_ws("", "skills"))
display(df_arry_conc_01)

id,skills,skills_str
1,"List(Java, Python, Spark)",JavaPythonSpark
2,"List(SQL, Excel)",SQLExcel
3,"List(Scala, Python, R)",ScalaPythonR
4,"List(SQL, PySpark, Java)",SQLPySparkJava
5,"List(Airflow, Devops, SparkSQL)",AirflowDevopsSparkSQL


In [0]:
# df_arry_conc_02 = df_arry.withColumn("skills_str", concat_ws(" ", "skills"))
df_arry_conc_02 = df_arry.withColumn("skills_str", concat_ws(" / ", "skills"))
display(df_arry_conc_02)

id,skills,skills_str
1,"List(Java, Python, Spark)",Java / Python / Spark
2,"List(SQL, Excel)",SQL / Excel
3,"List(Scala, Python, R)",Scala / Python / R
4,"List(SQL, PySpark, Java)",SQL / PySpark / Java
5,"List(Airflow, Devops, SparkSQL)",Airflow / Devops / SparkSQL


##### 6) Creating composite key

In [0]:
df_key = spark.read.csv("/Volumes/azureadb/pyspark/collect/Sales_Collect.csv", header=True, inferSchema=True)
display(df_key.limit(20))

Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Currency,Target_Currency,Average,Increment,Target_Simulation_Id
257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,INR,INR,2381.657773,0.0,1071
264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,INR,INR,553.8461539,0.0,1063
265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,INR,INR,-7199.999999,0.0,1065
266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,INR,INR,7200.0,0.0,1063
267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,INR,INR,1404.878049,0.0,1063
268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,INR,INR,834.1253,0.0,1068
270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,INR,INR,7200.0,0.0,1065
271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,INR,INR,1668.2506,0.0,1068
272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,INR,INR,-2383.215143,0.0,1071
277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,INR,INR,7440.0,0.0,1065


**Method 01**

In [0]:
df_with_id = df_key.select(
    "*",
    concat_ws(
        "_",
        col("Id"),
        col("dept_Id"),
        col("SubDept_Id"),
        col("Vehicle_Id"),
        col("Vehicle_Profile_Id"),
        col("Vehicle_Price_Id")
    ).cast("string").alias("surrogate_key")
)

df_with_id.display()

Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Currency,Target_Currency,Average,Increment,Target_Simulation_Id,surrogate_key
257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,INR,INR,2381.657773,0.0,1071,257_257_1_1_0_6
264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,INR,INR,553.8461539,0.0,1063,264_264_1_0_0_90
265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,INR,INR,-7199.999999,0.0,1065,265_265_1_0_0_83
266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,INR,INR,7200.0,0.0,1063,266_266_1_0_0_76
267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,INR,INR,1404.878049,0.0,1063,267_267_1_0_0_96
268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,INR,INR,834.1253,0.0,1068,268_268_1_1_0_48
270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,INR,INR,7200.0,0.0,1065,270_270_1_0_0_76
271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,INR,INR,1668.2506,0.0,1068,271_271_1_0_0_34
272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,INR,INR,-2383.215143,0.0,1071,272_272_1_1_0_29
277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,INR,INR,7440.0,0.0,1065,277_277_1_0_0_73


**Method 02**

In [0]:
from pyspark.sql import functions as F

In [0]:
primary_cols_mapping = ["Id", "dept_Id", "SubDept_Id", "Vehicle_Id", "Vehicle_Profile_Id", "Vehicle_Price_Id"]

In [0]:
final_df = (
    df_with_id.select(
        concat_ws("_", *primary_cols_mapping).alias("surrogate_key"),
        *[col(c) for c in df_with_id.columns]
    )
    .distinct()
)

display(final_df.limit(20))

surrogate_key,Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Currency,Target_Currency,Average,Increment,Target_Simulation_Id,surrogate_key.1
342_342_1_1_0_7,342,342,1,1,0,Grand Vitara,7,56432.60486,1289.0001,2023-05-19,INR,INR,1118619.911,-268694.1996,1071,342_342_1_1_0_7
357_357_1_2_0_52,357,357,1,2,0,Seltos,52,53.0,134560.01,2023-03-20,INR,INR,-99.30063095,0.0,1071,357_357_1_2_0_52
453_453_1_0_0_33,453,453,1,0,0,Suzuki Swift,33,34.0,12340.0123,2023-03-20,INR,INR,-20.33516498,0.0,1068,453_453_1_0_0_33
372_372_1_1_10_10,372,372,1,1,10,Grand Vitara,10,6789900.995,1289.0001,2024-12-20,INR,INR,-11681012.54,22673055.9,1071,372_372_1_1_10_10
448_448_1_1_1_65,448,448,1,1,1,Engine_Base,65,66345.789,45678.01,2023-06-20,INR,INR,-40521.40582,0.0,1068,448_448_1_1_1_65
481_481_1_0_9_82,481,481,1,0,9,Engine_Base,82,83345.1234,1450.01,2024-11-20,INR,INR,-40419.26229,0.0,1068,481_481_1_0_9_82
481_481_1_0_3_10,481,481,1,0,3,Grand Vitara,10,6789900.995,1289.0001,2024-05-20,INR,INR,2246830.56,-1671560.202,1068,481_481_1_0_3_10
489_489_1_1_1_65,489,489,1,1,1,EXAUST_SYSTEM,65,66345.789,45678.01,2023-06-20,INR,INR,1236230.737,0.00029802,1071,489_489_1_1_1_65
466_466_1_0_0_28,466,466,1,0,0,Suzuki Swift,28,29.0,12340.0123,2023-03-20,INR,INR,104.2472325,0.0,1068,466_466_1_0_0_28
310_310_1_0_5_8,310,310,1,0,5,Grand Vitara,8,3894560.999,1289.0001,2023-08-21,INR,INR,-1156473.99,441708.8814,1068,310_310_1_0_5_8


##### Summary of use cases

| Use Case        | Example                     |
| --------------- | --------------------------- |
| Merge columns   | `First + Last Name`         |
| Create IDs      | `Combine ID`, year, country |
| Dynamic lists   | `array(*cols)`              |
| `Ignore nulls`  | `Automatically handled`     |
| `ETL keys`      | **PRODUCT_KEY, HEADER_ID**  |

**Key Points to Remember**

- `Separator is mandatory`
- `Skips NULL values, not empty strings`
- Works with `strings, numbers, arrays`
- Commonly used for:
  - `Full name creation`
  - `Composite keys`
  - `Address formatting`
  - `Flattening array columns`