**translate()**

- is helpful for **replacing or removing characters** in a **string** column, making it easy to **clean up or modify** text data.

- You can perform **simple replacements, remove unwanted characters**, or handle more complex string **transformations**.

**Syntax**

     from pyspark.sql.functions import translate
     translate(column, matching_chars, replacement_chars)

- **column:** The column name (or expression) where you want to perform the translation.
- **matching_chars:** A **string of characters** to be **replaced**.
- **replacement_chars:** The **string of characters** that will **replace** the **matching characters**.

  - If **replacement_chars** is shorter than **matching_chars**, extra characters in **matching_chars** are **removed**.
  - If the **matching_chars and replacement_chars** strings are of **unequal length**, PySpark will consider the **shorter one**, and the characters in the **longer string** without a counterpart will be **ignored**.

- Characters in **matching_chars** are **mapped one-to-one** to characters in **replacement_chars**.


In [0]:
import pyspark.sql.functions as F
from pyspark.sql.functions import translate, col

**1) Replace specific characters**
- Replace characters **a, b, c** with **x, y, z**.

In [0]:
data = [("abc@123",), ("spark@456",), ("xyz@789",), ("Chethan@456",), ("Balu@123",)]
df = spark.createDataFrame(data, ["Description"])

# Use translate to replace 'a' with 'x', 'b' with 'y' and 'c' with 'z'
df_abc = df.withColumn("translated_text", translate("Description", "abc", "xyz"))
display(df_abc)

Description,translated_text
abc@123,xyz@123
spark@456,spxrk@456
xyz@789,xyz@789
Chethan@456,Chethxn@456
Balu@123,Bxlu@123


- Let's **replace all occurrences** of **'a' with '#'** and **'e' with '$'** in a **text** column.

In [0]:
# Sample DataFrame
data = [("Harish",), ("Rakesh",), ("Swetha",), ("Radha",), ("Sekhar",)]
columns = ["Names"]

df1 = spark.createDataFrame(data, columns)

# Use translate to replace 'a' with '#' and 'e' with '$'
df_ae = df1.withColumn("translated_text", translate("Names", "ae", "#$"))
display(df_ae)

Names,translated_text
Harish,H#rish
Rakesh,R#k$sh
Swetha,Sw$th#
Radha,R#dh#
Sekhar,S$kh#r


- For instance, replace **a with 1**, **e with 2**, and **p with 3**.

In [0]:
df_aep = df.withColumn("translated_text", translate("Description", "aep", "123"))
display(df_aep)

Description,translated_text
abc@123,1bc@123
spark@456,s31rk@456
xyz@789,xyz@789
Chethan@456,Ch2th1n@456
Balu@123,B1lu@123


In [0]:
df_str_int = df.withColumn("trans_string", translate("Description", "abc", "xyz")) \
               .withColumn("trans_int", translate("Description", "123456789", "ABCDEFGHI"))
display(df_str_int)

Description,trans_string,trans_int
abc@123,xyz@123,abc@ABC
spark@456,spxrk@456,spark@DEF
xyz@789,xyz@789,xyz@GHI
Chethan@456,Chethxn@456,Chethan@DEF
Balu@123,Bxlu@123,Balu@ABC


**2) Remove unwanted characters**
- Remove **digits** by translating them to **empty** string.

In [0]:
df.withColumn("no_digits", translate("Description", "0123456789", "")).display()

Description,no_digits
abc@123,abc@
spark@456,spark@
xyz@789,xyz@
Chethan@456,Chethan@
Balu@123,Balu@


**3) Removing Specific Characters**
- If you want to **remove** specific characters from the string, you can use **translate()** by **replacing** those characters with an **empty string**.


In [0]:
# Sample DataFrame
data = [(101, "Desk01", "$100,00/-"),
        (102, "Desk02", "#FootBall!"),
        (103, "Desk03", "!Cricket, @Tennis, #Football and $20,000!"),
        (104, "Desk04", "#2025-07-24$T22:55:58Z%"),
        (105, "Desk05", ":Medium;, Small<> & Large!")]
                              
columns = ["S.No", "Location", "Description"]

df_rm = spark.createDataFrame(data, columns)
display(df_rm)

S.No,Location,Description
101,Desk01,"$100,00/-"
102,Desk02,#FootBall!
103,Desk03,"!Cricket, @Tennis, #Football and $20,000!"
104,Desk04,#2025-07-24$T22:55:58Z%
105,Desk05,":Medium;, Small<> & Large!"


In [0]:
characters_to_remove = "$#!%;:<>" # Characters you want to remove
df_cleaned = df_rm.withColumn("clean_Description", translate("Description", characters_to_remove, ""))
display(df_cleaned)

S.No,Location,Description,clean_Description
101,Desk01,"$100,00/-","100,00/-"
102,Desk02,#FootBall!,FootBall
103,Desk03,"!Cricket, @Tennis, #Football and $20,000!","Cricket, @Tennis, Football and 20,000"
104,Desk04,#2025-07-24$T22:55:58Z%,2025-07-24T225558Z
105,Desk05,":Medium;, Small<> & Large!","Medium, Small & Large"


##### 4) Digit Masking

In [0]:
# Sample DataFrame with more rows
data = [
    ("My number is 9876543210",),
    ("Card number: 1234-5678-9012",),
    ("Call me at 987-654-3210",),
    ("ID: AB123CD",),
    ("Account: 1234567890",),
    ("Phone: 8008008008",),
    ("Serial No: SN09876AB",),
    ("Emergency contact: 1122334455",),
    ("Alternate number: 9009090909",),
    ("Code: 000111",),
    ("Backup ID: ZX98765YU",),
    ("Booking ref: 321654987",),
    ("Contact number is 700-111-2222",),
    ("OTP sent is 456789",),
    ("My PAN is A1234BCD",),
    ("Mobile: 98765 43210",),
    ("Employee ID: EMP123456",)
]

columns = ["text"]

df_mask = spark.createDataFrame(data, columns)

display(df_mask)


text
My number is 9876543210
Card number: 1234-5678-9012
Call me at 987-654-3210
ID: AB123CD
Account: 1234567890
Phone: 8008008008
Serial No: SN09876AB
Emergency contact: 1122334455
Alternate number: 9009090909
Code: 000111


In [0]:
# Step: Mask digits using translate (0–9 → X)
digits = '0123456789'
mask_char = '*' * len(digits)  # '**********'

df_masked = df_mask.withColumn("masked_text", translate("text", digits, mask_char))

# Display the result
display(df_masked)

text,masked_text
My number is 9876543210,My number is **********
Card number: 1234-5678-9012,Card number: ****-****-****
Call me at 987-654-3210,Call me at ***-***-****
ID: AB123CD,ID: AB***CD
Account: 1234567890,Account: **********
Phone: 8008008008,Phone: **********
Serial No: SN09876AB,Serial No: SN*****AB
Emergency contact: 1122334455,Emergency contact: **********
Alternate number: 9009090909,Alternate number: **********
Code: 000111,Code: ******


**5) Stripping punctuation**
- You can easily drop punctuation by translating it to the empty string

In [0]:
# Extended Data with various punctuation
data2 = [
    ("Hello, world!",),
    ("Spark: fast.",),
    ("Good morning!",),
    ("What's your name?",),
    ("Clean-text, now.",),
    ("Remove (brackets) and [squares].",),
    ("Multiple!!! Exclamations!!!",),
    ("Colon: Semicolon; Period.",),
    ("--Dashes-- and under_scores__",),
    ("Punctuation: gone!",),
    ("End with a period.",),
    ("Numbers 1,234.56 should stay.",),
    ("Quotes 'single' and \"double\"",),
]

# Create DataFrame
dfp = spark.createDataFrame(data2, ["Punctuation"])
display(dfp)

Punctuation
"Hello, world!"
Spark: fast.
Good morning!
What's your name?
"Clean-text, now."
Remove (brackets) and [squares].
Multiple!!! Exclamations!!!
Colon: Semicolon; Period.
--Dashes-- and under_scores__
Punctuation: gone!


In [0]:
# Define punctuation characters to remove
punctuation = ",:!?.;()-[]'\"_"

empty_map = ''  # we want to remove these chars

# Use translate to remove punctuation
df_punct = dfp.withColumn("clean", translate(col("Punctuation"), punctuation, empty_map))

# Show results
display(df_punct)

Punctuation,clean
"Hello, world!",Hello world
Spark: fast.,Spark fast
Good morning!,Good morning
What's your name?,Whats your name
"Clean-text, now.",Cleantext now
Remove (brackets) and [squares].,Remove brackets and squares
Multiple!!! Exclamations!!!,Multiple Exclamations
Colon: Semicolon; Period.,Colon Semicolon Period
--Dashes-- and under_scores__,Dashes and underscores
Punctuation: gone!,Punctuation gone


**6) Case‑folding (manual)**
- map **uppercase → lowercase (or vice versa)** by providing the full alphabet strings (though it’s verbose)

In [0]:
# Sample data for manual case-folding (mixed casing, acronyms, etc.)
data = [
    ("HELLO WORLD",),
    ("This Is a Title Case Sentence",),
    ("PySpark Is COOL",),
    ("email: USER@DOMAIN.COM",),
    ("123ABCxyz",),
    ("UPPER and lower MIXED",),
    ("NASA and ISRO",),
    ("Sentence with Numbers 12345",),
    ("CamelCaseVariableName",),
    ("ALL CAPS TEXT",),
    ("MiXeD CaSe StrInG",),
]

columns = ["Comments"]
df_fold = spark.createDataFrame(data, columns)
display(df_fold)

Comments
HELLO WORLD
This Is a Title Case Sentence
PySpark Is COOL
email: USER@DOMAIN.COM
123ABCxyz
UPPER and lower MIXED
NASA and ISRO
Sentence with Numbers 12345
CamelCaseVariableName
ALL CAPS TEXT


In [0]:
import string

# Manual case-folding: UPPER → lower
upper = string.ascii_uppercase   # "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lower = string.ascii_lowercase   # "abcdefghijklmnopqrstuvwxyz"

df_fold_upr = df_fold.withColumn("lowered_manual", translate(col("Comments"), upper, lower)) \
                     .withColumn("upper_manual", translate(col("Comments"), lower, upper))

# Show result
display(df_fold_upr)

Comments,lowered_manual,upper_manual
HELLO WORLD,hello world,HELLO WORLD
This Is a Title Case Sentence,this is a title case sentence,THIS IS A TITLE CASE SENTENCE
PySpark Is COOL,pyspark is cool,PYSPARK IS COOL
email: USER@DOMAIN.COM,email: user@domain.com,EMAIL: USER@DOMAIN.COM
123ABCxyz,123abcxyz,123ABCXYZ
UPPER and lower MIXED,upper and lower mixed,UPPER AND LOWER MIXED
NASA and ISRO,nasa and isro,NASA AND ISRO
Sentence with Numbers 12345,sentence with numbers 12345,SENTENCE WITH NUMBERS 12345
CamelCaseVariableName,camelcasevariablename,CAMELCASEVARIABLENAME
ALL CAPS TEXT,all caps text,ALL CAPS TEXT


**7) Swap numbers:**

     1 → 9, 2 → 8, 3 → 7
     4 → 6, 5 → 5, 6 → 4
     7 → 3, 8 → 2, 9 → 1

In [0]:
df.withColumn("swapped_digits", translate("Description", "123456789", "987654321")).display()

Description,swapped_digits
abc@123,abc@987
spark@456,spark@654
xyz@789,xyz@321
Chethan@456,Chethan@654
Balu@123,Balu@987


**8) Convert lowercase to uppercase manually**

     a → A, b → B, c → C
     d → D, e → E, f → F
     g → G, h → H, i → I

In [0]:
df.withColumn("upper_case", translate("Description", "abcdefghi", "ABCDEFGHI")).display()

Description,upper_case
abc@123,ABC@123
spark@456,spArk@456
xyz@789,xyz@789
Chethan@456,CHEtHAn@456
Balu@123,BAlu@123


**9) Handling Characters of Different Lengths**
- If **matching_chars and replace_chars** have **different lengths**, PySpark will **match characters one by one**, ignoring **extra characters** in the longer string.
- Notice that here, the **last character (i)** in **"aei"** has no corresponding replacement, so it is ignored.

     translate(col("text"), "aei", "xy")

     'a' → 'x'
     'e' → 'y'
     'i' → ❌ No mapping → gets removed


In [0]:
from pyspark.sql.functions import translate, col

# Sample data containing 'a', 'e', and 'i'
data = [
    ("apple pie",),
    ("eagle",),
    ("ice cream",),
    ("aim",),
    ("fine wine",),
    ("banana",),
    ("juice",),
    ("nice",),
    ("elite",),
]

columns = ["text"]
df_diff = spark.createDataFrame(data, columns)

# Translate with unequal lengths: 'aei' → 'xy'
# 'a' → 'x', 'e' → 'y', 'i' → removed
df_diff_len = df_diff.withColumn("translated", translate(col("text"), "aei", "xy"))

# Show the result
display(df_diff_len)

text,translated
apple pie,xpply py
eagle,yxgly
ice cream,cy cryxm
aim,xm
fine wine,fny wny
banana,bxnxnx
juice,jucy
nice,ncy
elite,ylty


**10) Using translate() with when() to Condition Translation**

     # If the "Description" contains "a", then apply character replacements:
          'a' → 'x'
          'e' → 'y'
          'i' → 'z'
     # Otherwise, keep the original "Description".

In [0]:
from pyspark.sql.functions import when, translate, col

# Sample data
data = [
    ("apple pie",),
    ("orange juice",),
    ("Ice cream",),
    ("milk",),
    ("banana",),
    ("tea",),
    ("coffee",),
    ("sugar",),
    ("energy drink",),
    ("juice",)
]

columns = ["Description"]
df_trans = spark.createDataFrame(data, columns)

# Apply condition: if 'a' is in Description, then translate 'aei' → 'xyz'
df_trns_aei = df_trans.withColumn(
    "conditioned_text",
    when(col("Description").contains("a"), translate("Description", "aei", "xyz")).otherwise(col("Description"))
)

# Display result
display(df_trns_aei)

Description,conditioned_text
apple pie,xpply pzy
orange juice,orxngy juzcy
Ice cream,Icy cryxm
milk,milk
banana,bxnxnx
tea,tyx
coffee,coffee
sugar,sugxr
energy drink,energy drink
juice,juice


**11) Replacing characters in PySpark Column**
- Suppose we wanted to make the following character replacements:

      A -> #
      e -> @
      o -> %

In [0]:
# Sample data with names (containing A, e, o)
data = [
    ("Alice",),
    ("George",),
    ("Amol",),
    ("Eon",),
    ("Rakesh",),
    ("Leona",),
    ("Sonia",),
    ("Arjun",),
    ("Deepak",),
    ("Monica",),
]

columns = ["name"]
df_repl = spark.createDataFrame(data, columns)

# Apply custom character translation: A → #, e → @, o → %
df_new_repl = df_repl.withColumn("name_masked", translate(col("name"), "Aeo", "#@%"))

# Show results
display(df_new_repl)

name,name_masked
Alice,#lic@
George,G@%rg@
Amol,#m%l
Eon,E%n
Rakesh,Rak@sh
Leona,L@%na
Sonia,S%nia
Arjun,#rjun
Deepak,D@@pak
Monica,M%nica


**Notes:**
- Only **single-character** mappings are supported.
- It **does not support substring** replacement (use **regexp_replace** for that).