<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/15_String_Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. String Functions

Spark SQL provides a comprehensive set of functions for manipulating string data, essential for data cleaning, parsing, and feature engineering. These functions are available in `pyspark.sql.functions`.

### Common PySpark String Functions:

| Function                              | Description                                                                 |
| :------------------------------------ | :-------------------------------------------------------------------------- |
| `substring(str, pos, len)`            | Extracts a substring. `pos` is 1-based.                                   |
| `concat(*cols)`                       | Concatenates multiple string columns.                                       |
| `concat_ws(sep, *cols)`               | Concatenates multiple string columns with a `separator`.                    |
| `split(str, delimiter)`               | Splits a string into an array of strings based on a `delimiter`.            |
| `trim(str)`                           | Removes leading and trailing spaces.                                        |
| `ltrim(str)`                          | Removes leading spaces.                                                     |
| `rtrim(str)`                          | Removes trailing spaces.                                                    |
| `lpad(str, len, pad)`                 | Pads `str` on the left with `pad` until it reaches `len`.                   |
| `rpad(str, len, pad)`                 | Pads `str` on the right with `pad` until it reaches `len`.                  |
| `lower(str)`                          | Converts a string to lowercase.                                             |
| `upper(str)`                          | Converts a string to uppercase.                                             |
| `length(str)`                         | Returns the character length of a string.                                   |

### Example Code:

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring, concat, concat_ws, split, \
    trim, ltrim, rtrim, lpad, rpad, lower, upper, length, lit

spark = SparkSession.builder.appName("StringFunctions").getOrCreate()

data = [("Alice", "alice.smith@example.com", " NYC ", "123-456"),
        ("Bob", "bob.jones@example.com", "LA", "789-012"),
        ("Charlie", "charlie@gmail.com", "SFO", "345-678")]
columns = ["Name", "Email", "City", "PhonePrefix"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show(truncate=False)

# Substring
print("\nSubstring:")
df.withColumn("EmailDomain_Part", substring(col("Email"), 7, 10)).show() # e.g., 'example.co' from 7th char, 10 chars long
df.withColumn("First3CharsName", substring(col("Name"), 1, 3)).show()

# Concat / Concat_WS
print("\nConcat / Concat_WS:")
df.withColumn("FullEmail", concat(col("Name"), lit("<"), col("Email"), lit(">"))).show(truncate=False)
df.withColumn("FormattedCity", concat_ws("-", lit("City"), col("City"))).show()

# Split
print("\nSplit Email by '@':")
df.withColumn("EmailParts", split(col("Email"), "@")).show(truncate=False)
df.withColumn("Domain", split(col("Email"), "@")[1]).show(truncate=False) # Access element from array

# Trim, Ltrim, Rtrim
print("\nTrim:")
df.withColumn("TrimmedCity", trim(col("City"))).show()
df.withColumn("LTrimmedCity", ltrim(col("City"))).show()
df.withColumn("RTrimmedCity", rtrim(col("City"))).show()

# Lpad, Rpad
print("\nLpad / Rpad:")
df.withColumn("PaddedPhone", lpad(col("PhonePrefix"), 10, "0")).show() # Pad with '0' to length 10
df.withColumn("PaddedName", rpad(col("Name"), 10, "*")).show() # Pad with '*' to length 10

# Lower, Upper
print("\nLower / Upper:")
df.withColumn("LowerName", lower(col("Name"))) \
  .withColumn("UpperCity", upper(col("City"))).show()

# Length
print("\nLength of Name:")
df.withColumn("NameLength", length(col("Name"))).show()

spark.stop()

Original DataFrame:
+-------+-----------------------+-----+-----------+
|Name   |Email                  |City |PhonePrefix|
+-------+-----------------------+-----+-----------+
|Alice  |alice.smith@example.com| NYC |123-456    |
|Bob    |bob.jones@example.com  |LA   |789-012    |
|Charlie|charlie@gmail.com      |SFO  |345-678    |
+-------+-----------------------+-----+-----------+


Substring:
+-------+--------------------+-----+-----------+----------------+
|   Name|               Email| City|PhonePrefix|EmailDomain_Part|
+-------+--------------------+-----+-----------+----------------+
|  Alice|alice.smith@examp...| NYC |    123-456|      smith@exam|
|    Bob|bob.jones@example...|   LA|    789-012|      nes@exampl|
|Charlie|   charlie@gmail.com|  SFO|    345-678|      e@gmail.co|
+-------+--------------------+-----+-----------+----------------+

+-------+--------------------+-----+-----------+---------------+
|   Name|               Email| City|PhonePrefix|First3CharsName|
+-------+-

---

## 2. Regex Functions

Regular expressions (regex) are powerful tools for complex string pattern matching and manipulation. PySpark provides `regexp_extract` and `regexp_replace` for this.

### PySpark Regex Functions:

| Function                                     | Description                                                                      |
| :------------------------------------------- | :------------------------------------------------------------------------------- |
| `regexp_extract(str, pattern, idx)`          | Extracts a string that matches a `pattern`. `idx` specifies which capture group to return (0 for the entire match, 1 for the first group, etc.). |
| `regexp_replace(str, pattern, replacement)`  | Replaces all substrings in `str` that match the `pattern` with `replacement`.    |

### Example Code:

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract, regexp_replace

spark = SparkSession.builder.appName("RegexFunctions").getOrCreate()

data = [("Alice Smith", "Order #123456 - Product X (Qty: 2)", "john.doe@email.com"),
        ("Bob Johnson", "Product Y - REF_789012", "jane_doe@gmail.com"),
        ("Charlie Brown", "No Order Info", "charlie@org.net")]
columns = ["CustomerName", "OrderDetails", "Email"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show(truncate=False)

# regexp_extract: Extract Order ID (e.g., numbers after '#')
print("\nExtracting Order ID:")
# r"#(\d+)" : "#" literal, then capture group "(\d+)" one or more digits. Index 1 for the capture group.
df.withColumn("OrderID", regexp_extract(col("OrderDetails"), r"#(\d+)", 1)).show(truncate=False)

# regexp_extract: Extract domain from email
print("\nExtracting Email Domain:")
# r"@([a-zA-Z0-9.-]+)" : "@" literal, then capture group with letters, numbers, dot, dash.
df.withColumn("EmailDomain", regexp_extract(col("Email"), r"@([a-zA-Z0-9.-]+)", 1)).show(truncate=False)

# regexp_replace: Mask part of an email (e.g., username with ***)
print("\nMasking Email Username:")
# r"^[^@]+@" : Start of string "^", one or more characters that are NOT "@" "[^@]+", then "@" literal.
df.withColumn("MaskedEmail", regexp_replace(col("Email"), r"^[^@]+@", "***@")).show(truncate=False)

# regexp_replace: Remove non-alphanumeric characters from Customer Name
print("\nRemoving non-alphanumeric from Customer Name:")
# r"[^a-zA-Z0-9]" : Matches any character that is NOT an uppercase letter, lowercase letter, or digit.
df.withColumn("CleanCustomerName", regexp_replace(col("CustomerName"), r"[^a-zA-Z0-9\s]", "")).show(truncate=False) # Added \s for space, adjust as needed

spark.stop()

Original DataFrame:
+-------------+----------------------------------+------------------+
|CustomerName |OrderDetails                      |Email             |
+-------------+----------------------------------+------------------+
|Alice Smith  |Order #123456 - Product X (Qty: 2)|john.doe@email.com|
|Bob Johnson  |Product Y - REF_789012            |jane_doe@gmail.com|
|Charlie Brown|No Order Info                     |charlie@org.net   |
+-------------+----------------------------------+------------------+


Extracting Order ID:
+-------------+----------------------------------+------------------+-------+
|CustomerName |OrderDetails                      |Email             |OrderID|
+-------------+----------------------------------+------------------+-------+
|Alice Smith  |Order #123456 - Product X (Qty: 2)|john.doe@email.com|123456 |
|Bob Johnson  |Product Y - REF_789012            |jane_doe@gmail.com|       |
|Charlie Brown|No Order Info                     |charlie@org.net   |       |