- Author: Ben Du
- Date: 2020-06-17
- Title: String Functions in Spark
- Slug: spark-dataframe-func-string
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, string, round, Spark SQL, functions

https://spark.apache.org/docs/2.1.1/api/java/index.html?org/apache/spark/sql/functions.html

In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.1.1
org.apache.spark spark-sql_2.11 2.1.1

In [6]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession
    .builder()
    .master("local")
    .appName("string-sample")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
spark

import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@6887b103

# Replacement Inside String

Notice that `replace` is for replacing elements in a column 
NOT for replacemnt inside each string element.
To replace substring with another one in a string,
you have to use either `regexp_replace` or `translate`.

In [7]:
import org.apache.spark.sql.functions._

val df = Seq(
    ("2017/01/01", 1),
    ("2017/02/01", 2)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|    1|
|2017/02/01|    2|
+----------+-----+



null

## regexp_replace

In [11]:
df.withColumn("date", regexp_replace($"date", "/", "-")).show

+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+



## translate

Notice that translate is different from usual replacemnt!!!

In [8]:
df.withColumn("date", translate($"date", "/", "-")).show

+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+



null

## substring

1. Uses 1-based index.

2. `substring` on `null` returns `null`.

In [9]:
import org.apache.spark.sql.functions._

val df = Seq(
    ("2017/01/01", 1),
    ("2017/02/01", 2),
    (null, 3)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|    1|
|2017/02/01|    2|
|      null|    3|
+----------+-----+



null

In [10]:
df.withColumn("year", substring($"date", 1, 4)).show

+----------+-----+----+
|      date|month|year|
+----------+-----+----+
|2017/01/01|    1|2017|
|2017/02/01|    2|2017|
|      null|    3|null|
+----------+-----+----+



null

In [11]:
df.withColumn("month", substring($"date", 6, 2)).show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   02|
|      null| null|
+----------+-----+



null

In [12]:
df.withColumn("month", substring($"date", 9, 2)).show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   01|
|      null| null|
+----------+-----+



null

## rlike

In [13]:
val df = Seq(
    ("2017/01/01", 1),
    ("2017/02/01", 2),
    ("2018/02/05", 3),
    (null, 4)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|    1|
|2017/02/01|    2|
|2018/02/05|    3|
|      null|    4|
+----------+-----+



null

In [17]:
df.filter($"date" rlike "\\d{4}/02/\\d{2}").show

+----------+-----+
|      date|month|
+----------+-----+
|2017/02/01|    2|
|2018/02/05|    3|
+----------+-----+



null

## regex_extract

```
public static Column regexp_extract(Column e, String exp, int groupIdx)
```

## length

In [18]:
val df = Seq(
    ("2017", 1),
    ("2017/02", 2),
    ("2018/02/05", 3),
    (null, 4)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|      2017|    1|
|   2017/02|    2|
|2018/02/05|    3|
|      null|    4|
+----------+-----+



null

In [19]:
import org.apache.spark.sql.functions.length

df.select($"date", length($"date")).show

+----------+------------+
|      date|length(date)|
+----------+------------+
|      2017|           4|
|   2017/02|           7|
|2018/02/05|          10|
|      null|        null|
+----------+------------+



null