## String Manipulation Functions

We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@4e9fed17


org.apache.spark.sql.SparkSession@4e9fed17

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Case Conversion - `lower`, `upper`, `initcap`
* Getting size of the column value - `length`
* Extracting Data - `substr` and `split`
* Trimming and Padding functions - `trim`, `rtrim`, `ltrim`, `rpad` and `lpad`
* Reversing strings - `reverse`
* Concatenating multiple strings `concat` and `concat_ws`

### Case Conversion and Length
Let us understand how to perform case conversion functions of a string and also length of a string.

* Case Conversion Functions - `lower`, `upper`, `initcap`

In [3]:
%%sql

SELECT lower('hEllo wOrlD') AS lower_result,
    upper('hEllo wOrlD') AS upper_result,
    initcap('hEllo wOrlD') AS initcap_result

Waiting for a Spark session to start...

+------------+------------+--------------+
|lower_result|upper_result|initcap_result|
+------------+------------+--------------+
| hello world| HELLO WORLD|   Hello World|
+------------+------------+--------------+



* Getting length - `length`

In [4]:
%%sql

SELECT length('hEllo wOrlD') AS result

+------+
|result|
+------+
|    11|
+------+



Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.

* order_status for some of the orders is in lower case and we will convert every thing to upper case.

In [5]:
%%sql

USE itv002461_retail

++
||
++
++



In [6]:
%%sql 

SHOW tables

+----------------+-----------------+-----------+
|        database|        tableName|isTemporary|
+----------------+-----------------+-----------+
|itv002461_retail|       categories|      false|
|itv002461_retail|        customers|      false|
|itv002461_retail|      departments|      false|
|itv002461_retail|             dual|      false|
|itv002461_retail|      order_items|      false|
|itv002461_retail|order_items_stage|      false|
|itv002461_retail|           orders|      false|
|itv002461_retail|      orders_part|      false|
|itv002461_retail|         products|      false|
+----------------+-----------------+-----------+



In [7]:
%%sql

SELECT * FROM orders LIMIT 10

|   34572|2014-02-23 00:00:...|             8135|        ...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   34565|2014-02-23 00:00:...|             8702|       COMPLETE|
|   34566|2014-02-23 00:00:...|             3066|PENDING_PAYMENT|
|   34567|2014-02-23 00:00:...|             7314|SUSPECTED_FRAUD|
|   34568|2014-02-23 00:00:...|             1271|       COMPLETE|
|   34569|2014-02-23 00:00:...|            11083|       COMPLETE|
|   34570|2014-02-23 00:00:...|             3159|         CLOSED|
|   34571|2014-02-23 00:00:...|             4551|         CLOSED|
|   34572|2014-02-23 00:00:...|             8135|        PENDING|
|   34573|2014-02-23 00:00:...|             7497|PENDING_PAYMENT|
|   34574|2014-02-23 00:00:...|             1868|        ON_HOLD|
+--------+--------------------+-----------------+---------------+



In [8]:
%%sql

SELECT order_id, order_date, order_customer_id,
    lower(order_status) AS order_status,
    length(order_status) AS order_status_length
FROM orders LIMIT 10

|       6|2013-07-25 00:00:.....


+--------+--------------------+-----------------+---------------+-------------------+
|order_id|          order_date|order_customer_id|   order_status|order_status_length|
+--------+--------------------+-----------------+---------------+-------------------+
|       1|2013-07-25 00:00:...|            11599|         closed|                  6|
|       2|2013-07-25 00:00:...|              256|pending_payment|                 15|
|       3|2013-07-25 00:00:...|            12111|       complete|                  8|
|       4|2013-07-25 00:00:...|             8827|         closed|                  6|
|       5|2013-07-25 00:00:...|            11318|       complete|                  8|
|       6|2013-07-25 00:00:...|             7130|       complete|                  8|
|       7|2013-07-25 00:00:...|             4530|       complete|                  8|
|       8|2013-07-25 00:00:...|             2911|     processing|                 10|
|       9|2013-07-25 00:00:...|             5657|pendi

### Extracting Data - substr and split
Let us understand how to extract data from strings using `substr`/`substring` and `split`.

* We can get syntax and symantecs of the functions using `DESCRIBE FUNCTION`
* We can extract first four characters from string using substr or substring.

In [9]:
%%sql

DESCRIBE FUNCTION substr

+--------------------+
|       function_desc|
+--------------------+
|    Function: substr|
|Class: org.apache...|
|Usage: substr(str...|
+--------------------+



In [10]:
%%sql

DESCRIBE FUNCTION substring

+--------------------+
|       function_desc|
+--------------------+
| Function: substring|
|Class: org.apache...|
|Usage: substring(...|
+--------------------+



In [11]:
spark.sql("DESCRIBE FUNCTION substring").show(false)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|function_desc                                                                                                                                                                          |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Function: substring                                                                                                                                                                    |
|Class: org.apache.spark.sql.catalyst.expressions.Substring                                                                                                                             |
|Usage: substring(str, pos[, len]) - Returns the substring of `str` th

In [12]:
%%sql

SELECT substr('2013-07-25 00:00:00.0', 1, 4) AS result

+------+
|result|
+------+
|  2013|
+------+



In [13]:
%%sql

SELECT substr('2013-07-25 00:00:00.0', 6, 2) AS result

+------+
|result|
+------+
|    07|
+------+



In [14]:
%%sql

SELECT substr('2013-07-25 00:00:00.0', 9, 2) AS result

+------+
|result|
+------+
|    25|
+------+



In [15]:
%%sql

SELECT substr('2013-07-25 00:00:00.0', 12) AS result

+----------+
|    result|
+----------+
|00:00:00.0|
+----------+



* Let us see how we can extract date part from order_date of orders.

In [16]:
%%sql

SELECT * FROM orders LIMIT 10

|   34572|2014-02-23 00:00:...|             8135|        ...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   34565|2014-02-23 00:00:...|             8702|       COMPLETE|
|   34566|2014-02-23 00:00:...|             3066|PENDING_PAYMENT|
|   34567|2014-02-23 00:00:...|             7314|SUSPECTED_FRAUD|
|   34568|2014-02-23 00:00:...|             1271|       COMPLETE|
|   34569|2014-02-23 00:00:...|            11083|       COMPLETE|
|   34570|2014-02-23 00:00:...|             3159|         CLOSED|
|   34571|2014-02-23 00:00:...|             4551|         CLOSED|
|   34572|2014-02-23 00:00:...|             8135|        PENDING|
|   34573|2014-02-23 00:00:...|             7497|PENDING_PAYMENT|
|   34574|2014-02-23 00:00:...|             1868|        ON_HOLD|
+--------+--------------------+-----------------+---------------+



In [17]:
%%sql

SELECT order_id,
  substr(order_date, 1, 10) AS order_date,
  order_customer_id,
  order_status
FROM orders

|      10|2013-07-25|             5648|PENDIN...


+--------+----------+-----------------+---------------+
|order_id|order_date|order_customer_id|   order_status|
+--------+----------+-----------------+---------------+
|       1|2013-07-25|            11599|         CLOSED|
|       2|2013-07-25|              256|PENDING_PAYMENT|
|       3|2013-07-25|            12111|       COMPLETE|
|       4|2013-07-25|             8827|         CLOSED|
|       5|2013-07-25|            11318|       COMPLETE|
|       6|2013-07-25|             7130|       COMPLETE|
|       7|2013-07-25|             4530|       COMPLETE|
|       8|2013-07-25|             2911|     PROCESSING|
|       9|2013-07-25|             5657|PENDING_PAYMENT|
|      10|2013-07-25|             5648|PENDING_PAYMENT|
+--------+----------+-----------------+---------------+
only showing top 10 rows



Let us understand how to extract the information from the string where there is a delimiter.
* `split` converts delimited string into array.

In [18]:
%%sql

SELECT split('2013-07-25', '-') AS result

+--------------+
|        result|
+--------------+
|[2013, 07, 25]|
+--------------+



In [19]:
%%sql

SELECT split('2013-07-25', '-')[1] AS result

+------+
|result|
+------+
|    07|
+------+



* We can use explode to convert an array into records.

In [20]:
%%sql

SELECT explode(split('2013-07-25', '-')) AS result

+------+
|result|
+------+
|  2013|
|    07|
|    25|
+------+



### Trimming and Padding Functions

Let us understand how to trim or remove leading and/or trailing spaces in a string.

* `ltrim` is used to remove the spaces on the left side of the string.
* `rtrim` is used to remove the spaces on the right side of the string.
* `trim` is used to remove the spaces on both sides of the string.

In [21]:
%%sql

SELECT ltrim('     Hello World') AS result

+-----------+
|     result|
+-----------+
|Hello World|
+-----------+



In [22]:
%%sql

SELECT rtrim('     Hello World       ') AS result

+----------------+
|          result|
+----------------+
|     Hello World|
+----------------+



In [23]:
%%sql

SELECT length(trim('     Hello World       ')) AS result

+------+
|result|
+------+
|    11|
+------+



Let us understand how to use padding to pad characters to a string.

* Let us assume that there are 3 fields - year, month and date which are of type integer.
* If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.
* `lpad` is used more often than `rpad` especially when we try to build the date from separate columns.

In [24]:
%%sql

SELECT 2013 AS year, 7 AS month, 25 AS myDate

+----+-----+------+
|year|month|myDate|
+----+-----+------+
|2013|    7|    25|
+----+-----+------+



In [25]:
%%sql

SELECT lpad(7, 2, 0) AS result

+------+
|result|
+------+
|    07|
+------+



In [26]:
%%sql

SELECT lpad(10, 2, 0) AS result

+------+
|result|
+------+
|    10|
+------+



In [27]:
%%sql

SELECT lpad(100, 2, 0) AS result

+------+
|result|
+------+
|    10|
+------+



### Reverse and Concatenating multiple strings

Let us understand how to reverse a string as well as concatenate multiple strings.
* We can use `reverse` to reverse a string.
* We can concatenate multiple strings using `concat` and `concat_ws`.
* `concat_ws` is typically used if we want to have the same string between all the strings that are being concatenated.

In [28]:
%%sql

SELECT reverse('Hello World') AS result

+-----------+
|     result|
+-----------+
|dlroW olleH|
+-----------+



In [29]:
%%sql

SELECT concat('Hello ', 'World') AS result

+-----------+
|     result|
+-----------+
|Hello World|
+-----------+



In [30]:
%%sql

SELECT concat('Order Status is ', order_status) AS result
FROM orders LIMIT 10

+--------------------+
|              result|
+--------------------+
|Order Status is C...|
|Order Status is P...|
|Order Status is C...|
|Order Status is C...|
|Order Status is C...|
|Order Status is C...|
|Order Status is C...|
|Order Status is P...|
|Order Status is P...|
|Order Status is P...|
+--------------------+



In [31]:
spark.sql("""
    SELECT concat('Order Status is ', order_status) AS result
    FROM orders_part LIMIT 10
""").show(false)

+-------------------------------+
|result                         |
+-------------------------------+
|Order Status is CLOSED         |
|Order Status is PENDING_PAYMENT|
|Order Status is COMPLETE       |
|Order Status is CLOSED         |
|Order Status is COMPLETE       |
|Order Status is COMPLETE       |
|Order Status is COMPLETE       |
|Order Status is PROCESSING     |
|Order Status is PENDING_PAYMENT|
|Order Status is PENDING_PAYMENT|
+-------------------------------+



In [32]:
%%sql

SELECT * FROM (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q

+----+-----+------+
|year|month|myDate|
+----+-----+------+
|2013|    7|    25|
+----+-----+------+



In [33]:
%%sql

SELECT concat(year, '-', lpad(month, 2, 0), '-',
              lpad(myDate, 2, 0)) AS order_date
FROM
    (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q

+----------+
|order_date|
+----------+
|2013-07-25|
+----------+



In [34]:
%%sql

SELECT concat_ws('-', year, lpad(month, 2, 0),
              lpad(myDate, 2, 0)) AS order_date
FROM
    (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q

+----------+
|order_date|
+----------+
|2013-07-25|
+----------+

