- Title: Column Functions and Operators in Spark
- Slug: pyspark-func-operators
- Date: 2021-03-24 16:58:01
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, column, functions, operators, func, fun
- Author: Ben Du

In [1]:
from typing import List, Tuple
import pandas as pd

In [2]:
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))
#findspark.init("/opt/spark-2.3.0-bin-hadoop2.7")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField, ArrayType

spark = SparkSession.builder.appName("PySpark_Str_Func") \
    .enableHiveSupport().getOrCreate()

In [3]:
df = spark.createDataFrame(
    pd.DataFrame(
        data=[([1, 2], "how", 1), ([2, 3], "are", 2), ([3, 4], "you", 3)],
        columns=["col1", "col2", "col3"]
    )
)
df.show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



In [4]:
dir(df.col1)

['__add__',
 '__and__',
 '__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__iter__',
 '__le__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_asc_doc',
 '_asc_nulls_first_doc',
 '_asc_nulls_last_doc',
 '_bitwiseAND_doc',
 '_bitwiseOR_doc',
 '_bitwiseXOR_doc',
 '_contains_doc',
 '_desc_doc',
 '_desc_nulls_first_doc',
 '_desc_nulls_last_doc',
 '_endswith_doc',
 '_eqNullSafe_doc',
 '_isNotNull_doc',
 '_isNull_doc',
 '_j

## [Rounding Functions](http://www.legendu.net/misc/blog/spark-dataframe-func-rounding)

Please refer to 
[Rounding Functions in Spark](http://www.legendu.net/misc/blog/spark-dataframe-func-rounding)
for details.

## [String Functions](http://www.legendu.net/misc/blog/spark-dataframe-func-string)

Please refer to 
[String Functions in Spark](http://www.legendu.net/misc/blog/spark-dataframe-func-string)
for details.

## [Statistical Functions](http://www.legendu.net/misc/blog/spark-stat-functions)

Please refer to
[Statistical Functions in Spark](http://www.legendu.net/misc/blog/spark-stat-functions)
for details.

## [Date Functions in Spark](http://www.legendu.net/misc/blog/spark-dataframe-func-date)

Please refer to 
[Date Functions in Spark](http://www.legendu.net/misc/blog/spark-dataframe-func-date)
for details.

## [Window Functions in Spark](http://www.legendu.net/misc/blog/window-functions-in-spark)

Please refer to 
[Window Functions in Spark](http://www.legendu.net/misc/blog/window-functions-in-spark)
for details.

## [Collection Functions](http://www.legendu.net/misc/blog/spark-dataframe-func-collection)

Please refer to
[Collection Functions](http://www.legendu.net/misc/blog/spark-dataframe-func-collection)
for details.

## Not (`~`) for Column Expressions

Use `~` to reverse the values of a boolean column expression.
Notice that you cannot use the `not` keyword in this situation.

In [13]:
df.filter(~col("col5")).show()

+----+----+----+----+-----+
|col1|col2|col3|col4| col5|
+----+----+----+----+-----+
|   3|   c| foo| 5.0|false|
|   4|   d| bar| 7.0|false|
+----+----+----+----+-----+



## between

In [7]:
df.filter(col("col2").between("hoa", "hox")).show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
+------+----+----+



In [8]:
df.filter(col("col3").between(2, 3)).show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



## cast

In [12]:
df2 = df.select(
    col("col1"),
    col("col2"),
    col("col3").astype(StringType())
)
df2.show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



In [13]:
df2.schema

StructType(List(StructField(col1,ArrayType(LongType,true),true),StructField(col2,StringType,true),StructField(col3,StringType,true)))

In [15]:
df3 = df2.select(
    col("col1"),
    col("col2"),
    col("col3").cast(IntegerType())
)
df3.show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



In [16]:
df3.schema

StructType(List(StructField(col1,ArrayType(LongType,true),true),StructField(col2,StringType,true),StructField(col3,IntegerType,true)))

## lit

In [4]:
x = lit(1)

In [5]:
type(x)

pyspark.sql.column.Column

## hash

In [7]:
df.withColumn("hash_code", hash("col2")).show()

+------+----+----+-----------+
|  col1|col2|col3|  hash_code|
+------+----+----+-----------+
|[1, 2]| how|   1|-1205091763|
|[2, 3]| are|   2| -422146862|
|[3, 4]| you|   3| -315368575|
+------+----+----+-----------+



## when

1. `null` in when condition is considered as false.

In [1]:
import org.apache.spark.sql.functions._

val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

`null` in when condition is considered as `false`.

In [3]:
df.select(when($"age" > 20, 1).otherwise(0).alias("gt20")).show

+----+
|gt20|
+----+
|   0|
|   1|
|   0|
+----+



In [5]:
df.select(when($"age" <= 20, 1).otherwise(0).alias("le20")).show

+----+
|le20|
+----+
|   0|
|   0|
|   1|
+----+



In [6]:
df.select(when($"age".isNull, 0).when($"age" > 20 , 100).otherwise(10).alias("age")).show

+---+
|age|
+---+
|  0|
|100|
| 10|
+---+



In [7]:
df.select(when($"age".isNull, 0).alias("age")).show

+----+
| age|
+----+
|   0|
|null|
|null|
+----+



## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html