#PySpark Column Class | Operators & Functions


---

**pyspark.sql.Column class provides several functions to work with DataFrame to manipulate the Column values, evaluate the boolean expression to filter rows, retrieve a value or part of a value from a DataFrame column, and to work with list, map & struct columns.**

**How to create Column object, access them to perform operations, and finally most used PySpark Column Functions with Examples.**

- PySpark Column class represents a single Column in a DataFrame.
- It provides functions that are most used to manipulate DataFrame Columns & Rows.
- Some of these Column functions evaluate a Boolean expression that can be used with filter() transformation to filter the DataFrame Rows.
- Provides functions to get a value from a list column by index, map value by key & index, and finally struct nested column.
- PySpark also provides additional functions pyspark.sql.functions that take Column object and return a Column type.


Note: Most of the pyspark.sql.functions return Column type hence it is very important to know the operation you can perform with Column type.

##1. Create Column Class Object


---


**One of the simplest ways to create a Column class object is by using PySpark lit() SQL function, this takes a literal value and returns a Column object.**

In [0]:
from pyspark.sql.functions import lit

colObj = lit('mycolumn')
colObj

Out[1]: Column<'mycolumn'>

***You can also access the Column from DataFrame by multiple ways.***

In [0]:
data = [('James', 23), ('Ann', 40)]

df = spark.createDataFrame(data=data).toDF('name.fname', 'gender')

df.printSchema()
df.show()

root
 |-- name.fname: string (nullable = true)
 |-- gender: long (nullable = true)

+----------+------+
|name.fname|gender|
+----------+------+
|     James|    23|
|       Ann|    40|
+----------+------+



In [0]:
#Using Datframe object (df)
df.select(df.gender).show()
df.select(df['gender']).show()

#Accessing column name with dot (with databricks)
df.select(df["`name.fname`"]).show()

+------+
|gender|
+------+
|    23|
|    40|
+------+

+------+
|gender|
+------+
|    23|
|    40|
+------+

+----------+
|name.fname|
+----------+
|     James|
|       Ann|
+----------+



In [0]:
#Using SQL col() function

from pyspark.sql.functions import col

df.select(col('gender')).show()

#Accessing column name with dot (with databricks)
df.select(col('`name.fname`')).show()


+------+
|gender|
+------+
|    23|
|    40|
+------+

+----------+
|name.fname|
+----------+
|     James|
|       Ann|
+----------+



**Accessing struct type columns. Here I have use PySpark Row class to create a struct type. Alternatively you can also create it by using PySpark StructType & StructField classes**

In [0]:
#Create Dateframe with struct using Row class

from pyspark.sql import Row

data = [
    Row(name='James', prop=Row(hair='balck', eye='blue')),
    Row(name='Ann', prop=Row(hair='grey', eye='black'))
]

df = spark.createDataFrame(data=data)
df.printSchema()
df.show()


root
 |-- name: string (nullable = true)
 |-- prop: struct (nullable = true)
 |    |-- hair: string (nullable = true)
 |    |-- eye: string (nullable = true)

+-----+-------------+
| name|         prop|
+-----+-------------+
|James|{balck, blue}|
|  Ann|{grey, black}|
+-----+-------------+



In [0]:
#Access struct column
df.select(df.prop.hair).show()
df.select(df['prop.hair']).show()
df.select(col('prop.hair')).show()


+---------+
|prop.hair|
+---------+
|    balck|
|     grey|
+---------+

+-----+
| hair|
+-----+
|balck|
| grey|
+-----+

+-----+
| hair|
+-----+
|balck|
| grey|
+-----+



In [0]:
# Access all columns from struct
df.select(col('prop.*')).show()

+-----+-----+
| hair|  eye|
+-----+-----+
|balck| blue|
| grey|black|
+-----+-----+



##2. PySpark Column Operators

---


**PySpark column also provides a way to do arithmetic operations on columns using operators.**

In [0]:
data = [(100,2,1), (200,3,4), (300,4,4)]

df = spark.createDataFrame(data=data).toDF('col1', 'col2', 'col3')

#Arithmatic Operations
df.select(df.col1 + df.col2).show()
df.select(df.col1 - df.col2).show()
df.select(df.col1 * df.col2).show()
df.select(df.col1 / df.col2).show()
df.select(df.col1 % df.col2).show()


df.select(df.col1 > df.col2).show()
df.select(df.col1 < df.col2).show()
df.select(df.col1 == df.col2).show()

+-------------+
|(col1 + col2)|
+-------------+
|          102|
|          203|
|          304|
+-------------+

+-------------+
|(col1 - col2)|
+-------------+
|           98|
|          197|
|          296|
+-------------+

+-------------+
|(col1 * col2)|
+-------------+
|          200|
|          600|
|         1200|
+-------------+

+-----------------+
|    (col1 / col2)|
+-----------------+
|             50.0|
|66.66666666666667|
|             75.0|
+-----------------+

+-------------+
|(col1 % col2)|
+-------------+
|            0|
|            2|
|            0|
+-------------+

+-------------+
|(col1 > col2)|
+-------------+
|         true|
|         true|
|         true|
+-------------+

+-------------+
|(col1 < col2)|
+-------------+
|        false|
|        false|
|        false|
+-------------+

+-------------+
|(col1 = col2)|
+-------------+
|        false|
|        false|
|        false|
+-------------+



#3. PySpark Column Functions

---

**Let’s see some of the most used Column Functions, on below table, I have grouped related functions together to make it easy, click on the link for examples.**

<table><thead><tr><th>Column Function</th><th>Function Description</th></tr></thead><tbody><tr><td><code>alias</code>(*alias,&nbsp;**kwargs)<br><code>name</code>(*alias,&nbsp;**kwargs)</td><td>Provides alias to the column or expressions<br><code>name()</code>&nbsp;returns same as&nbsp;<code>alias()</code>.</td></tr><tr><td><code>asc</code>()<br><code>asc_nulls_first</code>()<br><code>asc_nulls_last</code>()</td><td>Returns ascending order of the column.<br><code>asc_nulls_first</code>() Returns null values first then non-null values.<br><code>asc_nulls_last</code>() – Returns null values after non-null values.</td></tr><tr><td><code>astype</code>(dataType)<br><code>cast</code>(dataType)</td><td>Used to cast the data type to another type.<br><code>astype()</code>&nbsp;returns same as&nbsp;<code>cast()</code>.</td></tr><tr><td><code>between</code>(lowerBound,&nbsp;upperBound)</td><td>Checks if the columns values are between lower and upper bound. Returns boolean value.</td></tr><tr><td><code>bitwiseAND</code>(other)<br><code>bitwiseOR</code>(other)<br><code>bitwiseXOR</code>(other)</td><td>Compute bitwise AND, OR &amp; XOR of this expression with another expression respectively.</td></tr><tr><td><code>contains</code>(other)</td><td>Check if String contains in another string.</td></tr><tr><td><code>desc</code>()<br><code>desc_nulls_first</code>()<br><code>desc_nulls_last</code>()</td><td>Returns descending order of the column.<br><code>desc_nulls_first</code>() -null values appear before non-null values.<br><code>desc_nulls_last</code>() – null values appear after non-null values.</td></tr><tr><td><code>startswith</code>(other)<br><code>endswith</code>(other)</td><td>String starts with. Returns boolean expression<br>String ends with. Returns boolean expression</td></tr><tr><td><code>eqNullSafe</code>(other)</td><td>Equality test that is safe for null values.</td></tr><tr><td><code>getField</code>(name)</td><td>Returns a field by name in a StructField and by key in Map.</td></tr><tr><td><code>getItem</code>(key)</td><td>Returns a values from Map/Key at the provided position.</td></tr><tr><td><code>isNotNull</code>()<br><code>isNull</code>()</td><td>isNotNull() – Returns True if the current expression is NOT null.<br>isNull() – Returns True if the current expression is null.</td></tr><tr><td><code>isin</code>(*cols)</td><td>A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.</td></tr><tr><td><code>like</code>(other)<br><code>rlike</code>(other)</td><td>Similar to SQL like expression.<br>Similar to SQL RLIKE expression (LIKE with Regex).</td></tr><tr><td><code>over</code>(window)</td><td>Used with window column</td></tr><tr><td><code>substr</code>(startPos,&nbsp;length)</td><td>Return a&nbsp;Column&nbsp;which is a substring of the column.</td></tr><tr><td><code>when</code>(condition,&nbsp;value)<br><code>otherwise</code>(value)</td><td>Similar to SQL CASE WHEN, Executes a list of conditions and returns one of multiple possible result expressions.</td></tr><tr><td><code>dropFields</code>(*fieldNames)</td><td>Used to drops fields in&nbsp;StructType&nbsp;by name.</td></tr><tr><td><code>withField</code>(fieldName,&nbsp;col)</td><td>An expression that adds/replaces a field in<code>&nbsp;StructType</code>&nbsp;by name.</td></tr></tbody></table>

#4. PySpark Column Functions Examples

---


**Let’s create a simple DataFrame to work with PySpark SQL Column examples. For most of the examples below, I will be referring DataFrame object name (df.) to get the column.**

In [0]:
data = [
    ('James', 'Bond', '100', None),
    ('Ann', 'Varsa', '200', 'F'),
    ('Tom Cruise', 'XXX', '400', ''),
    ('Tom Brand', None, '400', 'M'),
    ]

columns = ['fname', 'lname', 'id', 'gender']

df = spark.createDataFrame(data=data, schema=columns)
display(df)

fname,lname,id,gender
James,Bond,100,
Ann,Varsa,200,F
Tom Cruise,XXX,400,
Tom Brand,,400,M


##4.1 alias() – Set’s name to Column

---

**On below example df.fname refers to Column object and alias() is a function of the Column to give alternate name. Here, fname column has been changed to first_name & lname to last_name.**

**On second example I have use PySpark expr() function to concatenate columns and named column as fullName.**

In [0]:
#alias
from pyspark.sql.functions import expr

df.select(
            df.fname.alias('first_name'),
            df.lname.alias('last_name')
          ).show()

#Another example

df.select(expr("fname ||','|| lname").alias('fullname')).show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     James|     Bond|
|       Ann|    Varsa|
|Tom Cruise|      XXX|
| Tom Brand|     null|
+----------+---------+

+--------------+
|      fullname|
+--------------+
|    James,Bond|
|     Ann,Varsa|
|Tom Cruise,XXX|
|          null|
+--------------+



##4.2 asc() & desc() – Sort the DataFrame columns by Ascending or Descending order.

In [0]:
#asc, desc to sort ascending and descending order respectively.

df.sort(df.fname.asc()).show()

df.sort(df.fname.desc()).show()

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|       Ann|Varsa|200|     F|
|     James| Bond|100|  null|
| Tom Brand| null|400|     M|
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
| Tom Brand| null|400|     M|
|     James| Bond|100|  null|
|       Ann|Varsa|200|     F|
+----------+-----+---+------+



##4.3 cast() & astype() – Used to convert the data Type.

In [0]:
#cast

df.select(df.fname, df.id.cast('int')).printSchema()

root
 |-- fname: string (nullable = true)
 |-- id: integer (nullable = true)



##4.4 between() – Returns a Boolean expression when a column values in between lower and upper bound.

In [0]:
#between

df.filter(df.id.between(100,300)).show()

+-----+-----+---+------+
|fname|lname| id|gender|
+-----+-----+---+------+
|James| Bond|100|  null|
|  Ann|Varsa|200|     F|
+-----+-----+---+------+



##4.5 contains() – Checks if a DataFrame column value contains a a value specified in this function.

In [0]:
#startswith, endswith()

df.filter(df.fname.startswith("T")).show()

df.filter(df.fname.endswith("Cruise")).show()

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
| Tom Brand| null|400|     M|
+----------+-----+---+------+

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+



##4.8 isNull & isNotNull() – Checks if the DataFrame column has NULL or non NULL values.

In [0]:
#isNull & isNotNull

df.filter(df.lname.isNull()).show()

df.filter(df.lname.isNotNull()).show()

+---------+-----+---+------+
|    fname|lname| id|gender|
+---------+-----+---+------+
|Tom Brand| null|400|     M|
+---------+-----+---+------+

+----------+-----+---+------+
|     fname|lname| id|gender|
+----------+-----+---+------+
|     James| Bond|100|  null|
|       Ann|Varsa|200|     F|
|Tom Cruise|  XXX|400|      |
+----------+-----+---+------+



##4.9 like() & rlike() – Similar to SQL LIKE expression

In [0]:
#like, rlike

df.select(df.fname,df.lname).filter(df.fname.like("%om%")).show()


+----------+-----+
|     fname|lname|
+----------+-----+
|Tom Cruise|  XXX|
| Tom Brand| null|
+----------+-----+



##4.10 substr() – Returns a Column after getting sub string from the Column

In [0]:
df.select(df.fname.substr(1,2).alias('substr')).show()

+------+
|substr|
+------+
|    Ja|
|    An|
|    To|
|    To|
+------+



##4.11 when() & otherwise() – It is similar to SQL Case When, executes sequence of expressions until it matches the condition and returns a value when match.

In [0]:
#when & otherwise

from pyspark.sql.functions import when

df.select(df.fname, df.lname, when(df.gender=="M", "Male")\
            .when(df.gender=="F","Female")\
            .when(df.gender==None, '')
            .otherwise(df.gender).alias('new_gender')\
         ).show()

+----------+-----+----------+
|     fname|lname|new_gender|
+----------+-----+----------+
|     James| Bond|      null|
|       Ann|Varsa|    Female|
|Tom Cruise|  XXX|          |
| Tom Brand| null|      Male|
+----------+-----+----------+



##4.12 isin() – Check if value presents in a List.

In [0]:
#isin

demo_list = ['100', '200']

df.select(df.fname, df.lname, df.id).filter(df.id.isin(demo_list)).show()

+-----+-----+---+
|fname|lname| id|
+-----+-----+---+
|James| Bond|100|
|  Ann|Varsa|200|
+-----+-----+---+



##4.13 getField() – To get the value by key from MapType column and by stuct child name from StructType column

---

**Rest of the below functions operates on List, Map & Struct data structures hence to demonstrate these I will use another DataFrame with list, map and struct columns.**

In [0]:
#Create Datframe with struct, array & map

from pyspark.sql.types import StructType, StructField, StringType, ArrayType, MapType

data=[(("James","Bond"),["Java","C#"],{'hair':'black','eye':'brown'}),
      (("Ann","Varsa"),[".NET","Python"],{'hair':'brown','eye':'black'}),
      (("Tom Cruise",""),["Python","Scala"],{'hair':'red','eye':'grey'}),
      (("Tom Brand",None),["Perl","Ruby"],{'hair':'black','eye':'blue'})]

schema = StructType([
    StructField('name',StructType([
        StructField('fname', StringType(), True),
        StructField('lname', StringType(), True),
     ])),
    StructField('languages', ArrayType(StringType()), True),
    StructField('properties', MapType(StringType(),StringType()), True)
])

df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- fname: string (nullable = true)
 |    |-- lname: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+-----------------+---------------+-----------------------------+
|name             |languages      |properties                   |
+-----------------+---------------+-----------------------------+
|{James, Bond}    |[Java, C#]     |{eye -> brown, hair -> black}|
|{Ann, Varsa}     |[.NET, Python] |{eye -> black, hair -> brown}|
|{Tom Cruise, }   |[Python, Scala]|{eye -> grey, hair -> red}   |
|{Tom Brand, null}|[Perl, Ruby]   |{eye -> blue, hair -> black} |
+-----------------+---------------+-----------------------------+



In [0]:
#getField from MapType
df.select(df.properties.getField('hair')).show()

#getField from Struct
df.select(df.name.getField('fname')).show()

+----------------+
|properties[hair]|
+----------------+
|           black|
|           brown|
|             red|
|           black|
+----------------+

+----------+
|name.fname|
+----------+
|     James|
|       Ann|
|Tom Cruise|
| Tom Brand|
+----------+



##4.14 getItem() – To get the value by index from MapType or ArrayTupe & any key for MapType column.

In [0]:
#getItem() used with ArrayType
df.select(df.languages.getItem(1)).show()

#getItem() used with MapType
df.select(df.properties.getItem('hair')).show()

+------------+
|languages[1]|
+------------+
|          C#|
|      Python|
|       Scala|
|        Ruby|
+------------+

+----------------+
|properties[hair]|
+----------------+
|           black|
|           brown|
|             red|
|           black|
+----------------+



## 4.15 dropFields –

In [0]:
df.select(df.name.dropFields('fname')).show()

+--------------------------------+
|update_fields(name, dropfield())|
+--------------------------------+
|                          {Bond}|
|                         {Varsa}|
|                              {}|
|                          {null}|
+--------------------------------+



##4.16 withField() –

In [0]:
df.select(df.name.withField('mname', lit('izac'))).printSchema()
df.select(df.name.withField('mname', lit('izac'))).show()

root
 |-- update_fields(name, WithField(izac)): struct (nullable = true)
 |    |-- fname: string (nullable = true)
 |    |-- lname: string (nullable = true)
 |    |-- mname: string (nullable = false)

+------------------------------------+
|update_fields(name, WithField(izac))|
+------------------------------------+
|                 {James, Bond, izac}|
|                  {Ann, Varsa, izac}|
|                {Tom Cruise, , izac}|
|                {Tom Brand, null,...|
+------------------------------------+



##4.17 over() – Used with Window Functions

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.orderBy('name')

df.withColumn('row_number', row_number().over(window)).show(truncate=False)

+-----------------+---------------+-----------------------------+----------+
|name             |languages      |properties                   |row_number|
+-----------------+---------------+-----------------------------+----------+
|{Ann, Varsa}     |[.NET, Python] |{eye -> black, hair -> brown}|1         |
|{James, Bond}    |[Java, C#]     |{eye -> brown, hair -> black}|2         |
|{Tom Brand, null}|[Perl, Ruby]   |{eye -> blue, hair -> black} |3         |
|{Tom Cruise, }   |[Python, Scala]|{eye -> grey, hair -> red}   |4         |
+-----------------+---------------+-----------------------------+----------+

