#Pyspark – Get substring() from a column

---

**In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.**

**In this tutorial, I have explained with an example of getting substring of a column using substring() from pyspark.sql.functions and using substr() from pyspark.sql.Column type.**

---

##Using SQL function substring()


**Using the substring() function of pyspark.sql.functions module we can extract a substring or slice of a string from the DataFrame column by providing the position and length of the string you wanted to slice.**



###substring(str, pos, len)



---

**Note: Please note that the position is not zero based, but 1 based index.**


---


**Below is an example of Pyspark substring() using withColumn().**

In [0]:
from pyspark.sql.functions import substring, col

In [0]:
data = [(1,"20200828"),(2,"20180525")]

columns=["id","date"]

df = spark.createDataFrame(data=data, schema=columns)

df = df.withColumn("year", substring('date', 1, 4))\
.withColumn("month", substring('date', 5, 2))\
.withColumn("day", substring('date', 7, 2))

df.printSchema()


df.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)

+---+--------+----+-----+---+
|id |date    |year|month|day|
+---+--------+----+-----+---+
|1  |20200828|2020|08   |28 |
|2  |20180525|2018|05   |25 |
+---+--------+----+-----+---+



**In above example, we have created a DataFrame with two columns, id and date. Here date is in the form “year month day”. HereI have used substring() on date column to return sub strings of date as year, month, day respectively. Below is the output.**

##2. Using substring() with select()


**In Pyspark we can get substring() of a column using select. Above example can bed written as below.**

In [0]:
df.select('date', substring('date', 1,4).alias("year"),\
         substring('date', 5, 2).alias("month"),\
         substring('date', 7, 2).alias("day"))\
.show(truncate=False)

+--------+----+-----+---+
|date    |year|month|day|
+--------+----+-----+---+
|20200828|2020|08   |28 |
|20180525|2018|05   |25 |
+--------+----+-----+---+



##3.Using substring() with selectExpr()


**Sample example using selectExpr to get sub string of column(date) as year,month,day. Below is the code that gives same output as above.**

In [0]:
df.selectExpr('date', 'substring(date, 1,4) as year',\
             'substring(date, 5,2) as month',\
             'substring(date, 7, 2) as day')\
.show(truncate=False)

+--------+----+-----+---+
|date    |year|month|day|
+--------+----+-----+---+
|20200828|2020|08   |28 |
|20180525|2018|05   |25 |
+--------+----+-----+---+



##4. Using substr() from Column type


**Below is the example of getting substring using substr() function from pyspark.sql.Column type in Pyspark.**

In [0]:
df2 = df.withColumn('year', col('date').substr(1,4))\
.withColumn('month', col('date').substr(5,2))\
.withColumn('day', col('date').substr(7,2))

df2.show(truncate=False)

+---+--------+----+-----+---+
|id |date    |year|month|day|
+---+--------+----+-----+---+
|1  |20200828|2020|08   |28 |
|2  |20180525|2018|05   |25 |
+---+--------+----+-----+---+

