#PySpark Convert String to Array Column

---

**PySpark SQL provides split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting it into ArrayType.**



---


**In this article, I will explain converting String to Array column using split() function on DataFrame and SQL query.**

##Split() function syntax

**PySpark SQL split() is grouped under Array Functions in PySpark SQL Functions class with the below syntax.**


###pyspark.sql.functions.split(str, pattern, limit=-1)

---


**The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter. This function returns pyspark.sql.Column of type Array.**

---


**Before we start with usage, first, let’s create a DataFrame with a string column with text separated with comma delimiter**

In [0]:
data = [("James, A, Smith","2018","M",3000),
            ("Michael, Rose, Jones","2010","M",4000),
            ("Robert,K,Williams","2010","M",4000),
            ("Maria,Anne,Jones","2005","F",4000),
            ("Jen,Mary,Brown","2010","",-1)
            ]

columns = ["name", "dob_year", "gender", "salary"]
df = spark.createDataFrame(data=data, schema=columns)
df.printSchema()
df.show(truncate=False)

root
 |-- name: string (nullable = true)
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+--------------------+--------+------+------+
|name                |dob_year|gender|salary|
+--------------------+--------+------+------+
|James, A, Smith     |2018    |M     |3000  |
|Michael, Rose, Jones|2010    |M     |4000  |
|Robert,K,Williams   |2010    |M     |4000  |
|Maria,Anne,Jones    |2005    |F     |4000  |
|Jen,Mary,Brown      |2010    |      |-1    |
+--------------------+--------+------+------+



**This yields the below output. As you notice we have a name column with takens firstname, middle and lastname with comma separated.**


---

##PySpark Convert String to Array Column


**Below PySpark example snippet splits the String column name on comma delimiter and convert it to an Array. If you do not need the original column, use drop() to remove the column.**

In [0]:
from pyspark.sql.functions import col, split

In [0]:
df2 = df.select(split(col("name"), ",").alias("NameArray"))\
.drop("name")

df2.printSchema()
df2.show(truncate=False)

root
 |-- NameArray: array (nullable = true)
 |    |-- element: string (containsNull = false)

+------------------------+
|NameArray               |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+



##Convert String to Array Column using SQL Query


**Since PySpark provides a way to execute the raw SQL, let’s learn how to write the same example using Spark SQL expression.**

**In order to use raw SQL, first, you need to create a table using createOrReplaceTempView(). This creates a temporary view from the Dataframe and this view is available lifetime of the current Spark context.**

In [0]:
df.createOrReplaceTempView("PERSON")
spark.sql(" select SPLIT(name, ',') as NamedArray  from PERSON")\
.show(truncate=False)

+------------------------+
|NamedArray              |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+

