# 0. **Install PySpark**

In [6]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=6ad6d9d7a4431e2e7faca5c84c9de26d3727a01e98b93d80cf6115f873fad20c
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


# 1. **Importing Libraries and Initializing Spark Session**:


In [7]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

   - Imports necessary PySpark libraries.
   - Initializes a Spark session with the application name 'SparkByExamples.com' and runs it locally with a single thread (`local[1]`).


# 2. **Defining Schema and Sample Data**:


In [8]:
columns = ["name", "languagesAtSchool", "currentState"]
data = [("James,,Smith", ["Java", "Scala", "C++"], "CA"),
        ("Michael,Rose,", ["Spark", "Java", "C++"], "NJ"),
        ("Robert,,Williams", ["CSharp", "VB"], "NV")]

- Defines the schema with three columns: `name`, `languagesAtSchool`, and `currentState`.
- Defines sample data as a list of tuples, where each tuple represents a row in the DataFrame.


# 3. **Creating DataFrame**:


In [9]:
df = spark.createDataFrame(data=data, schema=columns)
df.printSchema()
df.show(truncate=False)

root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)

+----------------+------------------+------------+
|name            |languagesAtSchool |currentState|
+----------------+------------------+------------+
|James,,Smith    |[Java, Scala, C++]|CA          |
|Michael,Rose,   |[Spark, Java, C++]|NJ          |
|Robert,,Williams|[CSharp, VB]      |NV          |
+----------------+------------------+------------+



   - Creates a DataFrame from the sample data and schema.
   - Prints the schema of the DataFrame.
   - Displays the content of the DataFrame without truncating the output.


# 4. **Concatenating Array Column**:


In [10]:
from pyspark.sql.functions import col, concat_ws

df2 = df.withColumn("languagesAtSchool", concat_ws(",", col("languagesAtSchool")))
df2.printSchema()
df2.show(truncate=False)

root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: string (nullable = false)
 |-- currentState: string (nullable = true)

+----------------+-----------------+------------+
|name            |languagesAtSchool|currentState|
+----------------+-----------------+------------+
|James,,Smith    |Java,Scala,C++   |CA          |
|Michael,Rose,   |Spark,Java,C++   |NJ          |
|Robert,,Williams|CSharp,VB        |NV          |
+----------------+-----------------+------------+



   - Imports `col` and `concat_ws` functions from `pyspark.sql.functions`.
   - Uses `concat_ws` to concatenate the elements of the `languagesAtSchool` array column into a single string, separated by commas.
   - Creates a new DataFrame `df2` with the modified `languagesAtSchool` column.
   - Prints the schema and displays the content of `df2`.


# 5. **Using SQL to Concatenate Array Column**:


In [11]:
df.createOrReplaceTempView("ARRAY_STRING")
spark.sql("SELECT name, concat_ws(',', languagesAtSchool) AS languagesAtSchool, currentState FROM ARRAY_STRING").show(truncate=False)

+----------------+-----------------+------------+
|name            |languagesAtSchool|currentState|
+----------------+-----------------+------------+
|James,,Smith    |Java,Scala,C++   |CA          |
|Michael,Rose,   |Spark,Java,C++   |NJ          |
|Robert,,Williams|CSharp,VB        |NV          |
+----------------+-----------------+------------+



   - Creates a temporary view `ARRAY_STRING` from the DataFrame `df`.
   - Runs an SQL query to concatenate the `languagesAtSchool` array column using `concat_ws` and displays the result.
