# 0. **Install PySpark**

In [3]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=3f247cb9aa5f73c88d5d05a628ef572cb93f06fe02a7788b44f6b8e5054bf65b
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


# 1. **Importing Libraries and Initializing Spark Session**:

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, ArrayType, StructType, StructField

spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

   - Imports necessary PySpark libraries.
   - Initializes a Spark session with the application name 'SparkByExamples.com'.


# 2. **Defining Array Column Type**:


In [5]:
arrayCol = ArrayType(StringType(), False)

   - Defines an array column type with elements of `StringType` and specifies that the array cannot contain `null` values.


# 3. **Defining Sample Data and Schema**:


In [6]:
data = [
    ("James,,Smith", ["Java", "Scala", "C++"], ["Spark", "Java"], "OH", "CA"),
    ("Michael,Rose,", ["Spark", "Java", "C++"], ["Spark", "Java"], "NY", "NJ"),
    ("Robert,,Williams", ["CSharp", "VB"], ["Spark", "Python"], "UT", "NV")
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("languagesAtSchool", ArrayType(StringType()), True),
    StructField("languagesAtWork", ArrayType(StringType()), True),
    StructField("currentState", StringType(), True),
    StructField("previousState", StringType(), True)
])

- Defines sample data as a list of tuples, where each tuple represents a row in the DataFrame.
- Defines a schema with five fields: `name`, `languagesAtSchool`, `languagesAtWork`, `currentState`, and `previousState`.


# 4. **Creating DataFrame**:


In [7]:
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- languagesAtWork: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)
 |-- previousState: string (nullable = true)

+----------------+------------------+---------------+------------+-------------+
|            name| languagesAtSchool|languagesAtWork|currentState|previousState|
+----------------+------------------+---------------+------------+-------------+
|    James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|
|   Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|
|Robert,,Williams|      [CSharp, VB]|[Spark, Python]|          UT|           NV|
+----------------+------------------+---------------+------------+-------------+



- Creates a DataFrame from the sample data and schema.
- Prints the schema of the DataFrame.
- Displays the content of the DataFrame.


# 5. **Exploding Array Column**:


In [8]:
from pyspark.sql.functions import explode

df.select(df.name, explode(df.languagesAtSchool)).show()

+----------------+------+
|            name|   col|
+----------------+------+
|    James,,Smith|  Java|
|    James,,Smith| Scala|
|    James,,Smith|   C++|
|   Michael,Rose,| Spark|
|   Michael,Rose,|  Java|
|   Michael,Rose,|   C++|
|Robert,,Williams|CSharp|
|Robert,,Williams|    VB|
+----------------+------+



- Imports the `explode` function from `pyspark.sql.functions`.
- Uses `explode` to transform each element of the `languagesAtSchool` array column into a separate row, along with the corresponding `name`.


# 6. **Splitting String Column into Array**:


In [9]:
from pyspark.sql.functions import split

df.select(split(df.name, ",").alias("nameAsArray")).show()

+--------------------+
|         nameAsArray|
+--------------------+
|    [James, , Smith]|
|   [Michael, Rose, ]|
|[Robert, , Williams]|
+--------------------+



- Imports the `split` function from `pyspark.sql.functions`.
- Splits the `name` column by commas into an array of strings and displays the result as a new column `nameAsArray`.


# 7. **Combining Columns into Array**:


In [10]:
from pyspark.sql.functions import array

df.select(df.name, array(df.currentState, df.previousState).alias("States")).show()

+----------------+--------+
|            name|  States|
+----------------+--------+
|    James,,Smith|[OH, CA]|
|   Michael,Rose,|[NY, NJ]|
|Robert,,Williams|[UT, NV]|
+----------------+--------+



- Imports the `array` function from `pyspark.sql.functions`.
- Combines `currentState` and `previousState` columns into a single array column named `States`.


# 8. **Checking for Element in Array**:


In [11]:
from pyspark.sql.functions import array_contains

df.select(df.name, array_contains(df.languagesAtSchool, "Java").alias("array_contains")).show()

+----------------+--------------+
|            name|array_contains|
+----------------+--------------+
|    James,,Smith|          true|
|   Michael,Rose,|          true|
|Robert,,Williams|         false|
+----------------+--------------+



   - Imports the `array_contains` function from `pyspark.sql.functions`.
   - Checks if the `languagesAtSchool` array contains the element "Java" and displays the result as a new column `array_contains`.