#### Spark certification

Spark is a distributed data processing platform. It is and open-source unified analytics engine for large-scale data processing.

##### Databricks platform

* Magic commands: %python, %scala, %sql, %r, %sh, %md (%sh -> run shell commands on the driver)
* Render HTML: displayHTML
* DBFS = Databricks File System -> virtual file system, allows you to treat cloud object storage as though it were local files and directories on the cluster. You can run file system commands on DBFS using %fs. (example: %fs ls)
* DBUtils: dbutils.fs (%fs), dbutils.notebook (%run), dbutils.widgets (example: dbutils.fs.ls("/databricks-datasets"),
%fs help)

CREATE TABLE: You can use databricks sql commands to create a table:

`%sql
CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "/mnt/training/ecommerce/events/events.parquet");`

* Databricks WIDGETS
Input widgets allow you to add parameters to your notebooks and dashboards. The widget API consists of calls to create various types of input widgets, remove them, and get bound values. There are 4 types: text, dropdown, combobox, multiselect.
Documentation: dbutils.widgets.help()

<img src="https://www.edureka.co/blog/wp-content/uploads/2018/09/Picture6-2.png" alt="spark">

##### Spark overview

Spark is standard unified analytics engine for big data processing. It's the largest open-source project in data processing.
Fast (faster than Hadoop), easy to use, unified.
Core modules: spark SQL + dataframe, spark streaming, MLlib, Spark core API (all other functionality are builton top of it). 

Spark execution: it uses clusters, to distribute the work across different machines. The secret of spark performance is PARALLELISM, each parallelize action is refered to as a JOB. A job is broken down in STAGES. Then we have TASKS that depends on stages, created by the driver and assigned a partition of data to process. 

Spark Application runs all the Spark Jobs in parallel.

Each spark application may run as a series of spark jobs (the driver converts your spark application into one or more spark jobs). Spark job is internally represented as a DAG (spark's execution plan) of stages. Each node within a DAG could be a single or multiple spark stages. 

**DAG** (Directed Acyclic Graph) in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. (Como sabes, una característica de Spark es ser Lazy. Esto quiere decir que no ejecuta ningún trabajo hasta que debe entregar un resultado final. Debe crear una “lista” de tareas, que no se ejecutaran hasta que se envíe una orden de ejecución. Esta lista de tareas se conoce como Grafio Aciclico dirigido(DAG), esto quiere decir que las tareas se ejecutan en cascadas sin retornar jamas.)

TASK: a combination of a block of data and a set of transformers that will run on a single executor. If there is one big partition in our dataset, we will have one task. Each task maps to a single core and works on a single partition of data.  As such, an executor with 16 cores can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel (they DON'T run in a sequence). So, one executor doesn't run only one task at a time.

STAGE: is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines. A stage is a unit for work that is executed as a sequence of tasks in parallel without a shuffle. The Spark engine starts new stages after operations called shuffles. A shuffle represents a physical repartitioning of the data. This type of repartitioning requires coordinating across executors to move data around.

Each stage is comprised of Spark tasks (a unit of execution).

Components that spark use to coordinate work across a cluster of computers:
DRIVER is the machine in which the application runs. It is responsible for 3 main things:
* Maintaining information about spark application
* Responding to the user's programm 
* Analyzing, distributing and scheduling work across the executors

A WORKER node hosts the executor process. It has a fixed number of executors allocated at any point in time.
EXECUTORS: each executor will hold a chunk of the data to be processed, this chunk is called a SPARK PARTITION. It's a collection of rows that sit on one physical machine in the cluster. -> this is completely separated from harddisk partitions which have to do a storage space on a hardrive. Executors are responsible for carrying out work assigned by a driver. Each executor is responsible for 2 things:
* Execute the code assigned
* Report the state of the computation back to the driver

Spark **executors** run on **worker nodes** in the spark cluster. Each worker node may run on one or more executors depending upon the resource availability on the worker node. Spark worker is a node on spark cluster where spark executor runs.

Slots are not the same thing as executors. Executors could have multiple slots in them, and tasks are executed on slots.

Worker nodes are fault-tolerant.

Spark parallelize the work on 2 levels. One is split the work across the executors, the other is the SLOT. Each executors has a number of slots, each slot can be assigned a task. (Slot: CORE)
SLOTS are resources for parallelization within a Spark application.

<img src="https://miro.medium.com/max/1400/1*9lw9eMn9oUbLDN0SgbxQEw.png" alt="spark">

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in **FIFO fashion**. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a **“round robin”** fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

To enable the fair scheduler, simply set the **spark.scheduler.mode** property to **FAIR** when configuring a SparkContext:

Spark's job scheduler:
- By default, Spark's scheduler runs jobs in a FIFO fashion
- FAIR scheduler assigns tasks between jobs in a "round robin" fashion

In general, there should be one Spark **job** for one action. Actions always return results. Each job breaks down into a series of **stages**, the number of which depends on how many shuffle operations need to take place.

A Spark JOB is internally represented as a DAG of stages. (Each Spark application may run as a series of Spark Jobs.)

Not all Spark operations can happen in a single stage, so they may be divided into multiple stages.

STAGES -> **Stages** in Spark represent groups of tasks that can be executed together to compute the same operation on multiple machines. Spark will try to pack as much work as possible (i.e., as many transformations as possible inside your job) into the same stage, but the engine starts new stages after operations called shuffles. A shuffle represents a physical repartitioning of the data—for example, sorting a DataFrame, or grouping data that was loaded from a file by key. Spark starts a new stage after each shuffle, and keeps track of what order the stages must run in to compute the final result. Stages are created based on what operations can be performed serially or in parallel. They represent groups of tasks that can be executed together. 

The **spark.sql.shuffle.partitions** default value is 200, which means that when there is a shuffle performed during execution, it outputs 200 shuffle partitions by default. You can change this value, and the number of output partitions will change. (The number of partitions should be set according to the number of cores in your cluster to ensure efficient execution.)

`spark.conf.set("spark.sql.shuffle.partitions",100)`

A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.

TASKS -> Stages in Spark consist of **tasks**. Each task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor. If there is one big partition in our dataset, we will have one task. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel. A task is just a unit of computation applied to a unit of data (the partition). Partitioning your data into a greater number of partitions means that more can be executed in parallel.

<img src="https://www.researchgate.net/publication/332829490/figure/fig1/AS:754304478633984@1556851610122/Job-paths-in-the-Apache-Spark-cluster-12.png" alt="spark">

The **Spark driver** is the node in which the spark application's main method runs to coordinate the spark application. It contains the SparkContext object. It is responsible for scheduling the execution of data by various worker nodes in cluster mode. The spark driver should be as close as possible to worker nodes for optimal performance. 

-> NOT: the driver is horizontally scaled to increase overall processing throughput.

The spark driver is the controller of the execution of a Spark Application and maintains all of the state of the Spark cluster (the state and tasks of the executors). It must interface with the cluster manager in order to actually get physical resources and launch executors. At the end of the day, this is just a
process on a physical machine that is responsible for maintaining the state of the application running on the cluster.

The driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG (Spark’s execution plan), where each node within a DAG could be single or multiple Spark stages.

The Spark driver has multiple roles: 

- It communicates with the cluster manager; 
- It requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); 
- It transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors. Once the resources are allocated, it communicates directly with the executors.

Once the resources are allocated for the executors, the driver communicates directly with the executors running on the worker nodes.

Every Spark Application creates one driver and one or more executors at run time. Drivers and executors are never shared. Every application will have its own dedicated Spark Driver.

##### Cluster mode

-> the spark driver runs in a worker node inside the cluster. In the cluster mode the cluster manager launches the driver process on a worker node inside the cluster, in addition to the executor processes. This means that the cluster manager is responsible for maintaining all Spark worker nodes. Therefore, the cluster manager places the driver on a worker node and the executors on separate worker nodes.

Worker nodes are machines that host the executors responsible for the execution of tasks.

Cluster mode is probably the most common way of running Spark Applications. In cluster mode, a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster manager then launches the driver process on a worker node inside the cluster, in addition to the executor processes. This means that the cluster manager is responsible for maintaining all Spark Application–related processes. 

There is a single worker node that contains the spark driver and the executors. Spark application running in cluster mode runs in one of the worker nodes in the Spark cluster. Spark driver never does any data processing so you will have at least one executor running on some worker node. (False: Spark driver is alone running spark application).

**Cluster manager** creates worker nodes and allocates resources to them.
The cluster manager is responsible for maintaining a cluster of machines that will run your Spark Application(s). Somewhat confusingly, a cluster manager will have its own “driver” (sometimes called master) and “worker” abstractions. The core difference is that these are tied to physical machines rather than processes (as they are in Spark). -> A cluster manager may have its own master and worker nodes. When it comes time to actually run a Spark Application, we request resources from the cluster manager to run it.

The Cluster manager allocates resources and keeps track of resources across application. It is an external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN, kubernetes). These resources are assigned as Spark worker or containers. The cluster manager starts the spark driver and allocates resources for the cluster of nodes on which your Spark application runs. Currently Spark supports 4 cluster managers: the built-in standalone cluster manager, apache hadoop YARN, apache mesos and kubernetes.

**Executors are Java Virtual Machines (JVMs) running on a worker node.**

-> Driver program will translate your custom logic into stages, job and task.. and your **application master** will make sure to get enough resources from RM and also make sure to check the status of your tasks running in a container.

Spark executors are the processes that perform the tasks assigned by the Spark driver. Executors have one core responsibility: take the tasks assigned by the driver, run them, and report back their state (success or failure) and results. Each Spark Application has its own separate executor processes.

Worker nodes are machines that host the executors responsible for the execution of tasks.

##### Spark Deploy Modes

Spark's execution/deployment mode determines where the driver and executors are physically located when a Spark application is run.

Spark deployment mode (--deploy-mode) specifies where to run the driver program of your Spark application/job, Spark provides two deployment modes, client and cluster, you could use these to run Java, Scala, and PySpark applications.
* Cluster: In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs. You can submit your applications in cluster mode using the spark-submit utility. Cluster mode runs the driver with the YARN Application Master. (YARN cluster supports client and cluster modes.) Kubernetes Cluster manager does not support deployment mode and by default runs in cluster mode.
* Client: In client mode, the driver runs locally from where you are submitting your application using spark-submit command. Client mode is majorly used for interactive and debugging purposes. Note that in client mode only the driver runs locally and all tasks run on cluster worker nodes. Client mode is nearly the same as cluster mode except that the Spark driver remains on the client machine that submitted the application. This means that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processses.


In Client mode, Spark runs driver in local machine, and in cluster mode, it runs driver on one of the nodes in the cluster.

**Local Mode** is also known as Spark in-process is the default mode of spark. It does not require any resource manager. It runs everything on the same machine. Because of local mode, we are able to simply download spark and run without having to install any resource manager.
Local mode is a significant departure from the previous two modes: it runs the entire Spark Application on a single machine. It achieves parallelism through threads on that single machine. This is a common way to learn Spark, to test your applications, or experiment iteratively with local development. However, we do not recommend using local mode for running production applications. 

Local mode runs the spark driver and executor in the same JVM on a single computer.

The only different between client and cluster mode is in client, mode driver will run on the machine where we have executed/run spark application/job and AM runs in one of the cluster nodes. In cluster mode driver run inside application master, it means the application has much more responsibility.

Every SPARK APPLICATION creates one driver and one or more executors at run time. Drivers and executors are never shared. Every application will have its own dedicated Spark Driver.

You can submit spark applications in client mode or in cluster mode. The cluster mode starts the driver on the Spark cluster, whereas client mode starts the driver on the client machine.

##### Spark SQL

Spark SQL is a module used for structured data processing with multiple interfaces. We can also interact with Spark sql using the dataframe API. The same query can be expressed with SQL and the DataFrame API. 

Query plans (sql queries, python/scala dataframe API) -> Optimized query plan -> RDDs -> execution

Resilient Distributed Datasets (RDDs) are the low-level representation of datasets processed by a Spark cluster. In early versions of Spark, you had to write code manipulating RDDs directly. In modern versions of Spark you should instead use the higher-level DataFrame APIs, which Spark automatically compiles into low-level RDD operations.

##### DataFrames

DataFrames and Datasets are (distributed) table-like collections with well-defined rows and columns. To Spark, DataFrames and Datasets represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a location to generate some output. When we perform an action on a DataFrame, we instruct Spark to perform the actual transformations and return the result. Tables and views are basically the same thing as DataFrames. We just execute SQL against them instead of DataFrame code.


* DATAFRAMES: is a distributed collection of data grouped into named columns. They are a data abstraction in spark that help us think about data in a familiar, tabular way.
* SCHEMA is what defines the column names and types of a Dataframe.
* Dataframe TRANSFORMATIONS are methods that return a new Dataframe and are lazily evaluated. (example: select, where and orderBy)
* Dataframe ACTIONS are methods that trigger computation (example: count, collect, show, first, foreach). An action is needed to trigger the execution of any dataframe transformations.

Recall that expressing our query using methods in the DataFrame API returns results in a DataFrame. 

To create a new DataFrame or Dataset, you call a transformation. To start computation or convert to native language types, you call an action.

Spark allows you to use the notion of schema-on-read using the infer schema option. However, this approach may have some problems. Schema-on-read is recommended for ad hoc analysis and Spark SQL. Infer schema is slow and it often infers column types incorrectly. So the recommendation is to define your schema manually for production use cases.

In [0]:
# Access a dataframe's schema using the "schema" attribute.

budgetDF.schema

budgetDF.printSchema()

You can use the add() method to add new columns in your schema. However, you must be using `pyspark.sql.types.DataType`.

`newSchema = mySchema.add("Department", StringType())`

Most of the time the schema infers the Date as a string (not date format)

##### DataFrame Action Methods
| Method | Description |
| --- | --- |
| show | Displays the top n rows of DataFrame in a tabular form |
| count | Returns the number of rows in the DataFrame |
| describe,  summary | Computes basic statistics for numeric and string columns |
| first, head | Returns the the first row |
| collect | Returns an array that contains all rows in this DataFrame |
| take | Returns an array of the first n rows in the DataFrame |

Another common task is to compute summary statistics for a column or set of columns. We can use the **describe** method to achieve exactly this. This will take all numeric columns and
calculate the count, mean, standard deviation, min, and max.

In [0]:
salesDF = spark.read.parquet("/mnt/training/ecommerce/sales/sales.parquet")
salesDF.show(5)

+--------+--------------------+---------------------+-------------------+-----------------------+------------+--------------------+
|order_id|               email|transaction_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|               items|
+--------+--------------------+---------------------+-------------------+-----------------------+------------+--------------------+
|  257437|kmunoz@powell-dur...|     1592194221828900|                  1|                 1995.0|           1|[{null, M_PREM_K,...|
|  282611|bmurillo@hotmail.com|     1592504237604072|                  1|                  940.5|           1|[{NEWBED10, M_STA...|
|  257448| bradley74@gmail.com|     1592200438030141|                  1|                  945.0|           1|[{null, M_STAN_F,...|
|  257440|jameshardin@campb...|     1592197217716495|                  1|                 1045.0|           1|[{null, M_STAN_Q,...|
|  283949| whardin@hotmail.com|     1592510720760323|                  1|   

In [0]:
salesDF.describe().show()

+-------+-----------------+-------------------+---------------------+-------------------+-----------------------+-------------------+
|summary|         order_id|              email|transaction_timestamp|total_item_quantity|purchase_revenue_in_usd|       unique_items|
+-------+-----------------+-------------------+---------------------+-------------------+-----------------------+-------------------+
|  count|           210370|             210370|               210370|             210370|                 210370|             210370|
|   mean|         362617.5|               null| 1.593088009692872...|  1.146166278461758|     1042.7902657223433| 1.1214098968484099|
| stddev|60728.73240210701|               null| 4.600583796479087...| 0.3930559465510659|      494.5537027838925|0.35208445487528167|
|    min|           257433|aacosta@hotmail.com|     1592181560601984|                  1|                   53.1|                  1|
|    max|           467802|  zzuniga@quinn.com|     1593879294

If I want to extract the value for column "email" from the first row of the dataframe SalesDF:

`salesDF.first().email`

##### Spark session

First step of any spark application: create a SPARK SESSION -> is the single entry point to all dataframe api functionality. (It's automatically created in a databricks notebook as the variable spark, so if I just write "spark" in databricks and execute I have info about spark session and I can also access to Spark UI, see below). 

In a standalone Spark application, you can create a SparkSession using one of the high-level APIs in the programming language of your choice. In the Spark shell, the SparkSession is created for you, and you can access it via a global variable called spark or sc.

spark session methods: 
* sql: returns a dataframe representing the result of the given query
* table: returns the specified table as a dataframe
* read: returns a dataframereader that can be used to read data in as a dataframe
* range: create a dataframe with a columns containing elements in a range from start to end (exclusive) with step value and number of partitions
* createDataFrame: creates a dataframe from a list of tuples, primarily used for testing

In Spark 2.0, the SparkSession became a unified conduit to all Spark operations and data. Not only did it subsume previous entry points to the Spark like the SparkContext, SQLContext, HiveContext, SparkConf, and StreamingContext, but it also made working with Spark simpler and easier.
Although in Spark 2.x the SparkSession subsumes all other contexts, you can still access the individual contexts and their respective methods. In this way, the community maintained backward compatibility.

Spark Session allows you to create JVM runtime parameters, define DataFrames and Datasets, read from data sources, access catalog metadata, and issue Spark SQL queries. SparkSession provides a single unified entry point to all of Spark’s functionality.

A **SparkContext** object within the SparkSession represents the connection to the Spark cluster. This class is how you communicate with some of Spark’s lower-level APIs, such as RDDs. Through a SparkContext, you can create RDDs, accumulators, and broadcast variables, and you can run code on the cluster.

A SparkContext object within the SparkSession represents the connection to the Spark cluster. This class is how you communicate with some of Spark’s lower-level APIs, such as RDDs.

`spark = SparkSession
   .builder
   .appName("MySparkApplication")
   .getOrCreate()`
   
   
`spark.scheduler.mode` This configuration sets the scheduling mode between jobs submitted to the same SparkContext.

`spark-submit`-> it lets you send your application code to a cluster and launch it to execute there. Upon submission, the application will run until it exits (completes the task) or encounters an error.
You can do this with all of Spark’s support cluster managers including Standalone, Mesos, and YARN. The simplest example is running an application on your local machine. 

A way to specify Spark configurations directly in your Spark application or on the command line when submitting the application with spark-submit, using the --conf flag:

`spark-submit --conf spark.sql.shuffle.partitions=5 --conf`


To avoid job failures due to resource starvation or gradual performance degradation, there are a handful of Spark configurations that you can enable or alter. These configurations affect three Spark components: the Spark driver, the executor, and the shuffle service running on the executor.

Spark-submit useful to provide dependencies to the Spark application are:
* --jars
* --packages 
* --py-files

(not useful: --files)

##### Static versus dynamic resource allocation

If more resources are needed as tasks queue up in the driver due to a larger than anticipated workload, Spark cannot accommodate or allocate extra resources.
If instead you use Spark’s **dynamic resource allocation** configuration, the Spark driver can request more or fewer compute resources as the demand of large workloads flows and ebbs. In scenarios where your workloads are dynamic—that is, they vary in their demand for compute capacity using dynamic allocation helps to accommodate sudden peaks.
By default spark.dynamicAllocation.enabled is set to false.

`spark.dynamicAllocation.enabled true`

This feature enables Spark driver to request more or fewer compute resources as the demand of large workloads flows.

**Dynamic Allocation**  is used to scale up and down the number of executors dynamically based on the application's current number of pending tasks in a Spark cluster.

The following configurations are related to enabling dynamic adjustment of the resources for your application based on the workload:

- `spark.dynamicAllocation.shuffleTracking.enabled`
- `spark.dynamicAllocation.enabled`
- `spark.shuffle.service.enabled`

**Permissive Mode** When it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null.

In [0]:
spark

You can use the SparkSession method `table` to create a DataFrame from the `products` table. Let's save this in the variable `productsDF`.

`productsDF = spark.table("products")`

We can use `display()` to output the results of a dataframe.

`display(budgetDF)`

##### Convert between DataFrames and SQL

`createOrReplaceTempView` creates a temporary view based on the DataFrame. The lifetime of the temporary view is tied to the SparkSession that was used to create the DataFrame. Spark views are temporary and they disappear after your Spark application terminates.

`budgetDF.createOrReplaceTempView("budget")`

You can drop a Spark view in two ways.
1. Using the DROP VIEW Spark SQL Expression. `spark.sql("DROP VIEW IF EXISTS my_view"`
2. Using the spark.catalog.dropTempView() `spark.catalog.dropTempView("my_view")`

However, do remember that temp view and global temp view are different. -> Global temp views are accessed via prefix global_temp

`spark.read.table("global_temp.my_global_view")`

##### Data Sources

* CSV, text file that use commas or other delimiters to separate values. There's an optional header.
* Apache Parquet, a columnar storage format that provide compressed and efficient columnar data representation. Unlike csv, parquet allows you to load in only the columns you need since the value for a single record are not stored together. Schema is stored in a footer on the file, you don't need to infer it. Compression means you don't loose space storing missing values. 
* Delta lake: open source technology designed to be used with spark to build robust data lakes.

##### DataFrame reader/writer

DataFrameReader (interface used to load a dataframe from external storage systems)
`spark.read.parquet("path/to/file")`

DataFrameReader is accessible through the SparkSession attribute `read`. This class includes methods to load DataFrames from different external storage systems.

DataFrameWriter: write a dataframe to external storage systems
`(df.write
  .option("compression", "snappy")
  .mode("overwrite")
  .parquet(outPath)
)`

##### DataFrame Reader

Spark DataFrameReader allows you to set mode configuration. DataFrameReader for the following behaviour is **permissive**: When it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null.

In [0]:
# Read from CSV with the DataFrameReader's "csv" method and the following options: Tab separator, use first line as header, infer schema

usersCsvPath = "/mnt/training/ecommerce/users/users-500k.csv"

usersDF = (spark
           .read
           .option("sep", "\t")
           .option("header", True)
           .option("inferSchema", True)
           .csv(usersCsvPath)
          )

usersDF.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)
 |-- email: string (nullable = true)



In [0]:
df = spark.read.format("csv") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .option("sep", ";") \
    .load("data/my_data_file.txt")

In [0]:
# Spark's Python API also allows you to specify the DataFrameReader options as parameters to the "csv" method

usersDF = (spark
           .read
           .csv(usersCsvPath, sep="\t", header=True, inferSchema=True)
          )

usersDF.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)
 |-- email: string (nullable = true)



In [0]:
# Manually define the schema by creating a "StructType" with column names and data types

from pyspark.sql.types import LongType, StringType, StructType, StructField

# from pyspark.sql.types import ArrayType, DoubleType, IntegerType, LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
    StructField("user_id", StringType(), True),
    StructField("user_first_touch_timestamp", LongType(), True),
    StructField("email", StringType(), True)
])

# Read from CSV using this user-defined schema instead of inferring the schema

usersDF = (spark
           .read
           .option("sep", "\t")
           .option("header", True)
           .schema(userDefinedSchema)
           .csv(usersCsvPath)
          )

# Alternatively, define the schema using data definition language (DDL) syntax.

DDLSchema = "user_id string, user_first_touch_timestamp long, email string"

usersDF = (spark
           .read
           .option("sep", "\t")
           .option("header", True)
           .schema(DDLSchema)
           .csv(usersCsvPath)
          )

**DDL expressions** will create Spark Database objects which are stored in the meta store. Spark by default uses hive meta store. So you must enable hive support and include hive dependencies for using DDL expressions.

You must use the `enableHiveSupport()` while creating your Spark Session. Without hive support, you cannot run Spark DDL expressions.

In [0]:
# Read from JSON files 

eventsJsonPath = "/mnt/training/ecommerce/events/events-500k.json"

eventsDF = (spark
            .read
            .option("inferSchema", True)
            .json(eventsJsonPath)
           )

eventsDF.printSchema()

root
 |-- device: string (nullable = true)
 |-- ecommerce: struct (nullable = true)
 |    |-- purchase_revenue_in_usd: double (nullable = true)
 |    |-- total_item_quantity: long (nullable = true)
 |    |-- unique_items: long (nullable = true)
 |-- event_name: string (nullable = true)
 |-- event_previous_timestamp: long (nullable = true)
 |-- event_timestamp: long (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- coupon: string (nullable = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- item_name: string (nullable = true)
 |    |    |-- item_revenue_in_usd: double (nullable = true)
 |    |    |-- price_in_usd: double (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- traffic_source: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)

In [0]:
%scala

// You can use the `StructType` Scala method `toDDL` to have a DDL-formatted string created for you. In a Python notebook, create a Scala cell to create the string to copy and paste.

spark.read.parquet("/mnt/training/ecommerce/events/events.parquet").schema.toDDL

-> The default date format for JSON and CSV data files in Spark is yyyy-MM-dd

The correct format for connecting to any JDBC compliant data source such as Oracle (example: connect to an Oracle database and read a table into your spark dataframe), MySQL, PostgreSQL is format("jdbc").

##### DataFrame Writer

In [0]:
# Parquet method with configuration: Snappy compression, overwrite mode

usersOutputPath = workingDir + "/users.parquet"

(usersDF
 .write
 .option("compression", "snappy")
 .mode("overwrite")
 .parquet(usersOutputPath)
)

The snappy is the default compression format for the parquet file.

In [0]:
# As with DataFrameReader, Spark's Python API also allows you to specify the DataFrameWriter options as parameters to the "parquet" method

(usersDF
 .write
 .parquet(usersOutputPath, compression="snappy", mode="overwrite")
)

storesDF.write.parquet(filePath)

- append: Append contents of this DataFrame to existing data.
- overwrite: Overwrite existing data.
- error: Throw an exception if data already exists.

There is no `repartition()` operation for DataFrameWriter — **the partitionBy()** operation should be used instead.

`df1.write.mode("overwrite").option("path", "data/flights_delay.csv").save()` The code block will save the DataFrame in parquet file format.

DataFrame writer default format is the parquet file. So the data will be saved in the parquet file format. The path option specifies the directory location for the data files. So Spark will save parquet data files in the data/flights_delay.csv directory.

`df.write.mode("overwrite").save("data/myTable")`

For saving a DataFrame in parquet format, you must call the parquet() or save() method. There is no method such as path(). The default format is parquet so you can skip setting the format() for saving data in parquet format.

##### Write DataFrame to table

In [0]:
# Write "eventsDF" to a table using the DataFrameWriter method "saveAsTable"

# This creates a global table, unlike the local view created by the DataFrame method "createOrReplaceTempView"

eventsDF.write.mode("overwrite").saveAsTable("events_p")

##### Delta lake 

In almost all cases, the best practice is to use Delta Lake format, especially whenever the data will be referenced from a Databricks workspace. Delta Lake is an open source technology designed to work with Spark to bring reliability to data lakes.

Features:
- ACID transactions
- Scalable metadata handline
- Unified streaming and batch processing
- Time travel (data versioning)
- Schema enforcement and evolution
- Audit history
- Parquet format
- Compatible with Apache Spark API

In [0]:
eventsOutputPath = workingDir + "/delta/events"

(eventsDF
 .write
 .format("delta")
 .mode("overwrite")
 .save(eventsOutputPath)
)

By default Spark is case insensitive; however, you can make Spark case sensitive by setting the configuration:
`-- in SQL`
`set spark.sql.caseSensitive true`

##### Columns and expressions 

You can select, manipulate and remove columns from dataframes and these operations are represented using expressions. 
A **COLUMN** is a logical construction that will be computed based on the data in a dataframe using an expression. Columns can only be transformed within the context of a dataframe. 

How to refer to a column:
- df["columnName"]
- df.columnName
- col("columnName")
- col("columnName.field")

However, do not forget to place the column name in double-quotes.

(In scala also $)

Create columns from an expression: 
- col("a") + col("b")  
- col("a").cast("int") * 100

--> from pyspark.sql.functions import col

An expression is a set of transformations on one or more values in a record in a DataFrame. Think of it like a function that takes as input one or more column names, resolves them, and then potentially applies more expressions to create a single value for each record in the dataset.

In the simplest case, an expression, created via the expr function, is just a DataFrame column reference. In the simplest case, expr("someCol") is equivalent to col("someCol").

If you use col() and want to perform transformations on that column, you must perform those on that column reference. When using an expression, the expr function can actually parse transformations and column references from a string and can subsequently be passed into further transformations. 

Example: `expr("someCol - 5")` is the same transformation as performing `col("someCol") - 5`, or even `expr("someCol") - 5`.

`(((col("someCol") + 5) * 200) - 6) < col("otherCol")`

`expr("(((someCol + 5) * 200) - 6) < otherCol")`

-----

`df.select("name", expr("salary * 0.20 as increment"))` -> Expression to create an alias.

##### Column Operators and Methods
| Method | Description |
| --- | --- |
| \*, + , <, >= | Math and comparison operators |
| ==, != | Equality and inequality tests (Scala operators are `===` and `=!=`) |
| alias | Gives the column an alias |
| cast, astype | Casts the column to a different data type |
| isNull, isNotNull, isNan | Is null, is not null, is NaN |
| asc, desc | Returns a sort expression based on ascending/descending order of the column |

How to chain two conditions -> A valid answer would be &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and ~.

##### Working with Nulls

You have coalesce() and ifnull() as valid approaches to replace null values. The coalesce() is available as a DataFrame function as well as a Spark SQL function. However, ifnull() is a SparkSQL function and it is not available as a DataFrame function so it does not work as a DataFrame function.

`df.withColumn("Year", expr("coalesce(Year, '2021')"))`

`df.withColumn("Year", coalesce(col("Year"), lit("2021")))`

`df.withColumn("Year", expr("ifnull(Year, '2021')"))`

WRONG ->`df.withColumn("Year", ifnull(col("Year"), "2021"))`

**coalesce()**  allows you to select the first non-null value from a set of columns by using the coalesce function.
`from pyspark.sql.functions import coalesce
df.select(coalesce(col("Description"), col("CustomerId"))).show()
`

`df.withColumn("Year", coalesce(col("Year"), lit("2021")))` You must use the lit() function to add a literal value for a column.

**ifnull** allows you to select the second value if the first is null, and defaults to the first. Alternatively, you could use **nullif**, which returns null if the two values are equal or else returns the second if they are not. **nvl** returns the second value if the first is null, but defaults to the first. Finally, **nvl2** returns the second value if the first is not null; otherwise, it will return the last specified value.

**isnull()**

The isnull() function takes a single argument and returns true if expr is null, or false otherwise. So using isnull() is incorrect. The ifnull() is a valid SQL function for the purpose. However, you do not need lit() in Spark SQL. In fact, we do not have a lit() function in Spark SQL because it is not needed.

`df.withColumn("Year", expr("ifnull(Year, '2021')"))`

Example: 

`df.withColumn("count2", col("count").cast("long"))`

`df.withColumn("salary", col("salary").cast("double"))` -> remember "double", "long" with " "

You can also use Spark SQL functions such as INT(), DOUBLE(), DATE(), etc to cast a value. `df.select("name", expr("INT(age)"), expr("DOUBLE(salary)"))`

Like for asc(), asc_null_last() does not take any argument, but is applied to column to return a sort expression based on ascending order of the column, and null values appear after non-null values. -> `df.orderBy(col("created_date").asc_nulls_last())`

* asc_nulls_first(columnName: String): Column - Similar to asc function but null values return first and then non-null values.
* asc_nulls_last(columnName: String): Column - Similar to asc function but non-null values return first and then null values.

* desc_nulls_first(columnName: String): Column - Similar to desc function but null values return first and then non-null values.
* desc_nulls_last(columnName: String): Column - Similar to desc function but non-null values return first and then null values.

For optimization purposes, it’s sometimes advisable to sort within each partition before another set of transformations. You can use the sortWithinPartitions method to do this:
`spark.read.format("json").load("/data/flight-data/json/*-summary.json")\
.sortWithinPartitions("count")`

We do not have a Spark SQL function for desc_nulls_first() so the following option is incorrect.
`df.orderBy(expr("desc_nulls_first(Year)"))`

The following expression looks correct but it does not work correctly in PySpark. `expr("salary desc")`
The expr("salary").desc() is equivalent to col("salary").desc() so both the options are correct.

##### DataFrame Transformation Methods
| Method | Description |
| --- | --- |
| select | Returns a new DataFrame by computing given expression for each element |
| drop | Returns a new DataFrame with a column dropped |
| withColumnRenamed | Returns a new DataFrame with a column renamed |
| withColumn | Returns a new DataFrame by adding a column or replacing the existing column that has the same name |
| filter, where | Filters rows using the given condition |
| sort, orderBy | Returns a new DataFrame sorted by the given expressions |
| dropDuplicates, distinct | Returns a new DataFrame with duplicate rows removed |
| limit | Returns a new DataFrame by taking the first n rows |
| groupBy | Groups the DataFrame using the specified columns, so we can run aggregation on them |

**drop_duplicates(subset=None)** is an alias for dropDuplicates().

dropna(how='any', thresh=None, subset=None)

**thresh** – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.

`df.na.drop(thresh=1)` -> Means keep the row if at least one column is not null.

The rarely used **between()** method. It exists and resolves to ((storeId >= 20) AND (storeId <= 30)) in SQL.

Example:

`df.filter(col("count") < 2).show(2)`
`df.where("count < 2").show(2)`


`df.where(col("InvoiceNo") != 536365)\
.select("InvoiceNo", "Description")\
.show(5, False)`

-> Cleaner way: 
`df.where("InvoiceNo = 536365")
.show(5, false)`
`df.where("InvoiceNo <> 536365")
.show(5, false)`

df.where("salary > 5000") -> equivalent to:

df.where(expr("salary > 5000"))

df.filter(col("salary") > 5000)

df.filter(expr("salary > 5000"))

-!- df.filter("salary" > 5000) is WRONG

-----

df.where("salary > 5000 and age > 30") -> equivalent to:

df.filter((df.salary > 5000) & (df.age > 30))

df.filter("salary > 5000").filter("age > 30")

df.filter(col("salary") > 5000) & (col ("age") > 30) You must apply the col() to convert it to a column expression. If not python will try comparing a string "salary" with the number 5000.

**selectExpr** method when you’re working with expressions in strings. 

`df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME"))\
.show(2)`
`df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)`

In the next example, we’ll set a Boolean flag for when the origin country is the same as the destination country:
`df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\
.show(2)`


`df.selectExpr("name", "case when (salary < 5000) then salary * 0.20 else 0 end as increment")`

The select() method does not accept column expressions. You must use  selectExpr() or wrap your column expressions into expr(). Spark if/else expression is syntactically incorrect. The case statement is syntactically correct.

**LIMIT**

`df.limit(100).where("salary > 4000")` This first line of code will limit 100 records and then apply the filter condition.

`df.where("salary > 4000").limit(100)` This second line of code will first apply the filter and then limit the results to 100 records.

The limit and where clauses will be applied in the same order as specified in the code. Reordering the limit() clause is not an optimization but it will be logically incorrect. Hence, Spark does not try to push down the limit clause.

When we call **groupBy**, we end up with a **RelationalGroupedDataset**, which is a fancy name for a DataFrame that has a grouping specified but needs the user to specify an aggregation before it can be queried further. We basically specified that we’re going to be grouping by a key (or set of keys) and that now we’re going to perform an aggregation over each one of those keys.

**rollup()** Calculates subtotals and a grand total over (ordered) combination of groups. Beside cube and rollup multi-dimensional aggregate operators. 

`df.rollup("Year", "CourseName"),agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year", "CourseName")`

**Random Samples** 
Sometimes, you might just want to sample some random records from your DataFrame. You can do this by using the sample method on a DataFrame, which makes it possible for you to specify
a fraction of rows to extract from a DataFrame and whether you’d like to sample with or without replacement:

`seed = 5
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()`

**sample()** 
PySpark sampling (pyspark.sql.DataFrame.sample()) is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data for example 10% of the original file.

sample(withReplacement, fraction, seed=None)
* withReplacement – Sample with replacement or not (default False). Some times you may need to get a random sample with repeated values. By using the value true, results in repeated values.
* fraction – Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records. (By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset. For example, 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.)
* seed – Seed for sampling (default a random seed). Used to reproduce the same random sampling. Every time you run a sample() function it returns a different set of sampling records, however sometimes during the development and testing phase you may need to regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run. Change slice value to get different results.
`print(df.sample(0.1,123).collect())
//Output: 36,37,41,43,56,66,69,75,83
print(df.sample(0.1,123).collect())
//Output: 36,37,41,43,56,66,69,75,83
print(df.sample(0.1,456).collect())
//Output: 19,21,42,48,49,50,75,80
`

**Random Splits**
Random splits can be helpful when you need to break up your DataFrame into a random “splits” of the original DataFrame. In this next example, we’ll split our DataFrame into two different
DataFrames by setting the weights by which we will split the DataFrame (these are the arguments to the function). Because this method is designed to be randomized, we will also specify a seed (just replace seed with a number of your choosing in the code block). It’s important to note that if you don’t specify a proportion for each DataFrame that adds up to one, they will be normalized so that they do:

`dataFrames = df.randomSplit([0.25, 0.75], seed)
dataFrames[0].count() > dataFrames[1].count() # False
`

**Union() and UnionAll()**

DataFrames are immutable. This means users cannot append to DataFrames because that would be changing it. To append to a DataFrame, you must union the original DataFrame along with the new DataFrame. This just concatenates the two DataFramess. To union two DataFrames, you must be sure that they have the same schema and number of columns; otherwise, the union will fail.

* Dataframe union() – union() method of the DataFrame is used to combine two DataFrame’s of the same structure/schema. If schemas are not the same it returns an error.
* DataFrame unionAll() – unionAll() is deprecated since Spark “2.0.0” version and replaced with union().

Note: In other SQL’s, Union eliminates the duplicates but UnionAll combines two datasets including duplicate records. But, in spark both behave the same and use DataFrame duplicate function to remove duplicate rows. Union() does not remove duplicates, it works like union all in spark sql.

UNION operation in Spark SQL combines two tables and also removes duplicates. In spark DataFrame API I have to use: `df3 = df1.union(df2).distinct()`

**monotonically_increasing_id()** A column that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

`df.withColumn("ID", monotonically_increasing_id())`

In [0]:
# "selectExpr()" Selects a list of SQL expressions

appleDF = eventsDF.selectExpr("user_id", "device in ('macOS', 'iOS') as apple_user")
appleDF.show(5)

+-----------------+----------+
|          user_id|apple_user|
+-----------------+----------+
|UA000000107379500|      true|
|UA000000107359357|     false|
|UA000000107375547|      true|
|UA000000107370581|      true|
|UA000000107377108|     false|
+-----------------+----------+
only showing top 5 rows



In [0]:
# "withColumn()" Returns a new DataFrame by adding a column or replacing an existing column that has the same name.

from pyspark.sql.functions import col

mobileDF = eventsDF.withColumn("mobile", col("device").isin("iOS", "Android"))
mobileDF.show(5)

+-------+------------------+----------+------------------------+----------------+-------------------+--------------------+--------------+--------------------------+-----------------+------+
| device|         ecommerce|event_name|event_previous_timestamp| event_timestamp|                geo|               items|traffic_source|user_first_touch_timestamp|          user_id|mobile|
+-------+------------------+----------+------------------------+----------------+-------------------+--------------------+--------------+--------------------------+-----------------+------+
|  macOS|{null, null, null}|  warranty|        1593878899217692|1593878946592107|     {Montrose, MI}|                  []|        google|          1593878899217692|UA000000107379500| false|
|Windows|{null, null, null}|     press|        1593876662175340|1593877011756535|  {Northampton, MA}|                  []|        google|          1593876662175340|UA000000107359357| false|
|  macOS|{null, null, null}|  add_item|        159

In [0]:
locationDF = eventsDF.withColumnRenamed("geo", "location")
locationDF.show(5)

+-------+------------------+----------+------------------------+----------------+-------------------+--------------------+--------------+--------------------------+-----------------+
| device|         ecommerce|event_name|event_previous_timestamp| event_timestamp|           location|               items|traffic_source|user_first_touch_timestamp|          user_id|
+-------+------------------+----------+------------------------+----------------+-------------------+--------------------+--------------+--------------------------+-----------------+
|  macOS|{null, null, null}|  warranty|        1593878899217692|1593878946592107|     {Montrose, MI}|                  []|        google|          1593878899217692|UA000000107379500|
|Windows|{null, null, null}|     press|        1593876662175340|1593877011756535|  {Northampton, MA}|                  []|        google|          1593876662175340|UA000000107359357|
|  macOS|{null, null, null}|  add_item|        1593878792892652|1593878815459100|    

In [0]:
# "filter()" Filters rows using the given SQL expression or column based condition.

purchasesDF = eventsDF.filter("ecommerce.total_item_quantity > 0")
purchasesDF.show(5)

+-------+--------------+----------+------------------------+----------------+------------------+--------------------+--------------+--------------------------+-----------------+
| device|     ecommerce|event_name|event_previous_timestamp| event_timestamp|               geo|               items|traffic_source|user_first_touch_timestamp|          user_id|
+-------+--------------+----------+------------------------+----------------+------------------+--------------------+--------------+--------------------------+-----------------+
|  Linux|{1195.0, 1, 1}|  finalize|        1593878893766134|1593878897648871|     {Shawnee, KS}|[{null, M_STAN_K,...|        google|          1593876996316576|UA000000107362263|
|    iOS|{1045.0, 1, 1}|  finalize|        1593878485345763|1593878487460247|     {Detroit, MI}|[{null, M_STAN_Q,...|      facebook|          1593877230282722|UA000000107364432|
|Android| {595.0, 1, 1}|  finalize|        1593877930076602|1593878966392505|{East Chicago, IN}|[{null, M_STAN

In [0]:
# "dropDuplicates()" Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns.
# Alias: "distinct"

distinctUsersDF = eventsDF.dropDuplicates(["user_id"])
distinctUsersDF.show(5)


# distinctDF = purchasesDF.select("event_name").distinct()
# distinctDF = purchasesDF.dropDuplicates(["event_name"])

+-------+------------------+-------------+------------------------+----------------+-----------------+--------------------+--------------+--------------------------+-----------------+
| device|         ecommerce|   event_name|event_previous_timestamp| event_timestamp|              geo|               items|traffic_source|user_first_touch_timestamp|          user_id|
+-------+------------------+-------------+------------------------+----------------+-----------------+--------------------+--------------+--------------------------+-----------------+
|    iOS|{null, null, null}|     checkout|        1592547736518007|1592548321455992|  {San Bruno, CA}|[{NEWBED10, M_STA...|         email|          1592196947865522|UA000000102357807|
|Android|{null, null, null}|     add_item|        1592573713168269|1592574347642610|     {Mobile, AL}|[{NEWBED10, M_STA...|         email|          1592198812458125|UA000000102358054|
|  macOS|{null, null, null}|shipping_info|        1592545562314108|1592545941007

In [0]:
eventsDF.distinct()

Out[11]: DataFrame[device: string, ecommerce: struct<purchase_revenue_in_usd:double,total_item_quantity:bigint,unique_items:bigint>, event_name: string, event_previous_timestamp: bigint, event_timestamp: bigint, geo: struct<city:string,state:string>, items: array<struct<coupon:string,item_id:string,item_name:string,item_revenue_in_usd:double,price_in_usd:double,quantity:bigint>>, traffic_source: string, user_first_touch_timestamp: bigint, user_id: string]

**Managed/unmanaged table**.
When you define a table from files on disk, you are defining an unmanaged table.
When you use saveAsTable on a DataFrame, you are instead creating a managed table for which Spark will track of all of the relevant information.

If you drop a managed table, both the data and the table definition will be removed. If you are dropping an unmanaged table, no data will be removed but you will no longer be able to refer to this data by the table name. Spark only manages metadata for unmanaged tables.

Managed tables are stored in the Spark warehouse directory. The default location is determined by the `spark.sql.warehouse.dir` configuration in your Spark Session.

Spark supports creating managed and unmanaged tables using Spark SQL and DataFrame APIs. 

As soon as you add the ‘path’ option in the dataframe writer it will be treated as an external/unmanaged table.

`df.write.option("path", "/data/").saveAsTable("my_managed_table")`

The default behavior of CREATE TABLE is to create a managed table. However, if you are setting a specific PATH, it becomes an external or unmanaged table. 

`spark.sql("""CREATE TABLE flights_tbl(date STRING, delay INT, 
               distance INT, origin STRING, destination STRING) 
               USING csv OPTIONS (PATH '/tmp/flights/flights_tbl.csv')""")`

In order to collect table-level statistic I use the expression `ANALYZE TABLE table_name COMPUTE STATISTICS`

You can list the tables and other catalog objects using the `spark.catalog`

`table_list = spark.catalog.listTables()`

In this structure, the PersonalDetails is a child DataFrame inside a top-level root DataFrame. This approach is known as creating the **DataFrame of DataFrames**.

In [0]:
root
 |-- ID: long (nullable = true)
 |-- PersonalDetails: struct (nullable = false)
 | |-- FName: string (nullable = true)
 | |-- LName: string (nullable = true)
 | |-- DOB: string (nullable = true)
 |-- Department: string (nullable = true)
