# Apache Spark Lakehouse

## What is Apache Spark

A distributed data processing framework that enables large-scale data analytics by coorindating work across multiple processing nodes in a cluster. In Fabric this is a *Spark pool*.

Spark runs on Java, Scala, Spark R, Spark SQL, and PySpark.

Spark pool consists of compute *nodes* that distribute the data processing tasks. The nodes consist of:
1. A head node that coordinates distributed processes through a **driver** program.
2. Worker nodes which execute proceses the data.

*Example of general node architecture*

![general_spark_node_arch_example](attachment:image.png)



## Spark in Fabric

Each workspace is provided a *starter pool* for quick setup of Spark jobs. Customization or new spark pools can be setup for more complex or specific needs outside of the standard configuration. 
* Customized nodes may be disabled by Fabric administrator at the org level.

![spark_node_custom_blade_example](attachment:image-2.png)

Custom settings: 
1. Node Family - type of virtual machines used for nodes. Memory optimized nodes provide optimate performance for general workloads
2. Autoscale - provision nodes as needed; Set the min and max nodes
3. Dynamic allocation - Whether or not to dynamically allocate executor processes on the worker nodes based on data volumes

## Spark Runtimes

Spark ecosystem includes multiple version of the Spark *runtime*. Determines the version of: 
1. Spark
2. Delta Lake
3. Python

Within the runtime options to install different libraries for general or specialized tasks are available. Organizations are free to setup multiple environments based on the runtime/libraries needed for the workload.

List of available runtimes are [here](https://learn.microsoft.com/en-us/fabric/data-engineering/runtime)

*Examples of Libraries Available in Fabric*

![libraries_fabric_apache_example](attachment:image-3.png)

Options available when setting up environment:
1. Spark runtime
2. Built-in libraries installed on every environment
3. Install specific public libraries from PyPI
4. Install custom libraries through upload of package file
5. Specify Spark Pool for environment usage
6. Specify Spark configuration properties to override default behavior
7. Upload resoure files that need to be available in the environment

### Additional Spark Configuration Options

#### Native execution engine

Microsoft Fabric uses the *native execution engine* as the vectorized processing engine running Spark operations directly on the lakehouse infrastructure. This significantly improvates the perfoamcne of queries when working with large data sets in Parquet or Delta file formats.
* Parquet is a storage format optimized for efficiency but does not inherently support transactions or data versioning.
* Delta uses Parquet as its foundation but adds features like ACID transactions, versioning, and schema enforcement, making it more suitable for dynamic data processing and real-time analytics.

This can be configured to bypass the default settings at the environment configuration level or at a notebook context. 

Enabling native execution at the notebook level:

```
%%configure 
{ 
   "conf": {
       "spark.native.enabled": "true", 
       "spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager" 
   } 
}
```

#### High concurrency mode

Shares spark sessions across multiple concurrent users or processes. When enabled for notebooks, multiple users can run code in notebooks that use the same Spark session, while ensuring isolation of code to avoid variables in one notebook affecting another. 
This is available for Spark jobs for concurrent non-interactive Spark script execution.

Enabling high concurrency mode is available in the Data Engineering/Science section of the workspace settings interface.
* More details (here)[https://learn.microsoft.com/en-us/fabric/data-engineering/high-concurrency-overview]

#### Automatic MLFlow loggin

Open source library used in data science workloads to manage machine learning training and model deployment. Logs model training and management operations within the experiment activity section of the workspace without explicit code. This can be disabled if desired.


#### Spark administration for Fabric capacity

Administrators can manage Spark settings at the Fabric capacity level, allows restritions and override of Spark settings at the workspace level.


## Running Spark Code

### Notebooks

Allows for multiple languages and sharing of code between different members of the team. Each step of the process allows for data interaction.

![example_pyspark_notebook](attachment:image-4.png)


Pyspark is the default language of notebooks, however, Spark SQL, Scala, Java, and Spark R may be used. 
* This can be managed at the notebook level or using *magic commands* at the cell level

*Example of magic command changing to sql*
```
%%sql
SELECT YEAR(OrderDate) AS OrderYear,
       SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
FROM salesorders
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear;
```

Changing back to Pyspark is done by ```%%pyspark``` where Scala is ```%%spark```.
* Magic commands are case sensitive.

#### Visualize Data in Spark Notebooks




### Spark job definitions

Spark jobs are defined to ingest and transform data in an automated process, on-demand or scheduled. 

![spark_job_def_example](attachment:image-5.png)


Jobs are setup without a workspace and may referance a predefined file with script information. Storage locations and specific cli arguments are used to define the context for jobs. 

### Scheduling notebook and job executiong

In production or more complex data ingestion solutions, notebooks and Spark jobs are added to the data pipeline for management. 

Notebooks are Spark jobs are ran within the security context of the logged in user. If using a schedule, it assumes the identity of the user who owns the pipeline.


## Data Sources for Spark

Notebooks are used to connect to data held in various data sources outside the workspace. Most common connections within Azure:
* Azure blob storage
* Azure SQL Database

### Azure Blob

Spark is used to ingest the data and transform before finally loading into a Lakehouse. 

*Example of loading a Lakehouse with a notebook*

```
# Azure Blob Storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "yellow"

# blob_sas_token = "add your SAS token here" 
# Construct the path for connection
wasbs_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/{blob_relative_path}'

# WASBS path for connection including SAS token
# wasbs_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/{blob_relative_path}?{blob_sas_token}'

# Read parquet data from Azure Blob Storage path
blob_df = spark.read.parquet(wasbs_path)

# Display the Azure Blob DataFrame
display(blob_df)
```


#### Alternate authentication

Alternative to legacy SAS tokens include **Service Principal** or **OAuth**.

*Example using Service Principal*

```
# Azure SQL Database connection info
server_name = "your_server_name.database.windows.net"
port_number = 1433  # Default port number for SQL Server
database_name = "your_database_name"
table_name = "YourTableName" # Database table

client_id = "YOUR_CLIENT_ID"  # Service principal client ID
client_secret = "YOUR_CLIENT_SECRET"  # Service principal client secret

# Build the Azure SQL Database JDBC URL with Service Principal 
jdbc_url = f"jdbc:sqlserver://{server_name}:{port_number};database={database_name};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;Authentication=ActiveDirectoryServicePrincipal"

# Properties for the JDBC connection 
properties = {
    "user": client_id,
    "password": client_secret
            }

# Read table from Azure SQL Database using Entra ID Service Principal
sql_df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=properties)

# Display the Azure SQL dataframe
display (sql_df)
```

### Write Data into Lakehouse

After ingestion and transformation, the same notebook can write the dataframe into a lakehouse using relative paths. 

```
# Write dataframe to Parquet file format
parquet_output_path = "Files/your_folder/your_file_name"

df.write.mode("overwrite").parquet(parquet_output_path)

print(f"DataFrame has been written to Parquet file: {parquet_output_path}")
```

Data ingested through Spark can also be saved in other compatible formats
* JSON, Delta, CSV, Avro, ORC, etc.

#### Delta Table writes

Key feature of Fabric lakehouses because they support **ACID** transactions.
- Atomicity, consistency, isolation, durability.

*Dataframe saved to Delta table*

```
# Write dataframe to Delta table
delta_table_name = "your_delta_table_name"
df.write.format("delta").mode("overwrite").saveAsTable(delta_table_name)

# Confirm load as Delta table
print(f"DataFrame has been written to Delta table: {delta_table_name}")
```

When writing to Delta files, Fabric enables features by default to avoid the *small file problem*. Spark stores a large number of small problems due to the parallel-processing framekwork where data stored on multiple nodes. V-Order and Optimize write avoid this problem. 

**V-Order** enales faster and more efficient reads by special sorting, distribution, encoding, and compression on parquet files at write-time.

**Optimize write** improves performance and reliability by increasing file sizes. Useful for scenarios where Delta tables have suboptimal or nonstandard file sizes, or where the extra write latency is tolerable.


## Spark Dataframe

Natively Spark uses a data structure called **resilitient distributed dataset** or *RDD*. Code can be written for RDD but commonly Dataframes are used. 

Loading data into a dataframe (pyspark) allows schema inferance during load. this is the default behaviour for common data loads. 

*Example of inferance schema load*

![pyspark_df_infer_schema_example](attachment:image-6.png)


If a schema needs to be defined, during the dataframe load each column is configured with the appropriate data type. Even if specifying the schema is not required, it does improve performance over inferance.

*Manual schema definition in Spark*

```
from pyspark.sql.types import *
from pyspark.sql.functions import *

productSchema = StructType([
    StructField("ProductID", IntegerType()),
    StructField("ProductName", StringType()),
    StructField("Category", StringType()),
    StructField("ListPrice", FloatType())
    ])

df = spark.read.load('Files/data/product-data.csv',
    format='csv',
    schema=productSchema,
    header=False)
display(df.limit(10))
```


### Filtering and grouping dataframes


Dataframe functions for selecting data (like sql) are available

```dataframe.select("columnA","columnB")```

or the traditional Pandas style: ```dataframe["columnA","columnB"]```.

Data filtering can be chained to filer out and return the remaining dataframe information: 

```
bikes_df = df.select("ProductName", "Category", "ListPrice").where((df["Category"]=="Mountain Bikes") | (df["Category"]=="Road Bikes"))
display(bikes_df)

```

This results in a simple dataframe return: 

![return_query_filter_spark_dataframe_example](attachment:image-7.png)


Grouping is available within the function ```dataframe.select("columnA","columnB").groupBy("ColumnB").count()```


## Spark SQL

### Tables and Spark catalog

The **Spark catalog** is a metastore for relational database objects such as views and tables. Allows SQL expressions to manipulate data for those that are more familiar/comfortable with SQL over Spark. Easiest way to accomplish this is to use a temporary view.

```df.createOrReplaceTempView("products_view")```

Views are temporary, automatically deleted at end of session. Tables can also be created that are persistant within the catalago to define a database for Spark SQL.
* Tables are metadata structures that store underlying data in the storage location associated with the catalog.

Microsoft Fabric data for **managed tables** are stored in the Tables storage location in a data lake, tables created by Spark are listed in this location.

*Creating a table in PySpark*

```
spark.catalog.createTable("sometable_name")

spark.catalog.saveAsTable("sometable_name")
```

When creating a table from a dataframe ```df.write.format("delta").saveAsTable("products")```.
* Spark supports saving different table filetypes but preferred is always delta in Fabric.

If a table needs to referance **external** data that is saved within a files saved location, ```spark.catalog.createExternalTable``` is used.
* For example, data in the Data Lake Files location


### Spark SQL API to Query Data

The DAtaframe API is park of the Spark library Spark SQL. This enables SQL expressions for analyst. 

*Example of SQL Query using API*

```
bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice \
                      FROM products \
                      WHERE Category IN ('Mountain Bikes', 'Road Bikes')")
display(bikes_df)
```

### Using SQL code in Notebook

This is referanced by the magic keyword ```%%sql``` at the beginning of the cell.

*Example of SQL Query in notebook*

```
%%sql

SELECT Category, COUNT(ProductID) AS ProductCount
FROM products
GROUP BY Category
ORDER BY Category
```

