
# Overview

Spark Sql and the available table providers


In Apache Spark, there are several types of table formats and storage options available. Hive and Delta tables are popular options, but they are not the only ones. Here are the main types of tables and storage formats you can use in Spark:

1. Hive Tables
Hive tables are traditional table formats used in Hadoop ecosystems, leveraging the Hive Metastore for schema storage and management. They can use various file formats, such as ORC, Parquet, and Text.

2. Delta Lake Tables
Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, durability) transactions to Apache Spark and big data workloads. Delta tables support efficient data lake operations, including DELETE, UPDATE, and MERGE.

3. Parquet Tables
Parquet is a columnar storage file format optimized for performance and storage efficiency. Parquet tables are commonly used in big data processing.

4. ORC Tables
ORC (Optimized Row Columnar) is another columnar storage file format used in Hadoop ecosystems. It is highly optimized for read and write performance.

5. Avro Tables
Avro is a row-based storage format that provides rich data structures and a compact, fast, binary data serialization format.

6. JSON Tables
JSON is a flexible, human-readable data format. JSON tables are useful for storing semi-structured data.

7. CSV Tables
CSV (Comma-Separated Values) is a simple file format used to store tabular data. CSV tables are straightforward to use but lack the performance optimizations of more advanced formats like Parquet or ORC.

8. JDBC Tables
Spark can connect to relational databases via JDBC (Java Database Connectivity) and treat database tables as Spark DataFrames.

9. Iceberg Tables
Apache Iceberg is a high-performance table format for huge analytic datasets. It brings features like schema evolution, partition evolution, and hidden partitioning.

10. Hudi Tables
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data management framework that provides support for atomic upserts and incremental data processing

## HIVE vs PARQUET tables
Comparing Hive and Parquet tables directly is somewhat of an "apples to oranges" comparison because they serve different purposes and operate at different levels of abstraction in the data processing stack. Here's a more detailed explanation to understand why:

Different Layers of Abstraction
Hive Tables:

Definition: Hive tables are a logical abstraction that can use various storage formats, including Text, ORC, and Parquet.
Metadata Management: Hive tables are typically managed by the Hive Metastore, which stores schema and partition information.
SQL Interface: Hive provides an SQL-like interface (HiveQL) for querying and managing data.
Usage: Hive tables are part of the broader Hadoop ecosystem, providing a way to query large datasets using SQL.
Parquet Tables:

Definition: Parquet is a columnar storage file format optimized for analytical queries.
File Format: Parquet files store data in a columnar format, enabling efficient compression and read performance.
Usage: Parquet tables refer to datasets stored in Parquet format, which can be queried using various big data processing engines like Spark, Hive, and Impala.
Key Differences
Scope:

Hive: Defines a table format and schema within the context of a database or data warehouse. It can use various underlying storage formats, including Parquet.
Parquet: Specifically defines a storage format that can be used by various table definitions (including Hive tables).
Management:

Hive Tables: Managed through Hive Metastore, which keeps track of schemas, partitions, and table properties.
Parquet Tables: The management of Parquet tables depends on the system using them. They can be managed by Hive, Spark, or other engines that support the Parquet format.
Data Format:

Hive Tables: Can use multiple formats (e.g., Text, ORC, Parquet).
Parquet Tables: Specifically use the Parquet format.
Performance:

Hive: Performance depends on the underlying storage format and the engine querying the data. Hive on Parquet can be very performant for read-heavy workloads.
Parquet: Optimized for read-heavy workloads due to its columnar storage format, but write operations can be slower compared to row-based formats.
Comparing "Like to Like"
To compare Hive and Parquet in a "like to like" manner, consider the specific context of Hive tables stored in Parquet format versus other storage formats. For example:

Hive Table with Parquet Format vs. Hive Table with ORC Format:

Both are Hive tables, but one uses Parquet and the other uses ORC. Here, you can compare read/write performance, compression efficiency, and suitability for different workloads.
Parquet Format in Hive vs. Parquet Format in Spark:

Both use the Parquet format, but the comparison would be on how different engines (Hive vs. Spark) handle Parquet data in terms of query performance, ease of use, and feature support.
Summary
Hive Tables: Logical abstraction for storing and querying data in a database-like manner. Can use various storage formats, including Parquet.
Parquet Tables: Specifically refer to data stored in the Parquet format, which can be queried by different engines (including Hive).
While Hive and Parquet can intersect (e.g., Hive tables using Parquet format), they fundamentally operate at different layers. Hive defines how data is structured and queried, while Parquet defines how data is stored. Therefore, comparing them directly is like comparing a database management system (Hive) with a storage format (Parquet), which serves different purposes and functionalities.

## HIVE vs DELTA Tables
When comparing Hive tables to Delta tables in Apache Spark, it's essential to understand their differences in terms of functionality, performance, and use cases. Below is a detailed comparison:

Hive Tables
Overview
Metadata Management: Managed by Hive Metastore.
Storage Format: Can use various formats like Parquet, ORC, Text, etc.
SQL Interface: Supports HiveQL.
Transactions: Supports ACID transactions (with limitations).
Schema Management: Supports schema evolution but can be cumbersome.
Partitioning and Bucketing: Native support for partitioning and bucketing.
Indexing: Supports indexes to optimize queries.
Materialized Views: Supports materialized views for query optimization.
Integration: Part of the Hadoop ecosystem; integrates well with other Hadoop components.
Pros
Mature Ecosystem: Well-established in the Hadoop ecosystem.
Flexible Storage: Supports multiple storage formats.
Advanced SQL Features: Supports complex data types, UDFs, and advanced SQL features.
Partitioning: Robust support for partitioning and bucketing, which can improve query performance.
Cons
Performance: Query performance may not be as optimized as newer solutions like Delta Lake.
Complex Transactions: ACID support is available but can be complex and less performant.
Schema Evolution: Can be cumbersome and error-prone.
Delta Tables
Overview
Metadata Management: Managed by Delta Lake's own transaction log.
Storage Format: Uses Parquet as the underlying storage format.
SQL Interface: Supports SQL with additional Delta Lake-specific commands.
Transactions: Full ACID transaction support.
Schema Management: Supports schema evolution with automatic updates.
Partitioning: Supports partitioning, though bucketing is less common.
Time Travel: Supports querying previous versions of data.
Performance: Optimized for read-heavy and write-heavy operations with features like data skipping and compaction.
Pros
ACID Transactions: Full support for ACID transactions, making it suitable for concurrent read/write operations.
Schema Evolution: Automatic schema evolution simplifies management.
Time Travel: Ability to query historical data provides powerful auditing and debugging capabilities.
Performance: Optimized for both read and write operations, with features like data skipping and compaction.
Cons
Newer Technology: Less mature than Hive, though rapidly evolving.
Integration: Primarily focused on the Spark ecosystem, though integrations with other tools are growing.

Key Differences
ACID Transactions

Hive Tables: Supports ACID transactions, but with more complexity and performance overhead.
Delta Tables: Full ACID support with simpler implementation and better performance.

Schema Evolution

Hive Tables: Can be cumbersome and error-prone.
Delta Tables: Supports automatic schema evolution, making it easier to manage schema changes.
Performance

Hive Tables: Good for batch processing but can be slower for real-time data processing.
Delta Tables: Optimized for both batch and streaming data processing with features like data skipping and compaction.

Time Travel

Hive Tables: No native support for time travel.
Delta Tables: Native support for time travel, allowing you to query historical data.

Partitioning

Hive Tables: Robust support for partitioning and bucketing.
Delta Tables: Supports partitioning, but bucketing is less commonly used.

Metadata Management

Hive Tables: Managed by Hive Metastore.
Delta Tables: Managed by Delta Lake's transaction log, which simplifies metadata operations

In [0]:
# File location and type
file_location = "/FileStore/tables/part_00000_0dba5518_60c7_424e_968b_27f7fc393894_c000_snappy.parquet"
file_type = "parquet"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

driver_id,driver_ref,number,code,name,dob,nationality,ingestion_date
1,hamilton,44.0,HAM,Lewis Hamilton,1985-01-07,British,2024-06-11T01:24:33.00728Z
2,heidfeld,,HEI,Nick Heidfeld,1977-05-10,German,2024-06-11T01:24:33.00728Z
3,rosberg,6.0,ROS,Nico Rosberg,1985-06-27,German,2024-06-11T01:24:33.00728Z
4,alonso,14.0,ALO,Fernando Alonso,1981-07-29,Spanish,2024-06-11T01:24:33.00728Z
5,kovalainen,,KOV,Heikki Kovalainen,1981-10-19,Finnish,2024-06-11T01:24:33.00728Z
6,nakajima,,NAK,Kazuki Nakajima,1985-01-11,Japanese,2024-06-11T01:24:33.00728Z
7,bourdais,,BOU,Sébastien Bourdais,1979-02-28,French,2024-06-11T01:24:33.00728Z
8,raikkonen,7.0,RAI,Kimi Räikkönen,1979-10-17,Finnish,2024-06-11T01:24:33.00728Z
9,kubica,88.0,KUB,Robert Kubica,1984-12-07,Polish,2024-06-11T01:24:33.00728Z
10,glock,,GLO,Timo Glock,1982-03-18,German,2024-06-11T01:24:33.00728Z


In [0]:
%ls

[0m[01;34mazure[0m/  [01;34meventlogs[0m/                   [01;34mlogs[0m/          [01;32mpreload_class.lst[0m*
[01;34mconf[0m/   [01;32mhadoop_accessed_config.lst[0m*  [01;34mmetastore_db[0m/


In [0]:
# Create a view or table

temp_table_name = "drivers"

df.createOrReplaceTempView(temp_table_name)

In [0]:
%sql
describe extended drivers

col_name,data_type,comment
driver_id,int,
driver_ref,string,
number,int,
code,string,
name,string,
dob,date,
nationality,string,
ingestion_date,timestamp,


In [0]:
df.write.saveAsTable("hive_table")

In [0]:
%sql
describe extended hive_table

col_name,data_type,comment
driver_id,int,
driver_ref,string,
number,int,
code,string,
name,string,
dob,date,
nationality,string,
ingestion_date,timestamp,
,,
# Delta Statistics Columns,,


In [0]:
df.write.format("parquet").saveAsTable("parquet_table")

In [0]:
%sql
describe extended parquet_table

col_name,data_type,comment
driver_id,int,
driver_ref,string,
number,int,
code,string,
name,string,
dob,date,
nationality,string,
ingestion_date,timestamp,
,,
# Detailed Table Information,,


In [0]:
df.write.format("json").saveAsTable("json_table")

In [0]:
%sql
desc extended json_table

col_name,data_type,comment
driver_id,int,
driver_ref,string,
number,int,
code,string,
name,string,
dob,date,
nationality,string,
ingestion_date,timestamp,
,,
# Detailed Table Information,,


In [0]:
df.write.format("hive").saveAsTable("hope_table_hive")

In [0]:
%sql
desc extended hope_table_hive

col_name,data_type,comment
driver_id,int,
driver_ref,string,
number,int,
code,string,
name,string,
dob,date,
nationality,string,
ingestion_date,timestamp,
,,
# Detailed Table Information,,


In [0]:
%sql
show databases

databaseName
default


In [0]:
%sql
show tables in default

database,tableName,isTemporary
default,hive_table,False
default,hope_table_hive,False
default,json_table,False
default,parquet_table,False
,drivers,True


In [0]:
%sql
drop table hive_table

In [0]:
%sql
show tables in default

database,tableName,isTemporary
default,hope_table_hive,False
default,json_table,False
default,parquet_table,False
,drivers,True


In [0]:
%sql
update hope_table_hive set driver_id = 1

In [0]:
df.write.format("delta").saveAsTable("hope_table_delta")

In [0]:
%sql
show tables in default

database,tableName,isTemporary
default,hope_table_delta,False
default,hope_table_hive,False
default,json_table,False
default,parquet_table,False
,drivers,True


In [0]:
%sql
update hope_table_delta set driver_id = 1

num_affected_rows
853


In [0]:
%sql
update hope_table_hive set driver_id = 1

In [0]:
%sql
update json_table set driver_id = 1

In [0]:
%sql
update parquet_table set driver_id = 1

In [0]:
%sql
update drivers set driver_id = 1

In [0]:
%sql
ALTER TABLE hope_table_hive ADD PARTITION (nationality);

In [0]:
%sql
CREATE TABLE hive_table1 (
    id INT,
    name STRING
)
PARTITIONED BY (nationality STRING)
STORED AS PARQUET;

In [0]:
%sql
describe extended hive_table1

col_name,data_type,comment
id,int,
name,string,
nationality,string,
# Partition Information,,
# col_name,data_type,comment
nationality,string,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,default,


In [0]:
%sql
CREATE TABLE delta_table1 (
    id INT,
    name STRING
)
USING DELTA
PARTITIONED BY (nationality STRING)


In [0]:
%sql
desc extended delta_table1

col_name,data_type,comment
id,int,
name,string,
nationality,string,
# Partition Information,,
# col_name,data_type,comment
nationality,string,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,default,


In [0]:
%sql
CREATE TABLE prq_table1 (
    id INT,
    name STRING
)
USING PARQUET
PARTITIONED BY (nationality STRING)

In [0]:
%sql
desc extended prq_table1

col_name,data_type,comment
id,int,
name,string,
nationality,string,
# Partition Information,,
# col_name,data_type,comment
nationality,string,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,default,


In [0]:
%sql
ALTER TABLE hive_table1 ADD PARTITION (nationality = 'USA');

In [0]:
%sql
ALTER TABLE prq_table1 ADD PARTITION (nationality = 'USA');