In [None]:
spark

In [None]:
sc

# Higher Level APIs 
    - Dataframes 
    - Spark SQL 
    - Datasets -> Language specific (not available for Python, but available for Scala/Java) 

### RDD
- No schema 
- No metadata
- Raw data distributed across different partitions 


![](../img/SparkArchitect.jpeg)

### Table
- Consists of `data` and `metadata` 
- Data is stored at storage layer (in the form of data files)
- Metadata is stored in some metastore which holds the schema 
- When we run `select * from table` -> It gets the checks data and metadata together to give us the data in a tabular form
- For example: `select some_col from table` -> If this col `some_col` is not present in the metastore/table metadata, it will throw an exception. In that case it will not even look at the data files. 


### Spark SQL

- It works in a similar manner 
- Data files (S3/HDFS/etc) + Metastore (some database, admins can decide, we dont have to worrk )

- Metastore on AWS
    - When you create tables in Glue, it stores metadata about those tables (like table names, column names, data types, etc.) in the AWS Glue Data Catalog, which serves as a centralized metastore for your AWS environment. Data Catalog is integrated with Amazon S3, Amazon RDS, Amazon Redshift, and other services.
    - If you’re using Apache Spark on Amazon EMR or other AWS environments (apart from Glue), you might choose to use the Glue Data Catalog as your metastore by configuring your Spark environment accordingly. The advantage of this is a unified metastore across multiple services and applications, all managed by AWS.
    - This metadata is stored in a highly available and durable way, but the exact details of its storage are abstracted away from the user as part of the fully managed nature of AWS services.

### DataFrames and Spark SQL

- Dataframes are nothing but RDD + Meradata (schema/structure)
- `Spark Dataframes` are **not persistent (its in-memory)**
    - Data - in-memory 
    - Metadata - in-memory 
    - There is no metastore, it is stored in a temp metadata catalog. Once the application is closed/stoped, its gone
    - Dataframes is only visible to 1 session (our session where we create it)
    - We can simply think of it as RDD with some structure
    
- `Spark Table` is always **persistent**
    - After closing the session the data persists (data and metadata)
    - Can be accessed via others across other sessions 
                        
- Performance would be almost same whether we use Dataframe or Spark Table
- They can be used interchangable, we can convert a Spark table to a Dataframe and vice-versa based on our requirement
- Higher level APIs are more performant as Spark now knows about the metadata, and it can optimize the operation in a better and more efficient manner 

In [None]:
spark

# Dataframe

At its core, we do the following typically:

    - Step 1: We load the data/some file and create a Spark DF 
    - Step 2: Perform some operation 
    - Step 3: Save/write the transformed data back to some storage (S3/HDFS/etc)

## Loading the data and creating a dataframe 

#### 1. CSV 

In [None]:
data_set = 's3://fcc-spark-example/dataset/diamonds.csv'

df = (spark.read                               # reader API
           .format('csv')                      # format is CSV
           .option('header', 'true')           # consider first line as header 
           .option('inferSchema', 'true')      # infer the schema automatically
           .load(data_set)                     # load the data 
     )

In [None]:
df.show(5)

In [None]:
df.printSchema()

> It is not prefered to use inferSchema to infer the schema

>    - It may not infer the schema correctly like `datetime` column might get infered as `string`
>    - it can lead to performance issues, as spark has to scan some data in oder to infer the schema
    

In [None]:
df = (spark.read                               # reader API
           .format('csv')                      # format is CSV
           .option('header', 'true')           # consider first line as header 
           .option('inferSchema', 'true')      # infer the schema automatically
           .option('samplingRatio', 0.2)       # mentioning the sampling ration of 20% 
           .load(data_set)                     # load the data 
     )

In [None]:
df.show(3)

In [None]:
# Another way to read the data 

df = (spark
        .read
        .csv(data_set, header=True, inferSchema=True)
     )

df.show(5)

#### 2. JSON

In [None]:
# meta-data is embeded within the data 

data_set = 's3://fcc-spark-example/dataset/diamonds.json'

df = (spark      
          .read   
          .json(data_set)
     )

df.show(5)

In [None]:
df.printSchema()

#### 3. Parquet

In [None]:
# meta-data is embeded within the data 

data_set = 's3://fcc-spark-example/dataset/diamonds_parquet' 

df = (spark
      .read
      .parquet(data_set)
      )

df.show(5)

In [None]:
# df.repartition(4).write.format("parquet").mode("overwrite").save("s3://fcc-spark-example/dataset/diamonds_parquet")


In [None]:
df.show(5)

## Performing some transformations 

- Create a DF (Load) --> (READER API) 
- We perform some transformation 
- Store the clean/processed data (WRITER API) --> We will look later 

### Change the column name

In [None]:
# Change the column name 

df2 = (df.withColumnRenamed('x', 'x_col') 
       .withColumnRenamed('y', 'y_col') 
       .withColumnRenamed('z', 'z_col') )

df2.show(5)

In [None]:
df.show(5)

### Performing some filter operations 

In [None]:
# Some filter operation 

df_premium = df.where("cut == 'Premium'")
df_premium.show(5)

In [None]:
# where() is an alias for filter()

df_premium = df.filter("cut == 'Premium'")
df_premium.show(5)

### Changing the datatype

In [None]:
# Lets create one dataframe, and later on we will change the datatype

data_set = 's3://fcc-spark-example/dataset/2023/orders.csv'

df_orders =  (spark.read                               # reader API
                   .format('csv')                      # format is CSV
                   .option('header', 'true')           # consider first line as header 
                   .option('inferSchema', 'true')      # infer the schema automatically
                   .load(data_set)                     # load the data 
             )

df_orders.show(5)

In [None]:
df_orders.printSchema()

In [None]:
# Changing the data type from Integer Type to Long Type
from pyspark.sql import types as T

df_orders2 = df_orders.withColumn('order_customer_id_2', 
                                  df_orders['order_customer_id'].cast(T.LongType())
                                 )

df_orders2.printSchema()

In [None]:
df_orders2.show(5)

#### Dataframe to Spark Table/View

In Spark SQL, a table and a view both allow you to structure and organize your data, but they serve different purposes and are used differently.

#### Creating a View

In SQL, we have two distinct concepts: 

- `table` is materialized in memory and on disk,
- `view` is computed on the fly. 

Spark’s temp views are conceptually closer to a view than a table. 

A view in Spark is a logical construct. It's essentially a named SQL query that acts as a virtual table.

Views do not physically store data. Instead, every time you query a view, Spark applies the view's transformation to the underlying data. This can be beneficial when you have complex transformations that you want to reuse, or when you want to simplify queries for end users.

- `createOrReplaceTempView`
- `createTempView`
- `createOrReplaceGlobalTempView`
- `createGlobalTempView`

These transformation (e.g `createOrReplaceTempView`) will look at the data frame
referenced by the Python variable on which the method was applied and will create a
Spark SQL reference to the same data frame.

Spark SQL also has tables as well, which we will see later. 
Tables in Spark are similar to tables in a relational database. They are data structures that organize data into rows and columns. Each column has a specific data type, and each row contains a record.


In [None]:
data_set = 's3://fcc-spark-example/dataset/diamonds_parquet' 

df = (spark
      .read
      .parquet(data_set)
      )

df.show(5)

In [None]:
df.createOrReplaceTempView('diamonds') 

# Now we have a distributed table/view called 'diamonds' in our Spark Cluster 

In [None]:
df_premium = df.where("cut == 'Premium'")
df_premium.show(5)

In [None]:
df_premium = spark.sql('SELECT * \
                           FROM diamonds \
                           WHERE cut="Premium" \
                        ')

In [None]:
df_premium.show(5)

In [None]:
df.createOrReplaceTempView('diamonds') 

In [None]:
df.createOrReplaceTempView('diamonds2') 

In [None]:
# List all the tables 

spark.sql("SHOW tables").show()

In [None]:
# Another way to list all the tables 
spark.catalog.listTables()

# After this open a new `pyspark` shell and run the same `spark.catalog.listTables()` 
# We will see no tables, as this table was created as a "Local Table"

In [None]:
spark.sql("DESCRIBE diamonds").show()

#### Spark Table to Dataframe 

In [None]:
df = ( spark
          .read
          .table('diamonds')
     )

In [None]:
df.show(5)

#### Clean up 

In [None]:
spark.sql("SHOW tables").show()

In [None]:
spark.catalog.listTables() 

In [None]:
spark.catalog.dropTempView('diamonds')

In [None]:
spark.catalog.listTables()

In [None]:
spark.catalog.dropTempView('diamonds2')

In [None]:
spark.sql("SHOW tables").show()