In [1]:
spark

In [2]:
sc

# Higher Level APIs 
    - Dataframes 
    - Spark SQL 
    - Datasets -> Language specific (not available for Python, but available for Scala/Java) 

### RDD
- No schema 
- No metadata
- Raw data distributed across different partitions 


### Table
- Consists of `data` and `metadata` 
- Data is stored at storage layer (in the form of data files)
- Metadata is stored in some metastore which holds the schema 
- When we run `select * from table` -> It gets the checks data and metadata together to give us the data in a tabular form
- For example: `select color from table` -> If this col `color` is not present in the metastore/table metadata, it will throw an exception. In that case it will not even look at the data files. 


### Spark SQL

- It works in a similar manner 
- Data files (S3/HDFS/etc) + Metastore (some database, admins can decide, we dont have to worrk )
- Metastore on AWS 
    - When you create tables in Glue, it stores metadata about those tables (like table names, column names, data types, etc.) in the AWS Glue Data Catalog, which serves as a centralized metastore for your AWS environment. Data Catalog is integrated with Amazon S3, Amazon RDS, Amazon Redshift, and other services.
    - If you're using Apache Spark on Amazon EMR or other non-Glue AWS environments, you might choose to use the Glue Data Catalog as your metastore by configuring your Spark environment accordingly. The advantage of this is a unified metastore across multiple services and applications, all managed by AWS.
    - This metadata is stored in a highly available and durable way, but the exact details of its storage are abstracted away from the user as part of the fully managed nature of AWS services.


### DataFrames and Spark SQL

- Dataframes are nothing but RDD + Meradata (schema/structure)
- `Spark Dataframes` are **not persistent (its in-memory)**
    - Data - in-memory 
    - Metadata - in-memory 
    - There is no metastore, it is stored in a temp metadata catalog. Once the application is closed/stoped, its gone
    - Dataframes is only visible to 1 session (our session where we create it)
    - We can think of it as RDD with some structure
    
- `Spark Table` is always **persistent**
    - After closing the session the data persists (data and metadata)
    - Can be accessed via others across other sessions 
                        
- Performance would be almost same whether we use Dataframe or Spark Table
- They can be used interchangable, we can convert a Spark table to a Dataframe and vice-versa based on our requirement
- Higher level APIs are more performant as Spark now knows about the metadata, and it can optimize the operation in a better and more efficient manner 

In [3]:
spark

# Dataframe

At its core, we do the following typically:

    - Step 1: We load the data/some file and create a Spark DF 
    - Step 2: Perform some operation 
    - Step 3: Save/write the transformed data back to some storage (S3/HDFS/etc)

## Loading the data and creating a dataframe 

#### 1. CSV 

In [4]:
data_set = 's3://fcc-spark-example/dataset/diamonds.csv'

df = (spark.read                               # reader API
           .format('csv')                      # format is CSV
           .option('header', 'true')           # consider first line as header 
           .option('inferSchema', 'true')      # infer the schema automatically
           .load(data_set)                     # load the data 
     )

                                                                                

> It is not prefered to use inferSchema to infer the schema

>    - It may not infer the schema correctly like data time column might get infered as string
>    - it can lead to performance issues, as spark has to scan some data in oder to infer the schema
    

In [5]:
df.show()

+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
|  0.3|     Good|    J|    SI1| 64.0| 55.0|  339|4.25|4.28|2.73|
| 0.23|    Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
| 0.22|  Premium|    F|  

In [6]:
# Another way to read the data 

df = (spark
        .read
        .csv(data_set, header=True, inferSchema=True)
     )

df.show(5)

[Stage 5:>                                                          (0 + 1) / 1]

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|  Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|   Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|   Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



                                                                                

#### 2. JSON

In [7]:
# meta-data is embeded within the data 

data_set = 's3://fcc-spark-example/dataset/diamonds.json'

df = (spark
          .read
          .json(data_set)
      )

df.show(5)

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|    SI2|    E|  Ideal| 61.5|  326| 55.0|3.95|3.98|2.43|
| 0.21|    SI1|    E|Premium| 59.8|  326| 61.0|3.89|3.84|2.31|
| 0.23|    VS1|    E|   Good| 56.9|  327| 65.0|4.05|4.07|2.31|
| 0.29|    VS2|    I|Premium| 62.4|  334| 58.0| 4.2|4.23|2.63|
| 0.31|    SI2|    J|   Good| 63.3|  335| 58.0|4.34|4.35|2.75|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



In [8]:
df.printSchema()

root
 |-- carat: double (nullable = true)
 |-- clarity: string (nullable = true)
 |-- color: string (nullable = true)
 |-- cut: string (nullable = true)
 |-- depth: double (nullable = true)
 |-- price: long (nullable = true)
 |-- table: double (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



#### 3. Parquet

In [9]:
# meta-data is embeded within the data 

data_set = 's3://fcc-spark-example/dataset/diamonds_parquet' 

df = (spark.read
      .parquet(data_set)
      )

df.show(5)

[Stage 9:>                                                          (0 + 1) / 1]

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0|4.65| 4.7|2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0|5.21|5.25|3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0|6.48|6.51|3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
only showing top 5 rows



                                                                                

In [10]:
# df.repartition(4).write.format("parquet").mode("overwrite").save("s3://fcc-spark-example/dataset/diamonds_parquet")


In [11]:
df.show(5)

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0|4.65| 4.7|2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0|5.21|5.25|3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0|6.48|6.51|3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
only showing top 5 rows



## Performing some transformations 

### Change the column name

In [12]:
# Change the column name 

df2 = df.withColumnRenamed('x', 'x_col') \
       .withColumnRenamed('y', 'y_col') \
       .withColumnRenamed('z', 'z_col') 

df2.show()

+-----+-------+-----+---------+-----+-----+-----+-----+-----+-----+
|carat|clarity|color|      cut|depth|price|table|x_col|y_col|z_col|
+-----+-------+-----+---------+-----+-----+-----+-----+-----+-----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0| 6.47| 6.43| 3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0| 4.65|  4.7| 2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0| 5.21| 5.25| 3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0| 6.48| 6.51| 3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0| 8.24| 8.16| 5.04|
| 0.32|   VVS1|    I|  Premium| 63.0|  756| 58.0| 4.38| 4.32| 2.74|
|  0.9|    SI2|    E|     Good| 61.3| 3895| 61.0| 6.13| 6.17| 3.77|
| 1.07|    VS2|    I|    Ideal| 60.6| 5167| 59.0| 6.64| 6.62| 4.02|
| 1.51|    SI2|    G|  Premium| 62.4| 7695| 57.0| 7.35| 7.29| 4.57|
| 0.31|    VS2|    F|Very Good| 63.0|  583| 57.0| 4.27| 4.33| 2.71|
| 0.36|   VVS2|    F|    Ideal| 60.4|  853| 58.0| 4.61| 4.66|  2.8|
| 0.55|    SI1|    F|     Good| 57.0| 1410| 62.0

### Performing some filter operations 

In [13]:
# Some filter operation 

df_premium = df.where("cut == 'Premium'")
df_premium.show(5)

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
| 2.03|    SI2|    E|Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
| 0.32|   VVS1|    I|Premium| 63.0|  756| 58.0|4.38|4.32|2.74|
| 1.51|    SI2|    G|Premium| 62.4| 7695| 57.0|7.35|7.29|4.57|
| 0.71|    VS1|    D|Premium| 62.9| 2860| 57.0|5.66| 5.6|3.54|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



[Stage 12:>                                                         (0 + 1) / 1]                                                                                

In [14]:
# where() is an alias for filter()

df_premium = df.filter("cut == 'Premium'")
df_premium.show(5)

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
| 2.03|    SI2|    E|Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
| 0.32|   VVS1|    I|Premium| 63.0|  756| 58.0|4.38|4.32|2.74|
| 1.51|    SI2|    G|Premium| 62.4| 7695| 57.0|7.35|7.29|4.57|
| 0.71|    VS1|    D|Premium| 62.9| 2860| 57.0|5.66| 5.6|3.54|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



### Changing the datatype

In [15]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.csv'

df_orders =  (spark.read                               # reader API
                   .format('csv')                      # format is CSV
                   .option('header', 'true')           # consider first line as header 
                   .option('inferSchema', 'true')      # infer the schema automatically
                   .load(data_set)                     # load the data 
             )

df_orders.show(5)

                                                                                

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



In [16]:
df_orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [18]:
# Changing the data type from Integer Type to Long Type
from pyspark.sql import types as T

df_orders2 = df_orders.withColumn('order_customer_id', 
                                  df_orders['order_customer_id'].cast(T.LongType())
                                 )

df_orders2.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- order_customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)



#### Dataframe to Spark Table

In [19]:
data_set = 's3://fcc-spark-example/dataset/diamonds_parquet' 

df = ( spark
          .read
          .parquet(data_set)
      )

df.show(5)

[Stage 18:>                                                         (0 + 1) / 1]

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0|4.65| 4.7|2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0|5.21|5.25|3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0|6.48|6.51|3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
only showing top 5 rows



                                                                                

#### Local Table

In [20]:
df.createOrReplaceTempView('diamonds') 

# Now we have a distributed table/view called 'diamonds' in our Spark Cluster 

In [21]:
df_premium = spark.sql('SELECT * \
                           FROM diamonds \
                           WHERE cut="Premium" \
                        ')

In [22]:
df_premium.show(5)

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
| 2.03|    SI2|    E|Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
| 0.32|   VVS1|    I|Premium| 63.0|  756| 58.0|4.38|4.32|2.74|
| 1.51|    SI2|    G|Premium| 62.4| 7695| 57.0|7.35|7.29|4.57|
| 0.71|    VS1|    D|Premium| 62.9| 2860| 57.0|5.66| 5.6|3.54|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



In [23]:
spark.catalog.listTables() 
# After this open a new `pyspark` shell and run the same `spark.catalog.listTables()` 
# We will see no tables, as this table was created as a "Local Table"

23/07/10 16:23:33 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml
23/07/10 16:23:33 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
23/07/10 16:23:33 INFO AWSGlueClientFactory: Using region from ec2 metadata : us-east-2


[Table(name='diamonds', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

#### Spark Table to Dataframe 

In [24]:
df = ( spark
          .read
          .table('diamonds')
     )

In [25]:
df.show(5)

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0|4.65| 4.7|2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0|5.21|5.25|3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0|6.48|6.51|3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
only showing top 5 rows



#### Global Table

In [26]:
df.createOrReplaceGlobalTempView('diamonds')

In [27]:
spark.catalog.listTables() 

[Table(name='diamonds', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [28]:
spark.catalog.dropGlobalTempView('diamonds')
spark.catalog.dropTempView('diamonds')

True

In [29]:
spark.catalog.listTables()

[]