In [1]:
spark

In [2]:
sc

# Higher Level APIs 
    - Dataframes 
    - Spark SQL 
    - Datasets -> Language specific (not available for Python, but available for Scala/Java) 

### RDD
- No schema, 
- Raw data distributed across different partitions 
- No schema 
- No metadata


### Table
- Consists of `data` and `metadata` 
- Data is stored at storage layer 
- Metadata is stored in some metastore which holds the schema 
- When we run `select * from table` -> It gets the checks data and metadata together to give us the data in a tabular form

### Spark SQL

- Data files (S3/HDFS/etc) + Metastore (some database)


### DataFrames and Spark SQL

- Dataframes are nothing but RDD + Meradata (schema)/structure
- Not persistent
    - Data - in-memory 
    - Metadata - in-memory (there is no metastore, it is stored in a temp metadata catalog. Once the application is closed/stoped, its gone

- Spark Table is always persistent 
    - After closing the session the data persists (data and metadata)
    - Can be accessed via others across other sessions 
    - Dataframes is only visible to 1 session (our session where we create it)
                            
- Performance would be almost same whether we use Dataframe or Spark Table
- Higher level APIs are more performant as Spark now knows

In [3]:
spark

# Dataframe

In [7]:
data_set = 's3://fcc-spark-example/dataset/diamonds.csv'

df = (spark.read
           .format('csv')
           .option('header', 'true')
           .option('inferSchema', 'true')
           .load(data_set)
     )

In [12]:
df.show()

+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
|  0.3|     Good|    J|    SI1| 64.0| 55.0|  339|4.25|4.28|2.73|
| 0.23|    Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
| 0.22|  Premium|    F|  

In [18]:
df = (spark.read
      .csv(data_set, header=True, inferSchema=True)
      )

df.show()

[Stage 15:>                                                         (0 + 1) / 1]

+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
|  0.3|     Good|    J|    SI1| 64.0| 55.0|  339|4.25|4.28|2.73|
| 0.23|    Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
| 0.22|  Premium|    F|  

                                                                                

In [21]:
# meta-data is embeded within the data 

data_set = 's3://fcc-spark-example/dataset/diamonds.json'

df = (spark.read
      .json(data_set)
      )

df.show()

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
| 0.23|    SI2|    E|    Ideal| 61.5|  326| 55.0|3.95|3.98|2.43|
| 0.21|    SI1|    E|  Premium| 59.8|  326| 61.0|3.89|3.84|2.31|
| 0.23|    VS1|    E|     Good| 56.9|  327| 65.0|4.05|4.07|2.31|
| 0.29|    VS2|    I|  Premium| 62.4|  334| 58.0| 4.2|4.23|2.63|
| 0.31|    SI2|    J|     Good| 63.3|  335| 58.0|4.34|4.35|2.75|
| 0.24|   VVS2|    J|Very Good| 62.8|  336| 57.0|3.94|3.96|2.48|
| 0.24|   VVS1|    I|Very Good| 62.3|  336| 57.0|3.95|3.98|2.47|
| 0.26|    SI1|    H|Very Good| 61.9|  337| 55.0|4.07|4.11|2.53|
| 0.22|    VS2|    E|     Fair| 65.1|  337| 61.0|3.87|3.78|2.49|
| 0.23|    VS1|    H|Very Good| 59.4|  338| 61.0| 4.0|4.05|2.39|
|  0.3|    SI1|    J|     Good| 64.0|  339| 55.0|4.25|4.28|2.73|
| 0.23|    VS1|    J|    Ideal| 62.8|  340| 56.0|3.93| 3.9|2.46|
| 0.22|    SI1|    F|  Pr

In [23]:
# meta-data is embeded 

data_set = 's3://fcc-spark-example/dataset/diamonds_parquet'

df = (spark.read
      .parquet(data_set)
      )

df.show()

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0|4.65| 4.7|2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0|5.21|5.25|3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0|6.48|6.51|3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
| 0.32|   VVS1|    I|  Premium| 63.0|  756| 58.0|4.38|4.32|2.74|
|  0.9|    SI2|    E|     Good| 61.3| 3895| 61.0|6.13|6.17|3.77|
| 1.07|    VS2|    I|    Ideal| 60.6| 5167| 59.0|6.64|6.62|4.02|
| 1.51|    SI2|    G|  Premium| 62.4| 7695| 57.0|7.35|7.29|4.57|
| 0.31|    VS2|    F|Very Good| 63.0|  583| 57.0|4.27|4.33|2.71|
| 0.36|   VVS2|    F|    Ideal| 60.4|  853| 58.0|4.61|4.66| 2.8|
| 0.55|    SI1|    F|     Good| 57.0| 1410| 62.0|5.42|5.44| 3.1|
| 0.31|    VS2|    E|    

In [22]:
# df.repartition(4).write.format("parquet").mode("overwrite").save("s3://fcc-spark-example/dataset/diamonds_parquet")


                                                                                

In [27]:
df.show(10)

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|  Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
|  0.4|    SI2|    G|     Good| 63.1|  596| 59.0|4.65| 4.7|2.95|
| 0.54|    SI1|    I|    Ideal| 62.0| 1057| 55.0|5.21|5.25|3.24|
| 1.01|    VS2|    F|Very Good| 59.4| 6288| 61.0|6.48|6.51|3.86|
| 2.03|    SI2|    E|  Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
| 0.32|   VVS1|    I|  Premium| 63.0|  756| 58.0|4.38|4.32|2.74|
|  0.9|    SI2|    E|     Good| 61.3| 3895| 61.0|6.13|6.17|3.77|
| 1.07|    VS2|    I|    Ideal| 60.6| 5167| 59.0|6.64|6.62|4.02|
| 1.51|    SI2|    G|  Premium| 62.4| 7695| 57.0|7.35|7.29|4.57|
| 0.31|    VS2|    F|Very Good| 63.0|  583| 57.0|4.27|4.33|2.71|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
only showing top 10 rows



In [30]:
df_premium = df.where("cut == 'Premium'")
df_premium.show(5)

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  1.0|    SI1|    F|Premium| 60.3| 5292| 58.0|6.47|6.43|3.89|
| 2.03|    SI2|    E|Premium| 61.5|18477| 59.0|8.24|8.16|5.04|
| 0.32|   VVS1|    I|Premium| 63.0|  756| 58.0|4.38|4.32|2.74|
| 1.51|    SI2|    G|Premium| 62.4| 7695| 57.0|7.35|7.29|4.57|
| 0.71|    VS1|    D|Premium| 62.9| 2860| 57.0|5.66| 5.6|3.54|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



In [31]:
# where() is an alias for filter()

df_premium = df.filter("cut == 'Premium'")
df_premium.show(5)

[Stage 30:>                                                         (0 + 1) / 1]

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  0.9|    SI1|    D|Premium| 61.2| 4304| 60.0|6.19|6.14|3.77|
|  0.3|   VVS1|    G|Premium| 62.4| 1013| 57.0|4.29|4.27|2.67|
| 1.02|    VS2|    G|Premium| 61.5| 6416| 59.0|6.44|6.41|3.95|
|  0.4|    VS2|    F|Premium| 59.3|  842| 59.0|4.79|4.83|2.85|
|  2.0|    SI2|    E|Premium| 58.1|15984| 60.0|8.32|8.25|4.81|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 5 rows



                                                                                

#### Dataframe to Spark Table

In [32]:
df.createOrReplaceTempView('diamonds')

In [None]:
# Now we have a distributed table/view called 'diamonds' in our Spark Cluster 

In [33]:
df_premium = spark.sql('SELECT * \
                           FROM diamonds \
                           WHERE cut="Premium"')

In [34]:
df_premium.show(10)

[Stage 31:>                                                         (0 + 1) / 1]

+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat|clarity|color|    cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  0.9|    SI1|    D|Premium| 61.2| 4304| 60.0|6.19|6.14|3.77|
|  0.3|   VVS1|    G|Premium| 62.4| 1013| 57.0|4.29|4.27|2.67|
| 1.02|    VS2|    G|Premium| 61.5| 6416| 59.0|6.44|6.41|3.95|
|  0.4|    VS2|    F|Premium| 59.3|  842| 59.0|4.79|4.83|2.85|
|  2.0|    SI2|    E|Premium| 58.1|15984| 60.0|8.32|8.25|4.81|
| 0.93|    SI2|    F|Premium| 59.5| 3620| 60.0|6.39|6.36|3.79|
| 1.03|    VS1|    E|Premium| 61.5| 7614| 58.0|6.55|6.43|3.99|
|  0.8|    VS2|    F|Premium| 61.6| 3429| 58.0|5.98|6.03| 3.7|
| 2.02|    VS2|    J|Premium| 62.0|13687| 59.0|8.06|8.08| 5.0|
| 0.31|    SI1|    G|Premium| 62.6|  593| 58.0|4.34|4.29| 2.7|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 10 rows



                                                                                

#### Spark Table to Dataframe 

In [35]:
df = spark.read.table('diamonds')

In [36]:
df.show(10)

[Stage 32:>                                                         (0 + 1) / 1]

+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|carat|clarity|color|      cut|depth|price|table|   x|   y|   z|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
|  0.7|    VS2|    F|    Ideal| 60.8| 2942| 56.0|5.78|5.79|3.52|
|  0.9|    SI1|    D|  Premium| 61.2| 4304| 60.0|6.19|6.14|3.77|
| 0.44|    VS2|    E|Very Good| 60.3|  987| 58.0| 4.9|4.92|2.96|
| 1.02|    SI2|    E|Very Good| 58.7| 4286| 63.0|6.61|6.55|3.86|
| 1.01|    SI2|    G|    Ideal| 59.6| 4327| 57.0|6.59|6.54|3.91|
|  2.1|    SI1|    I|    Ideal| 61.6|12168| 57.0|8.24|8.15|5.05|
|  0.3|   VVS1|    G|    Ideal| 62.6|  789| 54.0|4.31|4.32| 2.7|
|  0.3|   VVS1|    G|  Premium| 62.4| 1013| 57.0|4.29|4.27|2.67|
| 0.38|    VS1|    I|    Ideal| 62.3|  703| 53.4|4.65|4.69|2.91|
| 0.39|    VS2|    G|    Ideal| 62.0|  816| 57.0|4.68|4.74|2.92|
+-----+-------+-----+---------+-----+-----+-----+----+----+----+
only showing top 10 rows



                                                                                

In [40]:
df.createOrReplaceGlobalTempView('diamonds')

In [41]:
spark.catalog.listTables()

                                                                                

[Table(name='diamonds', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [47]:
spark.catalog.dropGlobalTempView('diamonds')
spark.catalog.dropTempView('diamonds')

True

In [48]:
spark.catalog.listTables()

[]