# Data Analysis using Spark SQL

This notebook covers the basic of using Spark SQL on Spark Dataframes.

We will treat spark dataframes like relational tables and run Spark SQL just like in our relational databases to query the data.


## Part I: Setup and Intro

In [0]:
from pyspark.sql.types import Row
from datetime import datetime
from pyspark.sql import SQLContext
from pyspark.sql.session import SparkSession

In [0]:
spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(sc)



In [0]:
asset_record = sc.parallelize([Row(asset_id = 'SUI',
                                   asset_name = "Sun Communities",
                                   active = True,
                                   category = ['Stock', 'Real Estate'],
                                   #revenue is in millions
                                   forecast = {"EPS" : -0.037, "Revenue": 610.32}, 
                                   earnings_announcement = datetime(2023, 4, 27)),
                               Row(
                                   asset_id = 'BX',
                                   asset_name = "Blackstone Inc",
                                   active = True,
                                   category = ['Stock', 'Investment Banking'],
                                   #revenue is in millions
                                   forecast = {"EPS" : 0.957, "Revenue": 2490.00}, 
                                   earnings_announcement = datetime(2023, 4, 19))
                                                              
                                  ])

In [0]:
#covert our data to a spark Dataframe
asset_record_df = asset_record.toDF()
asset_record_df.show()

+--------+---------------+------+--------------------+--------------------+---------------------+
|asset_id|     asset_name|active|            category|            forecast|earnings_announcement|
+--------+---------------+------+--------------------+--------------------+---------------------+
|     SUI|Sun Communities|  true|[Stock, Real Estate]|{EPS -> -0.037, R...|  2023-04-27 00:00:00|
|      BX| Blackstone Inc|  true|[Stock, Investmen...|{EPS -> 0.957, Re...|  2023-04-19 00:00:00|
+--------+---------------+------+--------------------+--------------------+---------------------+



In [0]:
#let's create a view from our dataframe
asset_record_df.createOrReplaceTempView('records')

In [0]:
all_asset_records_df = sqlContext.sql('SELECT * FROM records')

all_asset_records_df.show()

+--------+---------------+------+--------------------+--------------------+---------------------+
|asset_id|     asset_name|active|            category|            forecast|earnings_announcement|
+--------+---------------+------+--------------------+--------------------+---------------------+
|     SUI|Sun Communities|  true|[Stock, Real Estate]|{EPS -> -0.037, R...|  2023-04-27 00:00:00|
|      BX| Blackstone Inc|  true|[Stock, Investmen...|{EPS -> 0.957, Re...|  2023-04-19 00:00:00|
+--------+---------------+------+--------------------+--------------------+---------------------+



Notice we have some columns that contain complex data structures like python lists and dictionaries. This isn't what we are used to seeing in a relational table where there is one value for each entry.

Can we select data from these types of columns using SQL? Yes!

In [0]:
sqlContext.sql('SELECT asset_id, asset_name, category[1] FROM records').show()

+--------+---------------+------------------+
|asset_id|     asset_name|       category[1]|
+--------+---------------+------------------+
|     SUI|Sun Communities|       Real Estate|
|      BX| Blackstone Inc|Investment Banking|
+--------+---------------+------------------+



We can also use WHERE clauses just like in SQL. Notice we need to use python's equal operator "==" to find categories that are equal to Real Estate. 

In Oracle SQL this would just be "="

In [0]:
sqlContext.sql('SELECT asset_id, asset_name \
                FROM records \
                WHERE category[1] == "Real Estate" ').show()

+--------+---------------+
|asset_id|     asset_name|
+--------+---------------+
|     SUI|Sun Communities|
+--------+---------------+



We can also work with datetimes. 

Here we run a query to see which earnings calls have already happened.

In [0]:
sqlContext.sql('SELECT asset_id, asset_name \
                FROM records \
                WHERE earnings_announcement < "2023-04-25" ').show()

+--------+--------------+
|asset_id|    asset_name|
+--------+--------------+
|      BX|Blackstone Inc|
+--------+--------------+



## A note on temprary views

So far we have been working with a temporary view that is limited to the current spark session. Once the session ends the view will no longer be available.

We can make the temp view available to the entire application by creating a Global temp view. This allows us to access the view outside of the current session. 

Once we create a global temp view, it is stored in the global_temp system-preserved database. 

Let's see how this works.

In [0]:
asset_record_df.createGlobalTempView('global_records')

In [0]:
sqlContext.sql('SELECT * FROM global_temp.global_records').show()

+--------+---------------+------+--------------------+--------------------+---------------------+
|asset_id|     asset_name|active|            category|            forecast|earnings_announcement|
+--------+---------------+------+--------------------+--------------------+---------------------+
|     SUI|Sun Communities|  true|[Stock, Real Estate]|{EPS -> -0.037, R...|  2023-04-27 00:00:00|
|      BX| Blackstone Inc|  true|[Stock, Investmen...|{EPS -> 0.957, Re...|  2023-04-19 00:00:00|
+--------+---------------+------+--------------------+--------------------+---------------------+

