# SparkSession

SparkSession contains a catalog that lists all the data inside the cluster.  
The catalog provides methods for extracting different pieces of information.  
spark.catalog.listTables() returns a list that contains the names of all tables in the cluster.

**The output of createDataFrame is stored locally, NOT in the SparkSession catalog.**

This means that all Spark DataFrame methods can be used on it but its data cannot be accessed in other contexts.  

**To access via SparkSQL the Spark DF data must be saved as a temporary table** using one of the Spark DF methods:
- registerTempTable(tablename)
- createTempView(viewname)
- createOrReplaceTempView(viewname)

In SparkQSL, temporary views are **session-scoped and disappear if the session that creates them terminates.**  
To keep a view alive until the Spark app terminates requires creating a **global temporary view.**
!!! Warning !!! The qualified name "global_temp" must be used when referring to global temporary views.  
e.g. ```
SELECT * FROM global_temp.view1
```

```
# Registering the DataFrame as a global temporary view
df.createGlobalTempView("people")

# Global temporary view is tied to a system preserved database global_temp
spark.sql("SELECT * FROM global_temp.people").show()
```

In [4]:
from pyspark.sql import SparkSession
import pandas as pd

In [5]:
data = [
    ['Nick', 26],
    ['Helen', 28],
    ['Mary', 30],
    ['John', 31]
]

df = pd.DataFrame(data, columns=['Name','Age'])

spark = SparkSession.builder.master("local[1]") \
            .appName("app") \
            .getOrCreate()

25/03/16 22:31:06 WARN Utils: Your hostname, ChristoorossAir resolves to a loopback address: 127.0.0.1; using 192.168.1.18 instead (on interface en0)
25/03/16 22:31:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/16 22:31:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
sdf = spark.createDataFrame(df)
sdf.registerTempTable('Members')



In [7]:
r = spark.sql('SELECT * FROM Members WHERE Age > 28')
r.show()

                                                                                

+----+---+
|Name|Age|
+----+---+
|Mary| 30|
|John| 31|
+----+---+



The recommendation is to work with aggregated data. So we download locally the aggregate result, and convert it to a Pandas DF

In [8]:
pandas_df = sdf.toPandas()

In [9]:
pandas_df.head()

Unnamed: 0,Name,Age
0,Nick,26
1,Helen,28
2,Mary,30
3,John,31
