https://sparkbyexamples.com/pyspark/pyspark-what-is-sparksession/

# What is SparkSession
SparkSession introduced in version 2.0, It is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame. It’s object spark is default available in pyspark-shell and it can be created programmatically using SparkSession.

# SparkSession
With Spark 2.0 a new class SparkSession (pyspark.sql import SparkSession) has been introduced. SparkSession is a combined class for all different contexts we used to have prior to 2.0 relase (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior to 2.0.

As mentioned in the beginning SparkSession is an entry point to PySpark and creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame, and Dataset. SparkSession will be created using SparkSession.builder builder patterns.

Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession, many features of SparkContext are still available and used in Spark 2.0 and later. You should also know that SparkSession internally creates SparkConfig and SparkContext with the configuration provided with SparkSession.

Spark Session also includes all the APIs available in different contexts –

- Spark Context
- SQL Context
- Streaming Context
- Hive Context

You can create as many SparkSession objects you want using either SparkSession.builder or SparkSession.newSession.

## SparkSession in PySpark shell
Be default PySpark shell provides “spark” object; which is an instance of SparkSession class. We can directly use this object where required in spark-shell. Start your “pyspark” shell from $SPARK_HOME\bin folder and enter the below statement.

sqlcontext = spark.sqlContext

Similar to PySpark shell, In most of the tools, the environment itself creates default SparkSession object for us to use so you don’t have to worry about creating SparkSession object.

## Create SparkSession
In order to create SparkSession programmatically( in .py file) in PySpark, you need to use the builder pattern method builder() as explained below. getOrCreate() method returns an already existing SparkSession; if not exists, it creates a new SparkSession.

```python
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
print("First SparkContext:")
print("APP Name :" + spark.sparkContext.appName)
print("Master :" + spark.sparkContext.master)

sparkSession2 = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExample-test") \
      .getOrCreate()

print("Second SparkContext:")
print("APP Name :" + sparkSession2.sparkContext.appName)
print("Master :" + sparkSession2.sparkContext.master)
```

- master() – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either yarn or mesos depends on your cluster setup.

Use local[x] when running in Standalone mode. x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

- appName() – Used to set your application name.
- getOrCreate() – This returns a SparkSession object if already exists, creates new one if not exists.

Note:  SparkSession object “spark” is by default available in PySpark shell.

You can also create a new SparkSession using newSession() method.

```python
import pyspark
from pyspark.sql import SparkSession

sparkSession3 = SparkSession.newSession

print("Second SparkContext:")
print("APP Name :" + sparkSession3.sparkContext.appName)
print("Master :" + sparkSession3.sparkContext.master)
```

This always creates new SparkSession object.

# SparkSession Commonly Used Methods
<strong><br>version</strong> – Returns Spark version where your application is running, probably the Spark version you cluster is configured with.

<strong>createDataFrame</strong>() – This creates a DataFrame from a collection and an RDD

<strong>getActiveSession</strong>() – returns an active Spark session.

<strong><a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html#read--">read</a></strong>() – Returns an instance of DataFrameReader class, this is used to read records from csv, parquet, avro and more file formats into DataFrame.

<strong><a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html#readStream--">readStream</a></strong>() – Returns an instance of DataStreamReader class, this is used to read streaming data. that can be used to read streaming data into DataFrame.

<strong><a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html#sparkContext--">sparkContext</a></strong>() – Returns a SparkContext.

<strong><a href="https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html#sql-java.lang.String-">sql</a></strong> – Returns a DataFrame after executing the SQL mentioned.

<strong>sqlContext</strong> – Returns SQLContext.

<strong>stop</strong>() – Stop the current SparkContext.

<strong>table</strong>() – Returns a DataFrame of a table or view.

<strong>udf</strong>() – Creates a PySpark UDF to use it on DataFrame, Dataset, and SQL.

In [1]:
spark.version

'3.1.2'

In [5]:
spark.sparkContext