# Chapter 2
## The SparkSession

In [1]:
spark # This is a SparkSession instance that is used to control Spark Application

<pyspark.sql.session.SparkSession at 0x15926eafcf8>

In [5]:
# Create DataFrame with SparkSession
myRange = spark.range(1000).toDF("Number")

# Transformation (narrow transformation)
diviBy2 = myRange.where("number % 2 = 0")

# Aggregation (wide transformation)
diviBy2.count()

500

## Spark UI

Go to http://localhost:4040

## An End-to-End Example

In [12]:
# Read csv data and infer schema
flightData2015 = spark\
    .read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv("../pyspark-training/data/The-Definitive-Guide/flight-data/2015-summary.csv")

flightData2015.take(3) # The output is an array of row objects

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

Call `explain` on any DataFrame object to see the DataFrame's lineage (or how Spark will execute this query)

In [13]:
flightData2015.sort("count").explain()

== Physical Plan ==
*Sort [count#43 ASC], true, 0
+- Exchange rangepartitioning(count#43 ASC, 200)
   +- *Scan csv [DEST_COUNTRY_NAME#41,ORIGIN_COUNTRY_NAME#42,count#43] Format: CSV, InputPaths: file:/C:/Users/byron/Documents/GitHub/pyspark-training/data/The-Definitive-Guide/flight-data/2015..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


### Configure partitions before wide transformation

In [14]:
spark.conf.set("saprk.sql.shuffle.partitions", "5")
flightData2015.sort("count").take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]