# Table of Contents
1. [SparkSession](#SparkSession)
2. [RDDs](#RDDS)


# SparkSession

In the past, you would potentially work with SparkConf, SparkContext, SQLContext, and HiveContext to execute your various Spark queries for configuration, Spark context, SQL context, and Hive context respectively. The SparkSession is a combination of these contexts including StreamingContext.

The SparkSession is now the entry point for reading data, working with metadata, configuring the session, and managing the cluster resources.

The SQLContext, HiveContext and StreamingContext still exist under the hood in Spark 2.0 for continuity purpose with the Spark legacy code.

The Spark session has to be created when using spark-submit command. An example on how to do that is given below:

In [8]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
 
spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled","true").getOrCreate()
#sc = SparkContext()
 
#sqlContext = SQLContext(sc)

# RDDs

RDDs operate in parallel. This is the strongest advantage of working in Spark: Each transformation is executed in parallel for enormous increase in speed.

**The transformations to the dataset are lazy**. This means that any transformation is only executed when an action on a dataset is called. This helps Spark to optimize the execution.

## Creating RDDs

There are two ways to create an RDD in PySpark. You can either
- ```.parallelize(...)``` a colection(list or an array of some elements):
- or you can reference a file(or files located either locally or somewhere externally)

In [12]:
raw_data = sc.textFile("/Users/abanihiadmin/Documents/elnino.csv", 4)

In [13]:
raw_data.take(5)

[u'Observation, Year, Month, Day, Date, Latitude, Longitude, Zonal Winds, Meridional Winds, Humidity, Air Temp, Sea Surface Temp',
 u'1,80,3,7,800307,-0.02,-109.46,-6.8,0.7,.,26.14,26.24',
 u'2,80,3,8,800308,-0.02,-109.46,-4.9,1.1,.,25.66,25.97',
 u'3,80,3,9,800309,-0.02,-109.46,-4.5,2.2,.,25.69,25.28',
 u'4,80,3,10,800310,-0.02,-109.46,-3.8,1.9,.,25.57,24.31']

The last parameter in ```sc.textFile(..., n)``` specifies the number of partitions the dataset is divided into.

**TIP: A rule of thumb would be to break your dataset into two-four partitions for each in your cluster.**

Note: **When reading from a text file, each row from the file forms an element of an RDD.**


# Transformations
[back to top](#Table-of-Contents)

## ```.map(...)```

This method is applied to each element of the RDD: in the case for ```raw_data``` dataset we can think of this as a transformation of each row.


By using the map transformation in Spark, we can apply a function to every element in our RDD. Python's lambdas are specially expressive for this particular.

In this case we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows.

In [20]:
from pprint import pprint
from time import time
csv_data = raw_data.map(lambda x: x.split(","))
t0 = time()
head_rows = csv_data.take(2)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
pprint(head_rows)

Parse completed in 0.04 seconds
[[u'Observation',
  u' Year',
  u' Month',
  u' Day',
  u' Date',
  u' Latitude',
  u' Longitude',
  u' Zonal Winds',
  u' Meridional Winds',
  u' Humidity',
  u' Air Temp',
  u' Sea Surface Temp'],
 [u'1',
  u'80',
  u'3',
  u'7',
  u'800307',
  u'-0.02',
  u'-109.46',
  u'-6.8',
  u'0.7',
  u'.',
  u'26.14',
  u'26.24']]


Again, all action happens once we call the first Spark action (i.e. take in this case). What if we take a lot of elements instead of just the first few?

In [21]:
t0 = time()
head_rows = csv_data.take(10000)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))

Parse completed in 0.392 seconds
