# 1. Data Frame Transformation #
Example from *Spark: Definitive Guide: Big Data processing Made Simple*, by Mate Zaharia and Bill Chambers - Chapter 2.
Create a Spark session using `pyspark.sql`, create a range of numbers in a DataFrame and then apply two transformations:
+ Count all the even numbers
+ Sort the numbers


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-nb-1").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "512m").\
        config("spark.eventLog.enabled", "true").\
        config("spark.eventLog.dir", "file:///opt/workspace/events").\
        getOrCreate()

Generate a range of 1000 numbers

In [None]:
myrange = spark.range(1000).toDF("number")

  
Find all the even numbers - this is an example of a *Narrow Transformation* (no shuffle) as no data has to be moved between partitions - just a **filter** on a per-partition basis.  Spark uses "lazy evaluation" so nothing is executed at this point:

In [None]:
divisBy2 = myrange.where("number % 2 = 0")

  
Count all the items in the result-set "divisBy2"; this is an Example of a *Wide Transformation*.  There is an **aggregation** (reduce) that performs several counts on a per-partition basis and then a collect into a final result-set.

In [None]:
divisBy2.count()

Above is an **action** and causes the previous lines of code to be executed.  Standard types of Actions are:
- view data in console
- collect data to native objects in respective app API language
- write data to output destination
   
Next, sort the *divisBy2* dataframe and take the first 5 numbers from the sorted result   

In [None]:
divisBy2.sort("number").take(5)

In [None]:
spark.stop()