# Learning PySpark 
### Video series

### Packt Publishing

**Author**: Tomasz Drabas
**Date**:   2018-01-30





# Section 4: Spark DataFrames & Transformations

In this section we will look at the Spark DataFrames and the transformations available.

## Creating DataFrames
### From RDDs

In [1]:
import pyspark
sc = pyspark.SparkContext()

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("create").getOrCreate()

In [2]:
simple_rdd = sc.parallelize([
      ['2017-02-01','Rachel', 19, 156, 'Sydney']
    , ['2018-01-01','Albert',  3,  45, 'New York']
    , ['2018-03-02','Jack',   61, 190, 'Krakow']
    , ['2017-12-31','Skye',    8,  82, 'Harbin']
])

In [8]:
simple_df = spark.createDataFrame(
    simple_rdd, 
    ['Date','Name', 'Age', 'Weight', 'Location']
)
simple_df.show()

+----------+------+---+------+--------+
|      Date|  Name|Age|Weight|Location|
+----------+------+---+------+--------+
|2017-02-01|Rachel| 19|   156|  Sydney|
|2018-01-01|Albert|  3|    45|New York|
|2018-03-02|  Jack| 61|   190|  Krakow|
|2017-12-31|  Skye|  8|    82|  Harbin|
+----------+------+---+------+--------+



### From JSON string

In [9]:
json_string = [
    '{"Date":"2017-02-01","Name":"Rachel","Age":19,"Weight":156,"Location":"Sydney"}', 
    '{"Date":"2018-01-01","Name":"Albert","Age":3 ,"Weight":45 ,"Location":"New York"}', 
    '{"Date":"2018-03-02","Name":"Jack"  ,"Age":61,"Weight":190,"Location":"Krakow"}', 
    '{"Date":"2017-12-31","Name":"Skye"  ,"Age":8 ,"Weight":82 ,"Location":"Harbin"}'
]

simple_df_json = spark.read.json(sc.parallelize(json_string))
simple_df_json.show()

+---+----------+--------+------+------+
|Age|      Date|Location|  Name|Weight|
+---+----------+--------+------+------+
| 19|2017-02-01|  Sydney|Rachel|   156|
|  3|2018-01-01|New York|Albert|    45|
| 61|2018-03-02|  Krakow|  Jack|   190|
|  8|2017-12-31|  Harbin|  Skye|    82|
+---+----------+--------+------+------+



### Reading data

In [2]:
sample_df = spark.read.csv(
    'sample_data.csv'
    , header=True
)
sample_df.show(4)

+---------+-------+-------+------+-----+--------+------+
|OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+---------+-------+-------+------+-----+--------+------+
| 1/6/2016|   East|  Jones|Pencil|   95|    1.99|189.05|
| 3/2/2017|Central| Kivell|Binder|   50|   19.99| 999.5|
| 2/9/2016|Central|Jardine|Pencil|   36|    4.99|179.64|
|2/26/2016|Central|   Gill|   Pen|   27|   19.99|539.73|
+---------+-------+-------+------+-----+--------+------+
only showing top 4 rows



## Spark DataFrame schema
### RDDs reflection

In [4]:
sample_df.printSchema()

root
 |-- OrderDate: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Rep: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Units: string (nullable = true)
 |-- UnitCost: string (nullable = true)
 |-- Total: string (nullable = true)



### Programmatically specifying schema

In [22]:
import pyspark.sql.types as typ
import datetime as dt

schema = [
      ('Date', typ.DateType())
    , ('Name', typ.StringType())
    , ('Age',  typ.IntegerType())
    , ('Weight', typ.IntegerType())
    , ('Location', typ.StringType())
]

schema = typ.StructType([typ.StructField(e[0], e[1], True) for e in schema])

simple_df_schema = spark.createDataFrame(
      simple_rdd
        .map(lambda row: 
             [dt.datetime.strptime(row[0], '%Y-%m-%d')] + row[1:]
            )
    , schema=schema
)

simple_df_schema.show()

+----------+------+---+------+--------+
|      Date|  Name|Age|Weight|Location|
+----------+------+---+------+--------+
|2017-02-01|Rachel| 19|   156|  Sydney|
|2018-01-01|Albert|  3|    45|New York|
|2018-03-02|  Jack| 61|   190|  Krakow|
|2017-12-31|  Skye|  8|    82|  Harbin|
+----------+------+---+------+--------+



In [16]:
simple_df_schema.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Location: string (nullable = true)



### Automatically inferring schema while reading data

In [17]:
sample_df.printSchema()

root
 |-- OrderDate: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Rep: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Units: string (nullable = true)
 |-- UnitCost: string (nullable = true)
 |-- Total: string (nullable = true)



In [7]:
sample_df_inferred = spark.read.csv(
    'sample_data.csv'
    , header=True
    , inferSchema = True
)

sample_df_inferred.printSchema()
sample_df_inferred.show(5)

root
 |-- OrderDate: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Rep: string (nullable = true)
 |-- Item: string (nullable = true)
 |-- Units: integer (nullable = true)
 |-- UnitCost: double (nullable = true)
 |-- Total: double (nullable = true)

+---------+-------+-------+------+-----+--------+------+
|OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+---------+-------+-------+------+-----+--------+------+
| 1/6/2016|   East|  Jones|Pencil|   95|    1.99|189.05|
| 3/2/2017|Central| Kivell|Binder|   50|   19.99| 999.5|
| 2/9/2016|Central|Jardine|Pencil|   36|    4.99|179.64|
|2/26/2016|Central|   Gill|   Pen|   27|   19.99|539.73|
|3/15/2016|   West|Sorvino|Pencil|   56|    2.99|167.44|
+---------+-------+-------+------+-----+--------+------+
only showing top 5 rows



In [59]:
from pyspark.sql.functions import col, unix_timestamp, to_date
#sample_df_inferred['OrderDate']=sample_df_inferred.select(to_date(sample_df_inferred.OrderDate, 'dd-MM-yyyy'))
#sample_df_inferred= sample_df_inferred.withColumn('OrderDate1',to_date(sample_df_inferred.OrderDate))

In [61]:
sample_df_inferred = sample_df_inferred.withColumn('date_in_dateFormat', 
                   to_date(unix_timestamp(col('OrderDate'), 'dd-MM-yyyy').cast("timestamp")))
sample_df_inferred.show(2)

+---------+-------+------+------+-----+--------+------+------------------+
|OrderDate| Region|   Rep|  Item|Units|UnitCost| Total|date_in_dateFormat|
+---------+-------+------+------+-----+--------+------+------------------+
| 1/6/2016|   East| Jones|Pencil|   95|    1.99|189.05|              null|
| 3/2/2017|Central|Kivell|Binder|   50|   19.99| 999.5|              null|
+---------+-------+------+------+-----+--------+------+------------------+
only showing top 2 rows



In [52]:
sample_df_inferred.show(5)

+---------+-------+-------+------+-----+--------+------+----------+
|OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|OrderDate1|
+---------+-------+-------+------+-----+--------+------+----------+
|     null|   East|  Jones|Pencil|   95|    1.99|189.05|      null|
|     null|Central| Kivell|Binder|   50|   19.99| 999.5|      null|
|     null|Central|Jardine|Pencil|   36|    4.99|179.64|      null|
|     null|Central|   Gill|   Pen|   27|   19.99|539.73|      null|
|     null|   West|Sorvino|Pencil|   56|    2.99|167.44|      null|
+---------+-------+-------+------+-----+--------+------+----------+
only showing top 5 rows



In [8]:
import pyspark.sql.functions as f

sample_df_inferred = (
    sample_df_inferred
    .withColumn('OrderDate'
                , f.to_date('OrderDate', 'MM/dd/yy')
               )
)

sample_df_inferred.show(4)

+---------+-------+-------+------+-----+--------+------+
|OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+---------+-------+-------+------+-----+--------+------+
|     null|   East|  Jones|Pencil|   95|    1.99|189.05|
|     null|Central| Kivell|Binder|   50|   19.99| 999.5|
|     null|Central|Jardine|Pencil|   36|    4.99|179.64|
|     null|Central|   Gill|   Pen|   27|   19.99|539.73|
+---------+-------+-------+------+-----+--------+------+
only showing top 4 rows



## .agg(...)

In [7]:
sample_df_inferred.agg({'Total': 'avg'}).show()

+------------------+
|        avg(Total)|
+------------------+
|456.46232558139553|
+------------------+



In [15]:
aggregations = [
      ('Total', f.min,    'Total_min')
    , ('Total', f.max,    'Total_max')
    , ('Total', f.avg,    'Total_avg')
    , ('Total', f.stddev, 'Total_stddev')
]

(sample_df_inferred.agg(*[e[1](e[0]).alias(e[2]) for e in aggregations]).show())

+---------+---------+------------------+-----------------+
|Total_min|Total_max|         Total_avg|     Total_stddev|
+---------+---------+------------------+-----------------+
|     9.03|  1879.06|456.46232558139553|447.0221038416717|
+---------+---------+------------------+-----------------+



## .sql(...)

In [18]:
sample_df_inferred.createOrReplaceTempView('sample_df_inferred')

spark.sql('''SELECT 
              MIN(Total)    AS Total_min
            , MAX(Total)    AS Total_max
            , AVG(Total)    AS Total_avg
            , STDDEV(Total) AS Total_std
            FROM sample_df_inferred''').show()

+---------+---------+------------------+-----------------+
|Total_min|Total_max|         Total_avg|        Total_std|
+---------+---------+------------------+-----------------+
|     9.03|  1879.06|456.46232558139553|447.0221038416717|
+---------+---------+------------------+-----------------+



In [29]:
sample_df_inferred.createOrReplaceTempView('sample_df_inferred')

spark.sql('''SELECT
              Rep
            , MIN(Total) AS Total_min
            , MAX(Total) AS Total_max
            , ROUND(AVG(Total),2) AS Total_avg
            , ROUND(STDDEV(Total),2) AS Total_std
            FROM sample_df_inferred
            GROUP BY Rep
            ORDER BY Total_avg DESC''').show()

+--------+---------+---------+---------+---------+
|     Rep|Total_min|Total_max|Total_avg|Total_std|
+--------+---------+---------+---------+---------+
|  Parent|   299.85|  1619.19|   1034.1|    672.2|
|  Kivell|   479.04|   1005.9|   777.36|   266.95|
|Thompson|    63.68|  1139.43|   601.56|   760.67|
| Jardine|    54.89|  1879.06|   562.44|   749.73|
|   Smith|    86.43|   1305.0|   547.14|    661.4|
|  Morgan|   251.72|   686.95|   462.59|   217.93|
|    Gill|     9.03|    719.2|   349.97|   304.93|
| Sorvino|   139.93|    825.0|    320.9|   336.25|
|   Jones|    19.96|   575.36|   295.38|   185.72|
|  Howard|    57.71|   479.04|   268.38|   297.93|
| Andrews|    18.06|   149.25|   109.59|    61.46|
+--------+---------+---------+---------+---------+



In [14]:
(sample_df_inferred.selectExpr(
          'MIN(Total) AS Total_min'
        , 'MAX(Total) AS Total_max'
    )).show()

+---------+---------+
|Total_min|Total_max|
+---------+---------+
|     9.03|  1879.06|
+---------+---------+



## Creating Temporary views

In [20]:
#sample_df_inferred.createTempView('sample_df_inferred')

In [21]:
sample_df_inferred.createOrReplaceTempView('sample_df_inferred')

## Joining two DataFrames

In [9]:
regions = spark.createDataFrame(
    sc.parallelize([
        ('Central', 'Chicago')
        , ('West', 'Seattle')
        , ('East', 'Boston')
    ]),
    ['Region', 'Headquarters']
)
regions.show()

+-------+------------+
| Region|Headquarters|
+-------+------------+
|Central|     Chicago|
|   West|     Seattle|
|   East|      Boston|
+-------+------------+



In [10]:
sample_df_inferred.show(5)

+---------+-------+-------+------+-----+--------+------+
|OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+---------+-------+-------+------+-----+--------+------+
|     null|   East|  Jones|Pencil|   95|    1.99|189.05|
|     null|Central| Kivell|Binder|   50|   19.99| 999.5|
|     null|Central|Jardine|Pencil|   36|    4.99|179.64|
|     null|Central|   Gill|   Pen|   27|   19.99|539.73|
|     null|   West|Sorvino|Pencil|   56|    2.99|167.44|
+---------+-------+-------+------+-----+--------+------+
only showing top 5 rows



In [17]:
sample_df_inferred.join(regions, on=['Region'], how='left_outer').orderBy('UnitCost').show(10)


+-------+---------+--------+------+-----+--------+------+------------+
| Region|OrderDate|     Rep|  Item|Units|UnitCost| Total|Headquarters|
+-------+---------+--------+------+-----+--------+------+------------+
|Central|     null|    Gill|Pencil|   53|    1.29| 68.37|     Chicago|
|Central|     null|    Gill|Pencil|    7|    1.29|  9.03|     Chicago|
|Central|     null|   Smith|Pencil|   67|    1.29| 86.43|     Chicago|
|Central|     null| Andrews|Pencil|   14|    1.29| 18.06|     Chicago|
|Central|     null| Andrews|Pencil|   66|    1.99|131.34|     Chicago|
|   West|     null|Thompson|Pencil|   32|    1.99| 63.68|     Seattle|
|   West|     null| Sorvino|   Pen|   76|    1.99|151.24|     Seattle|
|   East|     null|   Jones|Pencil|   95|    1.99|189.05|      Boston|
|Central|     null| Andrews|Pencil|   75|    1.99|149.25|     Chicago|
|   East|     null|  Howard|Binder|   29|    1.99| 57.71|      Boston|
+-------+---------+--------+------+-----+--------+------+------------+
only s

## Descriptive Statistics

In [19]:
#Not for Null and DateTypem Columns
sample_df_inferred.describe().show()

+-------+-------+--------+------+------------------+------------------+------------------+
|summary| Region|     Rep|  Item|             Units|          UnitCost|             Total|
+-------+-------+--------+------+------------------+------------------+------------------+
|  count|     43|      43|    43|                43|                43|                43|
|   mean|   null|    null|  null|49.325581395348834|20.308604651162792|456.46232558139553|
| stddev|   null|    null|  null|30.078247899067208| 47.34511769375187| 447.0221038416717|
|    min|Central| Andrews|Binder|                 2|              1.29|              9.03|
|    max|   West|Thompson|Pencil|                96|             275.0|           1879.06|
+-------+-------+--------+------+------------------+------------------+------------------+



**For Numeric Columns Only**

In [21]:
numeric_columns = [e[0] 
         for e in sample_df_inferred.dtypes 
         if e[1] in ('int', 'double')
        ]

(sample_df_inferred.select(numeric_columns).describe().show())

+-------+------------------+------------------+------------------+
|summary|             Units|          UnitCost|             Total|
+-------+------------------+------------------+------------------+
|  count|                43|                43|                43|
|   mean|49.325581395348834|20.308604651162792|456.46232558139553|
| stddev|30.078247899067208| 47.34511769375187| 447.0221038416717|
|    min|                 2|              1.29|              9.03|
|    max|                96|             275.0|           1879.06|
+-------+------------------+------------------+------------------+



In [22]:
sample_df_inferred.agg(*
    [f.mean(f.col(e)).alias('mean_' + e) for e in numeric_columns] +
    [f.stddev(f.col(e)).alias('stddev_' + e) for e in numeric_columns]
).show()

+------------------+------------------+------------------+------------------+-----------------+-----------------+
|        mean_Units|     mean_UnitCost|        mean_Total|      stddev_Units|  stddev_UnitCost|     stddev_Total|
+------------------+------------------+------------------+------------------+-----------------+-----------------+
|49.325581395348834|20.308604651162792|456.46232558139553|30.078247899067208|47.34511769375187|447.0221038416717|
+------------------+------------------+------------------+------------------+-----------------+-----------------+



In [24]:
sample_df_inferred.agg(*[f.kurtosis(f.col('Total')).alias('kurtosis_Total'),
                        f.skewness(f.col('Total')).alias('skewness_Total')]).show()

+-----------------+------------------+
|   kurtosis_Total|    skewness_Total|
+-----------------+------------------+
|1.551174226609806|1.4391370583659786|
+-----------------+------------------+



## .distinct()

In [26]:
#Returns all records that are unique
sample_df_inferred.distinct().show(4)

+---------+-------+-------+------+-----+--------+------+
|OrderDate| Region|    Rep|  Item|Units|UnitCost| Total|
+---------+-------+-------+------+-----+--------+------+
|     null|   East|  Jones|Pencil|   95|    1.99|189.05|
|     null|Central| Kivell|Binder|   50|   19.99| 999.5|
|     null|Central|Jardine|Pencil|   36|    4.99|179.64|
|     null|Central|   Gill|   Pen|   27|   19.99|539.73|
+---------+-------+-------+------+-----+--------+------+
only showing top 4 rows



In [31]:
sample_df_inferred.select('Region').distinct().show()
sample_df_inferred.select('Rep').distinct().show()

+-------+
| Region|
+-------+
|   East|
|Central|
|   West|
+-------+

+--------+
|     Rep|
+--------+
|   Jones|
|  Kivell|
| Jardine|
|    Gill|
| Sorvino|
| Andrews|
|Thompson|
|  Morgan|
|  Howard|
|  Parent|
|   Smith|
+--------+



In [32]:
#Combine
(
    sample_df_inferred
    .select('Region', 'Rep')
    .distinct()
    .orderBy('Region', 'Rep')
    .show()
)

+-------+--------+
| Region|     Rep|
+-------+--------+
|Central| Andrews|
|Central|    Gill|
|Central| Jardine|
|Central|  Kivell|
|Central|  Morgan|
|Central|   Smith|
|   East|  Howard|
|   East|   Jones|
|   East|  Parent|
|   West| Sorvino|
|   West|Thompson|
+-------+--------+

