### Library Imports

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [4]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

## Use Case

Let's do some analysis on orders data. This will be a simple table with an:

* `id`: the primary key 
* `ordered_at`: date the order happened at
* `amonut`: the total dollars associated with the order


1) We want to set columns to a specific value. 

    ie., setting a column to the `current_date` to the current date for data wrangling.

2) We want to do comparisions with a constant value.

    ie., find all the orders with `amount` greater than `x`.

In [5]:
from datetime import datetime

In [8]:
orders_df = spark.createDataFrame(
    [
        (1, datetime(2018, 1, 1), 1),
        (2, datetime(2018, 1, 1), 2),
        (3, datetime(2018, 1, 1), 3),
        (4, datetime(2018, 1, 1), 4),
    ], ['id','ordered_at', 'amount']
)

pets_df.toPandas()

Unnamed: 0,id,ordered_at,amount
0,1,2018-01-01,1
1,2,2018-01-01,2
2,3,2018-01-01,3
3,4,2018-01-01,4


## Case 1: Filtering on a Constant Value (`where` claus)

In [14]:
orders_df.where(F.col('amount') > 1).toPandas()

Unnamed: 0,id,ordered_at,amount
0,,2018-01-01,2
1,,2018-01-01,3
2,4.0,2018-01-01,4


## Case 2: Filtering for Constant Values (`isin` claus)

In [15]:
orders_df.where(F.col('amount').isin(1, 2, 3)).toPandas()

Unnamed: 0,id,ordered_at,amount
0,1.0,2018-01-01,1
1,,2018-01-01,2
2,,2018-01-01,3


## Case 3: Set a Column to a Constant Value (`withColumn` claus)

In [18]:
orders_df.withColumn('current_date', datetime.today()).toPandas()

AssertionError: col should be Column

In [21]:
orders_df.withColumn('current_date', F.lit(datetime.today())).toPandas()

Unnamed: 0,id,ordered_at,amount,current_date
0,1.0,2018-01-01,1,2018-12-11 23:22:34.820724
1,,2018-01-01,2,2018-12-11 23:22:34.820724
2,,2018-01-01,3,2018-12-11 23:22:34.820724
3,4.0,2018-01-01,4,2018-12-11 23:22:34.820724


## TL;DR

**Comparisions with constants:**  
When we are doing an setting of values we need to wrap the value with a `F.lit()`. As we need to provide it with a **spark literal**.

**Everythin else:**  
We can simply pass the value as is to filter conditions.