# Spark Learning Note - Basic Structured Operations

## DataFrame vs Dataset

- DataFrames are untyped until runtime - when you execute on it
- Datasets are typed during the compiling time - when you declare it
- Usually just work with DataFrames. When need strict compile-time checking, Dataset is prefered.
- Since Python is Dynamic language, it does not support Datasets (at least for Spark2)

## Structured API Execution Steps

- Write DataFrame/Dataset/SQL code
- Spark convert code to logical plan (analyze and **optimize** the logical plan)
- Spark transform logical plan to physical plan (**optimize** the physical plan, how to execute on cluster)
- Spark Execute physical plan on clusters

In [27]:
from pyspark.sql import SparkSession
data_example_path = '/home/jgeng/Documents/Git/SparkLearning/data/2015flight.csv'

### Spark Session
The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:

In [22]:
# build a spark session locally
spark = SparkSession.builder.appName('Spark Example').getOrCreate()
spark

### 1. Read File into DataFrame
Spark session can read file of differemt formats.
- Use `.format()` to specify file format
- `option()` provide many configurations for reading the data such as read header
- DataFrame object has a method `printSchema()`

DataType on Read:
- `json` file contains information regarding the type of the data (but not precision). 
- data in `csv` file will be read as string if not specified
- **the** `option('inferSchema', True)` **will enable the schema inference for reading the csv file**

In [28]:
# read from csv file
df = spark.read.format('csv').load(data_example_path)

In [48]:
# when does not specify the header, spark will treate each row as a data record and add header
df.show(3)
df.printSchema() # count is in StringType!

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+
only showing top 3 rows

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: string (nullable = true)



In [54]:
# use option to read header
df = spark.read.format('csv').option('header', True).option('inferSchema', True).load(data_example_path)
df.show(3)
df.printSchema()  # now count is in IntegerType!

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+
only showing top 3 rows

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [59]:
# very handy method
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



### 2. Load File with Manual Schema

When read data from file, if not specifying the schema of your data, default schema-on-read will be created for the data. **This could cause precision issue if the data in file is in different precision. In production, it is usually recommended to manually setup the schema.**

`pyspark.sql.types` have all supported data types. 

For a list of all: https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ch04.html

Spark DataFrame Schema is defiend by StructType and a list of StructField within it.

StructField:
- name: name of the field/column
- type: data type
- nullable: whether null value is allowed
- metadata: way of storing information about this column (will be used in machine learning)

In [61]:
# a more low level representation
print(df.schema)

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,IntegerType,true)))


In [76]:
# specify a schema manually - e.g. what if count is long rather than integer
from pyspark.sql.types import StructType, StructField, StringType, LongType

# field name, data type, nullable
field_1 = StructField('DEST_COUNTRY_NAME', StringType(), True)
field_2 = StructField('ORIGIN_COUNTRY_NAME', StringType(), True)
field_3 = StructField('count', LongType(), False)

manualSchema = StructType([field_1, field_2, field_3])

In [77]:
df = spark.read.format('csv').option('header', True).schema(manualSchema).load(data_example_path)
df.show(3) 

# now count is long! 
# but nullable is true
# this is because CSV format doesn't provide any tools which allow you to specify data constraints 
# so by definition reader cannot assume that input is not null and your data indeed contains nulls.
df.printSchema()  

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
+-----------------+-------------------+-----+
only showing top 3 rows



### 3. Column, Row, and Create DataFrame from Scratch 

col can be used in expression for `select()` method.

row is data records.

In [106]:
from pyspark.sql.functions import col

# create an unattached column
new_col = col('new_col')
new_col

Column<b'new_col'>

In [108]:
# get a column from df
df['count']

Column<b'count'>

In [115]:
# get all column names from a df
df.columns[0]

'DEST_COUNTRY_NAME'

In [123]:
from pyspark.sql import Row

# create a new data record
newRow = Row('Hi', None, 2, True)

# access a value from the data record
newRow[0]  # Python will automatically convert the value to correct type

'Hi'

In [125]:
# create a DataFrame from Scratch
# field name, data type, nullable
field_1 = StructField('DEST_COUNTRY_NAME', StringType(), True)
field_2 = StructField('ORIGIN_COUNTRY_NAME', StringType(), True)
field_3 = StructField('count', LongType(), False)

manualSchema = StructType([field_1, field_2, field_3])

newRow = Row('NYC', 'MIA', 2)

newDF = spark.createDataFrame([newRow], manualSchema)
newDF.show()

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|              NYC|                MIA|    2|
+-----------------+-------------------+-----+



### 4. Select & SelectExpr & withColumn

expr supports flexible expression on columns data.
- use `expr()`
- use `selectExpr()`

for maniplating column
- use `df.withColumn()` to add column
- use `df.withColumnRenamed()` to rename a column


In [127]:
df.show(4)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
+-----------------+-------------------+-----+
only showing top 4 rows



In [129]:
df.select('DEST_COUNTRY_NAME').show(4)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
|    United States|
|            Egypt|
+-----------------+
only showing top 4 rows



In [144]:
from pyspark.sql.functions import expr

# use AS to rename the column name
# the returned new DataFrame will have a new column name
newdf = df.select(expr("DEST_COUNTRY_NAME as destination"), 
          expr('ORIGIN_COUNTRY_NAME as departure'))
newdf.show(4)

+-------------+-------------+
|  destination|    departure|
+-------------+-------------+
|United States|      Romania|
|United States|      Croatia|
|United States|      Ireland|
|        Egypt|United States|
+-------------+-------------+
only showing top 4 rows



In [150]:
# use * to select all column
# expr can take some more operations between columns to create new column
newdf.select('*', expr('destination=departure as invalid')).show(4)

+-------------+-------------+-------+
|  destination|    departure|invalid|
+-------------+-------------+-------+
|United States|      Romania|  false|
|United States|      Croatia|  false|
|United States|      Ireland|  false|
|        Egypt|United States|  false|
+-------------+-------------+-------+
only showing top 4 rows



In [154]:
# use selectExpr to do it all use one line
# takes the expression as input, it can also take a literal as input1
df.selectExpr('*', 'DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME as invalid', '1 as one', 'True as bool').show(4)

+-----------------+-------------------+-----+-------+---+----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|invalid|one|bool|
+-----------------+-------------------+-----+-------+---+----+
|    United States|            Romania|   15|  false|  1|true|
|    United States|            Croatia|    1|  false|  1|true|
|    United States|            Ireland|  344|  false|  1|true|
|            Egypt|      United States|   15|  false|  1|true|
+-----------------+-------------------+-----+-------+---+----+
only showing top 4 rows



In [161]:
from pyspark.sql.functions import lit

# a more formal way to add column is to use withColumn
df.withColumn('one', lit(1)).show(2)  # this is NOT inplace, scala never inplace change table!

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|one|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



Column<b'count[0]'>

In [163]:
# withcolumn also support expr
df.withColumn('invalid', expr('DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME')).show(2)

+-----------------+-------------------+-----+-------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|invalid|
+-----------------+-------------------+-----+-------+
|    United States|            Romania|   15|  false|
|    United States|            Croatia|    1|  false|
+-----------------+-------------------+-----+-------+
only showing top 2 rows



In [167]:
# withColumn can also be used for copy and rename a column
df.withColumn('destination', expr('DEST_COUNTRY_NAME')).show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|  destination|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|United States|
|    United States|            Croatia|    1|United States|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



In [168]:
# a more straight forward way to rename a column is 
df.withColumnRenamed('DEST_COUNTRY_NAME', 'dest').show(2)

+-------------+-------------------+-----+
|         dest|ORIGIN_COUNTRY_NAME|count|
+-------------+-------------------+-----+
|United States|            Romania|   15|
|United States|            Croatia|    1|
+-------------+-------------------+-----+
only showing top 2 rows

