# Lecture 5 Examples

These are the examples presented in class with a little additional annotation.

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()
import os
dir_path = "/Volumes/workspace/default/files"
file_path = os.path.join(dir_path, "train.csv")
#dbutils.fs.ls(file_path)
spark_df = spark.read.csv(file_path, header=True, inferSchema=True)




`col` and `count` are expression builders.  `col` returns a abstract object representing a column, but it has not yet been associated with any data.   Expressions are symbolic.  As an aside, there are two `count` functions in Spark, which can be confusing.   There is `count` which is an expression builder, and there is the member function `count` which is an action that forces execution and returns an integer.   In the example below we are using the expression builder
`count`.

In [0]:
from pyspark.sql.functions import col, count
type(col("Cabin"))


`col("Cabin")` means a symbolic reference to a column "Cabin", but it is not yet associated with any underlying data.

`spark_df` is a Spark DataFrame.  `select` is a transformation.   A tranformation associates an expression with data and adds it to a logical plan. 

In [0]:
df = spark_df.select(col("Cabin").isNull())
df.explain(extended=True)

Below we now use the action `count` to count the number rows in the column `Cabin`.

In [0]:
df = spark_df.select(col("Cabin"))
df.count()


Below we create a symbolic expression representing the rows for which the values in the column `Cabin` are `null`.  We replace each `null` with the string literal `Cabin`, and every record that is not null is defaulted to `null`.

In [0]:
from pyspark.sql.functions import col, count, when
df = spark_df.select(when(col("Cabin").isNull(),"Cabin"))
df.show()

The action `count` counts the number of non-null rows.  Each row containing `Cabin` is not-null. This therefore counts the number of non-null rows.

In [0]:
df = spark_df.select(count(when(col("Cabin").isNull(),"Cabin")))
df.show()

In [0]:
df = spark_df.select([count(when(col(c).isNull(), c)).alias(c) for c in spark_df.columns])

In [0]:
df.show()

In [0]:
spark_df.select("Age").describe().show()