# Chapter 3: Filtering and Data Manipulation

[**3.1 Filtering**](#3.1-Filtering)   
[**3.1.1 Filtering Column**](#3.1.1-Filtering-Column)   
[**3.1.2 Filtering Row**](#3.1.2-Filtering-Row)   
[**3.1.3 Filter single column value**](#3.1.3-Filter-single-column-value)   
[**3.1.4 Filter multiple column with AND operator**](#3.1.4-Filter-multiple-column-with-AND-operator)   
[**3.1.5 Filter multiple column with OR operator**](#3.1.5-Filter-multiple-column-with-OR-operator)   
[**3.1.6 Filter with Boolean expression**](#3.1.6-Filter-with-Boolean-expression)   
[**3.2 PySpark SQL Module**](#3.2-PySpark-SQL-Module)   
[**3.3 Numeric Type Manipulation**](#3.3-Numeric-Type-Manipulation)   
[**3.4 String Type Manipulation**](#3.4-String-Type-Manipulation)   
[**3.5 Date and Timestamp Type Manipulation**](#3.5-Date-and-Timestamp-Type-Manipulation)   
[**3.6 Complex Type Manipulation**](#3.6-Complex-Type-Manipulation)   
[**3.6.1 Arrays Type**](#3.6.1-Arrays-Type)   
[**3.6.2 Maps Type**](#3.6.2-Maps-Type)   
[**3.6.3 Structs Type**](#3.6.3-Structs-Type)   
[**3.7 Handling Nulls**](#3.7-Handling-Nulls)   
[**3.7.1 Droping Null Values**](#3.7.1-Droping-Null-Values)   
[**3.7.2 Filling Null Values**](#3.7.2-Filling-Null-Values)   
[**3.7.3 Filtering Null Values**](#3.7.3-Filtering-Null-Values)   
[**3.9 User Defined Functions**](#3.8-User-Defined-Functions)   

#### 3.1 Filtering
**Filtering**: Filtering is the process of subsetting data for analysis and reporting. Filter can be applied both on `rows` and `columns`.

#### 3.1.1 Filtering Column
Filtering column is the process of reducing i.e. dropping or removing columns/attributes from originial DataFrame. Filtering row are used to:-      
* remove sensitive fields from data.
* remove less important fields.
* remove temporary fields added during data transformation or validation.
* reduce data size for fast processing and optimization.

`drop()` method is used to drop the columns from DataFrame. The argument to the `drop()`method is either a single column name or list of column name to be dropped from DataFrame.

In [3]:
import configparser

# Read mysql database connection string from conf/db_properties.ini

config_filename = '../Chapter_2_Structured_API/Lab_1/conf/db_properties.ini'
db_properties = {}
config = configparser.ConfigParser()
config.read(config_filename)
db_prop = config['mysql']
db_url = db_prop['url']
db_properties['database'] = db_prop['database']
db_properties['schema'] = db_prop['schema']
db_properties['user'] = db_prop['user']
db_properties['password'] = db_prop['password']

In [6]:
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder \
    .master("local") \
    .appName("Chapter 3") \
    .getOrCreate()

# Load current_dept_emp
current_dept_emp = spark.read.jdbc(url = db_url, table = 'current_dept_emp', properties = db_properties)

# Load departments
departments = spark.read.jdbc(url = db_url, table = 'departments', properties = db_properties)

# Load dept_emp
dept_emp = spark.read.jdbc(url = db_url, table = 'dept_emp', properties = db_properties)

# Load dept_emp_latest_date
dept_emp_latest_date = spark.read.jdbc(url = db_url, table = 'dept_emp_latest_date', properties = db_properties)

# Load dept_manager
dept_manager = spark.read.jdbc(url = db_url, table = 'dept_manager', properties = db_properties)

# Load employees
employees = spark.read.jdbc(url = db_url, table = 'employees', properties = db_properties)

# Load highest_salary_employee
highest_salary_employee = spark.read.jdbc(url = db_url, table = 'highest_salary_employee', properties = db_properties)

# Load salaries
salaries = spark.read.jdbc(url = db_url, table = 'salaries', properties = db_properties)

# Load titles
titles = spark.read.jdbc(url = db_url, table = 'titles', properties = db_properties)

In [7]:
employees.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)



**Dropping Single Column**

To drop single column pass the column name to `drop()` method. The example below shows dropping emp_no from employees DF.

In [8]:
# Drop emp_no from employees DF and store new value into emp_tmpDF.

emp_tmpDF = employees.drop("emp_no")
emp_tmpDF.printSchema()
emp_tmpDF.show(10, False)

root
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)

+----------+----------+---------+------+----------+
|birth_date|first_name|last_name|gender|hire_date |
+----------+----------+---------+------+----------+
|1953-09-02|Georgi    |Facello  |M     |1986-06-26|
|1964-06-02|Bezalel   |Simmel   |F     |1985-11-21|
|1959-12-03|Parto     |Bamford  |M     |1986-08-28|
|1954-05-01|Chirstian |Koblick  |M     |1986-12-01|
|1955-01-21|Kyoichi   |Maliniak |M     |1989-09-12|
|1953-04-20|Anneke    |Preusig  |F     |1989-06-02|
|1957-05-23|Tzvetan   |Zielinski|F     |1989-02-10|
|1958-02-19|Saniya    |Kalloufi |M     |1994-09-15|
|1952-04-19|Sumant    |Peac     |F     |1985-02-18|
|1963-06-01|Duangkaew |Piveteau |F     |1989-08-24|
+----------+----------+---------+------+----------+
only showing top 10 rows



**Dropping Multiple Column**

To drop multipe column pass the column names to `drop()` method. The example below shows dropping emp_no, birth_date, gender and hire_date from employees DF. The argument to drop() method can be either string of column names or list of column names.

In [10]:
 # Drop emp_no, birth_date, gender and hire_date from employees DF and store new value into emp_tmpDF.

emp_tmpDF = employees.drop("emp_no", "birth_date", "gender" ,"hire_date")
emp_tmpDF.printSchema()
emp_tmpDF.show(10, False)

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)

+----------+---------+
|first_name|last_name|
+----------+---------+
|Georgi    |Facello  |
|Bezalel   |Simmel   |
|Parto     |Bamford  |
|Chirstian |Koblick  |
|Kyoichi   |Maliniak |
|Anneke    |Preusig  |
|Tzvetan   |Zielinski|
|Saniya    |Kalloufi |
|Sumant    |Peac     |
|Duangkaew |Piveteau |
+----------+---------+
only showing top 10 rows



In [17]:
 # Drop emp_no, birth_date, gender and hire_date from employees DF and store new value into emp_tmpDF.

column_list = ['emp_no', 'birth_date', 'gender']
emp_tmpDF = employees.drop(*column_list)
emp_tmpDF.printSchema()
emp_tmpDF.show(10, False)

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- hire_date: date (nullable = true)

+----------+---------+----------+
|first_name|last_name|hire_date |
+----------+---------+----------+
|Georgi    |Facello  |1986-06-26|
|Bezalel   |Simmel   |1985-11-21|
|Parto     |Bamford  |1986-08-28|
|Chirstian |Koblick  |1986-12-01|
|Kyoichi   |Maliniak |1989-09-12|
|Anneke    |Preusig  |1989-06-02|
|Tzvetan   |Zielinski|1989-02-10|
|Saniya    |Kalloufi |1994-09-15|
|Sumant    |Peac     |1985-02-18|
|Duangkaew |Piveteau |1989-08-24|
+----------+---------+----------+
only showing top 10 rows



#### 3.1.2 Filtering Row
Filtering row or record is the processing of filtering unwanted record(s) from DataFrame. Filtering row are used to:-   
* eliminiate errorneous record.
* eliminate duplicate record.
* eliminate null ro empty record.
* select record based on certain business rule or logic.
* select record for particular groups of interest.
* select record for particular period of time.

Records are always filtered based on the boolean `true` or `false` value evaluated either from the single `column` or `column expressions` value. The column expression can be constructed from multiple columns containing assignment, arithmetic, bitwise, logical, comparison etc. operators combined with various list of function available in `pyspark.sql.functions`. The logical statements built from these expression will result boolean value where the record are filtered based on it.   

`where()` and `filter()` method are used to filter records. The argument to these method are `column name`, `column expression` or `boolean statements`. Both method performs similiar task. We'll use `where()` in our entire session since it is easier to remember and similar to SQL clause.   

Boolean expressions uses combination of boolean operation as shown in table below. We can also use Boolean column to filter the values from DataFrame. Boolean column is constructed using Boolean expression result in DataFrame. The example is shown below in *Filter with Boolean column* section.  

Table 3.1.2 (a) Boolean Operation

| Operator | Description |
| --------- | ----------- |
| `&` | And operation |
| `-` | Or operation |
| `!` | Not operation |


**Note**: `and` filter is always chained together sequentially.

#### 3.1.3 Filter single column value

In [19]:
# Filter based on single column value

from pyspark.sql.functions import col

employees.where(col("first_name") == "Georgi")\
         .select("*")\
         .show(10, False)

+------+----------+----------+-----------+------+----------+
|emp_no|birth_date|first_name|last_name  |gender|hire_date |
+------+----------+----------+-----------+------+----------+
|10001 |1953-09-02|Georgi    |Facello    |M     |1986-06-26|
|10909 |1954-11-11|Georgi    |Atchley    |M     |1985-04-21|
|11029 |1962-07-12|Georgi    |Itzfeldt   |M     |1992-12-27|
|11430 |1957-01-23|Georgi    |Klassen    |M     |1996-02-27|
|12157 |1960-03-30|Georgi    |Barinka    |M     |1985-06-04|
|15220 |1957-08-03|Georgi    |Panienski  |F     |1995-07-23|
|15660 |1956-01-13|Georgi    |Hartvigsen |M     |1994-10-13|
|15689 |1962-09-14|Georgi    |Capobianchi|M     |1995-03-11|
|15843 |1958-07-15|Georgi    |Varley     |M     |1987-04-14|
|16672 |1955-04-25|Georgi    |Peris      |M     |1986-03-13|
+------+----------+----------+-----------+------+----------+
only showing top 10 rows



In [21]:
# Filter based on single column value

from pyspark.sql.functions import col

employees.where("first_name == 'Georgi'")\
         .select("*")\
         .show(10, False)

+------+----------+----------+-----------+------+----------+
|emp_no|birth_date|first_name|last_name  |gender|hire_date |
+------+----------+----------+-----------+------+----------+
|10001 |1953-09-02|Georgi    |Facello    |M     |1986-06-26|
|10909 |1954-11-11|Georgi    |Atchley    |M     |1985-04-21|
|11029 |1962-07-12|Georgi    |Itzfeldt   |M     |1992-12-27|
|11430 |1957-01-23|Georgi    |Klassen    |M     |1996-02-27|
|12157 |1960-03-30|Georgi    |Barinka    |M     |1985-06-04|
|15220 |1957-08-03|Georgi    |Panienski  |F     |1995-07-23|
|15660 |1956-01-13|Georgi    |Hartvigsen |M     |1994-10-13|
|15689 |1962-09-14|Georgi    |Capobianchi|M     |1995-03-11|
|15843 |1958-07-15|Georgi    |Varley     |M     |1987-04-14|
|16672 |1955-04-25|Georgi    |Peris      |M     |1986-03-13|
+------+----------+----------+-----------+------+----------+
only showing top 10 rows



#### 3.1.4 Filter multiple column with AND operator

In [22]:
employees.where("first_name == 'Georgi'")\
         .where("last_name == 'Facello'")\
         .select("*")\
         .show(10, False)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender|hire_date |
+------+----------+----------+---------+------+----------+
|10001 |1953-09-02|Georgi    |Facello  |M     |1986-06-26|
|55649 |1956-01-23|Georgi    |Facello  |M     |1988-05-04|
+------+----------+----------+---------+------+----------+



In [29]:
#F.when(col("col-1")>0.0) & (col("col-2")>0.0), 1).otherwise(0)

employees.where((col("first_name") == 'Georgi') & (col("last_name") == 'Facello'))\
         .select("*")\
         .show(10, False)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender|hire_date |
+------+----------+----------+---------+------+----------+
|10001 |1953-09-02|Georgi    |Facello  |M     |1986-06-26|
|55649 |1956-01-23|Georgi    |Facello  |M     |1988-05-04|
+------+----------+----------+---------+------+----------+



In [32]:
firstName = col("first_name") == "Georgi"
lastName = col("last_name") == "Facello"

employees.where(firstName & lastName)\
         .select("*")\
         .show(10, False)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender|hire_date |
+------+----------+----------+---------+------+----------+
|10001 |1953-09-02|Georgi    |Facello  |M     |1986-06-26|
|55649 |1956-01-23|Georgi    |Facello  |M     |1988-05-04|
+------+----------+----------+---------+------+----------+



#### 3.1.5 Filter multiple column with OR operator

In [34]:
employees.where((col("first_name") == 'Georgi') | (col("last_name") == 'Facello'))\
         .select("*")\
         .show(10, False)

+------+----------+----------+----------+------+----------+
|emp_no|birth_date|first_name|last_name |gender|hire_date |
+------+----------+----------+----------+------+----------+
|10001 |1953-09-02|Georgi    |Facello   |M     |1986-06-26|
|10327 |1954-04-01|Roded     |Facello   |M     |1987-09-18|
|10909 |1954-11-11|Georgi    |Atchley   |M     |1985-04-21|
|11029 |1962-07-12|Georgi    |Itzfeldt  |M     |1992-12-27|
|11430 |1957-01-23|Georgi    |Klassen   |M     |1996-02-27|
|12157 |1960-03-30|Georgi    |Barinka   |M     |1985-06-04|
|12751 |1964-07-06|Nahum     |Facello   |M     |1995-01-09|
|15220 |1957-08-03|Georgi    |Panienski |F     |1995-07-23|
|15346 |1959-09-26|Kirk      |Facello   |F     |1991-12-07|
|15660 |1956-01-13|Georgi    |Hartvigsen|M     |1994-10-13|
+------+----------+----------+----------+------+----------+
only showing top 10 rows



In [48]:
from pyspark.sql.functions import year

genderFilter = col("gender") == 'M'
ageFilter = year(employees.birth_date) <= 1969

employees.where(employees.last_name.isin("Facello")).where(genderFilter | ageFilter).show()

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10327|1954-04-01|     Roded|  Facello|     M|1987-09-18|
| 12751|1964-07-06|     Nahum|  Facello|     M|1995-01-09|
| 15346|1959-09-26|      Kirk|  Facello|     F|1991-12-07|
| 15685|1958-07-12|   Kasturi|  Facello|     M|1992-03-13|
| 18686|1962-02-23| Kwangyoen|  Facello|     F|1985-05-02|
| 19041|1957-05-29|    Billur|  Facello|     F|1992-08-03|
| 21947|1954-06-18|   Taisook|  Facello|     F|1991-07-30|
| 23938|1955-07-11|     Nahum|  Facello|     M|1985-09-15|
| 24774|1956-09-23|       Uno|  Facello|     F|1989-11-09|
| 24806|1959-09-30|  Charmane|  Facello|     F|1989-03-17|
| 25955|1962-10-09| Christoph|  Facello|     M|1989-03-24|
| 27732|1955-06-04|  Girolamo|  Facello|     M|1986-06-30|
| 30320|1953-12-21|  Kristine|  Facello|     F|1990-06-1

#### 3.1.6 Filter with Boolean expression

In [60]:
# Get male employee's first_name, last_name, gender, emp_no above 50 year old from employees DF

from pyspark.sql.functions import datediff, current_date


genderFilter = col("gender") == 'M'
ageFilter = datediff(current_date(), col("birth_date")) > 50

employees.withColumn("male50above", genderFilter & ageFilter)\
         .where("male50above")\
         .select("first_name", "last_name", "gender", "emp_no", "birth_date").show(10)

+----------+-----------+------+------+----------+
|first_name|  last_name|gender|emp_no|birth_date|
+----------+-----------+------+------+----------+
|    Georgi|    Facello|     M| 10001|1953-09-02|
|     Parto|    Bamford|     M| 10003|1959-12-03|
| Chirstian|    Koblick|     M| 10004|1954-05-01|
|   Kyoichi|   Maliniak|     M| 10005|1955-01-21|
|    Saniya|   Kalloufi|     M| 10008|1958-02-19|
|  Patricio|  Bridgland|     M| 10012|1960-10-04|
| Eberhardt|     Terkki|     M| 10013|1963-06-07|
|     Berni|      Genin|     M| 10014|1956-02-12|
|  Guoxiang|  Nooteboom|     M| 10015|1959-08-19|
|  Kazuhito|Cappelletti|     M| 10016|1961-05-02|
+----------+-----------+------+------+----------+
only showing top 10 rows



**Create more examples for string manipulation data**

#### 3.2 PySpark SQL Module

Always keep the image below on your memory and link in your browser bookmark. This helps you to solve problem faster by finding all the resources easily which is one-stop shop for all [Spark Python API Documents](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html).  

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html


![Pysark SQL Module Image](spark_sql_module.png)



If you want to deep dive into Scala then use the link below (Optional for this Course). Copy and paste if link doesn't work.

* **DataSet Functions**: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset    
* **DataFrame and SQL Functions**: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\$
* **DataFrameStatFunctions**: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions
* **DataFrameNaFunctions**: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions     
* **Column Methods**: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

#### 3.3 Numeric Type Manipulation

In [69]:
salaries.select("*", ((col("salary") * 0.05) + col("salary")).alias("salary_with_5%_increase")).show(10)

+------+------+----------+----------+-----------------------+
|emp_no|salary| from_date|   to_date|salayr_with_5%_increase|
+------+------+----------+----------+-----------------------+
| 10001| 60117|1986-06-26|1987-06-26|               63122.85|
| 10001| 62102|1987-06-26|1988-06-25|                65207.1|
| 10001| 66074|1988-06-25|1989-06-25|                69377.7|
| 10001| 66596|1989-06-25|1990-06-25|                69925.8|
| 10001| 66961|1990-06-25|1991-06-25|               70309.05|
| 10001| 71046|1991-06-25|1992-06-24|                74598.3|
| 10001| 74333|1992-06-24|1993-06-24|               78049.65|
| 10001| 75286|1993-06-24|1994-06-24|                79050.3|
| 10001| 75994|1994-06-24|1995-06-24|                79793.7|
| 10001| 76884|1995-06-24|1996-06-23|                80728.2|
+------+------+----------+----------+-----------------------+
only showing top 10 rows



In [70]:
# Calculate new salary by increasing 5% and adding $500 commission with new fields 'bonus_salary' in salaries DF.

salaries.selectExpr("*", "salary * 0.05 + 500 + salary as bonus_salary").show(10)

+------+------+----------+----------+------------+
|emp_no|salary| from_date|   to_date|bonus_salary|
+------+------+----------+----------+------------+
| 10001| 60117|1986-06-26|1987-06-26|    63622.85|
| 10001| 62102|1987-06-26|1988-06-25|    65707.10|
| 10001| 66074|1988-06-25|1989-06-25|    69877.70|
| 10001| 66596|1989-06-25|1990-06-25|    70425.80|
| 10001| 66961|1990-06-25|1991-06-25|    70809.05|
| 10001| 71046|1991-06-25|1992-06-24|    75098.30|
| 10001| 74333|1992-06-24|1993-06-24|    78549.65|
| 10001| 75286|1993-06-24|1994-06-24|    79550.30|
| 10001| 75994|1994-06-24|1995-06-24|    80293.70|
| 10001| 76884|1995-06-24|1996-06-23|    81228.20|
+------+------+----------+----------+------------+
only showing top 10 rows



In [72]:
# Calculate new salary by increasing 5% and adding $500 commission with new fields 'bonus_salary' in salaries DF.

bonusSalary = (col("salary") * 0.05) + 500 + col("salary")
salaries.select("*", bonusSalary.alias("bonus_salary")).show(10)

+------+------+----------+----------+------------+
|emp_no|salary| from_date|   to_date|bonus_salary|
+------+------+----------+----------+------------+
| 10001| 60117|1986-06-26|1987-06-26|    63622.85|
| 10001| 62102|1987-06-26|1988-06-25|     65707.1|
| 10001| 66074|1988-06-25|1989-06-25|     69877.7|
| 10001| 66596|1989-06-25|1990-06-25|     70425.8|
| 10001| 66961|1990-06-25|1991-06-25|    70809.05|
| 10001| 71046|1991-06-25|1992-06-24|     75098.3|
| 10001| 74333|1992-06-24|1993-06-24|    78549.65|
| 10001| 75286|1993-06-24|1994-06-24|     79550.3|
| 10001| 75994|1994-06-24|1995-06-24|     80293.7|
| 10001| 76884|1995-06-24|1996-06-23|     81228.2|
+------+------+----------+----------+------------+
only showing top 10 rows



In [75]:
# Calculate the rounded age of employees calculated in month.

from pyspark.sql.functions import months_between, round

employees.selectExpr("round(months_between(current_date(), birth_date)) as age_in_months").show(10)

+----------------------+
|employee_age_in_months|
+----------------------+
|                 796.0|
|                 667.0|
|                 721.0|
|                 788.0|
|                 779.0|
|                 800.0|
|                 751.0|
|                 742.0|
|                 812.0|
|                 679.0|
+----------------------+
only showing top 10 rows



In [86]:
# Calcuate the floor and ceiling value for bonus_salary.

from pyspark.sql.functions import ceil, floor

bonus_salary = col("salary") + col("salary") * 0.05
salaries.select("salary", bonus_salary.alias("bonus_salary"), ceil(bonus_salary).alias("ceil_salary")\
                , floor(bonus_salary).alias("floor_salary")).show(10) 

+------+------------+-----------+------------+
|salary|bonus_salary|ceil_salary|floor_salary|
+------+------------+-----------+------------+
| 60117|    63122.85|      63123|       63122|
| 62102|     65207.1|      65208|       65207|
| 66074|     69377.7|      69378|       69377|
| 66596|     69925.8|      69926|       69925|
| 66961|    70309.05|      70310|       70309|
| 71046|     74598.3|      74599|       74598|
| 74333|    78049.65|      78050|       78049|
| 75286|     79050.3|      79051|       79050|
| 75994|     79793.7|      79794|       79793|
| 76884|     80728.2|      80729|       80728|
+------+------------+-----------+------------+
only showing top 10 rows



#### 3.4 String Type Manipulation

In [126]:
# Get the first two character from lastname followed by all character from firstname.
# Convert all the character in lower case with column name 'user_name' from employees DF.

# The link below provide the solution for using last name length. Try and fix the problem.
#https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function


from pyspark.sql.functions import length, substring, lower, concat, expr
import pandas as pd

fname_max_length = 2
lname_max_length = length("last_name")
substrFname = substring("first_name", 0, fname_max_length)
substrLname = substring("last_name", 0, fname_max_length)  # Why can't we use lname_max_length? Try to find solution.
concatNameLcase = lower(concat(substrFname, substrLname))
employees.withColumn("user_name", concatNameLcase).select("first_name", "last_name", "user_name",\
                    lname_max_length).show(10)

+----------+---------+---------+-----------------+
|first_name|last_name|user_name|length(last_name)|
+----------+---------+---------+-----------------+
|    Georgi|  Facello|     gefa|                7|
|   Bezalel|   Simmel|     besi|                6|
|     Parto|  Bamford|     paba|                7|
| Chirstian|  Koblick|     chko|                7|
|   Kyoichi| Maliniak|     kyma|                8|
|    Anneke|  Preusig|     anpr|                7|
|   Tzvetan|Zielinski|     tzzi|                9|
|    Saniya| Kalloufi|     saka|                8|
|    Sumant|     Peac|     supe|                4|
| Duangkaew| Piveteau|     dupi|                8|
+----------+---------+---------+-----------------+
only showing top 10 rows



In [132]:
employees.select("first_name", "last_name",\
          expr("lower(concat(substring(first_name, 0, 2), substring(last_name, 0, length(last_name)))) as user_name")\
          ).show(10)

+----------+---------+-----------+
|first_name|last_name|  user_name|
+----------+---------+-----------+
|    Georgi|  Facello|  gefacello|
|   Bezalel|   Simmel|   besimmel|
|     Parto|  Bamford|  pabamford|
| Chirstian|  Koblick|  chkoblick|
|   Kyoichi| Maliniak| kymaliniak|
|    Anneke|  Preusig|  anpreusig|
|   Tzvetan|Zielinski|tzzielinski|
|    Saniya| Kalloufi| sakalloufi|
|    Sumant|     Peac|     supeac|
| Duangkaew| Piveteau| dupiveteau|
+----------+---------+-----------+
only showing top 10 rows



In [199]:
# Replace employees birth_date with '0' but '-' alias with emp_dob.
# Remove year from birth_date and rename column with birth_mm_dd
# date_pattern = "^\d{4}-\d{2}-\d{2}$"
# Learn more about regular expression aka regex 
# https://www.rexegg.com/regex-quickstart.html

from pyspark.sql.functions import regexp_replace



employees.select("birth_date", regexp_replace(col("birth_date"), '\d', '0')\
                 .alias("emp_dob"),\
                 regexp_replace(col("birth_date"), "^\d{4}-" , '')\
                 .alias("birth_mm_dd")).show(10)

+----------+----------+-----------+
|birth_date|   emp_dob|birth_mm_dd|
+----------+----------+-----------+
|1953-09-02|0000-00-00|      09-02|
|1964-06-02|0000-00-00|      06-02|
|1959-12-03|0000-00-00|      12-03|
|1954-05-01|0000-00-00|      05-01|
|1955-01-21|0000-00-00|      01-21|
|1953-04-20|0000-00-00|      04-20|
|1957-05-23|0000-00-00|      05-23|
|1958-02-19|0000-00-00|      02-19|
|1952-04-19|0000-00-00|      04-19|
|1963-06-01|0000-00-00|      06-01|
+----------+----------+-----------+
only showing top 10 rows



In [220]:
# Show distinct records where employee first name contains 'Sr'

from pyspark.sql.functions import instr

employees.selectExpr("first_name", "instr(first_name, 'Sr') as isSr")\
            .where("isSr == 1")\
            .distinct()\
            .show(10)

+-----------+----+
| first_name|isSr|
+-----------+----+
|  Sreenivas|   1|
|   Srinidhi|   1|
|Sreekrishna|   1|
+-----------+----+



In [252]:
# Display employee first name that contains 'Sri, Moh, Zeh'
# @ todo validate the result it not working as expected.

from pyspark.sql.functions import instr, expr

nameLike = ["sri", "geo", "par"]
def name_checker(name, nameLike):
    return instr(name, nameLike)

nameContains = [name_checker(employees.first_name, n) for n in nameLike]
nameContains.append(expr("*"))  # append column
employees.select(*nameContains).show(10)
employees.select(*nameContains).select("first_name").show(10) 
        #.where("instr\(first_name\, sri\)" == 1)\  Apply filter on new column

+----------------------+----------------------+----------------------+------+----------+----------+---------+------+----------+
|instr(first_name, sri)|instr(first_name, geo)|instr(first_name, par)|emp_no|birth_date|first_name|last_name|gender| hire_date|
+----------------------+----------------------+----------------------+------+----------+----------+---------+------+----------+
|                     0|                     0|                     0| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
|                     0|                     0|                     0| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
|                     0|                     0|                     0| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
|                     0|                     0|                     0| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
|                     0|                     0|                     0| 10005|1955-01-21|   Kyoichi| Mali

#### 3.5 Date and Timestamp Type Manipulation

Date and Timestamps plays crucial role during data modeling and consider as important attribute for tracking information. The format of date and timestamps should must be specified correctly for any database and proramming languages. While reading the schema from file, date data types can be considered as string and later converted to respective date type. Since all the application has its own date type formatting, treating string is better approach during schema-on-read.   
* Date: stores only calendar date. Default format is `yyyy-mm-dd`.
* Timestamps: stores date and time. Spark only supports seconds precision. While handling milliseconds and microseconds it need to treated as `longs`. Spark uses Java dates and timestamps formatting underneath. Default format is `yyyy-mm-dd hh:mm:ss`.

To use own date formatting style refer to [Java SimpleDateFormat API](https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html).

In [267]:
# Create new tmp_emp DF by adding current date and timestamp from employees DF.

from pyspark.sql.functions import current_date, current_timestamp

tmp_emp = employees.withColumn("current_date", current_date())\
         .withColumn("current_timestamp", current_timestamp())\
         .withColumn("current_date_str_type", current_date().cast("string"))
tmp_emp.show(5, False)

+------+----------+----------+---------+------+----------+------------+-----------------------+---------------------+
|emp_no|birth_date|first_name|last_name|gender|hire_date |current_date|current_timestamp      |current_date_str_type|
+------+----------+----------+---------+------+----------+------------+-----------------------+---------------------+
|10001 |1953-09-02|Georgi    |Facello  |M     |1986-06-26|2019-12-22  |2019-12-22 18:52:14.398|2019-12-22           |
|10002 |1964-06-02|Bezalel   |Simmel   |F     |1985-11-21|2019-12-22  |2019-12-22 18:52:14.398|2019-12-22           |
|10003 |1959-12-03|Parto     |Bamford  |M     |1986-08-28|2019-12-22  |2019-12-22 18:52:14.398|2019-12-22           |
|10004 |1954-05-01|Chirstian |Koblick  |M     |1986-12-01|2019-12-22  |2019-12-22 18:52:14.398|2019-12-22           |
|10005 |1955-01-21|Kyoichi   |Maliniak |M     |1989-09-12|2019-12-22  |2019-12-22 18:52:14.398|2019-12-22           |
+------+----------+----------+---------+------+---------

In [268]:
tmp_emp.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- current_date: date (nullable = false)
 |-- current_timestamp: timestamp (nullable = false)
 |-- current_date_str_type: string (nullable = false)



`to_date()` is used to convert string date to date type. If the string format doesn't matches with specified date type format then it will return `null` value.

In [272]:
# Convert current_date_str_type column that was stored as string type into date type.

from pyspark.sql.functions import to_date

empid_hire_dt = tmp_emp.select("emp_no", to_date("current_date_str_type").alias("str_curr_dt"))            
empid_hire_dt.printSchema()
empid_hire_dt.show(10)

root
 |-- emp_no: integer (nullable = true)
 |-- str_curr_dt: date (nullable = true)

+------+-----------+
|emp_no|str_curr_dt|
+------+-----------+
| 10001| 2019-12-22|
| 10002| 2019-12-22|
| 10003| 2019-12-22|
| 10004| 2019-12-22|
| 10005| 2019-12-22|
| 10006| 2019-12-22|
| 10007| 2019-12-22|
| 10008| 2019-12-22|
| 10009| 2019-12-22|
| 10010| 2019-12-22|
+------+-----------+
only showing top 10 rows



In [274]:
# Convert string type to date type when date format is wrong

from pyspark.sql.functions import to_date, lit

#2019-20-20 is not a valid date so it will return null value.

employees.select(to_date(lit("2019-12-20")), to_date(lit("2019-20-20"))).show(5)

+---------------------+---------------------+
|to_date('2019-12-20')|to_date('2019-20-20')|
+---------------------+---------------------+
|           2019-12-20|                 null|
|           2019-12-20|                 null|
|           2019-12-20|                 null|
|           2019-12-20|                 null|
|           2019-12-20|                 null|
+---------------------+---------------------+
only showing top 5 rows



In [276]:
# Add and subtract 30 days on hire date for all employees.

from pyspark.sql.functions import date_add, date_sub

employees.select("hire_date", date_add(col("hire_date"), 30), date_sub(col("hire_date"), 30)).show(10)

+----------+-----------------------+-----------------------+
| hire_date|date_add(hire_date, 30)|date_sub(hire_date, 30)|
+----------+-----------------------+-----------------------+
|1986-06-26|             1986-07-26|             1986-05-27|
|1985-11-21|             1985-12-21|             1985-10-22|
|1986-08-28|             1986-09-27|             1986-07-29|
|1986-12-01|             1986-12-31|             1986-11-01|
|1989-09-12|             1989-10-12|             1989-08-13|
|1989-06-02|             1989-07-02|             1989-05-03|
|1989-02-10|             1989-03-12|             1989-01-11|
|1994-09-15|             1994-10-15|             1994-08-16|
|1985-02-18|             1985-03-20|             1985-01-19|
|1989-08-24|             1989-09-23|             1989-07-25|
+----------+-----------------------+-----------------------+
only showing top 10 rows



In [281]:
# Calculate days difference from hire data till now.

from pyspark.sql.functions import datediff

employees.select("hire_date", datediff(current_date(), "hire_date").alias("today_hire_day")).show(5)

+----------+--------------+
| hire_date|today_hire_day|
+----------+--------------+
|1986-06-26|         12232|
|1985-11-21|         12449|
|1986-08-28|         12169|
|1986-12-01|         12074|
|1989-09-12|         11058|
+----------+--------------+
only showing top 5 rows



In [287]:
# Calculate total month of employees hire till now

from pyspark.sql.functions import months_between

employees.select(months_between(current_date(), "hire_date").alias("month_employed")).show(5) # order always matter. today in param 1
employees.select(months_between("hire_date", current_date()).alias("month_employed")).show(5) # order always matter. today in param 2

+--------------+
|month_employed|
+--------------+
|  401.87096774|
|  409.03225806|
|  399.80645161|
|  396.67741935|
|  363.32258065|
+--------------+
only showing top 5 rows

+--------------+
|month_employed|
+--------------+
| -401.87096774|
| -409.03225806|
| -399.80645161|
| -396.67741935|
| -363.32258065|
+--------------+
only showing top 5 rows



Specifying own date formatting style using SimpleDateFormat. Refer to previous link show above for different formatting.

In [295]:
from pyspark.sql.functions import to_date

ownDateFormat = "yyyy-MM-dd"
employees.select(to_date(lit("2019-12-20"), ownDateFormat).alias("correct_date"), # correct date format
                 to_date(lit("2019-20-20"), ownDateFormat).alias("incorrect_date")  # incorrect date format, month 20 doesn't exist
                ).show(5)

+------------+--------------+
|correct_date|incorrect_date|
+------------+--------------+
|  2019-12-20|          null|
|  2019-12-20|          null|
|  2019-12-20|          null|
|  2019-12-20|          null|
|  2019-12-20|          null|
+------------+--------------+
only showing top 5 rows



`to_timestamp()` is used to convert string date to timestamp type. If the string format doesn't matches with specified date type format then it will return `null` value.

In [299]:
# Convert hire date to timestamp type

from pyspark.sql.functions import to_timestamp

ownDateFormat = "yyyy-dd-MM"
employees.select("hire_date",
                 to_timestamp(col("hire_date"), ownDateFormat).alias("hire_datetime") # hours, mins and second is added.
                ).show(5) 

+----------+-------------------+
| hire_date|      hire_datetime|
+----------+-------------------+
|1986-06-26|1986-06-26 00:00:00|
|1985-11-21|1985-11-21 00:00:00|
|1986-08-28|1986-08-28 00:00:00|
|1986-12-01|1986-12-01 00:00:00|
|1989-09-12|1989-09-12 00:00:00|
+----------+-------------------+
only showing top 5 rows



Logical operators can be used to compare difference between two dates. String literal as date value can also be used while comparing.

In [301]:
# Compare between two dates using logical operator to filter the values.

ownDateFormat = "yyyy-dd-MM"
employees.where(col("hire_date") > current_date()).show(10) # filter if hire_date is greater than today.
employees.where(col("hire_date") < current_date()).show(10) # filter if hire_date is less than today.

+------+----------+----------+---------+------+---------+
|emp_no|birth_date|first_name|last_name|gender|hire_date|
+------+----------+----------+---------+------+---------+
+------+----------+----------+---------+------+---------+

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|


In [313]:
# Compare between two dates using string literal for date type with logical operator to filter the values.

# display employee hired in 2019-09-15
employees.where(col("hire_date") > "2019-09-15").show(10)

# display employee hired from jan 1st 1990.
employees.where(col("hire_date") >= "1990-01-01").show(3) 

# display employee hired between 1980-01-01 to 1990-01-01
employees.where(col("hire_date") >= "1990-01-01")\
    .where(col("hire_date") >= "1990-01-01")\
    .show(3)

+------+----------+----------+---------+------+---------+
|emp_no|birth_date|first_name|last_name|gender|hire_date|
+------+----------+----------+---------+------+---------+
+------+----------+----------+---------+------+---------+

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10011|1953-11-07|      Mary|    Sluis|     F|1990-01-22|
| 10012|1960-10-04|  Patricio|Bridgland|     M|1992-12-18|
+------+----------+----------+---------+------+----------+
only showing top 3 rows

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10011|1953-11-07|      Mary|    Sluis|     F|1990-01-22|
| 10012|1960-10-04|  Patricio|Brid

In [328]:
# display employee hired between 1980-01-01 to 1990-01-01 using between method

from pyspark.sql.functions import asc

employees.where(employees.hire_date.between("1980-01-01", "1990-01-01"))\
               .show(20)

+------+----------+----------+----------+------+----------+
|emp_no|birth_date|first_name| last_name|gender| hire_date|
+------+----------+----------+----------+------+----------+
| 10001|1953-09-02|    Georgi|   Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|    Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|   Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|   Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi|  Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|   Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan| Zielinski|     F|1989-02-10|
| 10009|1952-04-19|    Sumant|      Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew|  Piveteau|     F|1989-08-24|
| 10013|1963-06-07| Eberhardt|    Terkki|     M|1985-10-20|
| 10014|1956-02-12|     Berni|     Genin|     M|1987-03-11|
| 10015|1959-08-19|  Guoxiang| Nooteboom|     M|1987-07-02|
| 10018|1954-06-19|  Kazuhide|      Peha|     F|1987-04-03|
| 10021|1960-02-20|     Ramzi|      Erde

#### 3.6 Complex Type Manipulation

Complex types includes arrays, maps and structs. Comparing to Python type, it is list, dict, and list-of-list. Data can be stored in complex types as a column value in Spark. For example, we can store json representation in map that stores all the attributes in single column value. The most important aspect is retrieving the data for complex types. We'll use some examples below to for each types.

#### 3.6.1 Arrays Type  

Step 1: Data preparation: Let's create a new array type column in empDF from employees DF. We'll choose hire_date and split based on '-'. If we already have array type then ignore this step.   
Step 2: We'll apply some functions to get result from array type column.

In [331]:
# Create new Column with Array type 

from pyspark.sql.functions import split

# create empDF DataFrame with create_date attribute
empDF = employees.select("hire_date", split(col("hire_date"), "-").alias("create_date"))
empDF.show(10)
empDF.printSchema()

+----------+--------------+
| hire_date|   create_date|
+----------+--------------+
|1986-06-26|[1986, 06, 26]|
|1985-11-21|[1985, 11, 21]|
|1986-08-28|[1986, 08, 28]|
|1986-12-01|[1986, 12, 01]|
|1989-09-12|[1989, 09, 12]|
|1989-06-02|[1989, 06, 02]|
|1989-02-10|[1989, 02, 10]|
|1994-09-15|[1994, 09, 15]|
|1985-02-18|[1985, 02, 18]|
|1989-08-24|[1989, 08, 24]|
+----------+--------------+
only showing top 10 rows

root
 |-- hire_date: date (nullable = true)
 |-- create_date: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [332]:
# Get first value from create_date

empDF.selectExpr("create_date[0]").show(10)

+--------------+
|create_date[0]|
+--------------+
|          1986|
|          1985|
|          1986|
|          1986|
|          1989|
|          1989|
|          1989|
|          1994|
|          1985|
|          1989|
+--------------+
only showing top 10 rows



In [333]:
# Find the size of create_date

from pyspark.sql.functions import size

empDF.select(size("create_date")).show(10)

+-----------------+
|size(create_date)|
+-----------------+
|                3|
|                3|
|                3|
|                3|
|                3|
|                3|
|                3|
|                3|
|                3|
|                3|
+-----------------+
only showing top 10 rows



In [342]:
# Check if create_date contains certain value

from pyspark.sql.functions import array_contains

empDF.select("create_date",array_contains("create_date", '2019').alias("is2019"),\
      array_contains("create_date", '1989').alias("is1989"))\
     .show(20)

+--------------+------+------+
|   create_date|is2019|is1989|
+--------------+------+------+
|[1986, 06, 26]| false| false|
|[1985, 11, 21]| false| false|
|[1986, 08, 28]| false| false|
|[1986, 12, 01]| false| false|
|[1989, 09, 12]| false|  true|
|[1989, 06, 02]| false|  true|
|[1989, 02, 10]| false|  true|
|[1994, 09, 15]| false| false|
|[1985, 02, 18]| false| false|
|[1989, 08, 24]| false|  true|
|[1990, 01, 22]| false| false|
|[1992, 12, 18]| false| false|
|[1985, 10, 20]| false| false|
|[1987, 03, 11]| false| false|
|[1987, 07, 02]| false| false|
|[1995, 01, 27]| false| false|
|[1993, 08, 03]| false| false|
|[1987, 04, 03]| false| false|
|[1999, 04, 30]| false| false|
|[1991, 01, 26]| false| false|
+--------------+------+------+
only showing top 20 rows



In [344]:
# Split create_date into several record for each value contained in create_date column value.
# To learn more about explode function check the link below used in hive function. 
# https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

from pyspark.sql.functions import explode

empDF.select("create_date", explode("create_date").alias("valueExploded")).show(10)

+--------------+-------------+
|   create_date|valueExploded|
+--------------+-------------+
|[1986, 06, 26]|         1986|
|[1986, 06, 26]|           06|
|[1986, 06, 26]|           26|
|[1985, 11, 21]|         1985|
|[1985, 11, 21]|           11|
|[1985, 11, 21]|           21|
|[1986, 08, 28]|         1986|
|[1986, 08, 28]|           08|
|[1986, 08, 28]|           28|
|[1986, 12, 01]|         1986|
+--------------+-------------+
only showing top 10 rows



#### 3.6.2 Maps Type

Map is similar to dictionary in Python.

Step 1: Data preparation: Let's create a new map type column in empDF from employees DF. We'll choose emp_no, and first name as it's value. If we already have map type then ignore this step.   
Step 2: We'll apply some functions to get result from map type column.

In [347]:
# Create Map Column 

from pyspark.sql.functions import create_map

# create empDF DataFrame with create_date attribute
empDF = employees.select("emp_no", "first_name", create_map(col("emp_no"), col("first_name")).alias("empMap"))
empDF.show(10)
empDF.printSchema()

+------+----------+--------------------+
|emp_no|first_name|              empMap|
+------+----------+--------------------+
| 10001|    Georgi|   [10001 -> Georgi]|
| 10002|   Bezalel|  [10002 -> Bezalel]|
| 10003|     Parto|    [10003 -> Parto]|
| 10004| Chirstian|[10004 -> Chirstian]|
| 10005|   Kyoichi|  [10005 -> Kyoichi]|
| 10006|    Anneke|   [10006 -> Anneke]|
| 10007|   Tzvetan|  [10007 -> Tzvetan]|
| 10008|    Saniya|   [10008 -> Saniya]|
| 10009|    Sumant|   [10009 -> Sumant]|
| 10010| Duangkaew|[10010 -> Duangkaew]|
+------+----------+--------------------+
only showing top 10 rows

root
 |-- emp_no: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- empMap: map (nullable = false)
 |    |-- key: integer
 |    |-- value: string (valueContainsNull = true)



In [352]:
# Access value from map column through its key.

empDF.selectExpr("emp_no", "empMap", "empMap[10001]").show(10) # since the key is integer we used without quote but quote can also be used.

# If the key is not present for record value then it will give null value.

+------+--------------------+-------------+
|emp_no|              empMap|empMap[10001]|
+------+--------------------+-------------+
| 10001|   [10001 -> Georgi]|       Georgi|
| 10002|  [10002 -> Bezalel]|         null|
| 10003|    [10003 -> Parto]|         null|
| 10004|[10004 -> Chirstian]|         null|
| 10005|  [10005 -> Kyoichi]|         null|
| 10006|   [10006 -> Anneke]|         null|
| 10007|  [10007 -> Tzvetan]|         null|
| 10008|   [10008 -> Saniya]|         null|
| 10009|   [10009 -> Sumant]|         null|
| 10010|[10010 -> Duangkaew]|         null|
+------+--------------------+-------------+
only showing top 10 rows



In [356]:
# Explode map column with new key and value column.

empDF.selectExpr("emp_no", "explode(empMap)").show(10)

+------+-----+---------+
|emp_no|  key|    value|
+------+-----+---------+
| 10001|10001|   Georgi|
| 10002|10002|  Bezalel|
| 10003|10003|    Parto|
| 10004|10004|Chirstian|
| 10005|10005|  Kyoichi|
| 10006|10006|   Anneke|
| 10007|10007|  Tzvetan|
| 10008|10008|   Saniya|
| 10009|10009|   Sumant|
| 10010|10010|Duangkaew|
+------+-----+---------+
only showing top 10 rows



#### 3.6.3 Structs Type   

Struct can be consider as DataFrames of DataFrames. Struct type can be created by putting column names into parenthesis. ie. like declaring tuples in Python. e.g. `struct(column_1, column_2)`

Step 1: Data preparation: Let's create a new struct type column in empDF from employees DF. We'll choose first_name, and last_name as it's value. If we already have struct type then ignore this step.   
Step 2: We'll apply some functions to get result from struct type column.

In [359]:
# Create Struct Column 

from pyspark.sql.functions import struct

empDF = employees.select(struct("first_name", "last_name").alias("empStruct"))
empDF.show(10)
empDF.printSchema()

+--------------------+
|           empStruct|
+--------------------+
|   [Georgi, Facello]|
|   [Bezalel, Simmel]|
|    [Parto, Bamford]|
|[Chirstian, Koblick]|
| [Kyoichi, Maliniak]|
|   [Anneke, Preusig]|
|[Tzvetan, Zielinski]|
|  [Saniya, Kalloufi]|
|      [Sumant, Peac]|
|[Duangkaew, Pivet...|
+--------------------+
only showing top 10 rows

root
 |-- empStruct: struct (nullable = false)
 |    |-- first_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)



In [373]:
# Get value from Struct through column name

empDF.select("empStruct",\
             "empStruct.first_name",\
             "empStruct.last_name")\
             .show(10) # show first_name and last_name value from struct type

+--------------------+----------+---------+
|           empStruct|first_name|last_name|
+--------------------+----------+---------+
|   [Georgi, Facello]|    Georgi|  Facello|
|   [Bezalel, Simmel]|   Bezalel|   Simmel|
|    [Parto, Bamford]|     Parto|  Bamford|
|[Chirstian, Koblick]| Chirstian|  Koblick|
| [Kyoichi, Maliniak]|   Kyoichi| Maliniak|
|   [Anneke, Preusig]|    Anneke|  Preusig|
|[Tzvetan, Zielinski]|   Tzvetan|Zielinski|
|  [Saniya, Kalloufi]|    Saniya| Kalloufi|
|      [Sumant, Peac]|    Sumant|     Peac|
|[Duangkaew, Pivet...| Duangkaew| Piveteau|
+--------------------+----------+---------+
only showing top 10 rows



In [374]:
# Get value from Struct through column name 

# It uses getField method

empDF.select("empStruct",\
            col("empStruct").getField("first_name"),\
            col("empStruct").getField("first_name"))\
            .show(10)

+--------------------+--------------------+--------------------+
|           empStruct|empStruct.first_name|empStruct.first_name|
+--------------------+--------------------+--------------------+
|   [Georgi, Facello]|              Georgi|              Georgi|
|   [Bezalel, Simmel]|             Bezalel|             Bezalel|
|    [Parto, Bamford]|               Parto|               Parto|
|[Chirstian, Koblick]|           Chirstian|           Chirstian|
| [Kyoichi, Maliniak]|             Kyoichi|             Kyoichi|
|   [Anneke, Preusig]|              Anneke|              Anneke|
|[Tzvetan, Zielinski]|             Tzvetan|             Tzvetan|
|  [Saniya, Kalloufi]|              Saniya|              Saniya|
|      [Sumant, Peac]|              Sumant|              Sumant|
|[Duangkaew, Pivet...|           Duangkaew|           Duangkaew|
+--------------------+--------------------+--------------------+
only showing top 10 rows



In [382]:
# Get get all Struct values using '*'

empDF.selectExpr("empStruct",
    "empStruct.*").show(10)

+--------------------+----------+---------+
|           empStruct|first_name|last_name|
+--------------------+----------+---------+
|   [Georgi, Facello]|    Georgi|  Facello|
|   [Bezalel, Simmel]|   Bezalel|   Simmel|
|    [Parto, Bamford]|     Parto|  Bamford|
|[Chirstian, Koblick]| Chirstian|  Koblick|
| [Kyoichi, Maliniak]|   Kyoichi| Maliniak|
|   [Anneke, Preusig]|    Anneke|  Preusig|
|[Tzvetan, Zielinski]|   Tzvetan|Zielinski|
|  [Saniya, Kalloufi]|    Saniya| Kalloufi|
|      [Sumant, Peac]|    Sumant|     Peac|
|[Duangkaew, Pivet...| Duangkaew| Piveteau|
+--------------------+----------+---------+
only showing top 10 rows



#### 3.7 Handling Nulls

Null values always plays vital role in all programming language. When loading the data, if data value doesn't matches with defined schema then those are always display as null values.   

Null values in Spark can be:-   
* dropped explicitly
* filled with some values. It can be replace globally or column-wise.

Several functions can be used for handling null values. Such as `coalese(), ifnull(), nullIf(), nvl(), nvl2()`.

In [423]:
# Check if first columns value is empty or null, if it is empty then retrieve value from second column or put literal value.

from pyspark.sql.functions import coalesce, months_between, floor, lit

# Create new DataFrame tmpDF that contain all the columns from employees and adding new columns "age_above_50"
# column 'age_above_60' is calculated field that stores whether employees age is above 60 or not.
employees.printSchema()

# assign value
total_month = 12
filter_age = 60

# emp_current_age stores employee current age
# isabove50 stores boolean value whether or not age is 50
# age_above_60 stores Yes for above age 60 else null

tmpDF = employees.withColumn("emp_current_age",\
                  floor(months_between(current_date(), "birth_date")/total_month))\
                  .withColumn("isabove50", col("emp_current_age") >= filter_age)\
                  .withColumn("age_above_60", expr("case when isabove50 then 'Yes' else null end"))                             
tmpDF.show(20)

# coalese to filter the null values with other column value
tmpDF = tmpDF.withColumn("ready_to_retire", coalesce("age_above_60", col("isabove50").cast("string")))
tmpDF.show(10)

# coalese to filter the null values with literal 'No' value
tmpDF = tmpDF.withColumn("ready_to_retire", coalesce("age_above_60", lit("No")))
tmpDF.show(10)

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)

+------+----------+----------+-----------+------+----------+---------------+---------+------------+
|emp_no|birth_date|first_name|  last_name|gender| hire_date|emp_current_age|isabove50|age_above_60|
+------+----------+----------+-----------+------+----------+---------------+---------+------------+
| 10001|1953-09-02|    Georgi|    Facello|     M|1986-06-26|             66|     true|         Yes|
| 10002|1964-06-02|   Bezalel|     Simmel|     F|1985-11-21|             55|    false|        null|
| 10003|1959-12-03|     Parto|    Bamford|     M|1986-08-28|             60|     true|         Yes|
| 10004|1954-05-01| Chirstian|    Koblick|     M|1986-12-01|             65|     true|         Yes|
| 10005|1955-01-21|   Kyoichi|   Maliniak|     M|1989-0

#### 3.7.1 Droping Null Values

`drop()` functions is used to remove rows that has null values. The default `drop()` method without parameter will drop records that has any null values. The parameter to the methods are:-   
`any`: e.g. `drop("any")`. `any` argument will drops row if row has any null values.        
`all`: e.g. `drop("all")`. `all` argument will drops row if row has all null values.     
`any` or `all` followed by array of columns: e.g. `drop("all", subset=["first_name", "last_name"])`. Drops row only from the specified columns with `any` and `all` argument defined above.   

Mostly, used for cleaning the final DataFrame after merging/joining multiple DataFrame which contains null during left, right, full join etc.

In [426]:
# Drop all records that has null values in tmpDF and store into emp_above60DF DF.

emp_above60DF = tmpDF.na.drop()
emp_above60DF.show(10)

+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_current_age|isabove50|age_above_60|ready_to_retire|
+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|             66|     true|         Yes|            Yes|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|             60|     true|         Yes|            Yes|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|             65|     true|         Yes|            Yes|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|             64|     true|         Yes|            Yes|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|             66|     true|         Yes|            Yes|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|             62|     true|    

In [427]:
# Drop all records that has null values in any columns using 'any' parameter in tmpDF  and store in empNoNullDF DF.
# 'any' parameter is to drop records that has any null values in the DataFrame. 
# Reason to use any: Sometime multiple columns might has null values which might not be useful during analysis.

empNoNullDF = tmpDF.na.drop("any")
empNoNullDF.show(10)

+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_current_age|isabove50|age_above_60|ready_to_retire|
+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|             66|     true|         Yes|            Yes|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|             60|     true|         Yes|            Yes|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|             65|     true|         Yes|            Yes|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|             64|     true|         Yes|            Yes|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|             66|     true|         Yes|            Yes|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|             62|     true|    

In [428]:
# Drop all records that has null values across entire columns value using 'all' parameter in tmpDF DF. 
# For example: Due to data quality issue if incoming data is retrieve then entire record will be stored as null values
# so we need to drop those bad records. In such case use 'all' in drop() method to delete records that has
# null values in entire record.


# Drop record where entire field value is null. None of the records has null values for all fields so it won't 
# drop any records.

empInvalidDF = tmpDF.na.drop("all")
empInvalidDF.show(10)

+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_current_age|isabove50|age_above_60|ready_to_retire|
+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|             66|     true|         Yes|            Yes|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|             55|    false|        null|             No|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|             60|     true|         Yes|            Yes|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|             65|     true|         Yes|            Yes|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|             64|     true|         Yes|            Yes|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|             66|     true|    

In [442]:
# Drop all records that has null values with 'any' in tmpDF DataFrame only for first_name and age_above_60 column.

empAbove60DF = tmpDF.na.drop("any", subset=["first_name", "age_above_60"])
empAbove60DF.show(10)

+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_current_age|isabove50|age_above_60|ready_to_retire|
+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|             66|     true|         Yes|            Yes|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|             60|     true|         Yes|            Yes|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|             65|     true|         Yes|            Yes|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|             64|     true|         Yes|            Yes|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|             66|     true|         Yes|            Yes|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|             62|     true|    

#### 3.7.2 Filling Null Values

`fill()` method is used to fill records that contains null values for one or more columns with explicit user defined value. It works for all data types. `Dict` can also used to fill multiple columns. 

In [452]:
# Fill null values with "Still Not Above 60" in employees ready_to_retire column that has null values.
# The age_above_60 null values will be replace with "Still Not Above 60" literal value.

empFillNullDF = tmpDF.na.fill("Still Not Above 60")
empFillNullDF.selectExpr("isabove50", "age_above_60", "ready_to_retire").show(20)

+---------+------------------+---------------+
|isabove50|      age_above_60|ready_to_retire|
+---------+------------------+---------------+
|     true|               Yes|            Yes|
|    false|Still Not Above 60|             No|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|    false|Still Not Above 60|             No|
|     true|               Yes|            Yes|
|    false|Still Not Above 60|             No|
|    false|Still Not Above 60|             No|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|    false|Still Not Above 60|             No|
|     true|               Yes|            Yes|
|     true|               Yes|            Yes|
|     true|  

In [455]:
# Fill "XXXX", "0000-00-00" and "Wait till 60"
# for first_name, birth_date and age_above_60 columns respectively in tmpDF DF that has null values using
# input from "null_column_dict" dict

# Check the output for 'age_above_60' since only this column has null value

null_column_dict = {"first_name": "XXXX", "birth_date": "0000-00-00", "age_above_60": 'Wait till 60', } 
empFillNullWithDictDF = tmpDF.na.fill(null_column_dict)
empFillNullWithDictDF.show(10)

+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_current_age|isabove50|age_above_60|ready_to_retire|
+------+----------+----------+---------+------+----------+---------------+---------+------------+---------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|             66|     true|         Yes|            Yes|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|             55|    false|Wait till 60|             No|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|             60|     true|         Yes|            Yes|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|             65|     true|         Yes|            Yes|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|             64|     true|         Yes|            Yes|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|             66|     true|    

#### 3.7.3 Filtering Null Values

Null values can be filtered by using `isNull()` method. If the column value is null then it return true and filter/select only records having null values.   

Not null values can be filtered by using `isNotNull()` method. It perform opposite operation compared to `isNull()`. If the column value is not null then it return true and filter/select only records having not null values.

In [470]:
from pyspark.sql.functions import col

# show boolean value for isNull() and isNotNull() method
tmpDF.select("emp_current_age", "isabove50", col("age_above_60").isNull(),col("age_above_60").isNotNull()).\
        show(10)

# filter null values from age_above_60 column
tmpDF.select("*").where(col("age_above_60").isNull()).show(10)

# filter not null values from age_above_60 column
tmpDF.select("*").where(col("age_above_60").isNotNull()).show(10)

+---------------+---------+----------------------+--------------------------+
|emp_current_age|isabove50|(age_above_60 IS NULL)|(age_above_60 IS NOT NULL)|
+---------------+---------+----------------------+--------------------------+
|             66|     true|                 false|                      true|
|             55|    false|                  true|                     false|
|             60|     true|                 false|                      true|
|             65|     true|                 false|                      true|
|             64|     true|                 false|                      true|
|             66|     true|                 false|                      true|
|             62|     true|                 false|                      true|
|             61|     true|                 false|                      true|
|             67|     true|                 false|                      true|
|             56|    false|                  true|              

#### 3.8 User Defined Functions

User Defined Functions (UDFs) are the custom function for manipulation and transforming the record values. If Spark doesn't provide specific function in its module to solve the business logic or problem then we need to create our own function known as UDF. Spark supports UDFs written on multiple languges such as Java, Python, Scala etc where Java and Scala has better perfromance compared to Python during data serialization. The best practice is to write UDF in Scala and call from Python. UDFs can have one or more columns as input parameters. These functions as simliar to other native functions. The functions need to registered before using it. By default, it is registered as temporary functions which is specific only for certain SparkSession. But it can also be permanently registered.

We'll create increase_ten_percent UDF both in Python and Scala to add 10% in current salary. Then register the function and apply it in DataFrame to calculate new column.

In [513]:
# Create, register, and call UDF in Python

# Add 10% to current salary 

# Create increase_ten_percent UDF
def increase_ten_percent(amount):
    return float((amount * 0.10) + amount)

# test function 
sal_1  = increase_ten_percent(20)
print(sal_1)  # must return 22
sal_2  = increase_ten_percent(10)
print(sal_2)  # must return 10


from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
# Register increase_ten_percent UDF
increase_ten_percent_udf = udf(increase_ten_percent)

# Call UDF
salaries.select("*", "salary" ,increase_ten_percent_udf(col("salary"))\
         .alias("increase_salary")).show(5)

22.0
11.0
+------+------+----------+----------+------+---------------+
|emp_no|salary| from_date|   to_date|salary|increase_salary|
+------+------+----------+----------+------+---------------+
| 10001| 60117|1986-06-26|1987-06-26| 60117|        66128.7|
| 10001| 62102|1987-06-26|1988-06-25| 62102|        68312.2|
| 10001| 66074|1988-06-25|1989-06-25| 66074|        72681.4|
| 10001| 66596|1989-06-25|1990-06-25| 66596|        73255.6|
| 10001| 66961|1990-06-25|1991-06-25| 66961|        73657.1|
+------+------+----------+----------+------+---------------+
only showing top 5 rows



**Assignment**: Create increase_ten_percent UDF both in Python and Scala with following features:   
* Add 10% from existing salary if employees worked more than 5 years.
* Salar field must only be integer and long.
* Hired date must be only string and date with 'yyyy-mm-dd' format
* Function must check null values for either parameter and return 0 if null is found. 
* Register the function and apply it in DataFrame to calculate new column bonus_salary.

In [514]:
# Create, register, and call UDF in Python

# Add 10% from current salary # if hired date is more than 5 year to current date.
from datetime import datetime


# Create increase_ten_percent UDF
def increase_ten_percent(amount, from_date):
    hire_threshold = 5
    # @todo: check date pattern    
    # check date instance
    if isinstance(from_date, str):    
        from_date = datetime.strptime(from_date, '%Y-%m-%d')    
    # get today
    today = datetime.today().date()
    # get diff year
    diff_year = today.year - from_date.year
    if diff_year > hire_threshold:
        return float((amount * 0.10) + amount)
    else:
        return float(amount)

# test function 
sal_1  = increase_ten_percent(20, "2010-01-01")
print(sal_1)  # must return 22
sal_2  = increase_ten_percent(20, "2019-01-01")
print(sal_2)  # must return 20


from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
# Register increase_ten_percent UDF
increase_ten_percent_udf = udf(increase_ten_percent, DateType())

# Call UDF
salaries.select("*", increase_ten_percent_udf(col("salary"), "from_date")\
         .alias("increase_salary")).show(5)

22.0
20.0


In [None]:
# Create, register, UDF in Scala and call from Python

import org.apache.spark.sql.functions.udf

def increase_ten_percent(amount: Integer):
    Integer = (amount * 0.10) * amount

# test function
increase_ten_percent(20) # must return 1

# Register increase_ten_percent UDF
val increase_ten_percent_udf = udf(increase_ten_percent(_:Integer):Integer)

# Call UDF in Python not in Scala
from pyspark.sql.functions import col
employees.select(increase_ten_percent_udf(col("salary"))).show(10)

**Important Note**

The registered UDF shown above is accessible only for DataFrame function. It cannot be use with string expression like `employees.selectExpr("increase_ten_percent_udf(column_name)")`. We can register the function as Spark SQL function which allows to use string expression as well as calling from SQL function too. The reason behind is "*UDF registered with Spark SQL functions or expression is valid for DataFrames expression*".

Register UDF as SQL function in Scala:   
`spark.udf.register("increase_ten_percent_udf", increase_ten_percent(_:Integer):Integer)`   

Register UDF as SQL function in Python:   
`spark.udf.register("increase_ten_percent_udf", increase_ten_percent)`  

Now, we can use in our DataFrame with `selectExpr` show below:   
`employees.selectExpr("increase_ten_percent_udf(column_name)")`   

Although, we have created our UDF and it works as expected. The best practice is to specific the return type from function. If type doesn't matches with then Spark will return `null` value. We can also specify return type as `None` and `Option` in Python and Scala respectively.

Register UDF as SQL function in Python with return type:   
`from pyspark.sql.types import IntegerType
spark.udf.register("increase_ten_percent_udf", increase_ten_percent, IntegerType())`   

Similary, we can register HIVE UDF and UDAF through Hive syntax. [Click for more detail](https://blog.cloudera.com/working-with-udfs-in-apache-spark/). 