### Spark Lab One

Date: 2019-12-15   
Author: Analytics Tensor   
Description: The purpose is to become familiar with Spark's DataFrame data manipulation. In this lab, we will read data from MySQL database and load into Spark DataFrame using JDBC connection. The main objectives of this lab are:-     
* Reading
* Projection
* Filtering
* Sorting
* Aggregation
* Writing

#### Reading
In Spark, reading the data from database can be through:-  
* Load method
* JDBC method

Both method accomplished the same task. We will use both method but the efficient and secured way is to use JDBC method.

**Create SparkSession**

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local") \
    .appName("Spark Lab One") \
    .getOrCreate()

**Load Method**

In [2]:
# Loading data using load method.
employees = spark.read \
  .format("jdbc") \
  .option("url", "jdbc:mysql://localhost:3306/employees") \
  .option("driver", "com.mysql.jdbc.Driver") \
  .option("dbtable", "employees") \
  .option("user", "root") \
  .option("password", "Mysql123#") \
  .load()

# Print Schema
employees.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)



**Display top 20 records**

In [3]:
employees.limit(20).show()

+------+----------+----------+-----------+------+----------+
|emp_no|birth_date|first_name|  last_name|gender| hire_date|
+------+----------+----------+-----------+------+----------+
| 10001|1953-09-02|    Georgi|    Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|     Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|    Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|    Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi|   Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|    Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|  Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya|   Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|       Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew|   Piveteau|     F|1989-08-24|
| 10011|1953-11-07|      Mary|      Sluis|     F|1990-01-22|
| 10012|1960-10-04|  Patricio|  Bridgland|     M|1992-12-18|
| 10013|1963-06-07| Eberhardt|     Terkki|     M|1985-10-20|
| 10014|1956-02-12|     

**JDBC Method**   
In JDBC method, we will pass the connection properties from config file. Python [configparser](https://docs.python.org/3/library/configparser.html) is used to read config file. While reading the SQL table, the connection properties is passed as dictionary through `properties` key/value in jdbc method. Using JDBC method, help to securely store the connection properties as well as repetatively use same connection properties multiple times for loading different table.

In [4]:
import configparser

# Read mysql database connection string from conf/db_properties.ini

config_filename = 'conf/db_properties.ini'
db_properties = {}
config = configparser.ConfigParser()
config.read(config_filename)
db_prop = config['mysql']
db_url = db_prop['url']
db_properties['database'] = db_prop['database']
db_properties['schema'] = db_prop['schema']
db_properties['user'] = db_prop['user']
db_properties['password'] = db_prop['password']
db_properties['driver'] = db_prop['driver']

In [5]:
# Load Employee table using JDBC method

employees = spark.read.jdbc(url = db_url, table = 'employees', properties = db_properties)
employees.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)



In [5]:
employees.limit(20).show()

+------+----------+----------+-----------+------+----------+
|emp_no|birth_date|first_name|  last_name|gender| hire_date|
+------+----------+----------+-----------+------+----------+
| 10001|1953-09-02|    Georgi|    Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|     Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|    Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|    Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi|   Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|    Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|  Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya|   Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|       Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew|   Piveteau|     F|1989-08-24|
| 10011|1953-11-07|      Mary|      Sluis|     F|1990-01-22|
| 10012|1960-10-04|  Patricio|  Bridgland|     M|1992-12-18|
| 10013|1963-06-07| Eberhardt|     Terkki|     M|1985-10-20|
| 10014|1956-02-12|     

**Load all the remaining tables from employees databases**   
List of tables: 
* current_dept_emp
* departments
* dept_emp
* dept_emp_latest_date
* dept_manager
* employees
* highest_salary_employee
* salaries
* titles

In [6]:
# Load current_dept_emp
current_dept_emp = spark.read.jdbc(url = db_url, table = 'current_dept_emp', properties = db_properties)

# Load departments
departments = spark.read.jdbc(url = db_url, table = 'departments', properties = db_properties)

# Load dept_emp
dept_emp = spark.read.jdbc(url = db_url, table = 'dept_emp', properties = db_properties)

# Load dept_emp_latest_date
dept_emp_latest_date = spark.read.jdbc(url = db_url, table = 'dept_emp_latest_date', properties = db_properties)

# Load dept_manager
dept_manager = spark.read.jdbc(url = db_url, table = 'dept_manager', properties = db_properties)

# Load employees
employees = spark.read.jdbc(url = db_url, table = 'employees', properties = db_properties)

# Load highest_salary_employee
highest_salary_employee = spark.read.jdbc(url = db_url, table = 'highest_salary_employee', properties = db_properties)

# Load salaries
salaries = spark.read.jdbc(url = db_url, table = 'salaries', properties = db_properties)

# Load titles
titles = spark.read.jdbc(url = db_url, table = 'titles', properties = db_properties)

#### Load table dynamically
Let's assume we have more than 100 tables, in such case we can't load all the table manually. So, we programatically load all the table dynamically.

In [11]:
# @todo

# Dynamically load all the table into respective dataframe.
# List of table to be loaded
table_list = [current_dept_emp, departments \
               ,dept_emp, dept_emp_latest_date, dept_manager, \
               employees, highest_salary_employee, salaries, \
               titles]

# create dictionary of employee_db
#employees_db = {}
#for table in table_list:
    #print(table)
    #print("\nLoading Mysql {} table".format(table))
    #employees_db[table] = spark.read.jdbc(url = db_url, table = table, properties = db_properties)
    #print("\nSpark DataFrame created for {}".format(table))

NameError: name 'current_dept_emp' is not defined

In [7]:
# Iterate on table list and print schema.
table_list = [current_dept_emp,departments \
               ,dept_emp, dept_emp_latest_date, dept_manager, \
               employees, highest_salary_employee, salaries, \
               titles]
for tables in table_list:
    tables.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)

root
 |-- dept_no: string (nullable = true)
 |-- dept_name: string (nullable = true)

root
 |-- emp_no: integer (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)

root
 |-- emp_no: integer (nullable = true)
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)

root
 |-- emp_no: integer (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)

root
 |-- id: long (nullable = true)
 |-- emp_no: long (nullable = t

#### Projection

1. Select top 10 records for all fields/attributes from employees DataFrame.

In [10]:
from pyspark.sql.functions import col

employees.select(col("*")).limit(10).show()

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|
+------+----------+----------+---------+------+----------+



In [11]:
employees.select("*").limit(10).show()

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|
+------+----------+----------+---------+------+----------+



In [47]:
employees.selectExpr("*").limit(10).show()

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|
+------+----------+----------+---------+------+----------+



2. Select top 5 first name, last name and gender from employees DataFrame. 

In [12]:
from pyspark.sql.functions import col

employees.select(col("first_name"), col("last_name"), col("gender")).show(5)

+----------+---------+------+
|first_name|last_name|gender|
+----------+---------+------+
|    Georgi|  Facello|     M|
|   Bezalel|   Simmel|     F|
|     Parto|  Bamford|     M|
| Chirstian|  Koblick|     M|
|   Kyoichi| Maliniak|     M|
+----------+---------+------+
only showing top 5 rows



In [13]:
from pyspark.sql.functions import column

employees.select(column("first_name"), column("last_name"), column("gender")).show(5)

+----------+---------+------+
|first_name|last_name|gender|
+----------+---------+------+
|    Georgi|  Facello|     M|
|   Bezalel|   Simmel|     F|
|     Parto|  Bamford|     M|
| Chirstian|  Koblick|     M|
|   Kyoichi| Maliniak|     M|
+----------+---------+------+
only showing top 5 rows



In [36]:
from pyspark.sql.functions import expr

employees.select(expr("first_name"), expr("last_name"), column("gender")).show(5)

+----------+---------+------+
|first_name|last_name|gender|
+----------+---------+------+
|    Georgi|  Facello|     M|
|   Bezalel|   Simmel|     F|
|     Parto|  Bamford|     M|
| Chirstian|  Koblick|     M|
|   Kyoichi| Maliniak|     M|
+----------+---------+------+
only showing top 5 rows



In [40]:
employees.select("first_name", "last_name", "gender").show(5)

+----------+---------+------+
|first_name|last_name|gender|
+----------+---------+------+
|    Georgi|  Facello|     M|
|   Bezalel|   Simmel|     F|
|     Parto|  Bamford|     M|
| Chirstian|  Koblick|     M|
|   Kyoichi| Maliniak|     M|
+----------+---------+------+
only showing top 5 rows



In [14]:
employees.selectExpr("first_name as first_name", "last_name as last_name", "gender").show(5)

+----------+---------+------+
|first_name|last_name|gender|
+----------+---------+------+
|    Georgi|  Facello|     M|
|   Bezalel|   Simmel|     F|
|     Parto|  Bamford|     M|
| Chirstian|  Koblick|     M|
|   Kyoichi| Maliniak|     M|
+----------+---------+------+
only showing top 5 rows



In [15]:
employees.select(employees.first_name, employees.last_name, employees.gender).show(5)

+----------+---------+------+
|first_name|last_name|gender|
+----------+---------+------+
|    Georgi|  Facello|     M|
|   Bezalel|   Simmel|     F|
|     Parto|  Bamford|     M|
| Chirstian|  Koblick|     M|
|   Kyoichi| Maliniak|     M|
+----------+---------+------+
only showing top 5 rows



3. Select all the fields and additional new field alias with "Full Name" from employees DF.

In [57]:
from pyspark.sql.functions import concat_ws

employees.select("*", concat_ws(" ", "first_name","last_name").alias("Full Name")).show(5)

+------+----------+----------+---------+------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|        Full Name|
+------+----------+----------+---------+------+----------+-----------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|   Georgi Facello|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|   Bezalel Simmel|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|    Parto Bamford|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|Chirstian Koblick|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12| Kyoichi Maliniak|
+------+----------+----------+---------+------+----------+-----------------+
only showing top 5 rows



In [70]:
from pyspark.sql.functions import concat_ws

employees.withColumn("Full Name", concat_ws(" ", "first_name","last_name")).show(5)

+------+----------+----------+---------+------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|        Full Name|
+------+----------+----------+---------+------+----------+-----------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|   Georgi Facello|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|   Bezalel Simmel|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|    Parto Bamford|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|Chirstian Koblick|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12| Kyoichi Maliniak|
+------+----------+----------+---------+------+----------+-----------------+
only showing top 5 rows



4. Select upper first name and lower last name of all employees.

In [72]:
from pyspark.sql.functions import lower, upper

employees.select(upper("first_name").alias("Upper FullName"), lower("last_name").alias("Lower LastName")).show(5)

+--------------+--------------+
|Upper FullName|Lower LastName|
+--------------+--------------+
|        GEORGI|       facello|
|       BEZALEL|        simmel|
|         PARTO|       bamford|
|     CHIRSTIAN|       koblick|
|       KYOICHI|      maliniak|
+--------------+--------------+
only showing top 5 rows



**Dropping Columns**

In [86]:
employees.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)



5. Drop gender from employees DF.

In [88]:
employees.drop("gender")

DataFrame[emp_no: int, birth_date: date, first_name: string, last_name: string, hire_date: date]

In [89]:
employees.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)



In [90]:
employees.show(2)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
+------+----------+----------+---------+------+----------+
only showing top 2 rows



DataFrame is immutable so the columns will not be dropped from original employees DataFrame. New DataFrame should be created or same DataFrame should be replaced with `drop` method.

In [5]:
emp = employees.drop("gender")

In [6]:
emp.show(5)

+------+----------+----------+---------+----------+
|emp_no|birth_date|first_name|last_name| hire_date|
+------+----------+----------+---------+----------+
| 10001|1953-09-02|    Georgi|  Facello|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|1989-09-12|
+------+----------+----------+---------+----------+
only showing top 5 rows



6. Drop first_name, last_name and gender from employee DF.

In [7]:
emp_1 = employees.drop("first_name", "last_name", "gender")

In [8]:
emp_1.show(5)

+------+----------+----------+
|emp_no|birth_date| hire_date|
+------+----------+----------+
| 10001|1953-09-02|1986-06-26|
| 10002|1964-06-02|1985-11-21|
| 10003|1959-12-03|1986-08-28|
| 10004|1954-05-01|1986-12-01|
| 10005|1955-01-21|1989-09-12|
+------+----------+----------+
only showing top 5 rows



7. Select new DF named "emp_mod" containing only 10000 employees information and adding new field emp_ID same as emp_no but with string type. Describe the schema to validate output.

In [19]:
from pyspark.sql.functions import col 

emp_mod = employees.withColumn("emp_ID", col("emp_no").cast("string"))

In [20]:
emp_mod.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)
 |-- emp_ID: string (nullable = true)



**Filtering**

8. Select employee whose emp_no is 10001 from employees DF.

In [22]:
employees.where(col("emp_no") == 10001).show(5)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
+------+----------+----------+---------+------+----------+



9. Select emp_no, first_name, last_name from employee whose emp_no is 10001 from employees DF.

In [31]:
employees.select("emp_no", "first_name", "last_name").where(col("emp_no") == 10001).show(5)

+------+----------+---------+
|emp_no|first_name|last_name|
+------+----------+---------+
| 10001|    Georgi|  Facello|
+------+----------+---------+



In [32]:
employees.selectExpr("emp_no", "first_name", "last_name").where(col("emp_no") == 10001).show(5)

+------+----------+---------+
|emp_no|first_name|last_name|
+------+----------+---------+
| 10001|    Georgi|  Facello|
+------+----------+---------+



In [30]:
employees.where(col("emp_no") == 10001).select("emp_no","first_name", "last_name").show(5)

+------+----------+---------+
|emp_no|first_name|last_name|
+------+----------+---------+
| 10001|    Georgi|  Facello|
+------+----------+---------+



10. Select first_name, last_name, fullname from employee whose emp_no are 10001, 10020, 10050, 10070 from employees DF.

In [37]:
employees.select("emp_no", "first_name", "last_name")\
    .where(col("emp_no").isin(10001,10020,10050, 10070)).show(5)

+------+----------+----------+
|emp_no|first_name| last_name|
+------+----------+----------+
| 10001|    Georgi|   Facello|
| 10020|    Mayuko|   Warwick|
| 10050|   Yinghua|    Dredge|
| 10070|    Reuven|Garigliano|
+------+----------+----------+



In [38]:
employees.select("emp_no", "first_name", "last_name")\
    .where(col("emp_no").isin([10001,10020,10050, 10070])).show(5)

+------+----------+----------+
|emp_no|first_name| last_name|
+------+----------+----------+
| 10001|    Georgi|   Facello|
| 10020|    Mayuko|   Warwick|
| 10050|   Yinghua|    Dredge|
| 10070|    Reuven|Garigliano|
+------+----------+----------+



 11. Select all the employees whose first name start with 'S'.

In [39]:
employees.select("*")\
    .where(col("first_name").like('S%')).show(5)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10022|1952-07-08|    Shahaf|   Famili|     M|1995-08-22|
| 10024|1958-09-05|   Suzette|   Pettey|     F|1997-05-19|
| 10053|1954-09-13|    Sanjiv| Zschoche|     F|1986-02-04|
+------+----------+----------+---------+------+----------+
only showing top 5 rows



 11. Select all the employees whose first name starts with 'S' and ends with 'a'.

In [16]:
employees.select("*")\
    .where(col("first_name").like('S%a')).show(5)

+------+----------+-----------+---------+------+----------+
|emp_no|birth_date| first_name|last_name|gender| hire_date|
+------+----------+-----------+---------+------+----------+
| 10008|1958-02-19|     Saniya| Kalloufi|     M|1994-09-15|
| 10093|1964-06-11|    Sailaja|  Desikan|     M|1996-11-05|
| 10098|1961-09-23|Sreekrishna|Servieres|     F|1985-05-13|
| 10235|1958-03-27|    Susanta| Roccetti|     F|1995-04-06|
| 10259|1964-11-24|    Susanna|    Vesel|     M|1986-06-25|
+------+----------+-----------+---------+------+----------+
only showing top 5 rows



 12. Select all the employees whose first name starts with 'S' and ends with 'a', and last name starts with 'K' and ends with 'a', and gender is Male.

In [17]:
employees.select("*")\
    .where(col("first_name").like('S%a'))\
    .where(col("last_name").like('K%a'))\
    .where(col("gender") == 'M').show(5)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10328|1955-06-28| Serenella|Kawashima|     M|1994-01-16|
| 20167|1960-05-10| Stamatina|   Kobara|     M|1985-04-04|
| 22850|1960-05-06|   Shushma|  Kuzuoka|     M|1991-12-01|
| 25313|1955-04-06|    Susuma|    Kroha|     M|1992-11-18|
| 26339|1961-05-20|    Susuma|Kawashima|     M|1995-08-09|
+------+----------+----------+---------+------+----------+
only showing top 5 rows



 13. Select all the employees whose first name starts with 'J' or 'K'.

In [55]:
employees.select("*")\
    .where(col("first_name").like('J%') | (col("first_name").like('K%')))\
    .show(10)

+------+----------+----------+-----------+------+----------+
|emp_no|birth_date|first_name|  last_name|gender| hire_date|
+------+----------+----------+-----------+------+----------+
| 10005|1955-01-21|   Kyoichi|   Maliniak|     M|1989-09-12|
| 10016|1961-05-02|  Kazuhito|Cappelletti|     M|1995-01-27|
| 10018|1954-06-19|  Kazuhide|       Peha|     F|1987-04-03|
| 10031|1959-01-27|   Karsten|     Joslin|     M|1991-09-01|
| 10032|1960-08-09|     Jeong|    Reistad|     F|1990-06-20|
| 10066|1952-11-13|      Kwee|   Schusler|     M|1986-02-26|
| 10079|1961-10-05|   Kshitij|       Gils|     F|1986-03-27|
| 10085|1962-11-07|   Kenroku|  Malabarba|     M|1994-04-09|
| 10088|1954-02-25|  Jungsoon|   Syrzycki|     F|1988-09-02|
| 10090|1961-05-30|    Kendra|    Hofting|     M|1986-03-14|
+------+----------+----------+-----------+------+----------+
only showing top 10 rows



 14. Select all the employees whose first name starts with 'J' or last name starts with 'K'.

In [56]:
employees.select("*")\
    .where(col("first_name").like('J%') | (col("last_name").like('K%')))\
    .show(10)

+------+----------+----------+-------------+------+----------+
|emp_no|birth_date|first_name|    last_name|gender| hire_date|
+------+----------+----------+-------------+------+----------+
| 10004|1954-05-01| Chirstian|      Koblick|     M|1986-12-01|
| 10008|1958-02-19|    Saniya|     Kalloufi|     M|1994-09-15|
| 10032|1960-08-09|     Jeong|      Reistad|     F|1990-06-20|
| 10084|1960-05-25|     Tuval|     Kalloufi|     M|1995-12-15|
| 10088|1954-02-25|  Jungsoon|     Syrzycki|     F|1988-09-02|
| 10096|1954-09-16|    Jayson|      Mandell|     M|1990-01-14|
| 10113|1963-11-13|    Jaewon|     Syrzycki|     M|1989-12-24|
| 10152|1954-12-01|    Jaques|        Munro|     F|1986-01-27|
| 10160|1953-10-18|  Debatosh|Khasidashvili|     M|1989-01-30|
| 10164|1956-01-19|    Jagoda|    Braunmuhl|     M|1985-11-12|
+------+----------+----------+-------------+------+----------+
only showing top 10 rows



 15. Select all the employees whose first name starts with 'J' and last name starts with 'K'.

In [57]:
employees.select("*")\
    .where(col("first_name").like('J%') & (col("last_name").like('K%')))\
    .show(10)

+------+----------+----------+-----------+------+----------+
|emp_no|birth_date|first_name|  last_name|gender| hire_date|
+------+----------+----------+-----------+------+----------+
| 10213|1964-05-24|   Jackson|     Kakkad|     M|1992-11-06|
| 10445|1957-01-10|   Junichi|   Kavanagh|     F|1987-11-04|
| 10657|1958-03-09| Juichirou|Kitsuregawa|     M|1989-12-31|
| 10660|1964-01-03|     Jouko|    Kolinko|     M|1988-08-12|
| 11387|1954-07-18|  Jordanka|   Kalloufi|     M|1997-05-11|
| 12581|1964-03-08|    Jaihie|    Kilgour|     M|1993-03-23|
| 12622|1958-11-24|     Jiafu|     Kobara|     M|1993-11-05|
| 12762|1964-04-27| Jaroslava|    Koblitz|     F|1992-09-25|
| 12814|1952-06-10|    Jaques|    Kohling|     M|1995-04-29|
| 13210|1953-06-19|   Jianhua|    Klassen|     M|1986-09-19|
+------+----------+----------+-----------+------+----------+
only showing top 10 rows



16. Describe the summary of salaries DF.

In [61]:
salaries.describe().show()

+-------+------------------+------------------+
|summary|            emp_no|            salary|
+-------+------------------+------------------+
|  count|           2844047|           2844047|
|   mean|253057.44317657198|63810.744836143705|
| stddev|161844.74133284207|16904.831259968036|
|    min|             10001|             38623|
|    max|            499999|            158220|
+-------+------------------+------------------+



In [62]:
salaries.summary().show()

+-------+------------------+------------------+
|summary|            emp_no|            salary|
+-------+------------------+------------------+
|  count|           2844047|           2844047|
|   mean|253057.44317657198|63810.744836143705|
| stddev|161844.74133284207|16904.831259968036|
|    min|             10001|             38623|
|    25%|             84857|             50510|
|    50%|            249765|             61142|
|    75%|            424894|             74189|
|    max|            499999|            158220|
+-------+------------------+------------------+



17. Print scheama of salaries DF.

In [64]:
salaries.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)



**Sorting**

18. Select top 10 records from salaries DF having lowest salary.

In [67]:
salaries.sort("salary").show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
|253406| 38623|2002-02-20|9999-01-01|
| 49239| 38735|1996-09-17|1997-09-17|
|281546| 38786|1996-11-13|1997-06-26|
| 15830| 38812|2001-03-12|2002-03-12|
| 64198| 38836|1989-10-20|1990-10-20|
|475254| 38849|1993-06-04|1994-06-04|
| 50419| 38850|1996-09-22|1997-09-22|
| 34707| 38851|1990-10-03|1991-10-03|
| 49239| 38859|1995-09-18|1996-09-17|
|274049| 38864|1996-09-01|1997-09-01|
+------+------+----------+----------+
only showing top 10 rows



In [68]:
salaries.orderBy("salary").show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
|253406| 38623|2002-02-20|9999-01-01|
| 49239| 38735|1996-09-17|1997-09-17|
|281546| 38786|1996-11-13|1997-06-26|
| 15830| 38812|2001-03-12|2002-03-12|
| 64198| 38836|1989-10-20|1990-10-20|
|475254| 38849|1993-06-04|1994-06-04|
| 50419| 38850|1996-09-22|1997-09-22|
| 34707| 38851|1990-10-03|1991-10-03|
| 49239| 38859|1995-09-18|1996-09-17|
|274049| 38864|1996-09-01|1997-09-01|
+------+------+----------+----------+
only showing top 10 rows



In [73]:
from pyspark.sql.functions import asc

salaries.orderBy(col("salary").asc()).show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
|253406| 38623|2002-02-20|9999-01-01|
| 49239| 38735|1996-09-17|1997-09-17|
|281546| 38786|1996-11-13|1997-06-26|
| 15830| 38812|2001-03-12|2002-03-12|
| 64198| 38836|1989-10-20|1990-10-20|
|475254| 38849|1993-06-04|1994-06-04|
| 50419| 38850|1996-09-22|1997-09-22|
| 34707| 38851|1990-10-03|1991-10-03|
| 49239| 38859|1995-09-18|1996-09-17|
|274049| 38864|1996-09-01|1997-09-01|
+------+------+----------+----------+
only showing top 10 rows



In [74]:
from pyspark.sql.functions import asc

salaries.orderBy(expr("salary as asc")).show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
|253406| 38623|2002-02-20|9999-01-01|
| 49239| 38735|1996-09-17|1997-09-17|
|281546| 38786|1996-11-13|1997-06-26|
| 15830| 38812|2001-03-12|2002-03-12|
| 64198| 38836|1989-10-20|1990-10-20|
|475254| 38849|1993-06-04|1994-06-04|
| 50419| 38850|1996-09-22|1997-09-22|
| 34707| 38851|1990-10-03|1991-10-03|
| 49239| 38859|1995-09-18|1996-09-17|
|274049| 38864|1996-09-01|1997-09-01|
+------+------+----------+----------+
only showing top 10 rows



18. Select top 10 records from salaries DF having highest salary.

In [69]:
from pyspark.sql.functions import desc

salaries.orderBy(col("salary").desc()).show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
| 43624|158220|2002-03-22|9999-01-01|
| 43624|157821|2001-03-22|2002-03-22|
|254466|156286|2001-08-04|9999-01-01|
| 47978|155709|2002-07-14|9999-01-01|
|253939|155513|2002-04-11|9999-01-01|
|109334|155377|2000-02-12|2001-02-11|
|109334|155190|2002-02-11|9999-01-01|
|109334|154888|2001-02-11|2002-02-11|
|109334|154885|1999-02-12|2000-02-12|
| 80823|154459|2002-02-22|9999-01-01|
+------+------+----------+----------+
only showing top 10 rows



In [75]:
from pyspark.sql.functions import desc

salaries.orderBy(desc("salary")).show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
| 43624|158220|2002-03-22|9999-01-01|
| 43624|157821|2001-03-22|2002-03-22|
|254466|156286|2001-08-04|9999-01-01|
| 47978|155709|2002-07-14|9999-01-01|
|253939|155513|2002-04-11|9999-01-01|
|109334|155377|2000-02-12|2001-02-11|
|109334|155190|2002-02-11|9999-01-01|
|109334|154888|2001-02-11|2002-02-11|
|109334|154885|1999-02-12|2000-02-12|
| 80823|154459|2002-02-22|9999-01-01|
+------+------+----------+----------+
only showing top 10 rows



19. Select top 10 emp_no and salary from salaries DF sorted by salary in ascending and  emp_no in descending order.

In [79]:
from pyspark.sql.functions import desc, asc

salaries.orderBy(col("salary").asc(), col("emp_no").desc()).selectExpr("emp_no", "salary").show(10)

+------+------+
|emp_no|salary|
+------+------+
|253406| 38623|
| 49239| 38735|
|281546| 38786|
| 15830| 38812|
| 64198| 38836|
|475254| 38849|
| 50419| 38850|
| 34707| 38851|
| 49239| 38859|
|274049| 38864|
+------+------+
only showing top 10 rows



20. Select all records from departments sorted by dept_no.

In [None]:
departments.orderBy("dept_no").show(10)

21. Select all records from departments sorted by dept_name in ascending order.

In [83]:
departments.orderBy("dept_name").show(10)

+-------+------------------+
|dept_no|         dept_name|
+-------+------------------+
|   d009|  Customer Service|
|   d005|       Development|
|   d002|           Finance|
|   d003|   Human Resources|
|   d001|         Marketing|
|   d004|        Production|
|   d006|Quality Management|
|   d008|          Research|
|   d007|             Sales|
+-------+------------------+



22. Select all records from departments sorted by dept_name in descending order.

In [84]:
departments.orderBy(desc("dept_name")).show(10)

+-------+------------------+
|dept_no|         dept_name|
+-------+------------------+
|   d007|             Sales|
|   d008|          Research|
|   d006|Quality Management|
|   d004|        Production|
|   d001|         Marketing|
|   d003|   Human Resources|
|   d002|           Finance|
|   d005|       Development|
|   d009|  Customer Service|
+-------+------------------+



**Aggregation**

23. Count total employees from employees DF.

In [5]:
from pyspark.sql.functions import count

employees.select(count("*")).show()

+--------+
|count(1)|
+--------+
|  300024|
+--------+



**Note**: In Spark `count(*)` will also count null values but specifying column name i.e. `count(first_name)` won't count null values.

In [6]:
from pyspark.sql.functions import count

employees.select(count("first_name")).show()

+-----------------+
|count(first_name)|
+-----------------+
|           300024|
+-----------------+



24. Count total distinct employees from employees DF.

In [40]:
from pyspark.sql.functions import countDistinct

employees.select(countDistinct("first_name").alias("Distinct First Name")).show()

+-------------------+
|Distinct First Name|
+-------------------+
|               1275|
+-------------------+



In [38]:
from pyspark.sql.functions import countDistinct

employees.agg(countDistinct("first_name").alias("Distinct First Name")).show()

+-------------------+
|Distinct First Name|
+-------------------+
|               1275|
+-------------------+



While count the large dataset, an exact count might not be achieved. By using `approx_count_distinct()` method it gives approximate distinct count. The parameter for methods is column and rsd. RSD refers to realative standard error rate. The default is 0.05. If rsd is less than 0.01 then `countDistinct()` is efficient to use.

25. Count approximate distinct count of emp_no from employees DF.

In [43]:
from pyspark.sql.functions import approx_count_distinct

employees.select(approx_count_distinct("emp_no", 0.1).alias("Approx Count Distinct")).show()

+---------------------+
|Approx Count Distinct|
+---------------------+
|               276091|
+---------------------+



26. Find the first and last emp_no from employees DF.

In [41]:
from pyspark.sql.functions import first, last

employees.select(first("emp_no").alias("First Record"), last("emp_no").alias("Last Record")).show()

+------------+-----------+
|First Record|Last Record|
+------------+-----------+
|       10001|     499999|
+------------+-----------+



27. Find minimum and maximum salary from salaries DF.

In [29]:
from pyspark.sql.functions import min, max

salaries.select(min("salary"), max("salary")).show()

+-----------+-----------+
|min(salary)|max(salary)|
+-----------+-----------+
|      38623|     158220|
+-----------+-----------+



28. Find the sum of salary from salaries DF.

In [31]:
from pyspark.sql.functions import sum

salaries.select(sum("salary")).show()

+------------+
| sum(salary)|
+------------+
|181480757419|
+------------+



29. Find the distinct sum of salary from salaries DF.

In [32]:
from pyspark.sql.functions import sumDistinct

salaries.select(sumDistinct("salary")).show()

+--------------------+
|sum(DISTINCT salary)|
+--------------------+
|          7078688488|
+--------------------+



30. Find the average salary from salaries DF.

In [36]:
from pyspark.sql.functions import avg

salaries.select(avg("salary").alias("Average")).show()

+------------------+
|           Average|
+------------------+
|63810.744836143705|
+------------------+



31. Find the mean salary from salaries DF.

In [34]:
from pyspark.sql.functions import mean

salaries.select(mean("salary")).show()

+------------------+
|       avg(salary)|
+------------------+
|63810.744836143705|
+------------------+



32. Find the count, sum and average salary from salaries DF.

In [35]:
from pyspark.sql.functions import avg, count, sum

salaries.select(\
            avg("salary").alias("Average"),\
            count("salary").alias("Count"),\
            sum("salary").alias("Sum")\
        ).show()

+------------------+-------+------------+
|           Average|  Count|         Sum|
+------------------+-------+------------+
|63810.744836143705|2844047|181480757419|
+------------------+-------+------------+



33. Find the correlation, sample covariance and population covariance between id and emp_no from highest_salary_employee DF. Not a good column choice, but just for example. To get detail about correlation and covariance check out the link. https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22

In [49]:
from pyspark.sql.functions import corr, covar_pop, covar_samp

highest_salary_employee.select(corr("id", "emp_no"), covar_samp("id", "emp_no"),\
                         covar_pop("id", "emp_no")).show()

+-------------------+----------------------+---------------------+
|   corr(id, emp_no)|covar_samp(id, emp_no)|covar_pop(id, emp_no)|
+-------------------+----------------------+---------------------+
|-0.4156756782643663|   -132.73809523809305|  -126.70454545454336|
+-------------------+----------------------+---------------------+



34. Find total count of employee based on gender from employees DF.

In [51]:
employees.groupBy("gender").count().show()

+------+------+
|gender| count|
+------+------+
|     F|120051|
|     M|179973|
+------+------+



35. Find the total sum of salary based on gender from highest_salary_employee DF.

In [63]:
highest_salary_employee.groupBy("gender")\
        .agg(sum("salary")).show()

+------+-----------+
|gender|sum(salary)|
+------+-----------+
|     F|    1092209|
|     M|    1112980|
+------+-----------+



36. Find the total count of salary based on gender from highest_salary_employee DF.

In [64]:
highest_salary_employee.groupBy("gender")\
        .agg(count("salary")).show()

+------+-------------+
|gender|count(salary)|
+------+-------------+
|     F|           11|
|     M|           11|
+------+-------------+



In [69]:
from pyspark.sql.functions import expr

highest_salary_employee.groupBy("gender")\
        .agg(expr("count(salary) as CountOfSalary")).show()

+------+-------------+
|gender|CountOfSalary|
+------+-------------+
|     F|           11|
|     M|           11|
+------+-------------+



**Window Functions**

In [None]:
#@todo

**Writing DataFrame into External Sources**

Spark uses DataFrameWriter to write file into external sources. Once the dataset has been processed and transformed, the output of the DataFrame is stored in external filesytem, database or streaming application for reporting, analysis or machine learning consumption. We'll use several output source for writing final DataFrame result. [For more information](https://spark.apache.org/docs/latest/sql-data-sources.html).

Let assume our final DataFrame is empDF.

In [8]:
from pyspark.sql.functions import col, concat_ws

empDF = employees.withColumn("full_name", concat_ws(" ", "first_name","last_name"))
empDF.show(10)

+------+----------+----------+---------+------+----------+------------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|         full_name|
+------+----------+----------+---------+------+----------+------------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|    Georgi Facello|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|    Bezalel Simmel|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|     Parto Bamford|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01| Chirstian Koblick|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|  Kyoichi Maliniak|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|    Anneke Preusig|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10| Tzvetan Zielinski|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|   Saniya Kalloufi|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|       Sumant Peac|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|Duangk

**Writing DataFrame to CSV File**

In [3]:
# Writing DF to CSV File

file_path = '/tmp/employee/spark_csv'

empDF.write.format("csv")\
    .mode("overwrite")\
    .option("path", file_path)\
    .save()

**Writing DataFrame to Avro File**

In [10]:
# Writing DF to Avro File

file_path = '/tmp/employee/spark_avro'

empDF.write.format("avro")\
    .mode("overwrite")\
    .option("path", file_path)\
    .save()

AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;

**Writing DataFrame to ORC File**

In [11]:
# Writing DF to ORC File

file_path = '/tmp/employee/spark_orc'

empDF.write.format("orc")\
    .mode("overwrite")\
    .option("path", file_path)\
    .save()

**Writing DataFrame to Parquet File**

In [12]:
# Writing DF to Parquet File

file_path = '/tmp/employee/spark_parquet'

empDF.write.format("parquet")\
    .mode("overwrite")\
    .option("path", file_path)\
    .save()

In [13]:
# Writing DF to Parquet File

file_path = '/tmp/employee/spark_parquet'

emp = spark.read.format("parquet")\
    .mode("overwrite")\
    .option("path", file_path)

AttributeError: 'DataFrameReader' object has no attribute 'mode'

**Writing DataFrame to JSON File**

In [None]:
# Writing DF to JSON File

file_path = '/tmp/employee/spark_json'

empDF.write.format("json")\
    .mode("overwrite")\
    .option("path", file_path)\
    .save()

**Writing DataFrame to Database**

In [None]:
from  datetime import datetime
#from pyspark.sql.functions import cast

# get current date
current_date = datetime.now().strftime('%Y-%m-%d')
table_name = "analytics_tensor.spark_employees_etl"  # + current_date


# Convert date type for birth_date and hire_date to string.
empDF_final = empDF.selectExpr("emp_no", "cast(birth_date as string)",\
                "full_name")

empDF_final.show(10)

# write to MySQL using existing connections 
empDF_final.write.jdbc(url = db_url, table = table_name, properties = db_properties).mode('append').save()

+------+----------+------------------+
|emp_no|birth_date|         full_name|
+------+----------+------------------+
| 10001|1953-09-02|    Georgi Facello|
| 10002|1964-06-02|    Bezalel Simmel|
| 10003|1959-12-03|     Parto Bamford|
| 10004|1954-05-01| Chirstian Koblick|
| 10005|1955-01-21|  Kyoichi Maliniak|
| 10006|1953-04-20|    Anneke Preusig|
| 10007|1957-05-23| Tzvetan Zielinski|
| 10008|1958-02-19|   Saniya Kalloufi|
| 10009|1952-04-19|       Sumant Peac|
| 10010|1963-06-01|Duangkaew Piveteau|
+------+----------+------------------+
only showing top 10 rows



**Usecase-1**  

Reading: Load employees table from MySQL database into Spark. 

Requirement: The report must include following information:   
1. Calculate the employees current age.
2. Calculate total number of years worked by employee.
3. Find the age of employee when they are hired at the company.
4. Show employee birth year.
5. Create employee abbreviated name that contains 2 first character from last name and all character from first in lower case.
6. Reverse employee number.

Ordering: Sort the data by employee abbreviated name in ascending order.

The ordinality, attributes name and type is defined below:
6. id (type: integer)
5. user (type: string)
1. age (type: integer)
4. birth_year (type: integer)
3. start_age (type: integer)
2. year_worked (type: integer)

Output 1: The output file must be written in each file type shown below. The directory structure is defined below:   
BASE_DIR = `/opt/spark_processing/data/employee`   
FILE_TYPE = { `csv | parquet | orc | avro | json` }   
DATA = { `employee` }   
CURRENT_DATE = `now()`   
FILE_NAME = `spark_$DATA_$CURRENT_DATE`   
LOCATION = `$BASE_DIR/$FILE_NAME/$DATA/$FILE_NAME`     
 
Output 2: Mysql table ( database name: analytics_tensor, table name: spark_employees_current_date)