### Project Description
HR department asks me as a data engineer to create a data pipeline that can take in employee data in CSV format. My job is analyzing the data, applying any required transformations and facilitating the extraction of useful insights from the processed data.
I 've been requested to leverage Apache Spark components (Pyspark, Spark SQL) to accomplish the tasks.

### Get files ready
- Create a directory for data
- Download the files i will work on

Done using bash commands

In [None]:
!rm -r data
!mkdir data
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/data/employees.csv
!mv employees.csv data/

### Setting up the environment

In [None]:
import findspark
from pyspark import SparkContext
from pyspark.sql import SparkSession, functions as  F,types as T

In [None]:
findspark.init()

In [None]:
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("Hr department project").getOrCreate()

if 'spark'  in locals() and isinstance(spark, SparkSession):
    print("Spark Session is Active")
else:
    print('Spark Session Failed to Start')

In [None]:
spark

In [None]:
#Read using spark auto schema detection
employee_df = spark.read.csv('data/employees.csv', header=True, inferSchema=True)
employee_df.show(2)

In [None]:
employee_df.printSchema()

In [None]:
# Create my own schema
my_schema = T.StructType(
    [
        T.StructField('Emp_No',T.IntegerType(),True),
        T.StructField('Emp_Name',T.StringType(),True),
        T.StructField('Salary',T.IntegerType(),True),
        T.StructField('Age',T.IntegerType(),True),
        T.StructField('Department',T.StringType(),True)
    ]
)

In [None]:
#Create a Spark dataframe using my_schema
employee_df = spark.read.csv('data/employees.csv', header=True, schema=my_schema)
employee_df.show(2)

In [17]:
#Display the schema of employee dataframe
employee_df.printSchema()

root
 |-- Emp_No: integer (nullable = true)
 |-- Emp_Name: string (nullable = true)
 |-- Salary: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Department: string (nullable = true)



In [None]:
#Create a temporary view to allow us access dataframe as a SQL table
#First, ensure that there is no temp view in the same name
spark.sql("Drop View If Exists employees")
#Then, Create it
employee_df.createTempView('employees')

### Do Some Queries

In [None]:
#Show employees who thier age exceeds 30
spark.sql('Select * From employees Where Age > 30').show()

### Calculate the average salary by department

In [21]:
spark.sql("Select Department, Round(Avg(Salary), 2) As Average_Salary From employees Group By Department").show()

+----------+--------------+
|Department|Average_Salary|
+----------+--------------+
|     Sales|       5492.92|
|        HR|        5837.5|
|   Finance|        5730.8|
| Marketing|       6633.33|
|        IT|        7400.0|
+----------+--------------+



### Display only employees of IT department

In [None]:
spark.sql("Select * From employees Where Department='IT'").show()

### Add new column
- Add 10% bonus to the salaries in new column named `SalaryAfterBonus`

In [27]:
employee_df = employee_df.withColumn('SalaryAfterBonus', F.round((F.col('Salary')*1.1),2))
employee_df.show(5)

+------+--------+------+---+----------+----------------+
|Emp_No|Emp_Name|Salary|Age|Department|SalaryAfterBonus|
+------+--------+------+---+----------+----------------+
|   198|  Donald|  2600| 29|        IT|          2860.0|
|   199| Douglas|  2600| 34|     Sales|          2860.0|
|   200|Jennifer|  4400| 36| Marketing|          4840.0|
|   201| Michael| 13000| 32|        IT|         14300.0|
|   202|     Pat|  6000| 39|        HR|          6600.0|
+------+--------+------+---+----------+----------------+
only showing top 5 rows


### Display the maximum salary by age

In [29]:
employee_df.groupBy("Age").agg(
    F.max('Salary').alias('Max_Salary')
).orderBy('Max_Salary',ascending=False).show()

+---+----------+
|Age|Max_Salary|
+---+----------+
| 39|     24000|
| 27|     17000|
| 37|     17000|
| 32|     13000|
| 28|     12008|
| 33|     12008|
| 29|     10000|
| 35|      9000|
| 31|      8200|
| 30|      8000|
| 36|      7900|
| 34|      7800|
| 38|      6000|
| 26|      3600|
+---+----------+



### Do Self-join (just for learning joins)

In [30]:
employee_df.join(employee_df, on='Emp_No',how='inner').show()

+------+---------+------+---+----------+----------------+---------+------+---+----------+----------------+
|Emp_No| Emp_Name|Salary|Age|Department|SalaryAfterBonus| Emp_Name|Salary|Age|Department|SalaryAfterBonus|
+------+---------+------+---+----------+----------------+---------+------+---+----------+----------------+
|   198|   Donald|  2600| 29|        IT|          2860.0|   Donald|  2600| 29|        IT|          2860.0|
|   199|  Douglas|  2600| 34|     Sales|          2860.0|  Douglas|  2600| 34|     Sales|          2860.0|
|   200| Jennifer|  4400| 36| Marketing|          4840.0| Jennifer|  4400| 36| Marketing|          4840.0|
|   201|  Michael| 13000| 32|        IT|         14300.0|  Michael| 13000| 32|        IT|         14300.0|
|   202|      Pat|  6000| 39|        HR|          6600.0|      Pat|  6000| 39|        HR|          6600.0|
|   203|    Susan|  6500| 36| Marketing|          7150.0|    Susan|  6500| 36| Marketing|          7150.0|
|   204|  Hermann| 10000| 29|   Finan

### Calculate average employee age

In [33]:
employee_df.agg(F.avg(F.col('Age')).alias("Average_Age")).show()

+-----------+
|Average_Age|
+-----------+
|      33.56|
+-----------+



### Calculate the total salaries of each department

In [34]:
employee_df.groupBy('Department').agg(
    F.sum(F.col("Salary")).alias('Total_Salaries')
).orderBy("Total_Salaries",ascending=False).show()

+----------+--------------+
|Department|Total_Salaries|
+----------+--------------+
|        IT|         74000|
|     Sales|         71408|
| Marketing|         59700|
|   Finance|         57308|
|        HR|         46700|
+----------+--------------+



### Do sorting by more than one column one in ascending order and by another in descending order

In [38]:
employee_df.sort(['Age', 'Salary'],ascending=[True,False]).show()

+------+---------+------+---+----------+----------------+
|Emp_No| Emp_Name|Salary|Age|Department|SalaryAfterBonus|
+------+---------+------+---+----------+----------------+
|   137|   Renske|  3600| 26| Marketing|          3960.0|
|   101|    Neena| 17000| 27|     Sales|         18700.0|
|   114|      Den| 11000| 27|   Finance|         12100.0|
|   108|    Nancy| 12008| 28|     Sales|         13208.8|
|   130|    Mozhe|  2800| 28| Marketing|          3080.0|
|   126|    Irene|  2700| 28|        HR|          2970.0|
|   204|  Hermann| 10000| 29|   Finance|         11000.0|
|   115|Alexander|  3100| 29|   Finance|          3410.0|
|   134|  Michael|  2900| 29|     Sales|          3190.0|
|   198|   Donald|  2600| 29|        IT|          2860.0|
|   140|   Joshua|  2500| 29|   Finance|          2750.0|
|   136|    Hazel|  2200| 29|        IT|          2420.0|
|   120|  Matthew|  8000| 30|        HR|          8800.0|
|   110|     John|  8200| 31| Marketing|          9020.0|
|   127|    Ja

### Calculate number of employess in each department

In [39]:
employee_df.groupBy('Department').agg(
    F.count('Emp_no').alias("Employees_Number")
).show()

+----------+----------------+
|Department|Employees_Number|
+----------+----------------+
|     Sales|              13|
|        HR|               8|
|   Finance|              10|
| Marketing|               9|
|        IT|              10|
+----------+----------------+



### Display only employees that have character "o" in their names
- One by SQL
- One by Python

In [40]:
spark.sql('Select * From employees Where Emp_Name like "%o%"').show()

+------+-----------+------+---+----------+
|Emp_No|   Emp_Name|Salary|Age|Department|
+------+-----------+------+---+----------+
|   198|     Donald|  2600| 29|        IT|
|   199|    Douglas|  2600| 34|     Sales|
|   110|       John|  8200| 31| Marketing|
|   112|Jose Manuel|  7800| 34|        HR|
|   130|      Mozhe|  2800| 28| Marketing|
|   133|      Jason|  3300| 38|     Sales|
|   139|       John|  2700| 36|     Sales|
|   140|     Joshua|  2500| 29|   Finance|
+------+-----------+------+---+----------+



In [None]:
employee_df.filter(F.col('Emp_Name').ilike("%o%")).show()

+------+-----------+------+---+----------+----------------+
|Emp_No|   Emp_Name|Salary|Age|Department|SalaryAfterBonus|
+------+-----------+------+---+----------+----------------+
|   198|     Donald|  2600| 29|        IT|          2860.0|
|   199|    Douglas|  2600| 34|     Sales|          2860.0|
|   110|       John|  8200| 31| Marketing|          9020.0|
|   112|Jose Manuel|  7800| 34|        HR|          8580.0|
|   130|      Mozhe|  2800| 28| Marketing|          3080.0|
|   133|      Jason|  3300| 38|     Sales|          3630.0|
|   139|       John|  2700| 36|     Sales|          2970.0|
|   140|     Joshua|  2500| 29|   Finance|          2750.0|
+------+-----------+------+---+----------+----------------+

