# Final Project: Data Analysis using Spark

Estimated time needed: **60** minutes

This final project is similar to the Practice Project you did. In this project, you will create a DataFrame by loading data from a CSV file and apply transformations and actions using Spark SQL. This needs to be achieved by performing the following tasks:

- Task 1: Generate DataFrame from CSV data.
- Task 2: Define a schema for the data.
- Task 3: Display schema of DataFrame.
- Task 4: Create a temporary view.
- Task 5: Execute an SQL query.
- Task 6: Calculate Average Salary by Department.
- Task 7: Filter and Display IT Department Employees.
- Task 8: Add 10% Bonus to Salaries.
- Task 9: Find Maximum Salary by Age.
- Task 10: Self-Join on Employee Data.
- Task 11: Calculate Average Employee Age.
- Task 12: Calculate Total Salary by Department.
- Task 13: Sort Data by Age and Salary.
- Task 14: Count Employees in Each Department.
- Task 15: Filter Employees with the letter o in the Name.


### Prerequisites 

1. For this lab assignment, you will be using Python and Spark (PySpark). Therefore, it's essential to make sure that the following libraries are installed in your lab environment


In [None]:
# Installing required packages  
!pip install pyspark findspark wget


In [None]:
import findspark
findspark.init()

In [None]:
# PySpark is the Spark API for Python. In this lab, we use PySpark to initialize the SparkContext.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

In [None]:
# Creating a SparkContext object  
sc = SparkContext.getOrCreate()

# Creating a SparkSession  
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

2. Download the CSV data.  


In [None]:
# Download the CSV data first into a local `employees.csv` file
import wget
wget.download("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/data/employees.csv")

### Tasks


#### Task 1: Generate a Spark DataFrame from the CSV data

Read data from the provided CSV file, `employees.csv` and import it into a Spark DataFrame variable named `employees_df`.

 


In [None]:
# Read data from the "employees.csv" CSV file and import it into a DataFrame variable named "employees_df"
employees_df = spark.read.csv("employees.csv", header=True, inferSchema=True)
print("Successfully loaded employees.csv into DataFrame")
print(f"Number of rows: {employees_df.count()}")
print(f"Number of columns: {len(employees_df.columns)}")

#### Task 2: Define a schema for the data

Construct a schema for the input data and then utilize the defined schema to read the CSV file to create a DataFrame named `employees_df`.  


In [None]:
# Define a Schema for the input data and read the file using the user-defined Schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

# First, let's examine the current structure
print("Current DataFrame schema:")
employees_df.printSchema()
print("\nSample data:")
employees_df.show(5)

# Define a schema based on the expected structure
schema = StructType([
    StructField("Emp_No", IntegerType(), True),
    StructField("Emp_Name", StringType(), True),
    StructField("Salary", DoubleType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True)
])

# Read the CSV file with the defined schema
employees_df = spark.read.csv("employees.csv", header=True, schema=schema)
print("\nDataFrame loaded with defined schema:")
employees_df.printSchema()

#### Task 3: Display schema of DataFrame

Display the schema of the `employees_df` DataFrame, showing all columns and their respective data types.  


In [None]:
# Display all columns of the DataFrame, along with their respective data types
print("Schema of employees_df DataFrame:")
employees_df.printSchema()

print("\nColumn names and data types:")
for field in employees_df.schema.fields:
    print(f"Column: {field.name}, Data Type: {field.dataType}, Nullable: {field.nullable}")

print("\nSample data:")
employees_df.show(10)

#### Task 4: Create a temporary view

Create a temporary view named `employees` for the `employees_df` DataFrame, enabling Spark SQL queries on the data. 


In [None]:
# Create a temporary view named "employees" for the DataFrame
employees_df.createOrReplaceTempView("employees")
print("Successfully created temporary view 'employees'")

# Verify the view is created by running a simple query
result = spark.sql("SELECT COUNT(*) as total_employees FROM employees")
result.show()
print("Temporary view is working correctly!")

#### Task 5: Execute an SQL query

Compose and execute an SQL query to fetch the records from the `employees` view where the age of employees exceeds 30. Then, display the result of the SQL query, showcasing the filtered records.


In [None]:
# SQL query to fetch solely the records from the View where the age exceeds 30
query = "SELECT * FROM employees WHERE Age > 30"
result_df = spark.sql(query)

print("Employees with age > 30:")
result_df.show()

print(f"\nTotal number of employees with age > 30: {result_df.count()}")

#### Task 6: Calculate Average Salary by Department

Compose an SQL query to retrieve the average salary of employees grouped by department. Display the result.


In [None]:
# SQL query to calculate the average salary of employees grouped by department
query = "SELECT Department, AVG(Salary) as Average_Salary FROM employees GROUP BY Department ORDER BY Average_Salary DESC"
avg_salary_by_dept = spark.sql(query)

print("Average salary by department:")
avg_salary_by_dept.show()

#### Task 7: Filter and Display IT Department Employees

Apply a filter on the `employees_df` DataFrame to select records where the department is `'IT'`. Display the filtered DataFrame.


In [None]:
# Apply a filter to select records where the department is 'IT'
it_employees = employees_df.filter(employees_df.Department == 'IT')

print("IT Department employees:")
it_employees.show()

print(f"\nTotal number of IT employees: {it_employees.count()}")

#### Task 8: Add 10% Bonus to Salaries

Perform a transformation to add a new column named "SalaryAfterBonus" to the DataFrame. Calculate the new salary by adding a 10% bonus to each employee's salary.


In [None]:
from pyspark.sql.functions import col

# Add a new column "SalaryAfterBonus" with 10% bonus added to the original salary
employees_with_bonus = employees_df.withColumn("SalaryAfterBonus", col("Salary") * 1.10)

print("Employees with salary after 10% bonus:")
employees_with_bonus.select("Emp_No", "Emp_Name", "Salary", "SalaryAfterBonus").show()

# Update the main dataframe
employees_df = employees_with_bonus

#### Task 9: Find Maximum Salary by Age

Group the data by age and calculate the maximum salary for each age group. Display the result.


In [None]:
from pyspark.sql.functions import max

# Group data by age and calculate the maximum salary for each age group
max_salary_by_age = employees_df.groupBy("Age").agg(max("Salary").alias("Max_Salary")).orderBy("Age")

print("Maximum salary by age:")
max_salary_by_age.show()

#### Task 10: Self-Join on Employee Data

Join the "employees_df" DataFrame with itself based on the "Emp_No" column. Display the result.


In [None]:
# Join the DataFrame with itself based on the "Emp_No" column
# Create aliases for the dataframes to distinguish columns
employees_alias1 = employees_df.alias("emp1")
employees_alias2 = employees_df.alias("emp2")

# Self-join on Emp_No (this will create a cartesian product of matching Emp_No)
self_joined = employees_alias1.join(employees_alias2, 
                                   employees_alias1.Emp_No == employees_alias2.Emp_No, 
                                   "inner")

print("Self-joined DataFrame on Emp_No:")
# Select specific columns to make the output more readable
self_joined.select("emp1.Emp_No", "emp1.Emp_Name", "emp1.Salary", 
                  "emp2.Emp_No", "emp2.Emp_Name", "emp2.Salary").show(10)

print(f"\nTotal records after self-join: {self_joined.count()}")

#### Task 11: Calculate Average Employee Age

Calculate the average age of employees using the built-in aggregation function. Display the result.


In [None]:
# Calculate the average age of employees
from pyspark.sql.functions import avg 

avg_age = employees_df.agg(avg("Age").alias("Average_Age"))

print("Average age of employees:")
avg_age.show()

# Alternative: collect the value and print it
avg_age_value = avg_age.collect()[0]["Average_Age"]
print(f"\nThe average age of employees is: {avg_age_value:.2f} years")

#### Task 12: Calculate Total Salary by Department

Calculate the total salary for each department using the built-in aggregation function. Display the result.


In [None]:
# Calculate the total salary for each department. Hint - Use GroupBy and Aggregate functions
from pyspark.sql.functions import sum 

total_salary_by_dept = employees_df.groupBy("Department").agg(sum("Salary").alias("Total_Salary")).orderBy("Total_Salary", ascending=False)

print("Total salary by department:")
total_salary_by_dept.show()

#### Task 13: Sort Data by Age and Salary

Apply a transformation to sort the DataFrame by age in ascending order and then by salary in descending order. Display the sorted DataFrame.


In [None]:
# Sort the DataFrame by age in ascending order and then by salary in descending order
sorted_employees = employees_df.orderBy("Age", col("Salary").desc())

print("Employees sorted by age (ascending) and salary (descending):")
sorted_employees.select("Emp_No", "Emp_Name", "Age", "Salary", "Department").show()

#### Task 14: Count Employees in Each Department

Calculate the number of employees in each department. Display the result.


In [None]:
from pyspark.sql.functions import count

# Calculate the number of employees in each department
employee_count_by_dept = employees_df.groupBy("Department").agg(count("Emp_No").alias("Employee_Count")).orderBy("Employee_Count", ascending=False)

print("Number of employees in each department:")
employee_count_by_dept.show()

#### Task 15: Filter Employees with the letter o in the Name

Apply a filter to select records where the employee's name contains the letter `'o'`. Display the filtered DataFrame.


In [None]:
# Apply a filter to select records where the employee's name contains the letter 'o'
employees_with_o = employees_df.filter(col("Emp_Name").contains("o"))

print("Employees with letter 'o' in their name:")
employees_with_o.select("Emp_No", "Emp_Name", "Salary", "Age", "Department").show()

print(f"\nTotal number of employees with 'o' in their name: {employees_with_o.count()}")

# Congratulations! You have completed the project.

Now you know how to create a DataFrame from a CSV data file and perform a variety of DataFrame transformations and actions using Spark SQL.
