# Spark SQL

This notebook demonstrates how to use Spark SQL to perform data analysis using SQL queries on DataFrames.


In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import array
from pyspark.sql.functions import explode

In [None]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Spark SQL Exercises") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Created Successfully!")
print(f"Spark Version: {spark.version}")

## Load and familiarize yourself with the dataset

In [None]:
# Create datasets directory to store our csv file
!mkdir -p ../datasets

Let's prepare the dataset.
To load the dataset in our "datasets" directory, you have two options:
1. **Mount the local data/tickets.csv file** from the repository into your notebook environment in the docker-compose setup.
2. **Upload the csv file manually** in this Jupyter environment.

You can find the csv file in the repositoy or download it directly from [this link](https://www.kaggle.com/datasets/tobiasbueck/multilingual-customer-support-tickets?select=dataset-tickets-multi-lang3-4k.csv).


In [None]:
# Check if you have uploaded the csv file sucessfully.
!ls -lh ../datasets

In [None]:
# Read first 5 rows of the csv file
!head -n 5 ../datasets/tickets.csv

## Read CSV with Spark

In [None]:
# Create Dataframe from csv file
ticketsDF = (
    spark.read
        .option("header", True)
        .option("inferSchema", True)
        .option("multiLine", True)
        .option("escape", "\"")                # handle inner quotes
        .csv("../datasets/tickets.csv")
)

# Print schema and first 5 rows
<YOUR CODE HERE>

## Recap - Basic Filtering, Grouping & Aggregation

In [None]:
# Task 1: Show only high-priority tickets
highPriorityDF = <YOUR CODE HERE>

highPriorityDF.show(truncate=False)

In [None]:
# Task 2: Return the number of tickets by type (Incident, Request…), ordering them from biggest amount to least
ticketsByTypeDF = <YOUR CODE HERE>

ticketsByTypeDF.show()


In [None]:
## Task 3: Count tickets by language
ticketsByLangDF = <YOUR CODE HERE>

ticketsByLangDF.show()


## SQL Exercises

In order to use SQL query language, we need to register a table based on our DataFrame.

In [None]:
ticketsDF.createOrReplaceTempView("tickets")

In [None]:
# Show first 5 rows
spark.sql("""
<YOUR SQL QUERY HERE>
""").show()

In [None]:
# Task 4: Count tickets by priority (SQL version)
<YOUR SQL QUERY HERE>


In [None]:
# Task 5: Which ticket subjects contain the keyword “Account” (SQL version)
<YOUR SQL QUERY HERE>

## UDFs (User-Defined Functions)

UDFs (User-Defined Functions) allow you to create custom functions that can be applied to DataFrame columns in Spark SQL. They are useful when you need to perform operations that are not available in the built-in functions provided by Spark.

In [None]:
# Task 6: Create a UDF to detect whether a ticket is security-related

security_keywords = ["security", "cyber", "breach", "attack", "incident", "risk"]

# Create a function that checks if any of the security keywords (case insensitive) are present in the subject or body of the ticket
def is_security_ticket(subject, body):
    <YOUR CODE HERE>
    return ...

# Register the UDF, specifying the return type as BooleanType - True if security-related, False otherwise
isSecurityUDF = udf(is_security_ticket, BooleanType())


In [None]:
# Apply the UDF to our tickets DataFrame
ticketsSecurityDF = ticketsDF.withColumn(
    "is_security_ticket",
    isSecurityUDF(col("subject"), col("body"))
)

# Print the results, showing the subject and whether it's a security ticket
<YOUR CODE HERE>

In [None]:
# Count the security tickets
<YOUR CODE HERE>

In [None]:
# Use the same UDF but on SQL
<YOUR SQL QUERY HERE>

In [None]:
# Count the security tickets with SQL
<YOUR SQL QUERY HERE>

## Working with arrays

As you might have notices we have multiple columns for tags. Let's convert these tags columns to an array and perform operations on an array column.

In [None]:
# Task 7: Create new column "tags" of type array for our Dataframe, containing all the tags (tag_1 to tag_8)
tagsDF = ticketsDF.withColumn(<YOUR CODE HERE>)


In [None]:
# Task 8: Explode tags to find most common tags
explodedTagsDF = tagsDF.select(explode("tags").alias("tag"))
explodedTagsDF.show()

In [None]:
# Count the number of tags and print the results from most used to least used, ignoring NULL values
tagCountsDF = (
    explodedTagsDF.<YOUR CODE HERE>
)

tagCountsDF.show()

In [None]:
# Stop the Spark Session
spark.stop()