In [1]:
import pyspark
import pyspark.sql.functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/20 11:52:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/05/20 11:52:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# Data Wrangling Exercises

## Part 1

### 1

Read the case, department, and source data into their own spark dataframes.

In [2]:
# Read all three .csv files into dataframes

case = spark.read.csv("case.csv", sep=",", header=True, inferSchema=True)
department = spark.read.csv("dept.csv", sep=",", header=True, inferSchema=True)
source = spark.read.csv("source.csv", sep=",", header=True, inferSchema=True)

                                                                                

### 2

Let's see how writing to the local disk works in spark:

- Write the code necessary to store the source data in both csv and json format, store these as sources_csv and sources_json
- Inspect your folder structure. What do you notice?

In [3]:
# Write the source dataframe into json and csv formats

source.write.json('source_json', mode = 'overwrite')
source.write.csv('source_csv', mode = 'overwrite')

The files are stored in partitions inside a directory by the name given to the write methods. There is also a _SUCCESS file.

### 3

Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.

In [4]:
# Let's see the case dataframe

case.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- case_opened_date: string (nullable = true)
 |-- case_closed_date: string (nullable = true)
 |-- SLA_due_date: string (nullable = true)
 |-- case_late: string (nullable = true)
 |-- num_days_late: double (nullable = true)
 |-- case_closed: string (nullable = true)
 |-- dept_division: string (nullable = true)
 |-- service_request_type: string (nullable = true)
 |-- SLA_days: double (nullable = true)
 |-- case_status: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- request_address: string (nullable = true)
 |-- council_district: integer (nullable = true)



In [5]:
case.show(1, vertical = True, truncate = False)

-RECORD 0----------------------------------------------------
 case_id              | 1014127332                           
 case_opened_date     | 1/1/18 0:42                          
 case_closed_date     | 1/1/18 12:29                         
 SLA_due_date         | 9/26/20 0:42                         
 case_late            | NO                                   
 num_days_late        | -998.5087616000001                   
 case_closed          | YES                                  
 dept_division        | Field Operations                     
 service_request_type | Stray Animal                         
 SLA_days             | 999.0                                
 case_status          | Closed                               
 source_id            | svcCRMLS                             
 request_address      | 2315  EL PASO ST, San Antonio, 78207 
 council_district     | 5                                    
only showing top 1 row



The case_opened_date, case_closed_date, and SLA_due_date columns should be datetime types. The case_late and case_closed columns can be cast to boolean types.

In [6]:
fmt = "M/d/yy H:mm"
case = (
    case.withColumn("case_opened_date", F.to_timestamp("case_opened_date", fmt))
    .withColumn("case_closed_date", F.to_timestamp("case_closed_date", fmt))
    .withColumn("SLA_due_date", F.to_timestamp("SLA_due_date", fmt))
)

case = (
    case.withColumn("case_closed", F.expr('case_closed == "YES"'))
    .withColumn("case_late", F.expr('case_late == "YES"'))
)

In [7]:
# Let's see the department dataframe

department.printSchema()

root
 |-- dept_division: string (nullable = true)
 |-- dept_name: string (nullable = true)
 |-- standardized_dept_name: string (nullable = true)
 |-- dept_subject_to_SLA: string (nullable = true)



In [8]:
department.show(1, vertical = True, truncate = False)

-RECORD 0----------------------------------
 dept_division          | 311 Call Center  
 dept_name              | Customer Service 
 standardized_dept_name | Customer Service 
 dept_subject_to_SLA    | YES              
only showing top 1 row



In [9]:
department.select('dept_subject_to_SLA').distinct().show()

+-------------------+
|dept_subject_to_SLA|
+-------------------+
|                YES|
|                 NO|
+-------------------+



The dept_subject_to_SLA column can be a boolean type.

In [10]:
department = department.withColumn('dept_subject_to_SLA', F.expr('dept_subject_to_SLA == "YES"'))

In [11]:
# Finally let's see the source dataframe

source.printSchema()

root
 |-- source_id: string (nullable = true)
 |-- source_username: string (nullable = true)



In [12]:
source.show(1, vertical = True, truncate = False)

-RECORD 0---------------------------
 source_id       | 100137           
 source_username | Merlene Blodgett 
only showing top 1 row



The source_id column can probably be cast to an int type, but I don't think it is necessary.

## Part 2

### 1

How old is the latest (in terms of days past SLA) currently open issue?

In [13]:
# Let's try this with SQL
case.createOrReplaceTempView('case')

In [30]:
spark.sql('''
SELECT
    num_days_late
FROM case
WHERE case_closed = false
    AND case_late = true
ORDER BY num_days_late DESC
LIMIT 1;
''').show(vertical = True, truncate = False)

-RECORD 0--------------------
 num_days_late | 348.6458333 



                                                                                

How long has the oldest (in terms of days since opened) currently opened issue been open?

In [38]:
spark.sql('''
SELECT
    DATEDIFF(NOW(), case_opened_date) AS days_open
FROM case
WHERE case_closed = false
ORDER BY days_open DESC
LIMIT 1;
''').show(vertical = True)

-RECORD 0---------
 days_open | 1965 



### 2

How many Stray Animal cases are there?