# Introduction to DataFrames

**Learning Concepts:**
    
    - Spark Session 
    - DataFrame Reader API 
    
**DataFrame Methods:** 

    - printSchema()
    - withColumnRenamed()
    - withColumn()
    - where()
    - select()
    - distinct()
    - expr()
    - show()

### `create` a DataFrame

In [1]:
 dataset_file = 's3://fcc-spark-example/dataset/sf-fire-calls.csv'

In [2]:
# spark is the SparkSession object 

fire_df = spark.read \
            .format('csv') \
            .option('header', 'true') \
            .option('inferSchema', 'true') \
            .load(dataset_file)

                                                                                

In [3]:
# Using CSV 'DataFrameReader.csv' 

fire_df = spark.read \
               .csv(dataset_file,
                    header=True,
                    inferSchema=True)

### `show` method

In [4]:
fire_df.show(5)

23/03/06 03:58:57 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------+--------------------------+----------------------+------------------+--------------------+--------------------+-------------+---------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|        Neighborhood|            Location|        RowID|    Delay|
+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+------------

## DataFrame Methods

- **Actions** : 
    Kick off a Spark `Job` and return to the Spark driver
    
    
- **Transformations** : 
    Produces a newly transformed `Dataframe`
    
    
- **Functions/Methods** : 
    Neither Actions not Transformation 
    
    

#### 1) Re-naming a `Column` 

In [5]:
renamed_fire_df = fire_df \
                    .withColumnRenamed("CallNumber", "MyCallNumber") \
                    .withColumnRenamed("UnitID", "MyUnitID")

renamed_fire_df.show(5)

+------------+--------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------+--------------------------+----------------------+------------------+--------------------+--------------------+-------------+---------+
|MyCallNumber|MyUnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|       AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|        Neighborhood|            Location|        RowID|    Delay|
+------------+--------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+----+-------+---------+-----------+----+----------------+--------+

#### 2) Check the `schema` 

In [6]:
# Utility method `printSchema()`

fire_df.printSchema()   

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 

#### 3) Fix the data type for few of the `columns`

In [7]:
from pyspark.sql import functions as F

my_fire_df = fire_df \
                .withColumn('CallDate', F.to_date('CallDate', 'MM/dd/yyyy')) \
                .withColumn('WatchDate', F.to_date('WatchDate', 'MM/dd/yyyy')) \
                .withColumn('AvailableDtTm', F.to_date('AvailableDtTm', 'MM/dd/yyyy hh:mm:ss a')) \
                .withColumn('Delay', F.round('Delay', 2))
    
my_fire_df.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: date (nullable = true)
 |-- WatchDate: date (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: date (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 |-- Su

In [8]:
my_fire_df.show(5)

+----------+------+--------------+----------------+----------+----------+--------------------+-------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------+--------------------------+----------------------+------------------+--------------------+--------------------+-------------+-----+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|CallFinalDisposition|AvailableDtTm|             Address|City|Zipcode|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|        Neighborhood|            Location|        RowID|Delay|
+----------+------+--------------+----------------+----------+----------+--------------------+-------------+--------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+-----

#### Q1. How many distinct types of calls were made to the Fire Department?


##### SQL Approch 

In [9]:
# Step 1 - Create a temp view (Utility method)
my_fire_df.createOrReplaceTempView("fire_service_calls_vew")

# Step 2 - Run the SQL query
q1_sql_df = spark.sql("""
                        select count(distinct CallType) as distinct_call_type_count
                        from fire_service_calls_vew
                        where CallType is not null
                      """)

q1_sql_df.show()



+------------------------+
|distinct_call_type_count|
+------------------------+
|                      30|
+------------------------+



                                                                                

##### DataFrame Approch

In [10]:
# These all are transformations 
q1_df = my_fire_df.where("CallType is not null") \
               .select("CallType") \
               .distinct()   

# This is an action 
print(q1_df.count())

30


In [11]:
q1_df1 = my_fire_df.where("CallType is not null")
q1_df2 = q1_df1.select("CallType")
q1_df3 = q1_df2.distinct()

print(q1_df3.count())

30


#### Q2. What were distinct types of calls made to the Fire Department?


In [12]:
types_of_call = my_fire_df.where("CallType is not null") \
                          .select(F.expr("CallType as Distinct_Call_Type")) \
                          .distinct()

types_of_call.show(truncate=False)

+--------------------------------------------+
|Distinct_Call_Type                          |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Odor (Strange / Unknown)                    |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Traffic Collision                           |
|Water Rescue                                |
|Structure Fire                              |
|Aircraft Emergency                          |
|Administrative                              |
|HazMat                                      |
|Assist Police                               |
|Train / Rail Incident                       |
|Citizen Assist / Service Call               |
|Alarms                                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Outside Fire

#### Q3. Find out all response for delayed times greater than 5 mins?

In [13]:
df_q3 = my_fire_df.where("Delay > 5") \
                  .select("CallNumber", "City", "Delay")

df_q3.show()

+----------+----+-----+
|CallNumber|City|Delay|
+----------+----+-----+
|  20110315|  SF| 5.35|
|  20120147|  SF| 6.25|
|  20130013|  SF|  5.2|
|  20140067|  SF|  5.6|
|  20140177|  SF| 7.25|
|  20150056|  SF|11.92|
|  20150254|  SF| 5.12|
|  20150265|  SF| 8.63|
|  20150265|  SF|95.28|
|  20150380|  SF| 5.45|
|  20150414|  SF|  7.6|
|  20160059|  SF| 6.13|
|  20160064|  SF| 5.18|
|  20170118|  SF| 6.92|
|  20170342|  SF|  5.2|
|  20180129|  SF| 6.35|
|  20180191|  SF| 7.98|
|  20180382|  SF|13.55|
|  20190062|  SF| 5.15|
|  20190097|  SF|13.58|
+----------+----+-----+
only showing top 20 rows



#### Q4. What were the most common call types?

In [14]:
df_q4 = my_fire_df.select("CallType") \
                  .where("CallType is not null") \
                  .groupBy("CallType") \
                  .count() \
                  .orderBy("count", ascending=False) 

df_q4.show(truncate=False)

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
|Odor (Strange / Unknown)       |490   |
|Electrical Hazard              |482   |
|Elevator / Escalator Rescue    |453   |
|Smoke Investigation (Outside)  |391   |
|Fuel Spill                     |193   |
|HazMat                         |124   |
|Industrial Accidents           |94    |
|Explosion                      |89    |
|Train / Rail Incident          |57    |
|Aircraft Emergency             |36    |
+-------------------------------+------+
only showing top

This count is a Transformation NOT an Action 

- `DataFrame.count()`      => Action 
- `GtoupedData.count()`    => Transformation 

#### Q5. What zip codes accounted for most common calls?


In [15]:
my_fire_df.select("CallType", "Zipcode") \
          .where("Zipcode is not null") \
          .groupBy("CallType", "Zipcode") \
          .count() \
          .orderBy("count", ascending=False) \
          .show()

+----------------+-------+-----+
|        CallType|Zipcode|count|
+----------------+-------+-----+
|Medical Incident|  94102|16130|
|Medical Incident|  94103|14775|
|Medical Incident|  94110| 9995|
|Medical Incident|  94109| 9479|
|Medical Incident|  94124| 5885|
|Medical Incident|  94112| 5630|
|Medical Incident|  94115| 4785|
|Medical Incident|  94122| 4323|
|Medical Incident|  94107| 4284|
|Medical Incident|  94133| 3977|
|Medical Incident|  94117| 3522|
|Medical Incident|  94134| 3437|
|Medical Incident|  94114| 3225|
|Medical Incident|  94118| 3104|
|Medical Incident|  94121| 2953|
|Medical Incident|  94116| 2738|
|Medical Incident|  94132| 2594|
|  Structure Fire|  94110| 2267|
|Medical Incident|  94105| 2258|
|  Structure Fire|  94102| 2229|
+----------------+-------+-----+
only showing top 20 rows





### Excercise 

#### Q6. What San Francisco neighborhoods are in the zip codes 94114 and 94103

In [16]:
my_fire_df.select("Neighborhood", "Zipcode") \
          .where("Zipcode == 94114 or Zipcode == 94103") \
          .distinct() \
          .show()

+--------------------+-------+
|        Neighborhood|Zipcode|
+--------------------+-------+
|        Inner Sunset|  94114|
| Castro/Upper Market|  94103|
|     South of Market|  94103|
|             Mission|  94114|
| Castro/Upper Market|  94114|
|          Tenderloin|  94103|
|        Potrero Hill|  94103|
|        Hayes Valley|  94114|
|        Hayes Valley|  94103|
|          Noe Valley|  94114|
|      Haight Ashbury|  94114|
|          Twin Peaks|  94114|
|         Mission Bay|  94103|
|Financial Distric...|  94103|
|             Mission|  94103|
+--------------------+-------+



#### Q7. What was the sum of all calls, average, min and max of the response times for calls?


In [17]:
my_fire_df.select(F.expr("sum(NumAlarms) as SUM_CALLS"), \
                  F.expr("avg(Delay) as AVG_DELAY"), \
                  F.expr("min(Delay) as MIN_DELAY"), \
                  F.expr("max(Delay) as MAX_DELAY") \
                 ) \
          .show()

+---------+------------------+---------+---------+
|SUM_CALLS|         AVG_DELAY|MIN_DELAY|MAX_DELAY|
+---------+------------------+---------+---------+
|   176170|3.8923648571558935|     0.02|  1844.55|
+---------+------------------+---------+---------+



#### Q8. How many distinct years of data is in the CSV file?

In [18]:
my_fire_df.select(F.expr("year(CallDate) as year_num")) \
          .distinct() \
          .orderBy("year_num") \
          .show()

+--------+
|year_num|
+--------+
|    2000|
|    2001|
|    2002|
|    2003|
|    2004|
|    2005|
|    2006|
|    2007|
|    2008|
|    2009|
|    2010|
|    2011|
|    2012|
|    2013|
|    2014|
|    2015|
|    2016|
|    2017|
|    2018|
+--------+



#### Q9. What week of the year in 2017 had the most fire calls?

In [19]:
my_fire_df.select(F.expr("weekofyear(CallDate) as week_year")) \
          .filter("year(CallDate) == 2017") \
          .groupBy('week_year') \
          .count() \
          .orderBy('count', ascending=False) \
          .show()

+---------+-----+
|week_year|count|
+---------+-----+
|       35|  314|
|        6|  265|
|       43|  260|
|       46|  255|
|        1|  254|
|       50|  254|
|       13|  254|
|       40|  253|
|       49|  252|
|        2|  250|
|        4|  249|
|       24|  249|
|       52|  248|
|       33|  247|
|       32|  245|
|       11|  243|
|       18|  240|
|       45|  240|
|        9|  240|
|        3|  239|
+---------+-----+
only showing top 20 rows



#### Q10. What neighborhoods in San Francisco had the worst response time in 2018?

In [20]:
my_fire_df.select("Neighborhood", "Delay") \
          .filter("year(CallDate) == 2018") \
          .show()

+--------------------+-----+
|        Neighborhood|Delay|
+--------------------+-----+
|    Presidio Heights| 2.88|
|         Mission Bay| 6.33|
|           Chinatown| 2.65|
|Financial Distric...| 3.53|
|          Tenderloin|  1.1|
|Bayview Hunters P...| 4.05|
|      Inner Richmond| 2.57|
|        Inner Sunset|  1.4|
|     Sunset/Parkside| 2.67|
|     South of Market| 1.77|
|    Golden Gate Park| 1.68|
|      Bernal Heights| 3.65|
|          Tenderloin|  4.2|
|         Mission Bay| 6.33|
|         Mission Bay|  6.6|
|      Outer Richmond| 3.48|
|           Excelsior| 0.83|
|         North Beach| 2.55|
|    Western Addition| 2.17|
|        Hayes Valley| 3.13|
+--------------------+-----+
only showing top 20 rows

