# NYC Taxi Trips Example

This data is freely available. You can find some interesting background information at https://chriswhong.com/open-data/foil_nyc_taxi/ . We will use this data to perform some analytical tasks. The whole wotkshop is split up into multiple sections, which represents the typical data processing flow in a data centric project. We will follow the (simplified) steps when using a data lake.

1. Build "Structured Zone" containing all sources
2. Build "Refined Zone" that contains pre-processed data
3. Analyze the data before working on the next steps to find an appropriate approach
4. Build "Integrated Zone" that contains integrated data
5. Use Machine Learning for business questions

## Requirements

The workshop will require the following Python packages:

* PySpark (tested with Spark 2.4)
* Matplotlib
* Pandas
* GeoPandas
* Cartopy
* Contextily

# Part 1 - Build Structured Zone

The first part is about building the structured zone. It will contain a copy of the raw data stored in Hive tables and thereby easily accessible for downstream processing.

In [1]:
taxi_basedir = "s3://dimajix-training/data/nyc-taxi-trips/"
weather_basedir = "s3://dimajix-training/data/weather/"
holidays_basedir = "s3://dimajix-training/data/bank-holidays/"
dwh_basedir = "/user/hadoop/nyc-dwh"
structured_basedir = dwh_basedir + "/structured"

# 0 Create Spark Session

Before we begin, we create a Spark session if none was provided in the notebook.

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","64G") \
        .getOrCreate()

spark

# 1 Taxi Data

This data is freely available. You can find some interesting background information at https://chriswhong.com/open-data/foil_nyc_taxi/ . In the first step we read in the raw data. The data is split into two different entities: Basic trip information and payment information. We will store the data in a more efficient representation (Parquet) to form the structured zone.

## 1.1 Trip Information

We start with reading in the trip information. It contains the following columns
* **medallion** - This is some sort of a license for a taxi company. A single medallion is attached to a single cab and may be used by multiple drivers.
* **hack_license** - This is the drivers license
* **vendor_id**
* **rate_code** The final rate code in effect at the end of the trip. 
  * 1=Standard rate
  * 2=JFK
  * 3=Newark
  * 4=Nassau or Westchester
  * 5=Negotiated fare
  * 6=Group ride
* **store_and_fwd_flag** This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server
* **pickup_datetime** This is the time when a passenger was picked up
* **dropoff_datetime** This is the time when the passenger was dropped off again
* **passenger_count** Number of passengers of this trip
* **trip_time_in_secs**
* **trip_distance**
* **pickup_longitude**
* **pickup_latitude**
* **dropoff_longitude**
* **dropoff_latitude**

The primary key uniquely identifying each trip is given by the columns `medallion`, `hack_license`, `vendor_id` and `pickip_datetime`.

In [5]:
from pyspark.sql.types import *

trip_schema = StructType([
    StructField('medallion', StringType()),
    StructField('hack_license', StringType()),
    StructField('vendor_id', StringType()),
    StructField('rate_code', StringType()),
    StructField('store_and_fwd_flag', StringType()),
    StructField('pickup_datetime', TimestampType()),
    StructField('dropoff_datetime', TimestampType()),
    StructField('passenger_count', IntegerType()),
    StructField('trip_time_in_secs', IntegerType()),
    StructField('trip_distance', DoubleType()),
    StructField('pickup_longitude', DoubleType()),
    StructField('pickup_latitude', DoubleType()),
    StructField('dropoff_longitude', DoubleType()),
    StructField('dropoff_latitude', DoubleType()),
    ])

trip_data = spark.read \
    .option("header", True) \
    .schema(trip_schema) \
    .csv(taxi_basedir + "/data/")

Inspect the first 10 rows by converting them to a Pandas DataFrame.

In [6]:
trip_data.limit(10).toPandas()

Unnamed: 0,medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
0,3B1A31779BCE30367D00C6F7911573C0,AED0496C937E41C4515D64E851F873AB,VTS,1,,2013-05-01 00:04:00,2013-05-01 00:12:00,1,480,1.34,-73.982285,40.772816,-73.986214,40.758743
1,61F54249450649B22FCF456774A2F24F,9D871F2AE5ACF24D04C00484C8ECEF90,VTS,1,,2013-05-01 00:03:00,2013-05-01 00:10:00,5,420,2.6,-73.963013,40.711899,-73.991875,40.721916
2,160CA9331707228AC5BD584FDBF18B3C,18F9F1A9E76B707F7D15FC2B39E0BE33,VTS,1,,2013-05-01 00:04:00,2013-05-01 00:10:00,2,360,1.31,-73.981781,40.724354,-73.973755,40.736893
3,8F1DBE78C521F384A55AD0C77F75545D,AC4F234E82B375187FBAF428E10824D8,VTS,1,,2013-05-01 00:05:00,2013-05-01 00:09:00,1,240,0.82,-73.96402,40.70969,-73.950897,40.710972
4,C901A9DE8D66C4F05813EB48C50F0686,10E1D1418B5B22C82255FFC638547625,VTS,1,,2013-05-01 00:05:00,2013-05-01 00:14:00,1,540,1.65,-73.973915,40.752789,-73.996201,40.755867
5,9353AADDE79A8025BF13B6AC32BA3AE7,77E045C4E502526C9E3789116DA97DFE,VTS,1,,2013-05-01 00:00:00,2013-05-01 00:12:00,5,720,2.41,-74.002357,40.750324,-73.972885,40.756096
6,55A0A2A97F06FEF808382CD385597F84,B162F61E522964BDEAE9277EE96B651B,VTS,1,,2013-05-01 00:01:00,2013-05-01 00:10:00,1,540,2.44,-73.950111,40.771767,-73.977318,40.759239
7,FE1E7CE591DA8AEFA76005982EE399F2,F20370C70B1E67499C48C517315E8DE6,VTS,1,,2013-05-01 00:05:00,2013-05-01 00:12:00,1,420,2.42,-74.009293,40.724731,-73.998672,40.754932
8,8CFE46526C23E259F6BF5664DE46586F,4704E651F9FE2E1228D190CFA8B52240,VTS,1,,2013-05-01 23:57:00,2013-05-02 00:08:00,1,660,2.98,-74.003838,40.738476,-73.976479,40.775471
9,DCCE617529C5DC58F4EF6EB59C746C94,BBD68285796CE1EEC417CA3EA06E365A,VTS,1,,2013-05-01 00:04:00,2013-05-01 00:06:00,1,120,0.76,-73.979088,40.740398,-73.983017,40.731518


### Inspect Schema

Just to be sure, let us inspect the schema. It should match exactly the specified one.

In [7]:
trip_data.printSchema()

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- rate_code: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_time_in_secs: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- pickup_longitude: double (nullable = true)
 |-- pickup_latitude: double (nullable = true)
 |-- dropoff_longitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)



### Write into Structured Zone

Now we store data as parquet files.

In [8]:
trip_data.write.parquet(structured_basedir + "/taxi-trip")

## 1.2 Fare information

Now we read in the second table containing the trips fare information.

* **medallion** - This is some sort of a license for a taxi company
* **hack_license** - This is the drivers license
* **vendor_id**
* **pickup_datetime** This is the time when a passenger was picked up
* **payment_type** A numeric code signifying how the passenger paid for the trip. 
  * CRD = Credit card
  * CDH = Cash
  * ??? = No charge
  * ??? = Dispute
  * ??? = Unknown
  * ??? = Voided trip
* **fare_amount** The time-and-distance fare calculated by the meter
* **surcharge**
* **mta_tax** $0.50 MTA tax that is automatically triggered based on the metered rate in use
* **tip_amount** Tip amount –This field is automatically populated for credit card tips. Cash tips are not included
* **tolls_amount** Total amount of all tolls paid in trip
* **total_amount** The total amount charged to passengers. Does not include cash tips.

In [9]:
fare_schema = StructType([
    StructField('medallion', StringType()),
    StructField('hack_license', StringType()),
    StructField('vendor_id', StringType()),
    StructField('pickup_datetime', TimestampType()),
    StructField('payment_type', StringType()),
    StructField('fare_amount', DoubleType()),
    StructField('surcharge', DoubleType()),
    StructField('mta_tax', DoubleType()),
    StructField('tip_amount', DoubleType()),
    StructField('tolls_amount', DoubleType()),
    StructField('total_amount', DoubleType())
    ])

trip_fare = spark.read \
    .option("header", True) \
    .option("ignoreLeadingWhiteSpace", True) \
    .schema(fare_schema) \
    .csv(taxi_basedir + "/fare/")

In [10]:
trip_fare.limit(10).toPandas()

Unnamed: 0,medallion,hack_license,vendor_id,pickup_datetime,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,3B1A31779BCE30367D00C6F7911573C0,AED0496C937E41C4515D64E851F873AB,VTS,2013-05-01 00:04:00,CSH,7.0,0.5,0.5,0.0,0.0,8.0
1,61F54249450649B22FCF456774A2F24F,9D871F2AE5ACF24D04C00484C8ECEF90,VTS,2013-05-01 00:03:00,CRD,9.5,0.5,0.5,2.0,0.0,12.5
2,160CA9331707228AC5BD584FDBF18B3C,18F9F1A9E76B707F7D15FC2B39E0BE33,VTS,2013-05-01 00:04:00,CRD,6.5,0.5,0.5,1.0,0.0,8.5
3,8F1DBE78C521F384A55AD0C77F75545D,AC4F234E82B375187FBAF428E10824D8,VTS,2013-05-01 00:05:00,CSH,5.5,0.5,0.5,0.0,0.0,6.5
4,C901A9DE8D66C4F05813EB48C50F0686,10E1D1418B5B22C82255FFC638547625,VTS,2013-05-01 00:05:00,CRD,8.5,0.5,0.5,1.8,0.0,11.3
5,9353AADDE79A8025BF13B6AC32BA3AE7,77E045C4E502526C9E3789116DA97DFE,VTS,2013-05-01 00:00:00,CRD,11.0,0.5,0.5,2.3,0.0,14.3
6,55A0A2A97F06FEF808382CD385597F84,B162F61E522964BDEAE9277EE96B651B,VTS,2013-05-01 00:01:00,CRD,10.0,0.5,0.5,1.5,0.0,12.5
7,FE1E7CE591DA8AEFA76005982EE399F2,F20370C70B1E67499C48C517315E8DE6,VTS,2013-05-01 00:05:00,CRD,9.0,0.5,0.5,1.0,0.0,11.0
8,8CFE46526C23E259F6BF5664DE46586F,4704E651F9FE2E1228D190CFA8B52240,VTS,2013-05-01 23:57:00,CRD,11.5,0.5,0.5,2.4,0.0,14.9
9,DCCE617529C5DC58F4EF6EB59C746C94,BBD68285796CE1EEC417CA3EA06E365A,VTS,2013-05-01 00:04:00,CSH,4.0,0.5,0.5,0.0,0.0,5.0


### Inspect Schema

Let us inspect the schema of the data, which should match exactly the schema that we originally specified

In [11]:
trip_fare.printSchema()

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- total_amount: double (nullable = true)



### Store into Structured Zone

Finally store the data into the structured zone as Parquet files into the sub directory `taxi-fare`

In [12]:
trip_fare.write.parquet(structured_basedir + "/taxi-fare")

# 2. Weather Data

In order to improve our analysis, we will relate the taxi trips with weather information. We use the NOAA ISD weather data (https://www.ncdc.noaa.gov/isd), which contains measurements from many stations around the world, some of them dating back to 1901. You can download all data from ftp://ftp.ncdc.noaa.gov/pub/data/noaa . We will only use a small subset of the data which is good enough for our purposes.

## 2.1 Station Master Data

The weather data is split up into two different data sets: the measurements themselves and meta data about the stations. The later contains valuable information like the geo location of the weather station. This will be useful when trying to find the weather station nearest to all taxi trips.

Among other data the columns provide specifically the following informations
* **USAF** & **WBAN** - weather station id
* **CTRY** - the country of the weather station
* **STATE** - the state of the weather station
* **LAT** & **LONG** - latitude and longitude of the weather station (geo coordinates)
* **BEGIN** & **END** - date range when this weather station was active

In [14]:
weather_stations = spark.read \
    .option("header", True) \
    .csv(weather_basedir + "/isd-history/")

weather_stations.limit(10).toPandas()

Unnamed: 0,USAF,WBAN,STATION NAME,CTRY,STATE,ICAO,LAT,LON,ELEV(M),BEGIN,END
0,7005,99999,CWOS 07005,,,,,,,20120127,20120127
1,7011,99999,CWOS 07011,,,,,,,20111025,20121129
2,7018,99999,WXPOD 7018,,,,0.0,0.0,7018.0,20110309,20130730
3,7025,99999,CWOS 07025,,,,,,,20120127,20120127
4,7026,99999,WXPOD 7026,AF,,,0.0,0.0,7026.0,20120713,20141120
5,7034,99999,CWOS 07034,,,,,,,20121024,20121106
6,7037,99999,CWOS 07037,,,,,,,20111202,20121125
7,7044,99999,CWOS 07044,,,,,,,20120127,20120127
8,7047,99999,CWOS 07047,,,,,,,20120613,20120717
9,7052,99999,CWOS 07052,,,,,,,20121129,20121130


### Store data into Structured Zone

In the next step we want to store the data as Parquet files (which are much more efficient and very well supported by most batch frameworks in the Hadoop and Spark universe). In order to do so, we first need to rename some columns, which contain unsupported characters:
* "STATION NAME" => "STATION_NAME"
* "ELEV(M)" => "ELEVATION"

After the columns have been renamed, the data frame is written into the structured zone into the sub directory `weather-stations` using the `DataFrame.write.parquet` function.

In [15]:
weather_stations \
    .withColumnRenamed("STATION NAME", "STATION_NAME") \
    .withColumnRenamed("ELEV(M)", "ELEVATION") \
    .write.parquet(structured_basedir + "/weather-stations")

### Read in data agin

Using the `spark.read.parquet` function we read in the data back into Spark and display some records.

In [16]:
weather_stations = spark.read.parquet(structured_basedir + "/weather-stations")
weather_stations.limit(10).toPandas()

Unnamed: 0,USAF,WBAN,STATION_NAME,CTRY,STATE,ICAO,LAT,LON,ELEVATION,BEGIN,END
0,7005,99999,CWOS 07005,,,,,,,20120127,20120127
1,7011,99999,CWOS 07011,,,,,,,20111025,20121129
2,7018,99999,WXPOD 7018,,,,0.0,0.0,7018.0,20110309,20130730
3,7025,99999,CWOS 07025,,,,,,,20120127,20120127
4,7026,99999,WXPOD 7026,AF,,,0.0,0.0,7026.0,20120713,20141120
5,7034,99999,CWOS 07034,,,,,,,20121024,20121106
6,7037,99999,CWOS 07037,,,,,,,20111202,20121125
7,7044,99999,CWOS 07044,,,,,,,20120127,20120127
8,7047,99999,CWOS 07047,,,,,,,20120613,20120717
9,7052,99999,CWOS 07052,,,,,,,20121129,20121130


## 2.2 Weather Measurements

Now we will work with the second and more interesting part of the NOAA weather data set: The measurements. These are stored in different subdirectories per year. For us, the year 2013 is good enough, since the taxi trips are all from 2013.

The data format is a proprietary ASCII encoding, so we use the `spark.read.text` method to read each line as one record.

In [17]:
raw_weather = spark.read.text(weather_basedir + "/2013")
raw_weather.limit(10).toPandas()

Unnamed: 0,value
0,042599999963897201301010000I+32335-086979CRN05...
1,013399999963897201301010005I+32335-086979CRN05...
2,013399999963897201301010010I+32335-086979CRN05...
3,013399999963897201301010015I+32335-086979CRN05...
4,013399999963897201301010020I+32335-086979CRN05...
5,013399999963897201301010025I+32335-086979CRN05...
6,013399999963897201301010030I+32335-086979CRN05...
7,013399999963897201301010035I+32335-086979CRN05...
8,013399999963897201301010040I+32335-086979CRN05...
9,013399999963897201301010045I+32335-086979CRN05...


### Extract precipitation

Now we extract the precipitation from the measurements. This is not trivial, since that information is stored in a variable part. We assume that the record contains precipitation data when it contains the substring `AA1` at position 109. This denotes the type of the subsection in the data record followed by the number of hours of this measurement and the precipitation depth.

We use some PySpark string functions to extract the data.

In [18]:
raw_weather.select(
        f.substring(raw_weather["value"],106,999),
        f.instr(raw_weather["value"],"AA1").alias("s"),
        f.when(f.instr(raw_weather["value"],"AA1") == 109,f.substring(raw_weather["value"], 109+3, 8)).alias("AAD")
    )\
    .withColumn("precipitation_hours", f.substring(f.col("AAD"), 1, 2).cast("INT")) \
    .withColumn("precipitation_depth", f.substring(f.col("AAD"), 3, 4).cast("FLOAT")) \
    .filter(f.col("precipitation_depth") > 0) \
    .limit(10).toPandas()


Unnamed: 0,"substring(value, 106, 999)",s,AAD,precipitation_hours,precipitation_depth
0,ADDAA101000291AO105000091CF1106510CF2000010CG1...,109,1000291,1,2.0
1,ADDAA101000691AO105000091CF1105810CF2000010CG1...,109,1000691,1,6.0
2,ADDAA101008491AO105001291CF1105510CF2000010CG1...,109,1008491,1,84.0
3,ADDAA101002591AO105000091CF1105410CF2000010CG1...,109,1002591,1,25.0
4,ADDAA101000491AO105000091CF1105410CF2000010CG1...,109,1000491,1,4.0
5,ADDAA101000591AO105000091CF1105310CF2000010CG1...,109,1000591,1,5.0
6,ADDAA101001491AO105000291CF1105210CF2000010CG1...,109,1001491,1,14.0
7,ADDAA101000591AO105000091CF1105010CF2000010CG1...,109,1000591,1,5.0
8,ADDAA101000491AO105000091CF1104910CF2000010CG1...,109,1000491,1,4.0
9,ADDAA124014991CO199-06KA1240M+01541KA2240N+01081,109,24014991,24,149.0


### Extract all relevant measurements

The precipitation was the hardest part. Other measurements like wind speed and air temperature are stored at fixed positions together with some quality flags denoting if a measurement is valid. In the following statement, we extract all relevant measurements. Specifically we extract the following information
* **USAF** & **WBAN** - weather station identifier
* **ts** - timestamp of measurement
* **wind_direction** - wind direction (in degrees)
* **wind_direction_qual** - quality flag of the wind direction
* **wind_speed** - wind speed
* **wind_speed_qual** - quality flag indicating the validity of the wind speed
* **air_temperature** - air temperature in degree Celsius
* **air_temperature_qual** - quality flag for air temperature
* **precipitation_hours**
* **precipitation_depth**

In [20]:
weather = raw_weather.select(
        f.substring(raw_weather["value"],5,6).alias("usaf"),
        f.substring(raw_weather["value"],11,5).alias("wban"),
        f.to_timestamp(f.substring(raw_weather["value"],16,12), "yyyyMMddHHmm").alias("ts"),
        f.substring(raw_weather["value"],42,5).alias("report_type"),
        f.substring(raw_weather["value"],61,3).alias("wind_direction"),
        f.substring(raw_weather["value"],64,1).alias("wind_direction_qual"),
        f.substring(raw_weather["value"],65,1).alias("wind_observation"),
        (f.substring(raw_weather["value"],66,4).cast("float") / 10.0).alias("wind_speed"),
        f.substring(raw_weather["value"],70,1).alias("wind_speed_qual"),
        (f.substring(raw_weather["value"],88,5).cast("float") / 10.0).alias("air_temperature"),
        f.substring(raw_weather["value"],93,1).alias("air_temperature_qual"),
        f.when(f.instr(raw_weather["value"],"AA1") == 109,f.substring(raw_weather["value"], 109+3, 8)).alias("AAD")
    ) \
    .withColumn("precipitation_hours", f.substring(f.col("AAD"), 1, 2).cast("INT")) \
    .withColumn("precipitation_depth", f.substring(f.col("AAD"), 3, 4).cast("FLOAT")) \
    .withColumn("date", f.to_date(f.col("ts"))) \
    .drop("AAD")
    
weather.limit(10).toPandas()

Unnamed: 0,usaf,wban,ts,report_type,wind_direction,wind_direction_qual,wind_observation,wind_speed,wind_speed_qual,air_temperature,air_temperature_qual,precipitation_hours,precipitation_depth,date
0,999999,63897,2013-01-01 00:00:00,CRN05,124,1,H,0.9,1,10.6,1,1.0,0.0,2013-01-01
1,999999,63897,2013-01-01 00:05:00,CRN05,124,1,H,1.5,1,10.6,1,,,2013-01-01
2,999999,63897,2013-01-01 00:10:00,CRN05,122,1,H,1.7,1,10.4,1,,,2013-01-01
3,999999,63897,2013-01-01 00:15:00,CRN05,120,1,H,1.7,1,11.0,1,,,2013-01-01
4,999999,63897,2013-01-01 00:20:00,CRN05,120,1,H,1.7,1,10.9,1,,,2013-01-01
5,999999,63897,2013-01-01 00:25:00,CRN05,124,1,H,1.7,1,11.2,1,,,2013-01-01
6,999999,63897,2013-01-01 00:30:00,CRN05,121,1,H,2.0,1,11.2,1,,,2013-01-01
7,999999,63897,2013-01-01 00:35:00,CRN05,120,1,H,2.2,1,11.4,1,,,2013-01-01
8,999999,63897,2013-01-01 00:40:00,CRN05,122,1,H,2.5,1,11.5,1,,,2013-01-01
9,999999,63897,2013-01-01 00:45:00,CRN05,127,1,H,2.9,1,11.6,1,,,2013-01-01


### Store into Structured Zone

After successful extraction, we write the result again into the structured zone into the subdirectory `weather/2013`.

In [30]:
weather.write.parquet(structured_basedir + "/weather/2013")

### Read in from Structured Zone

Again we read back the data from the Parquet files.

In [31]:
weather = spark.read.parquet(structured_basedir + "/weather/2013")
weather.limit(10).toPandas()

Unnamed: 0,usaf,wban,ts,report_type,wind_direction,wind_direction_qual,wind_observation,wind_speed,wind_speed_qual,air_temperature,air_temperature_qual,AAD,precipitation_hours,precipitation_depth,date
0,999999,53159,2013-01-01 00:00:00,CRN05,999,9,9,999.9,9,-0.4,1,1000091.0,1.0,0.0,2013-01-01
1,999999,53159,2013-01-01 00:05:00,CRN05,999,9,9,999.9,9,-0.6,1,,,,2013-01-01
2,999999,53159,2013-01-01 00:10:00,CRN05,999,9,9,999.9,9,-0.7,1,,,,2013-01-01
3,999999,53159,2013-01-01 00:15:00,CRN05,999,9,9,999.9,9,-0.6,1,,,,2013-01-01
4,999999,53159,2013-01-01 00:20:00,CRN05,999,9,9,999.9,9,-0.8,1,,,,2013-01-01
5,999999,53159,2013-01-01 00:25:00,CRN05,999,9,9,999.9,9,-0.9,1,,,,2013-01-01
6,999999,53159,2013-01-01 00:30:00,CRN05,999,9,9,999.9,9,-1.1,1,,,,2013-01-01
7,999999,53159,2013-01-01 00:35:00,CRN05,999,9,9,999.9,9,-0.9,1,,,,2013-01-01
8,999999,53159,2013-01-01 00:40:00,CRN05,999,9,9,999.9,9,-0.7,1,,,,2013-01-01
9,999999,53159,2013-01-01 00:45:00,CRN05,999,9,9,999.9,9,-0.6,1,,,,2013-01-01


# 3. Holidays

Another important data source is additional date information, specifically if a certain date is a bank holiday. While other information like week days can be directly computed from a date, for bank holidays an additional source is required.

We follow again the same approach of reading in the raw data and storing it into the structured zone as Parquet files.

In [21]:
holidays_schema = StructType([
    StructField('id', IntegerType()),
    StructField('date', DateType()),
    StructField('description', StringType()),
    StructField('bank_holiday', BooleanType())
    ])

holidays = spark.read \
    .option("header", False) \
    .schema(holidays_schema) \
    .csv(holidays_basedir)

holidays.limit(10).toPandas()

Unnamed: 0,id,date,description,bank_holiday
0,1,2012-01-02,New Year Day,True
1,2,2012-01-16,Martin Luther King Jr. Day,True
2,3,2012-02-20,Presidents Day (Washingtons Birthday),True
3,4,2012-05-28,Memorial Day,True
4,5,2012-07-04,Independence Day,True
5,6,2012-09-03,Labor Day,True
6,7,2012-10-08,Columbus Day,True
7,8,2012-11-12,Veterans Day,True
8,9,2012-11-22,Thanksgiving Day,True
9,10,2012-12-25,Christmas Day,True


In [22]:
holidays.printSchema()

root
 |-- id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- description: string (nullable = true)
 |-- bank_holiday: boolean (nullable = true)



### Store into Structured Zone

In [23]:
holidays.write.parquet(structured_basedir + "/holidays")

### Read in from Structured Zone

Again let us check if writing was successful.

In [24]:
holidays = spark.read.parquet(structured_basedir + "/holidays")
holidays.limit(10).toPandas()

Unnamed: 0,id,date,description,bank_holiday
0,1,2012-01-02,New Year Day,True
1,2,2012-01-16,Martin Luther King Jr. Day,True
2,3,2012-02-20,Presidents Day (Washingtons Birthday),True
3,4,2012-05-28,Memorial Day,True
4,5,2012-07-04,Independence Day,True
5,6,2012-09-03,Labor Day,True
6,7,2012-10-08,Columbus Day,True
7,8,2012-11-12,Veterans Day,True
8,9,2012-11-22,Thanksgiving Day,True
9,10,2012-12-25,Christmas Day,True
