# Assignment 1: NYC Taxi Data

ETL processing of NFC TLC dataset. Process written in four parts.

1. Extract data from S3
2. Transform datatypes and create new features
3. Clean data - remove trips with questionable data
4. Load data into parquet files

In [1]:
# Import required packages
import boto3
from pyspark.sql import SparkSession
from pyspark.sql.types import BooleanType, DoubleType, IntegerType, StringType, StructType, StructField, TimestampType
import pyspark.sql.functions as F

In [2]:
# Set parameters 
bucket_name = "nyc-tlc" # s3 bucket name with required nyc tlc files
years = ["2017", "2018"]
tlc_colours = ["yellow", "green"]
months = range(1,13)
zone_lookup = "taxi _zone_lookup.csv"
dt_columns = ["pickup_datetime","dropoff_datetime"]
int_columns = ["passenger_count","year"]
num_columns = ["trip_distance","fare_amount","extra","mta_tax","improvement_surcharge","tip_amount","tolls_amount",
               "ehail_fee","total_amount"]
initial_columns = ["VendorID","pickup_datetime","dropoff_datetime","passenger_count","trip_distance","pickup_location_id",
                 "dropoff_location_id","RatecodeID","store_and_fwd_flag","payment_type","fare_amount","extra","mta_tax",
                 "improvement_surcharge","tip_amount","tolls_amount","ehail_fee","total_amount","trip_type","taxi_type",
                 "year","month","pickup_service_zone","pickup_borough","dropoff_service_zone","dropoff_borough"]

In [3]:
# Create a local spark session
spark = SparkSession.builder \
        .appName('nyc-taxi-etl') \
        .getOrCreate()

## Extract NYC Yellow and Green Taxi Cab Data

Extract data from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

In [4]:
# Function to extract data from S3 bucket
def extract_data_from_bucket(bucket, year, colour, month):
    if len(str(month)) != 2:
        month = str(month).zfill(2)
    df = spark.read.csv(f"s3a://{bucket}/trip data/{colour}_tripdata_{year}-{month}.csv", header=True)
    return df

In [5]:
# Function to extract lookup data from NYC TLC
def extract_lookup_data_from_bucket(bucket, filename):
    df = spark.read.csv(f"s3a://{bucket}/misc/{filename}", header=True)
    return df

## Transform Data
### Modify data types

* pickup_datetime_string: string -> timestamp
* dropoff_datetime_string: string -> timestamp
* passenger_count: string -> integer
* trip_distance: string -> double
* fare_amount: string -> double
* extra: string -> double
* mta_tax: string -> double
* tip_amount: string -> double
* tolls_amount: string -> double
* improvement_surcharge: string -> double
* total_amount: string -> double
* ehail_fee: string -> double

### Rename Columns

* **PULocationID** -> pickup_location_id
* **DOLocationID** -> dropof_location_id

### Join Datasets

* Join trips to zone lookups

### Create new features

* **taxi_type**: whether is a green or yellow cab - created in extract
* **trip_duration**: time, in seconds, between trip start and trip end
* **trip_duration_cat**: bins of trip durations; lt 5 Mins, 5-10 mins, 10-20 mins, 20-30 mins, gt 30 mins
* **year**: the year the trip took place in - created in extract
* **month**: the month the trip took place in
* **hour**: the hour the trip took place in
* **from_airport**: whether the trip started from either Newark or LaGuardia Airport
* **to_airport**: whether the trip ended at either Newark or LaGuardia Airport

In [6]:
# Function to calulate trip duration category
def get_trip_duration_category(time):
    minutes = time / 60
    if minutes < 5:
        return "Under 5 mins"
    elif 5 <= minutes < 10:
        return "5-10 mins"
    elif 10 <= minutes < 20:
        return "10-20 mins"
    elif 20 <= minutes < 30:
        return "20-30 mins"
    else:
        return "Above 30 mins"

# Register function as a Spark user defined function 
udf_get_trip_duration_category = F.udf(lambda x: get_trip_duration_category(x), StringType())

In [7]:
# Function to determine if location is an airport
def get_airport_location(location):
    if location == "EWR" or location == "Airports":
        return True
    else:
        return False
    
#Register function as a Spark user defined function
udf_get_airport_location = F.udf(lambda x: get_airport_location(x), BooleanType())

In [None]:
# Function to calaculate kilometres from a value in miles
def get_kilometres_from_miles(miles):
    km = miles * 1.60934
    return km

#Register function as a Spark user defined function
udf_kilometres_from_miles = F.udf(lambda x: get_kilometres_from_miles(x), DoubleType())

In [31]:
# Transforms specific to yellow taxi files
def transform_yellow_taxi_data(df):
    df = df.withColumn("trip_type",F.lit("1")).\
            withColumn("ehail_fee",F.lit("0")).\
            withColumnRenamed("tpep_pickup_datetime", "pickup_datetime").\
            withColumnRenamed("tpep_dropoff_datetime", "dropoff_datetime")
    return df

In [9]:
# Transforms specific to green taxi files
def transform_green_taxi_data(df):
    df = df.withColumnRenamed("lpep_pickup_datetime", "pickup_datetime").\
            withColumnRenamed("lpep_dropoff_datetime", "dropoff_datetime")
    return(df)

In [10]:
# Transform field to timestamp data type
def transform_timestamp_columns(df, column):
    if column in df.columns:
        df = df.withColumn(column, F.col(column).astype(TimestampType()))
    return df

In [11]:
def transform_integer_columns(df, column):
    if column in df.columns:
        df = df.withColumn(column, F.col(column).astype(IntegerType()))
    return df

In [12]:
def transform_double_columns(df, column):
    if column in df.columns:
        df = df.withColumn(column, F.col(column).astype(DoubleType()))
    return df

In [13]:
# Transforms for all NYC TLC files
def transform_generic_taxi_data(df, lkp_df, dt_columns, int_columns, num_columns, select_columns):
    # Modify data type for timestamp columns
    for column in dt_columns:
        df = transform_timestamp_columns(df, column)
    
    # Modify data type for integers columns
    for column in int_columns:
        df = transform_integer_columns(df, column)
        
    # Modify data type for numbers/decimals
    for column in num_columns:
        df = transform_double_columns(df, column)
    
    # Rename fields
    df = df.withColumnRenamed("PULocationID","pickup_location_id").\
            withColumnRenamed("DOLocationID","dropoff_location_id")
    
    # Join lookup data into data frame
    df = df.join(lkp_df, df.pickup_location_id == lkp_df.LocationID, how="left").\
            drop("LocationID").\
            drop("Zone").\
            withColumnRenamed("service_zone","pickup_service_zone").\
            withColumnRenamed("Borough","pickup_borough").\
            join(lkp_df, df.dropoff_location_id == lkp_df.LocationID, how="left").\
            drop("LocationID").\
            drop("Zone").\
            withColumnRenamed("service_zone","dropoff_service_zone").\
            withColumnRenamed("Borough","dropoff_borough")
    
    # Add features
    df = df.select(select_columns).\
            withColumn("trip_duration_seconds", F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long")).\
            withColumn("trip_duration_category", udf_get_trip_duration_category(F.col("trip_duration_seconds"))).\
            withColumn("pickup_hour", F.hour(F.col("pickup_datetime"))).\
            withColumn("from_airport", udf_get_airport_location(F.col("pickup_service_zone"))).\
            withColumn("to_airport", udf_get_airport_location(F.col("dropoff_service_zone"))).\
            withColumn("trip_distance_km", udf_kilometres_from_miles(F.col(trip_distance)))
    
    return df

In [14]:
# Function to bring all transforms together
def data_processing_transform(df, lkp_df, year, colour, month, dt_columns, int_columns, num_columns, select_columns):
    df = df.withColumn("taxi_type", F.lit(colour)).\
        withColumn("year", F.lit(year)).\
        withColumn("month", F.lit(month))
    if colour == "yellow":
        # Process transform tasks specific to yellow taxis
        df = transform_yellow_taxi_data(df)
    elif colour == "green":
        # Process transform tasks specific to green taxis
        df = transform_green_taxi_data(df)
    else:
        print("Taxi colour not defined")

    # Process generic transformations
    df = transform_generic_taxi_data(df, lkp_df, dt_columns, int_columns, num_columns, select_columns)
    return df

## Data Clean
### Remove records

* **RateCodeID**: trips with a 99 rate code
* **fare_amount**: trips with a fare amount of zero or below
* **trip_duration_seconds**: trips with a duration of zero, or less, seconds
* **pickup_datetime**: outside month of file period
* **passenger_count**: where equal to zero change to one

In [15]:
# Function to clean rate code id field
def clean_rate_code_id(df):
    # Remove records with a 99 rate code id 
    df = df.filter(F.col("RateCodeID") < 7)
    return df

In [16]:
# Function to clean fare amount
def clean_fare_amount(df):
    # Remove records with a fare_amount of zero or below
    df = df.filter(F.col("fare_amount") > 0.0)
    return df

In [33]:
# Function to clean trip duration seconds
def clean_trip_duration_seconds(df):
    # Remove records with a trip duration of 0 seconds or less
    df = df.filter((F.col("trip_duration_seconds") > 0) & F.col("trip_duration_seconds") < 36000)
    return df

In [18]:
# Function to clean file period
def clean_trips_outside_file_period(df, dt_field):
    # Remove trips that are outside original files remit
    df = df.filter((F.col("year") == F.year(F.col(dt_field))) # replace with pickup_datetime if parameter doesnt work
                   & (F.col("month") == F.month(F.col(dt_field))))
    return df

In [19]:
# Function to clean passenger counts
def clean_passenger_count(df):
    # Make trips with zero passengers equal to the mode for non zero passenger trips, which is 1 based on EDA
    df = df.withColumn("passenger_count", F.when(df["passenger_count"] == 0, 1).\
                       otherwise(df["passenger_count"]))
    return df

In [34]:
# Function to bring all clean processes into one
def data_processing_clean(df):
    # Clean RateCodeId data - remove invalid trips
    if "RateCodeID" in df.columns:
        df = clean_rate_code_id(df)
    
    # Clean fare_amount - remove trips with fares zero or below
    if "fare_amount" in df.columns:
        df = clean_fare_amount(df)
        
    # Clean trip_duration_seconds - remove trips of zero seconds, or below
    if "trip_duration_seconds" in df.columns:
        df = clean_trip_duration_seconds(df)
    
    if "passenger_count" in df.columns:
        df = clean_passenger_count(df)
    
    # Remove records outside file month year
    df = clean_trips_outside_file_period(df, "pickup_datetime")
    
    return df

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:40073)
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 

## Write data
Write data to parquest files for analysis and loading into ML model at later date.

In [21]:
# Function to write data to parquet files
def write_data_to_parquet(df, mode):
    df.write.partitionBy("year","month").parquet("./output", mode=mode)

## Process Data
For each year, month and taxi colour process csv and load into parquet files. Data is partitioned by year and month to speed up processing. Process is expected to run in full each time. Could make incremental if required.

In [32]:
loop_num = 1

# Extract zone lookup data
zone_df = extract_lookup_data_from_bucket(bucket_name, zone_lookup)

# For each applicable year, month and taxi colour process files and load into parquet 
for year in years:
    for tlc_colour in tlc_colours:
        for month in months:
            df_extract = extract_data_from_bucket(bucket_name, year, tlc_colour, month)
            df_transform = data_processing_transform(df_extract,
                                                     zone_df,
                                                     year,
                                                     tlc_colour,
                                                     month,
                                                     dt_columns,
                                                     int_columns,
                                                     num_columns,
                                                     initial_columns)
            df_clean = data_processing_clean(df_transform)
            
            # Now write data to parquet
            if loop_num == 1:
                mode = "overwrite"
            else:
                mode = "append"
                
            write_data_to_parquet(df_clean, mode)
            
            loop_num += 1
            string = "Data file for month: {}, year: {} and taxi colour: {} successfully loaded".format(month, year, tlc_colour)
            print(string)

Data file for month: 1, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 2, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 3, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 4, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 5, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 6, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 7, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 8, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 9, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 10, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 11, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 12, year: 2017 and taxi colour: yellow successfully loaded
Data file for month: 1, year: 2017 and taxi colou