# Assignment 1: NYC Taxi Data

## Exploratory Data Analysis of NYC taxi data
NYC taxi trip data can be found [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The data includes both yellow and green taxi trips capturing pick-up and drop-off date and times as well as other attributes of the trip including fare.

In [1]:
# Import required packages
import boto3
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, DoubleType
import pyspark.sql.functions as F

In [2]:
# Create a local spark session
spark = SparkSession.builder \
        .appName('nyc-taxi-eda') \
        .getOrCreate()

In [3]:
# Set parameters 
bucket_name = "nyc-tlc" # s3 bucket name with required nyc tlc files

In [4]:
# Create function to read S3 bucket
def list_bucket_contents(bucket, match=''):
    files = []
    s3_resource = boto3.resource('s3')
    bucket_resource = s3_resource.Bucket(bucket)
    for key in bucket_resource.objects.all():
        if match in key.key:
            files.append(key.key)
    return files

In [5]:
colours = ["yellow","green"]
years = ["2017","2018"]
files = []

for year in years:
    for colour in colours:
        match = colour + "_tripdata_" + year
        files.extend(list_bucket_contents(bucket=bucket_name, match=match))

### EDA of Yellow Taxi Trip Data

Yellow taxis are iconic in NYC. They are traditionally found in lower Manhattan making trips around the congested CBD. They rarely service the subrubs of New York. Yellow cabs are only allowed to pick up "hailed" passengers and use a standard metered rate ([reference](https://www1.nyc.gov/site/tlc/passengers/your-ride.page)). 

In [6]:
# Read January 2018 yellow taxi cab data from S3 bucket
yellow_df = spark.read.csv(f"s3a://{bucket_name}/trip data/yellow_tripdata_2018-01.csv", header=True)

In [6]:
# Show first twenty rows of the imported file
yellow_df.show(20)

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|       1| 2018-01-01 00:21:05|  2018-01-01 00:24:23|              1|          .50|         1|                 N|          41|          24|           2|        4.5|  0.5|    0.5|         0|           0|                  0.3|         5.8|
|       1| 2018-01-01 00:44:55|  2018-01-01 01:0

In [7]:
# Print schema of data frame to show field data types and nullability
yellow_df.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)



#### Data Type Changes
All fields imported as a string. The following data type conversions are required:

* tpep_pickup_datetime: string -> timestamp
* tpep_dropoff_datetime: string -> timestamp
* passenger_count: string -> integer
* trip_distance: string -> double
* store_and_fwd: string -> boolean ?
* fare_amount: string -> double
* extra: string -> double
* mta_tax: string -> double
* tip_amount: string -> double
* tolls_amount: string -> double
* improvement_surcharge: string -> double
* total_amount: string -> double

In [8]:
# Show summary statistics of the dataframe
yellow_df.summary().show()

+-------+------------------+--------------------+---------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+--------------------+-----------------+-------------------+---------------------+------------------+
|summary|          VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|   passenger_count|     trip_distance|         RatecodeID|store_and_fwd_flag|      PULocationID|      DOLocationID|      payment_type|       fare_amount|              extra|             mta_tax|       tip_amount|       tolls_amount|improvement_surcharge|      total_amount|
+-------+------------------+--------------------+---------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+--------------------+-----------------+-------------------+-------

#### Data Issues

* tpep_pickup_datetime: min date 2001-01-05 for 2018-01 file; max date 2018-07-27 for 2018-01 file
* tpep_dropoff_datetime: min date 2001-01-05 for 2018-01 file; max date 2018-07-27 for 2018-01 file
* passenger_count: min = zero
* trip_distance: min = zero
* ratecodeid: max = 99; documentation only expects values of 1 to 6
* fare_amount: min amount is negative
* other amounts: min amounts are negative
* mta_tax: should be $\$$1.00 or $\$$0.50 but has values up to $\$$6.33

In [7]:
# Count total records in the data frame
yellow_df.count()

8759874

In [None]:
# Check for nulls
yellow_df.select([count(when(isnan(c), c)).alias(c) for c in yellow_df.columns]).show()

In [9]:
# Determine if there are any drop offs before pickups
yellow_df.withColumn("pickup_datetime", F.unix_timestamp(F.col("tpep_pickup_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("dropoff_datetime", F.unix_timestamp(F.col("tpep_dropoff_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("trip_duration", (F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long"))).\
    filter(F.col("trip_duration") < 0).\
    select(["tpep_pickup_datetime","pickup_datetime","tpep_dropoff_datetime","dropoff_datetime","trip_duration"]).\
    show()

+--------------------+-------------------+---------------------+-------------------+-------------+
|tpep_pickup_datetime|    pickup_datetime|tpep_dropoff_datetime|   dropoff_datetime|trip_duration|
+--------------------+-------------------+---------------------+-------------------+-------------+
| 2018-01-01 15:15:13|2018-01-01 15:15:13|  2017-12-28 16:03:38|2017-12-28 16:03:38|      -342695|
| 2018-01-23 13:12:19|2018-01-23 13:12:19|  2018-01-23 00:28:25|2018-01-23 00:28:25|       -45834|
+--------------------+-------------------+---------------------+-------------------+-------------+



In [102]:
yellow_df.withColumn("pickup_datetime", F.unix_timestamp(F.col("tpep_pickup_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("dropoff_datetime", F.unix_timestamp(F.col("tpep_dropoff_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("trip_duration", (F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long"))).\
    filter(F.col("trip_duration") < 30).\
    select(["tpep_pickup_datetime","pickup_datetime","tpep_dropoff_datetime","dropoff_datetime","trip_duration","fare_amount"]).\
    show()

+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
|tpep_pickup_datetime|    pickup_datetime|tpep_dropoff_datetime|   dropoff_datetime|trip_duration|fare_amount|
+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
| 2018-01-01 00:38:57|2018-01-01 00:38:57|  2018-01-01 00:39:00|2018-01-01 00:39:00|            3|        2.5|
| 2018-01-01 00:48:59|2018-01-01 00:48:59|  2018-01-01 00:49:00|2018-01-01 00:49:00|            1|        2.5|
| 2018-01-01 00:20:53|2018-01-01 00:20:53|  2018-01-01 00:20:53|2018-01-01 00:20:53|            0|       14.5|
| 2018-01-01 00:43:40|2018-01-01 00:43:40|  2018-01-01 00:43:44|2018-01-01 00:43:44|            4|        2.5|
| 2018-01-01 00:12:01|2018-01-01 00:12:01|  2018-01-01 00:12:04|2018-01-01 00:12:04|            3|         75|
| 2018-01-01 00:02:31|2018-01-01 00:02:31|  2018-01-01 00:02:42|2018-01-01 00:02:42|           11|        2.5|
|

In [16]:
# Look to see how many trips are greater than 10 hours
yellow_df.withColumn("pickup_datetime", F.unix_timestamp(F.col("tpep_pickup_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("dropoff_datetime", F.unix_timestamp(F.col("tpep_dropoff_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("trip_duration", (F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long"))).\
    filter(F.col("trip_duration") > 36000).\
    select(["tpep_pickup_datetime","pickup_datetime","tpep_dropoff_datetime","dropoff_datetime","trip_duration","fare_amount"]).\
    show()

+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
|tpep_pickup_datetime|    pickup_datetime|tpep_dropoff_datetime|   dropoff_datetime|trip_duration|fare_amount|
+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
| 2018-01-01 00:57:31|2018-01-01 00:57:31|  2018-01-02 00:53:25|2018-01-02 00:53:25|        86154|        2.5|
| 2017-12-31 12:52:15|2017-12-31 12:52:15|  2018-01-01 12:13:05|2018-01-01 12:13:05|        84050|       13.5|
| 2018-01-01 00:23:33|2018-01-01 00:23:33|  2018-01-02 00:14:47|2018-01-02 00:14:47|        85874|       13.5|
| 2018-01-01 00:20:28|2018-01-01 00:20:28|  2018-01-01 23:05:08|2018-01-01 23:05:08|        81880|        9.5|
| 2017-12-31 22:15:21|2017-12-31 22:15:21|  2018-01-01 21:45:23|2018-01-01 21:45:23|        84602|       18.5|
| 2018-01-01 00:57:40|2018-01-01 00:57:40|  2018-01-02 00:52:37|2018-01-02 00:52:37|        86097|         12|
|

In [80]:
# Investigate size of 99 RateCodeID issue
yellow_df.filter(F.col("RateCodeID") > 6).\
    groupBy("RateCodeID").\
    count().\
    show()

+----------+-----+
|RateCodeID|count|
+----------+-----+
|        99|  106|
+----------+-----+



In [86]:
yellow_df.filter(F.col("passenger_count").astype(IntegerType()) < 1).\
    count()

59269

In [11]:
yellow_df.filter(F.col("passenger_count").astype(IntegerType()) > 0).\
    groupBy("passenger_count").\
    count().\
    orderBy("count", ascending=False).\
    first()[0]

'1'

In [91]:
yellow_df.filter(F.col("trip_distance") <= 0.0).\
    filter(F.col("fare_amount").astype(DoubleType()) <= 0.0).\
    count()

1986

In [92]:
yellow_df.filter(F.col("trip_distance") > 0.0).\
    filter(F.col("fare_amount").astype(DoubleType()) <= 0.0).\
    count()

4512

In [13]:
yellow_df.filter(F.col("trip_distance") <= 0.0).\
    show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|       1| 2018-01-01 00:20:53|  2018-01-01 00:20:53|              1|          .00|         1|                 N|         161|         264|           2|       14.5|  0.5|    0.5|         0|           0|                  0.3|        15.8|
|       2| 2018-01-01 00:57:31|  2018-01-02 00:5

In [97]:
yellow_df.filter(F.col("fare_amount").astype(DoubleType()) > 50.0).\
    filter(F.col("trip_distance") < 1.0).\
    count()

18678

#### Further Data Issues

* **Datetimes**: There are a small number of cases where the drop off time is before the pickup time (2 in the January 2018 file).
* **Rate code id**: There are a small number of rides with a 99 rate code (106 in the January 2018 file)
* **Passenger Counts**: There are quite a few trips where passenger count = 0; this is a driver entered field
* **Trip Distance**: Smallish number of trips with a zero distance, an even smaller amount with 0 (or lower) fare amount.
* **Fare Amount**: There are a small number of trips with a fare amount below zero; most have a trip distance gt zero, some don't
* **Fare Amount**: There are a small number of trips with a high fare amount and a low trip distance
* **Trip Duration**: There are some trips over 10 hours - seems improbable, some last for days - really improbable

### EDA of Green Taxi Trip Data

Green taxis were introduced to NYC in 2013 to serve areas outside of Manhattan where yellow taxis traditionally don't go. Green taxis can pick up passengers in northern Manhattan and anywhere in the Bronx, Brooklyn, Staten Island and Queens (excluding the airports). Green taxis can accept street hail passengers in the areas mentioned as well as accepting pre-arranged trips where the price is set by the base or app used to book the service ([reference](https://www1.nyc.gov/site/tlc/passengers/your-ride.page))

In [18]:
# Read January 2018 green taxi cab data from S3 bucket
green_df = spark.read.csv(f"s3a://{bucket_name}/trip data/green_tripdata_2018-01.csv", header=True)

In [71]:
# Show first twenty rows of the imported file
green_df.show(20)

+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+
|VendorID|lpep_pickup_datetime|lpep_dropoff_datetime|store_and_fwd_flag|RatecodeID|PULocationID|DOLocationID|passenger_count|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|ehail_fee|improvement_surcharge|total_amount|payment_type|trip_type|
+--------+--------------------+---------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+
|       2| 2018-01-01 00:18:50|  2018-01-01 00:24:39|                 N|         1|         236|         236|              5|          .70|          6|  0.5|    0.5|         0|           0|     null|                  0.3|     

#### Schema differences

The green cab csv file has an additional column named _trip_type_. Because green cabs can accept street hails or pre-booked (aka dispatch) fares the type is detailed here. Yellow cabs will always be "1" for this field.

In [73]:
# Show the columns, data types and nullability
green_df.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- lpep_pickup_datetime: string (nullable = true)
 |-- lpep_dropoff_datetime: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- ehail_fee: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- trip_type: string (nullable = true)



#### Data Type Changes

Same as yellow taxi cab file.

In [74]:
# Show summary statistics of the dataframe
green_df.summary().show()

+-------+------------------+--------------------+---------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+---------+---------------------+------------------+------------------+------------------+
|summary|          VendorID|lpep_pickup_datetime|lpep_dropoff_datetime|store_and_fwd_flag|        RatecodeID|      PULocationID|     DOLocationID|   passenger_count|     trip_distance|       fare_amount|              extra|            mta_tax|        tip_amount|       tolls_amount|ehail_fee|improvement_surcharge|      total_amount|      payment_type|         trip_type|
+-------+------------------+--------------------+---------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+-------------------+-------------

#### Data Issues

* lpep_pickup_datetime: min date 2009-01-01 for 2018-01 file; max date 2018-04-05 for 2018-01 file
* lpep_dropoff_datetime: min date 2009-01-01 for 2018-01 file; max date 2018-04-05 for 2018-01 file
* passenger_count: min = zero
* trip_distance: min = zero
* trip_duration: less than a minute, greater than 10 hours
* ratecodeid: max = 99; documentation only expects values of 1 to 6
* fare_amount: min amount is negative; max $\$$999.99, is this fair?
* other amounts: min amounts are negative
* mta_tax: should be $\$$1.00 or $\$$0.50 but has a negative minimum amount

In [101]:
# Count the number of rows in the data frame
green_df.count()

793529

In [None]:
# Check for nulls
green_df.select([count(when(isnan(c), c)).alias(c) for c in green_df.columns]).show()

In [104]:
# Determine if there are any drop offs before pickups
green_df.withColumn("pickup_datetime", F.unix_timestamp(F.col("lpep_pickup_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("dropoff_datetime", F.unix_timestamp(F.col("lpep_dropoff_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("trip_duration", (F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long"))).\
    filter(F.col("trip_duration") < 0).\
    select(["lpep_pickup_datetime","pickup_datetime","lpep_dropoff_datetime","dropoff_datetime","trip_duration"]).\
    show()

+--------------------+---------------+---------------------+----------------+-------------+
|lpep_pickup_datetime|pickup_datetime|lpep_dropoff_datetime|dropoff_datetime|trip_duration|
+--------------------+---------------+---------------------+----------------+-------------+
+--------------------+---------------+---------------------+----------------+-------------+



In [107]:
green_df.withColumn("pickup_datetime", F.unix_timestamp(F.col("lpep_pickup_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("dropoff_datetime", F.unix_timestamp(F.col("lpep_dropoff_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("trip_duration", (F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long"))).\
    filter(F.col("trip_duration") < 30).\
    select(["lpep_pickup_datetime","pickup_datetime","lpep_dropoff_datetime","dropoff_datetime","trip_duration","fare_amount"]).\
    show()

+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
|lpep_pickup_datetime|    pickup_datetime|lpep_dropoff_datetime|   dropoff_datetime|trip_duration|fare_amount|
+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
| 2018-01-01 00:13:46|2018-01-01 00:13:46|  2018-01-01 00:13:49|2018-01-01 00:13:49|            3|        2.5|
| 2018-01-01 00:23:45|2018-01-01 00:23:45|  2018-01-01 00:23:45|2018-01-01 00:23:45|            0|        4.5|
| 2018-01-01 00:42:50|2018-01-01 00:42:50|  2018-01-01 00:43:06|2018-01-01 00:43:06|           16|        2.5|
| 2018-01-01 00:16:36|2018-01-01 00:16:36|  2018-01-01 00:16:45|2018-01-01 00:16:45|            9|          6|
| 2018-01-01 00:03:08|2018-01-01 00:03:08|  2018-01-01 00:03:19|2018-01-01 00:03:19|           11|         52|
| 2018-01-01 00:07:57|2018-01-01 00:07:57|  2018-01-01 00:08:01|2018-01-01 00:08:01|            4|         18|
|

In [21]:
# Look to see how many trips are greater than 10 hours
green_df.withColumn("pickup_datetime", F.unix_timestamp(F.col("lpep_pickup_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("dropoff_datetime", F.unix_timestamp(F.col("lpep_dropoff_datetime"), "yyyy-MM-dd' 'HH:mm:ss").cast("timestamp")).\
    withColumn("trip_duration", (F.col("dropoff_datetime").cast("long") - F.col("pickup_datetime").cast("long"))).\
    filter(F.col("trip_duration") > 36000).\
    select(["lpep_pickup_datetime","pickup_datetime","lpep_dropoff_datetime","dropoff_datetime","trip_duration","fare_amount"]).\
    show()

+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
|lpep_pickup_datetime|    pickup_datetime|lpep_dropoff_datetime|   dropoff_datetime|trip_duration|fare_amount|
+--------------------+-------------------+---------------------+-------------------+-------------+-----------+
| 2018-01-01 00:40:10|2018-01-01 00:40:10|  2018-01-02 00:34:41|2018-01-02 00:34:41|        86071|       19.5|
| 2018-01-01 00:14:23|2018-01-01 00:14:23|  2018-01-01 23:44:27|2018-01-01 23:44:27|        84604|        7.5|
| 2018-01-01 00:48:42|2018-01-01 00:48:42|  2018-01-02 00:16:59|2018-01-02 00:16:59|        84497|         24|
| 2018-01-01 00:43:33|2018-01-01 00:43:33|  2018-01-02 00:34:08|2018-01-02 00:34:08|        85835|         16|
| 2018-01-01 00:22:14|2018-01-01 00:22:14|  2018-01-01 23:19:55|2018-01-01 23:19:55|        82661|       12.5|
| 2018-01-01 00:58:22|2018-01-01 00:58:22|  2018-01-02 00:48:27|2018-01-02 00:48:27|        85805|         33|
|

In [111]:
# Investigate size of 99 RateCodeID issue
green_df.filter(F.col("RateCodeID") > 6).\
    groupBy("RateCodeID").\
    count().\
    show()

+----------+-----+
|RateCodeID|count|
+----------+-----+
|        99|    3|
+----------+-----+



In [110]:
# Check how many trips have a passenger count of less than 1 (i.e zero)
green_df.filter(F.col("passenger_count").astype(IntegerType()) < 1).\
    count()

173

In [112]:
# Check how many trips have a distance less than 0 and a fare amount less than zero
green_df.filter(F.col("trip_distance") <= 0.0).\
    filter(F.col("fare_amount").astype(DoubleType()) <= 0.0).\
    count()

635

In [113]:
# Check how many trips have a distance greater than zero but a fare less than or equal to zero
green_df.filter(F.col("trip_distance") > 0.0).\
    filter(F.col("fare_amount").astype(DoubleType()) <= 0.0).\
    count()

3049

In [114]:
#Check how many trips have a high fare amount and a low trip distance
green_df.filter(F.col("fare_amount").astype(DoubleType()) > 50.0).\
    filter(F.col("trip_distance") < 1.0).\
    count()

774

#### Further Data Issues

* **Rate code id**: There are a small number of rides with a 99 rate code (3 in the January 2018 file)
* **Passenger Counts**: There is a small number of trips where passenger count = 0; this is a driver entered field
* **Trip Distance**: Smallish number of trips with a zero distance, an even smaller amount with 0 (or lower) fare amount.
* **Fare Amount**: There are a small number of trips with a fare amount below zero; most have a trip distance gt zero, some don't
* **Fare Amount**: There are a small number of trips with a high fare amount and a low trip distance
* **Trip Duration**: There are trips that last over 10 hours - these are probably unlikely