# Data Understanding

## The Data Source [NYC OpenData Yellow Taxis Trips 2017](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs)

[NYC OpenData](https://opendata.cityofnewyork.us) offers a collection of datasets for [Yellow taxi trips in New York City](https://data.cityofnewyork.us/browse?q=yellow%20taxi%20trip&sortBy=relevance) for the years 2009 until 2018.

## The Dataset

I decided to use [the dataset for 2017](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs), for the following reasons:
* My questions concern the year 2018, but the data for 2018 is not complete.
* Beginning with 2017 a new dataformat was used: start and end location for trips ware not given anymore as GPS coordinates due to privacy reasons, but were coded one of 263 taxi zones. I assumed that new format is easier to evaluate.
* 2017 data is given as a 10GB csv-file. Since I had a tough timeline (several days) I wanted to avoid trouble with big data sizes. Minimum of one whole year should be contained. 2017 is closest to 2018.

## How to download the dataset

Use [the download script](step_1a_download_raw_data.sh) I prepared:

```bash step_1a_download_raw_data.sh```

It downloads the data in csv format from this url: https://data.cityofnewyork.us/api/views/biws-g3hs/rows.csv?accessType=DOWNLOAD and compresses it with gzip at best level and saves it as `nyc-2017-yellow-taxi-trips.cvs.gz`

```
> bash step_1_download_raw_data.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.8G    0  9.8G    0     0  1094k      0 --:--:--  2:36:34 --:--:-- 1213k
```

The download takes about 2.5 hours. Looks like the server is limited to ~1MB/s.

It contains 113,496,874 data lines plus one header line.

```
> gunzip -c nyc-2017-yellow-taxi-trips.cvs.gz | wc -l
 113496875
```

## Structure of the dataset

NYC OpenData offers [General Info about the Dataset](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs) and a [PDF-Data-Dictionary](https://data.cityofnewyork.us/api/views/biws-g3hs/files/eb3ccc47-317f-4b2a-8f49-5a684b0b1ecc?download=true&filename=data_dictionary_trip_records_yellow.pdf).

In [32]:
import pandas as pd

In [35]:
# lets peek only at the first 1000 rows
# since it may take too long otherwise
df = pd.read_csv('nyc-2017-yellow-taxi-trips.cvs.gz', chunksize=1000).get_chunk()

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
VendorID                 1000 non-null int64
tpep_pickup_datetime     1000 non-null object
tpep_dropoff_datetime    1000 non-null object
passenger_count          1000 non-null int64
trip_distance            1000 non-null float64
RatecodeID               1000 non-null int64
store_and_fwd_flag       1000 non-null object
PULocationID             1000 non-null int64
DOLocationID             1000 non-null int64
payment_type             1000 non-null int64
fare_amount              1000 non-null float64
extra                    1000 non-null float64
mta_tax                  1000 non-null float64
tip_amount               1000 non-null float64
tolls_amount             1000 non-null float64
improvement_surcharge    1000 non-null float64
total_amount             1000 non-null float64
dtypes: float64(8), int64(6), object(3)
memory usage: 132.9+ KB


## Interesting Variables

### tpep_pickup_datetime, tpep_dropoff_datetime

That is the start end end time of the taxi trip. We will need this to analyze the trip duration.

### trip_distance

We will need this one to compute trip velocity. As we will see there trip data with unbelieveable velocity and we will drop them.

### PULocationID, DOLocationID

The IDs of the start and end zone of the trip. We will filter for start end end zone.

## Zone lookup table

The location IDs are documented in a lookup table, that is provided by the [NYC Taxi & Limousine Commission](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

Use the [the download script](step_1b_download_taxi_zone_lookup.sh) prepared by me.

```bash step_1b_download_taxi_zone_lookup```

It downloads the lookup table in csv format from this url: https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv and saves it as `nyc-taxi-zone-lookup.csv`.

```
> bash step_1b_download_taxi_zone_lookup.sh
=== nyc taxi to airport - step 1b download taxi zone lookup

downloading from url: https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
writing to file: nyc-taxi-zone-lookup.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12322  100 12322    0     0   2232      0  0:00:05  0:00:05 --:--:--  2574

done
```

In [44]:
zones = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv')

In [45]:
zones.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 4 columns):
LocationID      265 non-null int64
Borough         265 non-null object
Zone            264 non-null object
service_zone    263 non-null object
dtypes: int64(1), object(3)
memory usage: 8.4+ KB


LocationID is the integer value that is used in the taxi trip dataset.

In [50]:
zones.LocationID.unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

It is in the range from 1 to 256.

The two last values 264 and 265 encode missing values. But ID 264 has a Zone value "NV": I guess that stands for "no value".

In [53]:
zones.tail()

Unnamed: 0,LocationID,Borough,Zone,service_zone
260,261,Manhattan,World Trade Center,Yellow Zone
261,262,Manhattan,Yorkville East,Yellow Zone
262,263,Manhattan,Yorkville West,Yellow Zone
263,264,Unknown,NV,
264,265,Unknown,,


The Borough has one of the seven string values: 'EWR', 'Queens', 'Bronx', 'Manhattan', 'Staten Island', 'Brooklyn', 'Unknown'

In [54]:
zones.Borough.uniqu

array(['EWR', 'Queens', 'Bronx', 'Manhattan', 'Staten Island', 'Brooklyn',
       'Unknown'], dtype=object)

The Zone is the most finegrained info. There are 262 different zones.

In [57]:
zones.Zone.unique().shape

(262,)

Why are there 262 zones but 263 zones?

In [113]:
ids_per_zone = zones[['LocationID','Zone']].groupby('Zone').count().iloc[:,0]
ids_per_zone[ids_per_zone != 1]

Zone
Corona                                           2
Governor's Island/Ellis Island/Liberty Island    3
Name: LocationID, dtype: int64

In [122]:
zones[zones.Zone.isin(["Corona", "Governor's Island/Ellis Island/Liberty Island"])]

Unnamed: 0,LocationID,Borough,Zone,service_zone
55,56,Queens,Corona,Boro Zone
56,57,Queens,Corona,Boro Zone
102,103,Manhattan,Governor's Island/Ellis Island/Liberty Island,Yellow Zone
103,104,Manhattan,Governor's Island/Ellis Island/Liberty Island,Yellow Zone
104,105,Manhattan,Governor's Island/Ellis Island/Liberty Island,Yellow Zone


So there are 265 IDs. 2 of them encode missing values: 264 and 265. There are 262 zone values. But one of them ("NV") is just a code for a missing value, that leaves 261 real values. One zone has 2 IDs another zone has 3 IDs, all other zones have 1 ID. All in all There 1+3=4 more IDs than real values: 261+4 = 265.