# 2. Data Understanding

## The Data Source [NYC OpenData Yellow Taxis Trips 2017](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs)

[NYC OpenData](https://opendata.cityofnewyork.us) offers a collection of datasets for [Yellow taxi trips in New York City](https://data.cityofnewyork.us/browse?q=yellow%20taxi%20trip&sortBy=relevance) for the years 2009 until 2018.

## The Dataset

I decided to use [the dataset for 2017](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs), for the following reasons:
* My questions concern the year 2018, but the data for 2018 is not complete.
* Beginning with 2017 a new dataformat was used: start and end location for trips ware not given anymore as GPS coordinates due to privacy reasons, but were coded one of 263 taxi zones. I assumed that new format is easier to evaluate.
* 2017 data is given as a 10GB csv-file. Since I had a tough timeline (several days) I wanted to avoid trouble with big data sizes. Minimum of one whole year should be contained. 2017 is closest to 2018.

## How to download the dataset

Use [the download script](step_1_download_raw_data.sh) I prepared:

```bash step_1_download_raw_data.sh```

It downloads the data in csv format from this url: https://data.cityofnewyork.us/api/views/biws-g3hs/rows.csv?accessType=DOWNLOAD and compresses it with gzip at best level and saves it as `nyc-2017-yellow-taxi-trips.cvs.gz`

```
>bash step_1_download_raw_data.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.8G    0  9.8G    0     0  1094k      0 --:--:--  2:36:34 --:--:-- 1213k
```

The download takes about 2.5 hours. Looks like the server is limited to ~1MB/s.

It contains 113,496,874 data lines plus one header line.

```
> gunzip -c nyc-2017-yellow-taxi-trips.cvs.gz | wc -l
 113496875
```

## Structure of the dataset

NYC OpenData offers [General Info about the Dataset](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs) and a [PDF-Data-Dictionary](https://data.cityofnewyork.us/api/views/biws-g3hs/files/eb3ccc47-317f-4b2a-8f49-5a684b0b1ecc?download=true&filename=data_dictionary_trip_records_yellow.pdf).

In [32]:
import pandas as pd

In [35]:
# lets peek only at the first 1000 rows
# since it may take too long otherwise
df = pd.read_csv('nyc-2017-yellow-taxi-trips.cvs.gz', chunksize=1000).get_chunk()

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
VendorID                 1000 non-null int64
tpep_pickup_datetime     1000 non-null object
tpep_dropoff_datetime    1000 non-null object
passenger_count          1000 non-null int64
trip_distance            1000 non-null float64
RatecodeID               1000 non-null int64
store_and_fwd_flag       1000 non-null object
PULocationID             1000 non-null int64
DOLocationID             1000 non-null int64
payment_type             1000 non-null int64
fare_amount              1000 non-null float64
extra                    1000 non-null float64
mta_tax                  1000 non-null float64
tip_amount               1000 non-null float64
tolls_amount             1000 non-null float64
improvement_surcharge    1000 non-null float64
total_amount             1000 non-null float64
dtypes: float64(8), int64(6), object(3)
memory usage: 132.9+ KB


## Interesting Variables

### tpep_pickup_datetime, tpep_dropoff_datetime

That is the start end end time of the taxi trip. We will need this to analyze the trip duration.

### trip_distance

We will need this one to compute trip velocity. As we will see there trip data with unbelieveable velocity and we will drop them.

### PULocationID, DOLocationID

The IDs of the start and end zone of the trip. We will filter for start end end zone.