# Prepare Data

## Extracting trips to airports

The complete raw csv file contains more than 110 million rows. We will now strip it down by selecting only those taxi trips that are heading to any of the three airports:
* Newark
* JFK
* LaGuardia


The module that executes that is called: `step_2_extract_trips_to_airport`.

In [1]:
#uncomment if you debug the module
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import step_2_extract_trips_to_airport as step2

Lets lookup the location IDs for airport zones in the zone lookup table.

In [3]:
zones = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv').dropna()
zones[zones.Zone.str.contains("Airport")]

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
131,132,Queens,JFK Airport,Airports
137,138,Queens,LaGuardia Airport,Airports


These values are hard encoded in the step2 module

In [4]:
step2.nyc_airports, step2.newark, step2.jfk, step2.laguardia

((1, 132, 138), 1, 132, 138)

The main function will load the raw data csv file, filter trips with airport destinations and save it as a gzipped csv file. 

You may call it from command line:

```
> python step_2_extract_trips_to_airport.py
```

```
> python step_2_extract_trips_to_airport.py
=== nyc taxi to airport - step 2 extract trips to airport
loading file: nyc-2017-yellow-taxi-trips.cvs.gz
time 18s | 10,000,000 lines read in | 222,622 lines selected
time 36s | 20,000,000 lines read in | 467,448 lines selected
time 53s | 30,000,000 lines read in | 706,344 lines selected
time 71s | 40,000,000 lines read in | 932,478 lines selected
time 90s | 50,000,000 lines read in | 1,166,319 lines selected
time 108s | 60,000,000 lines read in | 1,401,412 lines selected
time 126s | 70,000,000 lines read in | 1,636,481 lines selected
time 145s | 80,000,000 lines read in | 1,860,261 lines selected
time 163s | 90,000,000 lines read in | 2,058,773 lines selected
time 181s | 100,000,000 lines read in | 2,244,486 lines selected
time 199s | 110,000,000 lines read in | 2,451,356 lines selected
time 206s | 113,496,874 lines read in | 2,533,072 lines selected
saving file: nyc-2017-yellow-taxi-trips-to-airport.cvs.gz
done
```

The result file contains 2,533,073 data rows plus one header line. Its gzipped size is now only 50MB.

```
> gunzip -l nyc-2017-yellow-taxi-trips-to-airport.cvs.gz
         compressed        uncompressed  ratio uncompressed_name
           51793187           278755354  81.4% nyc-2017-yellow-taxi-trips-to-airport.cvs
> gunzip -c nyc-2017-yellow-taxi-trips-to-airport.cvs.gz|wc -l
2533073
```

## Clean data

The module for cleaning is called: `step_3_clean_data.py`

In [13]:
import step_3_clean_data as step3

Lets load the data and look for stuff to clean up.

In [None]:
# note: the input of step 3 is the output of step 2
#       if you did not execute step 2, this will fail
%time df2 = step3.load_data(input_file=step3.input_file)

loading file: nyc-2017-yellow-taxi-trips-to-airport.cvs.gz
100,000 lines read | time 35s


In [14]:
step3.main?

[0;31mSignature:[0m [0mstep3[0m[0;34m.[0m[0mmain[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Loads the data for taxi trips to airports from step 2, cleans it and saves the result.

If output_file already exists, the function skips.
Remove the output_file manually in that case.

The input_file is loaded in chunks of 100,000 lines.
While loading simple progress info will be displayed.

After the whole file is loaded the function clean_data is applied.
That includes a transformation to efficient datatypes.

At the end the cleaned dataset is saved as a gzipped pickle file,
so that the datatypes are not lost.

Remember: pickle files should only be used for temporary storage, since
the format is not guaranteed to be stable between different lib versions.                   

Keyword Arguments: -

Returns: -
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_3_clean_data.py
[0;31mType:[0m      function


In [21]:
step3.load_data?

[0;31mSignature:[0m [0mstep3[0m[0;34m.[0m[0mload_data[0m[0;34m([0m[0minput_file[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Loads the dataframe from input_file.

The file will be loaded with pandas.read_csv with a chunksize of 100_000.
Simple progress info will be displayed during loading.

To speed up, the following transformations are done while loading:
   - only the columns in cols_to_use are loaded
   - data types are mapped as specified in dict data_types
   - the columns specified in dates_to_parse will be parsed

Keyword Arguments:
input_file -- the filepath of the input file to read

Returns: the loaded dataframe
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_3_clean_data.py
[0;31mType:[0m      function


In [15]:
step3.input_file

'nyc-2017-yellow-taxi-trips-to-airport.cvs.gz'

In [16]:
step3.output_file

'nyc-2017-yellow-taxi-trips-to-airport.pkl.gz'

In [23]:
step3.cols_to_use

['Unnamed: 0',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'PULocationID',
 'DOLocationID',
 'trip_distance']

In [18]:
step3.data_types

{'PULocationID': numpy.int16, 'DOLocationID': numpy.int16}

In [19]:
step3.dates_to_parse

['tpep_pickup_datetime', 'tpep_dropoff_datetime']

You may call it from command line:
