# Prepare Data

## Extracting trips to airports

The complete raw csv file contains more than 110 million rows. We will now strip it down by selecting only those taxi trips that are heading to any of the three airports:
* Newark
* JFK
* LaGuardia


The module that executes that is called: `step_2_extract_trips_to_airport`.

In [1]:
#uncomment if you debug the module
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import step_2_extract_trips_to_airport as step2

Lets lookup the location IDs for airport zones in the zone lookup table.

In [3]:
zones = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv').dropna()
zones[zones.Zone.str.contains("Airport")]

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
131,132,Queens,JFK Airport,Airports
137,138,Queens,LaGuardia Airport,Airports


These values are hard encoded in the step2 module

In [4]:
step2.nyc_airports, step2.newark, step2.jfk, step2.laguardia

((1, 132, 138), 1, 132, 138)

The main function will load the raw data csv file, filter trips with airport destinations and save it as a gzipped csv file. 

You may call it from command line:

```
> python step_2_extract_trips_to_airport.py
```

```
> python step_2_extract_trips_to_airport.py
=== nyc taxi to airport - step 2 extract trips to airport
loading file: nyc-2017-yellow-taxi-trips.cvs.gz
time 18s | 10,000,000 lines read in | 222,622 lines selected
time 36s | 20,000,000 lines read in | 467,448 lines selected
time 53s | 30,000,000 lines read in | 706,344 lines selected
time 71s | 40,000,000 lines read in | 932,478 lines selected
time 90s | 50,000,000 lines read in | 1,166,319 lines selected
time 108s | 60,000,000 lines read in | 1,401,412 lines selected
time 126s | 70,000,000 lines read in | 1,636,481 lines selected
time 145s | 80,000,000 lines read in | 1,860,261 lines selected
time 163s | 90,000,000 lines read in | 2,058,773 lines selected
time 181s | 100,000,000 lines read in | 2,244,486 lines selected
time 199s | 110,000,000 lines read in | 2,451,356 lines selected
time 206s | 113,496,874 lines read in | 2,533,072 lines selected
saving file: nyc-2017-yellow-taxi-trips-to-airport.cvs.gz
done
```

The result file contains 2,533,073 data rows plus one header line. Its gzipped size is now only 50MB.

```
> gunzip -l nyc-2017-yellow-taxi-trips-to-airport.cvs.gz
         compressed        uncompressed  ratio uncompressed_name
           51793187           278755354  81.4% nyc-2017-yellow-taxi-trips-to-airport.cvs
> gunzip -c nyc-2017-yellow-taxi-trips-to-airport.cvs.gz|wc -l
2533073
```

## Clean data

The module for cleaning is called: `step_3_clean_data.py`

In [13]:
import step_3_clean_data as step3

Lets load the data and look for stuff to clean up.

In [28]:
# note: the input of step 3 is the output of step 2
#       if you did not execute step 2, this will fail
%time df2 = pd.read_csv(step3.input_file)

CPU times: user 12.9 s, sys: 1.19 s, total: 14.1 s
Wall time: 14.4 s


In [29]:
df2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2533072 entries, 0 to 2533071
Data columns (total 18 columns):
Unnamed: 0               int64
VendorID                 int64
tpep_pickup_datetime     object
tpep_dropoff_datetime    object
passenger_count          int64
trip_distance            float64
RatecodeID               int64
store_and_fwd_flag       object
PULocationID             int64
DOLocationID             int64
payment_type             int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtypes: float64(8), int64(7), object(3)
memory usage: 831.0 MB


This dataframe has a size of 831MB in memory.

We will not pick all columns, only this ones:

In [30]:
step3.cols_to_use

['Unnamed: 0',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'PULocationID',
 'DOLocationID',
 'trip_distance']

In [33]:
df2[step3.cols_to_use].info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2533072 entries, 0 to 2533071
Data columns (total 6 columns):
Unnamed: 0               int64
tpep_pickup_datetime     object
tpep_dropoff_datetime    object
PULocationID             int64
DOLocationID             int64
trip_distance            float64
dtypes: float64(1), int64(3), object(2)
memory usage: 459.0 MB


That almost reduces the size by >330MB.

The column called 'Unnamed: 0' contains the original row numbers from the raw csv file. I will keep them, they may be helpful.

We can reduce memory with different datatypes: for location ids

In [36]:
step3.data_types

{'PULocationID': numpy.int16, 'DOLocationID': numpy.int16}

... and by parsing the datetime strings

In [37]:
step3.dates_to_parse

['tpep_pickup_datetime', 'tpep_dropoff_datetime']

This all happens if you call the main function.

In [38]:
step3.main?

[0;31mSignature:[0m [0mstep3[0m[0;34m.[0m[0mmain[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Loads the data for taxi trips to airports from step 2, cleans it and saves the result.

If output_file already exists, the function skips.
Remove the output_file manually in that case.

The input_file is loaded in chunks of 100,000 lines.
While loading simple progress info will be displayed.

After the whole file is loaded the function clean_data is applied.
That includes a transformation to efficient datatypes.

At the end the cleaned dataset is saved as a gzipped pickle file,
so that the datatypes are not lost.

Remember: pickle files should only be used for temporary storage, since
the format is not guaranteed to be stable between different lib versions.                   

Keyword Arguments: -

Returns: -
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_3_clean_data.py
[0;31mType:[0m      function


In [39]:
step3.load_data?

[0;31mSignature:[0m [0mstep3[0m[0;34m.[0m[0mload_data[0m[0;34m([0m[0minput_file[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Loads the dataframe from input_file.

The file will be loaded with pandas.read_csv with a chunksize of 100_000.
Simple progress info will be displayed during loading.

To speed up, the following transformations are done while loading:
   - only the columns in cols_to_use are loaded
   - data types are mapped as specified in dict data_types
   - the columns specified in dates_to_parse will be parsed

Keyword Arguments:
input_file -- the filepath of the input file to read

Returns: the loaded dataframe
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_3_clean_data.py
[0;31mType:[0m      function


In [49]:
step3.clean_data?

[0;31mSignature:[0m [0mstep3[0m[0;34m.[0m[0mclean_data[0m[0;34m([0m[0mdf[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Cleans the passed dataframe.

Actions done while cleaning:
- dropping all rows with missing location ids
- dropping all rows where dropoff time is before pickup time
- consider all location ids that map to the same zone as equivalent and replace them with a single value

Keyword Arguments:
df -- the dataframe to clean

Returns: the cleaned dataframe
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_3_clean_data.py
[0;31mType:[0m      function


In [40]:
step3.output_file

'nyc-2017-yellow-taxi-trips-to-airport.pkl.gz'

You may call it from command line:
````
=== nyc taxi to airport - step 3 clean data
loading file: nyc-2017-yellow-taxi-trips-to-airport.cvs.gz
100,000 lines read | time 20s
200,000 lines read | time 41s
300,000 lines read | time 61s
400,000 lines read | time 82s
500,000 lines read | time 103s
600,000 lines read | time 123s
700,000 lines read | time 144s
800,000 lines read | time 165s
900,000 lines read | time 185s
1,000,000 lines read | time 206s
1,100,000 lines read | time 226s
1,200,000 lines read | time 247s
1,300,000 lines read | time 268s
1,400,000 lines read | time 289s
1,500,000 lines read | time 309s
1,600,000 lines read | time 330s
1,700,000 lines read | time 351s
1,800,000 lines read | time 372s
1,900,000 lines read | time 393s
2,000,000 lines read | time 413s
2,100,000 lines read | time 434s
2,200,000 lines read | time 455s
2,300,000 lines read | time 475s
2,400,000 lines read | time 496s
2,500,000 lines read | time 516s
2,533,072 lines read | time 523s
saving file: nyc-2017-yellow-taxi-trips-to-airport.pkl.gz
done
```

## Transform data

The module for transforming is called: `step_4_transform_data.py`.

In [50]:
import step_4_transform as step4

In [51]:
step4.input_file

'nyc-2017-yellow-taxi-trips-to-airport.cvs.gz'

In [53]:
step4.output_file

'nyc-2017-yellow-taxi-trips-to-airport-expanded.pkl.gz'

In [55]:
step4.main?

[0;31mSignature:[0m [0mstep4[0m[0;34m.[0m[0mmain[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Loads the cleaned data for taxi trips to airports from step 3, transforms it and saves the result.

If output_file already exists, the function skips.
Remove the output_file manually in that case.

The input_file is loaded unchunked.

After the whole file is loaded the function transform is applied.

At the end the transformed dataset is saved as a gzipped pickle file.

Remember: pickle files should only be used for temporary storage, since
the format is not guaranteed to be stable between different lib versions.                   

Keyword Arguments: -

Returns: -
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_4_transform.py
[0;31mType:[0m      function


In [57]:
step4.transform?

[0;31mSignature:[0m [0mstep4[0m[0;34m.[0m[0mtransform[0m[0;34m([0m[0mdf[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Transforms the passed dataframe.

Actions done while transforming:
- translating location ids to zone name categories
- renaming datetime: get rid of the tpep_ prefix
- add additional variables derived from the dropoff dateime
- add trip duration in minutes and in hours
- add trip velocity

Keyword Arguments:
df -- the dataframe to transform

Returns: the transformed dataframe
[0;31mFile:[0m      ~/github/nyc-taxi-to-airport/step_4_transform.py
[0;31mType:[0m      function


You may call it from command line:

````
> python step_4_transform.py
loading file: nyc-2017-yellow-taxi-trips-to-airport.pkl.gz
=== nyc taxi to airport - step 4 transform
transform location ids to zones
renaming datetimes
adding additonal datetime variables
adding trip duration in minutes
adding trip duration in hours
adding trip velocity
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2509468 entries, 14 to 113496861
Data columns (total 17 columns):
pickup_datetime           datetime64[ns]
dropoff_datetime          datetime64[ns]
trip_distance             float64
pickup_borough            category
pickup_zone               category
dropoff_zone              category
pickup_service_zone       category
dropoff_month             category
drop_off_week_of_year     category
dropoff_day_of_year       category
dropoff_day_of_month      category
dropoff_weekday           category
dropoff_is_weekend        category
dropoff_hour              category
trip_duration_minutes     float64
trip_duration_hours       float64
trip_duration_velocity    float64
dtypes: category(11), datetime64[ns](2), float64(4)
memory usage: 165.2 MB
saving file: nyc-2017-yellow-taxi-trips-to-airport-expanded.pkl.gz
done
```