### Reminder
### Pre- EDA Summary
#### Matters that needs to be tended in the preprocessing:
1. Dropping unused columns

2. Cleaning missing values if any encountered

3. Converting timestamp columns to the proper type and format

4. Cleaning negative trip durations, trip distances

5. Cleaning negative fares

6. Cleaning the trips with no passengers

7. Encoding the variables that should be categorical

#### Feature engineering:

1. Calculate Borough sizes using number of zones that every one of them contains

2. Calculate average drive speed

3. Calculate the drivetime.

4. Create the indicator wether it was a night/rush hour course.

5. Create the indicator showing if the trip happened during the weekend

6. Create a season indicator (for models trained on many months).

7. Optionally merge the outlying payment types together into the "uncommon" category.

#### Features to keep from the original dataset:
1. PULocationID and DOLocationID
2. tpep_pickup_datetime and tpep_dropoff_datetime
3. passenger_count
4. trip_distance
5. payment_type
6. fare_amount
7. extra
8. tip_amount

#### Handling anomalies
##### Erase rows where:

1. PULocationID or DOLocationID is in {0, 264, 265}

2. Total amount is negative

3. "extra" value is negative

4. "tip_amount" is negative

5. Trip lasts longer than 100 minutes or its duration is less than or equal to 0 minutes.

6. Erase rows with missing values.

In [2]:
import pandas as pd
import numpy as np
import os

os.chdir('src')
from toolkit.etl_toolkit import ingest_data, preprocess_data, engineering_toolkit
from toolkit.analysis_toolkit import inspect_distribution, calculate_drivetime, inspect_yearly, corrplot

#### Preamble - yearly heatmaps

In [None]:
aggregated_data = pd.DataFrame()
valid_months = ['0' + str(i+1) if i < 9 else str(i+1) for i in range(12)]
for ind, month, year in zip(range(24), valid_months*2, ['2018']*12 + ['2019']*12):
    aggregated_data.loc[ind, 'year'] = year
    aggregated_data.loc[ind, 'month'] = month
    aggregated_data.loc[ind, 'trips'] = len(ingest_data(year, month))
    aggregated_data.loc[ind, 'unique_trips'] = len(set(ingest_data(year, month).index))
    

In [None]:
inspect_yearly(aggregated_data, method = 'heatmap')

In [None]:
inspect_yearly(aggregated_data, method = 'clustermap')

Now the data for EDA will be processed using what we have learned from the yellowcab_data_domain_understanding notebook.
Then it will be time to explore feature interactions, just before stepping into the realms of feature engineering.

In [None]:
zone_lookup = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv')

In [None]:
columns_to_keep = ['PULocationID', 
                   'DOLocationID', 
                   'tpep_pickup_datetime', 
                   'tpep_dropoff_datetime', 
                   'passenger_count',
                   'trip_distance', 
                   'payment_type',
                   'fare_amount',
                   'extra',
                   'tip_amount']

cat_columns = ['payment_type']

yellow_06_19 = preprocess_data(ingest_data('2019', '06').head(100000), zone_lookup, columns_to_keep, cat_columns)

In [None]:
corrplot(yellow_06_19, columns = ['passenger_count', 
                                  'trip_distance', 
                                  'fare_amount', 
                                  'extra', 
                                  'tip_amount'])

For now we could say that "extra" column is not necessarily used in a proper way. It should be futher processed to represent wether it is rush hour or a night ride (which should be more compact approach than extracting this information from the timestamps). 


In [None]:
yellow_06_19 = engineering_toolkit(yellow_06_19, ['borough_size', 'speed', 'trip_type', 'season'], zone_lookup)

In [None]:
yellow_06_19.columns

In [None]:
corrplot(yellow_06_19, columns = ['passenger_count', 
                                  'trip_distance', 
                                  'fare_amount', 
                                  'PUSize', 
                                  'DOSize', 
                                  'tip_amount',
                                  'speed',
                                  'drivetime'])

### Draft ideas of modelling could be:

#### Trained on sister months (01.2018 & 01.2019, 02.2018 & 02.2019 etc.):
1. Predict the trip duration, based on distance to drive, pickup/dropoff borough size, type of trip (night/rush hour/day), season.

#### Trained on whole dataset (but just on trips with card payments, for cash tips are not included in the data)
2. Predict tip amount based on distance, type of trip (night/rush hour/day), season, passengers count (after erasing the zero-passengers trips), pickup/dropoff borough size and optionally payment type.

#### Notes before modelling:

Model 1) 
columns: scaling: passenger_count, trip_distance, PUSize, DOSize
         one-hot- encoded: trip_type
         
Model 2) 
columns: scaling: passenger_count, trip_distance, PUSize, DOSize, PULocationID (Optional), DOLocationID (Optional), speed 
         one-hot- encoded: trip_type, season, PUBorough, DOBorough
         
