# EDA #2 & Feature selection

## Abstract

This notebook aimed at making a second batch of data explorations on the features set against the target variable.

___NB: the evaluation is made on 1/50 of the original data - since originally those features' data is ~ 8 GB.___

## Import data

In [1]:
%%bigquery data

SELECT
  *
FROM
  `aliz-ml-spec-2022-submission.demo1.Demo1_MLdataset`
WHERE
  RAND() < 1/200

Query complete after 0.01s: 100%|██████████| 2/2 [00:00<00:00, 800.13query/s]                         
Downloading: 100%|██████████| 276619/276619 [00:01<00:00, 147116.41rows/s]


In [2]:
import pandas as pd

pd.options.display.max_columns = 50

data.head()

Unnamed: 0,trip_id,fare,primary_fare,trip_start_timestamp,TripStartYear,TripStartMonth,TripStartDay,TripStartHour,TripStartMinute,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,rawDistance,rawLongitude,rawLatitude,historical_tripDuration,histOneWeek_tripDuration,histOneMonth_tripDuration,histThreeMonth_tripDuration,historical_tripDistance,histOneWeek_tripDistance,histOneMonth_tripDistance,histThreeMonth_tripDistance
0,374e725bbba5d229f26625df419a5b2d7368fd7f,8.5,12.8,2019-02-13 09:15:00+00:00,2019,2,Wednesday,9,15,792,1.8,17031080100,17031839100,0.027215,0.006088,0.026526,848.146889,901.875,900.058824,837.355263,1.910912,2.24625,1.964118,1.790132
1,7db5be6570468915fddba10a573393b69391c582,14.0,18.575,2017-08-14 07:30:00+00:00,2017,8,Monday,7,30,846,4.2,17031081600,17031838200,0.051035,0.046211,0.021658,897.294118,,,920.0,3.728824,,,3.25
2,856bdddaaaeed0bdc8dfdcf953406f88d7746b08,8.5,11.398611,2016-08-23 16:00:00+00:00,2016,8,Tuesday,16,0,655,1.6,17031081500,17031280100,0.018091,0.016594,0.007208,643.565789,734.428571,775.6875,744.850467,1.380099,1.342857,1.306563,1.556075
3,5b4c706fce3b788e03d1b782377a0389ee7fd061,6.75,9.091667,2016-03-17 11:15:00+00:00,2016,3,Thursday,11,15,420,1.3,17031081600,17031243500,0.023668,0.02366,0.000585,426.0,360.0,360.0,426.0,1.6,1.6,1.6,1.6
4,c7a77c77a72dd77ecd4f39b83e9ad2d7cfb3711c,5.75,7.169444,2018-01-16 07:45:00+00:00,2018,1,Tuesday,7,45,370,0.6,17031081800,17031280100,0.009344,0.004964,0.007916,346.817568,180.0,218.0,330.631579,0.987264,0.6,0.716667,0.828947


### Comment

The field `primary_fare` has been created out of the original pricing instructions issued by the city of Chicago: https://checkertaxichicago.com/rates-table/.

Such fare should be the minimum possible since diverse fees are not applied (additional passengers, airport tax, etc.)

However we can notice great discrepancies from now on since the applied real fare can actually be lower than the one advised.

## Check the types

In [3]:
data.dtypes

trip_id                                     object
fare                                       float64
primary_fare                               float64
trip_start_timestamp           datetime64[ns, UTC]
TripStartYear                                int64
TripStartMonth                               int64
TripStartDay                                object
TripStartHour                                int64
TripStartMinute                              int64
trip_seconds                                 int64
trip_miles                                 float64
pickup_census_tract                          int64
dropoff_census_tract                         int64
rawDistance                                float64
rawLongitude                               float64
rawLatitude                                float64
historical_tripDuration                    float64
histOneWeek_tripDuration                   float64
histOneMonth_tripDuration                  float64
histThreeMonth_tripDuration    

In [4]:
categorical_variables = [
    'TripStartYear',
    'TripStartMonth',
    'TripStartDay',
    'TripStartHour',
    'TripStartMinute',
    'pickup_census_tract',
    'dropoff_census_tract'
]

date_variables = ['trip_start_timestamp']

numerical_variables = [
    'fare',
    'primary_fare',
    'trip_seconds',
    'trip_miles',
    'rawDistance',
    'rawLongitude',
    'rawLatitude',
    'historical_tripDuration',
    'histOneWeek_tripDuration',
    'histOneMonth_tripDuration',
    'histThreeMonth_tripDuration',
    'historical_tripDistance',
    'histOneWeek_tripDistance',
    'histOneMonth_tripDistance',
    'histThreeMonth_tripDistance'
]

In [5]:
data[categorical_variables] = data[categorical_variables].astype('object')

## Sweetviz

In [6]:
import sweetviz as sv

my_report = sv.analyze(data, target_feat = "fare")
my_report.show_html("../reports/SWEETVIZ_REPORT.html")
my_report.show_notebook()

                                             |          | [  0%]   00:00 -> (? left)

Report ../reports/SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Feature selection

These are some good news that we have a lot of engineered features highly correlating with `trip_miles`, `trip_seconds` or even `fare` since:
- `fare` is the target variable to predict and finding high correlations between potential features known at serving time & the target variable is definitely promising;
- from documentation, `trip_miles` & `trip_seconds` are the most important factors determining the fare price; however such variables are not known at serving time - so we cannot use them for prediction. Then the features - known at serving time - which highly correlate with `trip_miles` & `trip_seconds` can definitely be used as subtitutes for prediction.

Based on those remarks, let's choose the features for modelling:
- `TripStartYear`, `TripStartMonth`, `TripStartDay`, `TripStartHour` & `TripStartMinute` will be categorical features referring to the trip start time;
- `pickup_census_tract` & `dropoff_census_tract` will be categorical features referring to the trip start & end locations;
- `historical_tripDuration` & `histOneWeek_tripDuration` are numerical features chosen to be substitutes to `trip_seconds` (remind that such features are engineered based on the same trip start/end locations & trip time - day & hour) - `historical_tripDuration`is a long snapshot of any external factors' impact on the trip duration while `histOneWeek_tripDuration` may encapsulate recent new external factors' impact;
- `historical_tripDistance` & `histOneWeek_tripDistance` are numerical features chosen to be substitutes to `trip_miles` (remind that such features are engineered based on the same trip start/end locations & trip time - day & hour) - `historical_tripDistance` is a long snapshot of any external factors' impact on the trip distance while `histOneWeek_tripDistance` may encapsulate recent new external factors' impact;
- `rawDistance` is a numerical feature referring to the euclidean distance between the trip start & end locations.