![Banner](./img/AI_Special_Program_Banner.jpg)

# Exercise on Feature Engineering
---

In this exercise, you will apply feature engineering methods to improve the performance of a ML model. Feature engineering is more of an art than an exact science and requires some creative work. This exercise is therefore designed to be very open and exploratory. Your task will be to develop and empirically evaluate **your own feature engineering methods**.

## Data preparation and a baseline model
---

In this section, we read in the data and develop an initial baseline model. We use a customized subset (145,572 rows) of the [NYC Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview) dataset from a Kaggle competition. The goal of this data is to predict the trip duration (`trip_duration`) of a cab ride in New York City as accurately as possible. It is therefore a **regression problem**. The competition was announced with prize money of \\$30,000.

### Reading in the data

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('./data/nyc_taxi.csv')

In [3]:
df.head()

Unnamed: 0,index,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,571578,id2141905,2,2016-03-31 16:04:41,1,-73.971916,40.757042,-73.974663,40.753624,0,113
1,1280332,id0996953,2,2016-04-21 21:54:52,2,-73.961891,40.771061,-73.906311,40.908562,0,2037
2,177838,id1572284,1,2016-03-30 11:26:24,3,-74.010338,40.711674,-73.957047,40.777634,0,1811
3,1433776,id0103694,1,2016-03-06 20:07:45,1,-74.005898,40.740093,-73.992287,40.758511,0,977
4,757662,id2548956,1,2016-04-06 13:45:10,1,-74.011063,40.715599,-74.005035,40.720966,0,342


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145572 entries, 0 to 145571
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   index               145572 non-null  int64  
 1   id                  145572 non-null  object 
 2   vendor_id           145572 non-null  int64  
 3   pickup_datetime     145572 non-null  object 
 4   passenger_count     145572 non-null  int64  
 5   pickup_longitude    145572 non-null  float64
 6   pickup_latitude     145572 non-null  float64
 7   dropoff_longitude   145572 non-null  float64
 8   dropoff_latitude    145572 non-null  float64
 9   store_and_fwd_flag  145572 non-null  int64  
 10  trip_duration       145572 non-null  int64  
dtypes: float64(4), int64(5), object(2)
memory usage: 12.2+ MB


For this exercise, we only work with **10% of the original data** to reduce the calculation time of the models and feature engineering procedures.

**Features of the data** (see [here](https://www.kaggle.com/c/nyc-taxi-trip-duration/data)):

* **id** - a unique identifier for each trip
* **vendor_id** - a code indicating the provider associated with the trip record
* **pickup_datetime** - date and time when the meter was engaged
* **passenger_count** - the number of passengers in the vehicle (driver entered value)
* **pickup_longitude** - the longitude where the meter was engaged
* **pickup_latitude** - the latitude where the meter was engaged
* **dropoff_longitude** - the longitude where the meter was disengaged
* **dropoff_latitude** - the latitude where the meter was disengaged
* **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - 1=store and forward; 0=not a store and forward trip

**Target:**

* **trip_duration** - duration of the trip in seconds

### Baseline model

A random forest model was selected as the baseline. As we are working with a significantly smaller amount of data compared to the original data, the default RF model unfortunately suffers from overfitting. The parameters `max_features` and `max_depth` were each adjusted to reduce overfitting.

In [5]:
MAX_FEATURES = 0.5
MAX_DEPTH = 8

In [6]:
cols_to_train = ['vendor_id', 'passenger_count', 'pickup_longitude','pickup_latitude',
                 'dropoff_longitude','dropoff_latitude','store_and_fwd_flag']

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate

X = df[cols_to_train].copy()
y = df['trip_duration']

# Train and score baseline model
baseline = RandomForestRegressor(random_state=0, n_jobs=-1, max_features=MAX_FEATURES, max_depth=MAX_DEPTH)
baseline_scores = cross_validate(
    baseline, X, y, cv=5, scoring="neg_root_mean_squared_error", return_train_score=True
)

print(f"RMSE: Train-Score: {-1*baseline_scores['train_score'].mean():.2f} - Test Score: {-1*baseline_scores['test_score'].mean():.2f}")

RMSE: Train-Score: 433.47 - Test Score: 440.95


<h1 style="color:blue">Exercise</h1>

---

In this exercise you have only one goal: **improve the performance of the baseline model through feature engineering!** 
You can use the pipeline of the baseline model above and add your new features to the `cols_to_train` list. Evaluate your new features analogous to the baseline model using the [RMSE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) score. It is best to determine the performance of new features individually in comparison to the baseline before finally using them in combination.

The best RMSE test score achieved in the sample solution is *332.83* (with the hyperparameters above). Can you beat this?

**Possible starting points:**

* **Log transformation** of the `trip_duration` target
    * Check whether the distribution of the target is suitable for a log transformation
    * To test the effect of a log transformation, you can use the [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html#sklearn-compose-transformedtargetregressor). This allows you to reverse the prediction with the `inverse_func` parameter in order to make the RMSE values comparable.
* Extraction of **date information** from `pickup_datetime`. *Surely the travel time of a cab ride depends on the time of day...*
* Calculation of **distances** from the geocoordinates `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, `dropoff_latitude`.
    * Note that the distances are to be calculated on the basis of an ellipse ($\hat =$ Earth). You can calculate these [geodesics](https://de.wikipedia.org/wiki/Geod%C3%A4te) with [Geopy](https://geopy.readthedocs.io/en/stable/#module-geopy.distance), for example.
    * Since cabs don't fly (*yet*), you could also use more meaningful distances. *New York City = [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry)?*
* Consideration of **holidays**. *Different travel times could perhaps be expected here...*
    * This can be implemented with the Pandas [Holiday-Calendar](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#holidays-holiday-calendars), for example
* **Clustering of the geodata**. You could use [MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html) to cluster the locations based on the geocoordinates. The start and end cluster could be a good predictor of travel time.
* Use of **external weather data**. *Perhaps shorter distances are increasingly traveled by cab in bad weather*
    * You can use the file `./data/weather_data_nyc_centralpark_2016.csv` for this (based on [here](https://www.kaggle.com/mathijs/weather-data-in-new-york-city-2016))
    * The T in the data stands for Trace (was recognized, but not enough for a measurement)

The points mentioned are only an initial selection of possible feature engineering approaches. You are welcome to experiment and develop your own ideas! Finally, to achieve the best performance, you can use your predictive features in combination.

In [9]:
from sklearn.compose import TransformedTargetRegressor

tt = TransformedTargetRegressor(regressor=baseline,
                                func=np.log, inverse_func=np.exp)
tt.fit(X, y)
tt.score(X, y)

0.5240099230159764

In [13]:
X['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], format='ISO8601')

X['Year'] = X['pickup_datetime'].dt.year
X['Month'] = X['pickup_datetime'].dt.month
X['Day'] = X['pickup_datetime'].dt.day
X['WeekOfYear'] = X['pickup_datetime'].dt.isocalendar().week
X['DayOfYear'] = X['pickup_datetime'].dt.dayofyear

In [14]:
X

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_datetime,Year,Month,Day,WeekOfYear,DayOfYear
0,2,1,-73.971916,40.757042,-73.974663,40.753624,0,2016-03-31 16:04:41,2016,3,31,13,91
1,2,2,-73.961891,40.771061,-73.906311,40.908562,0,2016-04-21 21:54:52,2016,4,21,16,112
2,1,3,-74.010338,40.711674,-73.957047,40.777634,0,2016-03-30 11:26:24,2016,3,30,13,90
3,1,1,-74.005898,40.740093,-73.992287,40.758511,0,2016-03-06 20:07:45,2016,3,6,9,66
4,1,1,-74.011063,40.715599,-74.005035,40.720966,0,2016-04-06 13:45:10,2016,4,6,14,97
...,...,...,...,...,...,...,...,...,...,...,...,...,...
145567,2,6,-74.008430,40.704441,-73.993698,40.687069,0,2016-03-31 22:15:56,2016,3,31,13,91
145568,2,2,-73.999146,40.728279,-73.991638,40.719582,0,2016-04-09 03:13:10,2016,4,9,14,100
145569,2,3,-73.996437,40.753201,-74.007622,40.741940,0,2016-03-19 14:45:21,2016,3,19,11,79
145570,2,1,-73.973206,40.743877,-73.977600,40.763412,0,2016-01-30 17:11:34,2016,1,30,4,30


In [17]:
from geopy import distance

pickup_coords = X[['pickup_latitude', 'pickup_longitude']].values
dropoff_coords = X[['dropoff_latitude', 'dropoff_longitude']].values

# 거리를 계산하여 'distance' 열에 추가
X['distance'] = np.array([distance.distance(pickup, dropoff).km for pickup, dropoff in zip(pickup_coords, dropoff_coords)])

In [25]:
X

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_datetime,Year,Month,Day,WeekOfYear,DayOfYear,distance,is_holiday
0,2,1,-73.971916,40.757042,-73.974663,40.753624,0,2016-03-31 16:04:41,2016,3,31,13,91,0.444817,False
1,2,2,-73.961891,40.771061,-73.906311,40.908562,0,2016-04-21 21:54:52,2016,4,21,16,112,15.972880,False
2,1,3,-74.010338,40.711674,-73.957047,40.777634,0,2016-03-30 11:26:24,2016,3,30,13,90,8.597141,False
3,1,1,-74.005898,40.740093,-73.992287,40.758511,0,2016-03-06 20:07:45,2016,3,6,9,66,2.346118,False
4,1,1,-74.011063,40.715599,-74.005035,40.720966,0,2016-04-06 13:45:10,2016,4,6,14,97,0.783958,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145567,2,6,-74.008430,40.704441,-73.993698,40.687069,0,2016-03-31 22:15:56,2016,3,31,13,91,2.296103,False
145568,2,2,-73.999146,40.728279,-73.991638,40.719582,0,2016-04-09 03:13:10,2016,4,9,14,100,1.155483,False
145569,2,3,-73.996437,40.753201,-74.007622,40.741940,0,2016-03-19 14:45:21,2016,3,19,11,79,1.567191,False
145570,2,1,-73.973206,40.743877,-73.977600,40.763412,0,2016-01-30 17:11:34,2016,1,30,4,30,2.200866,False


In [34]:
# 휴일 여부를 나타내는 조건 설정 (주말을 휴일로 간주)
X['is_holiday'] = (X['pickup_datetime'].dt.weekday >= 5)  # 5 이상인 경우 (토요일, 일요일)를 휴일로 간주
X

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_datetime,Year,Month,Day,WeekOfYear,DayOfYear,distance,is_holiday
0,2,1,-73.971916,40.757042,-73.974663,40.753624,0,2016-03-31 16:04:41,2016,3,31,13,91,0.444817,False
1,2,2,-73.961891,40.771061,-73.906311,40.908562,0,2016-04-21 21:54:52,2016,4,21,16,112,15.972880,False
2,1,3,-74.010338,40.711674,-73.957047,40.777634,0,2016-03-30 11:26:24,2016,3,30,13,90,8.597141,False
3,1,1,-74.005898,40.740093,-73.992287,40.758511,0,2016-03-06 20:07:45,2016,3,6,9,66,2.346118,True
4,1,1,-74.011063,40.715599,-74.005035,40.720966,0,2016-04-06 13:45:10,2016,4,6,14,97,0.783958,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145567,2,6,-74.008430,40.704441,-73.993698,40.687069,0,2016-03-31 22:15:56,2016,3,31,13,91,2.296103,False
145568,2,2,-73.999146,40.728279,-73.991638,40.719582,0,2016-04-09 03:13:10,2016,4,9,14,100,1.155483,True
145569,2,3,-73.996437,40.753201,-74.007622,40.741940,0,2016-03-19 14:45:21,2016,3,19,11,79,1.567191,True
145570,2,1,-73.973206,40.743877,-73.977600,40.763412,0,2016-01-30 17:11:34,2016,1,30,4,30,2.200866,True


In [35]:
mask = X['is_holiday'] == True
mask.sum()

41473

In [36]:
from sklearn.cluster import MiniBatchKMeans

cluster_features = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']
kmeans = MiniBatchKMeans(n_clusters=10, random_state=0, batch_size=6, n_init="auto")
X['geodata'] = kmeans.fit_predict(df[cluster_features])

In [37]:
X

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_datetime,Year,Month,Day,WeekOfYear,DayOfYear,distance,is_holiday,geodata
0,2,1,-73.971916,40.757042,-73.974663,40.753624,0,2016-03-31 16:04:41,2016,3,31,13,91,0.444817,False,0
1,2,2,-73.961891,40.771061,-73.906311,40.908562,0,2016-04-21 21:54:52,2016,4,21,16,112,15.972880,False,4
2,1,3,-74.010338,40.711674,-73.957047,40.777634,0,2016-03-30 11:26:24,2016,3,30,13,90,8.597141,False,7
3,1,1,-74.005898,40.740093,-73.992287,40.758511,0,2016-03-06 20:07:45,2016,3,6,9,66,2.346118,True,7
4,1,1,-74.011063,40.715599,-74.005035,40.720966,0,2016-04-06 13:45:10,2016,4,6,14,97,0.783958,False,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145567,2,6,-74.008430,40.704441,-73.993698,40.687069,0,2016-03-31 22:15:56,2016,3,31,13,91,2.296103,False,1
145568,2,2,-73.999146,40.728279,-73.991638,40.719582,0,2016-04-09 03:13:10,2016,4,9,14,100,1.155483,True,8
145569,2,3,-73.996437,40.753201,-74.007622,40.741940,0,2016-03-19 14:45:21,2016,3,19,11,79,1.567191,True,8
145570,2,1,-73.973206,40.743877,-73.977600,40.763412,0,2016-01-30 17:11:34,2016,1,30,4,30,2.200866,True,0


In [38]:
weather = pd.read_csv('./data/weather_data_nyc_centralpark_2016.csv')
weather

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.00,0.0,0
1,2-1-2016,40,32,36.0,0.00,0.0,0
2,3-1-2016,45,35,40.0,0.00,0.0,0
3,4-1-2016,36,14,25.0,0.00,0.0,0
4,5-1-2016,29,11,20.0,0.00,0.0,0
...,...,...,...,...,...,...,...
361,27-12-2016,60,40,50.0,0,0,0
362,28-12-2016,40,34,37.0,0,0,0
363,29-12-2016,46,33,39.5,0.39,0,0
364,30-12-2016,40,33,36.5,0.01,T,0


In [39]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date                 366 non-null    object 
 1   maximum temperature  366 non-null    int64  
 2   minimum temperature  366 non-null    int64  
 3   average temperature  366 non-null    float64
 4   precipitation        366 non-null    object 
 5   snow fall            366 non-null    object 
 6   snow depth           366 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 20.1+ KB


In [45]:
weather['date'] = pd.to_datetime(weather['date'], format='%d-%m-%Y')
weather['Year'] = weather['date'].dt.year
weather['Month'] = weather['date'].dt.month
weather['Day'] = weather['date'].dt.day

newX = pd.merge(X, weather, left_on=['Year', 'Month', 'Day'], right_on=['Year', 'Month', 'Day'], how='inner')

newX

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,pickup_datetime,Year,Month,...,distance,is_holiday,geodata,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,2,1,-73.971916,40.757042,-73.974663,40.753624,0,2016-03-31 16:04:41,2016,3,...,0.444817,False,0,2016-03-31,73,49,61.0,0.00,0.0,0
1,2,1,-73.982658,40.750771,-73.974205,40.747406,0,2016-03-31 09:20:25,2016,3,...,0.805777,False,0,2016-03-31,73,49,61.0,0.00,0.0,0
2,2,6,-73.991829,40.735119,-73.973602,40.764629,0,2016-03-31 16:54:37,2016,3,...,3.620617,False,7,2016-03-31,73,49,61.0,0.00,0.0,0
3,1,1,-73.969414,40.761238,-73.970253,40.765282,0,2016-03-31 15:07:56,2016,3,...,0.454594,False,0,2016-03-31,73,49,61.0,0.00,0.0,0
4,2,1,-73.975090,40.750202,-73.985497,40.731541,0,2016-03-31 22:07:09,2016,3,...,2.251040,False,8,2016-03-31,73,49,61.0,0.00,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145567,2,1,-73.973587,40.747940,-73.974709,40.748489,0,2016-03-29 08:18:04,2016,3,...,0.112661,False,0,2016-03-29,53,40,46.5,0.00,0.0,0
145568,2,2,-73.974854,40.765171,-73.790314,40.643856,0,2016-03-29 06:36:10,2016,3,...,20.608409,False,3,2016-03-29,53,40,46.5,0.00,0.0,0
145569,2,1,-74.001740,40.735432,-74.007217,40.726471,0,2016-03-29 11:48:12,2016,3,...,1.097415,False,8,2016-03-29,53,40,46.5,0.00,0.0,0
145570,1,2,-73.999664,40.761372,-73.996078,40.743473,0,2016-03-29 20:34:14,2016,3,...,2.010555,False,8,2016-03-29,53,40,46.5,0.00,0.0,0


In [47]:
newX.drop('date', axis=1, inplace=True)

In [53]:
newX.drop('pickup_datetime', axis=1, inplace=True)

In [54]:
newX

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,Year,Month,Day,...,DayOfYear,distance,is_holiday,geodata,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,2,1,-73.971916,40.757042,-73.974663,40.753624,0,2016,3,31,...,91,0.444817,False,0,73,49,61.0,0.00,0.0,0
1,2,1,-73.982658,40.750771,-73.974205,40.747406,0,2016,3,31,...,91,0.805777,False,0,73,49,61.0,0.00,0.0,0
2,2,6,-73.991829,40.735119,-73.973602,40.764629,0,2016,3,31,...,91,3.620617,False,7,73,49,61.0,0.00,0.0,0
3,1,1,-73.969414,40.761238,-73.970253,40.765282,0,2016,3,31,...,91,0.454594,False,0,73,49,61.0,0.00,0.0,0
4,2,1,-73.975090,40.750202,-73.985497,40.731541,0,2016,3,31,...,91,2.251040,False,8,73,49,61.0,0.00,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145567,2,1,-73.973587,40.747940,-73.974709,40.748489,0,2016,3,29,...,89,0.112661,False,0,53,40,46.5,0.00,0.0,0
145568,2,2,-73.974854,40.765171,-73.790314,40.643856,0,2016,3,29,...,89,20.608409,False,3,53,40,46.5,0.00,0.0,0
145569,2,1,-74.001740,40.735432,-74.007217,40.726471,0,2016,3,29,...,89,1.097415,False,8,53,40,46.5,0.00,0.0,0
145570,1,2,-73.999664,40.761372,-73.996078,40.743473,0,2016,3,29,...,89,2.010555,False,8,53,40,46.5,0.00,0.0,0


In [55]:
newX.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145572 entries, 0 to 145571
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   vendor_id            145572 non-null  int64  
 1   passenger_count      145572 non-null  int64  
 2   pickup_longitude     145572 non-null  float64
 3   pickup_latitude      145572 non-null  float64
 4   dropoff_longitude    145572 non-null  float64
 5   dropoff_latitude     145572 non-null  float64
 6   store_and_fwd_flag   145572 non-null  int64  
 7   Year                 145572 non-null  int32  
 8   Month                145572 non-null  int32  
 9   Day                  145572 non-null  int32  
 10  WeekOfYear           145572 non-null  UInt32 
 11  DayOfYear            145572 non-null  int32  
 12  distance             145572 non-null  float64
 13  is_holiday           145572 non-null  bool   
 14  geodata              145572 non-null  int32  
 15  maximum temperatu

In [64]:
newX['precipitation'] = pd.to_numeric(newX['precipitation'], errors='coerce')
newX['snow fall'] = pd.to_numeric(newX['snow fall'], errors='coerce')
newX['snow depth'] = pd.to_numeric(newX['snow depth'], errors='coerce')

In [70]:
newX['precipitation'].fillna(newX['precipitation'].mean(), inplace=True)
newX['snow fall'].fillna(newX['snow fall'].mean(), inplace=True)
newX['snow depth'].fillna(newX['snow depth'].mean(), inplace=True)

In [71]:
newX.isnull().sum()

vendor_id              0
passenger_count        0
pickup_longitude       0
pickup_latitude        0
dropoff_longitude      0
dropoff_latitude       0
store_and_fwd_flag     0
Year                   0
Month                  0
Day                    0
WeekOfYear             0
DayOfYear              0
distance               0
is_holiday             0
geodata                0
maximum temperature    0
minimum temperature    0
average temperature    0
precipitation          0
snow fall              0
snow depth             0
dtype: int64

In [72]:
baseline = RandomForestRegressor(random_state=0, n_jobs=-1, max_features=MAX_FEATURES, max_depth=MAX_DEPTH)
baseline_scores = cross_validate(
    baseline, newX, y, cv=5, scoring="neg_root_mean_squared_error", return_train_score=True
)

print(f"RMSE: Train-Score: {-1*baseline_scores['train_score'].mean():.2f} - Test Score: {-1*baseline_scores['test_score'].mean():.2f}")

RMSE: Train-Score: 637.47 - Test Score: 642.24
