# Datetime variable transformation

The **DatetimeFeatures()** transformer is able to extract many different datetime features from existing datetime variables present in a dataframe. Some of these features are numerical, such as month, year, day of the week, week of the year, etc. and some are binary, such as whether that day was a weekend day or was the last day of its correspondent month. All features are cast to integer before adding them to the dataframe. <br>
DatetimeFeatures() converts datetime variables whose dtype is originally object or categorical to a datetime format, but it does not work with variables whose original dtype is numerical. <br>
    
For this demonstration, we use the Metro Interstate Traffic Volume Data Set, which is publicly available at https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

In [38]:
#for starters, we import the relevant modules and the DatetimeFeatures class
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from feature_engine.datetime import DatetimeFeatures

In [39]:
#load and inspect the dataset
data = pd.read_csv('..\Metro_Interstate_Traffic_Volume.csv')

data.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


In [40]:
data.shape

(48204, 9)

In [41]:
# Inspect the columns typing and check for potentially missing values
pd.DataFrame({"type":data.dtypes, "nan count":data.isna().sum()})

Unnamed: 0,type,nan count
holiday,object,0
temp,float64,0
rain_1h,float64,0
snow_1h,float64,0
clouds_all,int64,0
weather_main,object,0
weather_description,object,0
date_time,object,0
traffic_volume,int64,0


As it seems, this dataset only contains one datetime variable (named, indeed, _date\_time_). <br>
Let's say we wanted to extract the _day of the month_ and the _hour_ features from it.
Since _date\_time_ happens to be the only datetime variable in this dataset, we can do either of the following
- let the transformer search for all datetime variables by initializing it with variables=None (which is the default option anyway)
- specify which variables are going to be processed, which in this case would be setting variables="date_time"

In [42]:
dtfs = DatetimeFeatures(
    variables=None,
    features_to_extract=["day_of_the_month", "hour"]
)

# as per scikit-learn and feature-engine convention, we call the fit and transform method
# to process the data (even though this particular transformer does not learn any parameters)
dtfs.fit(data)

DatetimeFeatures(features_to_extract=['day_of_the_month', 'hour'])

In [20]:
# check which variables have been picked up as datetime during fit
dtfs.variables_

['date_time']

In [43]:
data_transf = dtfs.transform(data)
data_transf.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,traffic_volume,date_time_dotm,date_time_hour
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,5545,2,9
1,,289.36,0.0,0.0,75,Clouds,broken clouds,4516,2,10
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,4767,2,11
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,5026,2,12
4,,291.14,0.0,0.0,75,Clouds,broken clouds,4918,2,13


Notably, the dataframe identified that the object-like _date\_time_ variable could be cast to datetime and acquired the two columns _date\_time\_dotm_ and _date\_time\_hour_ corresponding to the features we required through the _features\_to\_extract_ argument. <br>
**Note**: the original _date\_time_ column was removed from the dataframe in the process, as per default behaviour. If we want to keep it, we need to initialize the transformer passing drop_original=False.

In [44]:
# this time we specify what variable(s) we want the features extracted from
# we also want to keep the original datetime variable(s).
dtfs = DatetimeFeatures(
    variables="date_time",
    features_to_extract=["day_of_the_month", "hour"],
    drop_original=False
)

data_transf = dtfs.fit_transform(data)

In [45]:
data_transf.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume,date_time_dotm,date_time_hour
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545,2,9
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516,2,10
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767,2,11
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026,2,12
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918,2,13


There are many more datetime features that DatetimeFeatures() can extract; see the docs for a full list. <br>
The argument _features\_to\_extract_ has a default option aswell. Let's quickly see what it does.

In [46]:
dtfs = DatetimeFeatures(features_to_extract=None)

data_transf = dtfs.fit_transform(data)

In [47]:
# only show columns that were extracted from date_time
data_transf.filter(regex="date_time*").head()

Unnamed: 0,date_time_month,date_time_year,date_time_dotw,date_time_dotm,date_time_hour,date_time_minute,date_time_second
0,10,2012,1,2,9,0,0
1,10,2012,1,2,10,0,0
2,10,2012,1,2,11,0,0
3,10,2012,1,2,12,0,0
4,10,2012,1,2,13,0,0


As shown above, DatetimeFeatures() extracts _month_, _year_, _day of the week_, _day of the month_, _hour_, _minute_ and _second_ by default. <br>
**Note**: when a variable only contains date information all the time features default to 00:00:00 time; conversely, when a variable only contains time information, date features default to today's date at the time of calling the transform method.

If we really want to extract _all_ of the available features we can set _features\_to\_extract_ to the special value "all". Beware, though, as your feature space might grow significantly and most of the extracted features are most likely not going to be too relevant.

In [48]:
dtfs = DatetimeFeatures(features_to_extract="all")

data_transf = dtfs.fit_transform(data)

In [49]:
data_transf.filter(regex="date_time*").head()

Unnamed: 0,date_time_month,date_time_quarter,date_time_semester,date_time_year,date_time_wotm,date_time_woty,date_time_dotw,date_time_dotm,date_time_doty,date_time_weekend,...,date_time_month_end,date_time_quarter_start,date_time_quarter_end,date_time_year_start,date_time_year_end,date_time_leap_year,date_time_days_in_month,date_time_hour,date_time_minute,date_time_second
0,10,4,2,2012,1,40,1,2,276,0,...,0,0,0,0,0,1,31,9,0,0
1,10,4,2,2012,1,40,1,2,276,0,...,0,0,0,0,0,1,31,10,0,0
2,10,4,2,2012,1,40,1,2,276,0,...,0,0,0,0,0,1,31,11,0,0
3,10,4,2,2012,1,40,1,2,276,0,...,0,0,0,0,0,1,31,12,0,0
4,10,4,2,2012,1,40,1,2,276,0,...,0,0,0,0,0,1,31,13,0,0


Another thing to keep in mind is that oftentimes most of these features are going to be quasi-constant if not constant altogether. This can be for several reason, most likely due to the particular time window in which the data was collected. <br>
We can thus combine the DatetimeFeatures() and DropConstantFeatures() transformers from feature_engine in a scikit-learn pipeline to automatically get rid of features we deem irrelevant to our analysis.

In [50]:
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropConstantFeatures

pipe = Pipeline([
    ('datetime_extraction', DatetimeFeatures(
        features_to_extract=["year", "day_of_the_month", "minute", "second"])),
    ('drop_constants', DropConstantFeatures())
])

data_transf = pipe.fit_transform(data)

In [51]:
data_transf.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,traffic_volume,date_time_year,date_time_dotm
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,5545,2012,2
1,,289.36,0.0,0.0,75,Clouds,broken clouds,4516,2012,2
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,4767,2012,2
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,5026,2012,2
4,,291.14,0.0,0.0,75,Clouds,broken clouds,4918,2012,2


Since all data was gathered with only hour-precision, the _minute_ and _second_ features we had requested were extracted by DatetimeFeatures() but subsequently dropped by DropConstantFeatures(). This way we can avoid our feature space to become overly cluttered with useless information even when we are not being particularly diligent with the features we request to extract.