# Approximate Features


Some feature are more computationally intensive to calculate than others.  In a large dataset, direct features that are aggregations on the prediction entity may not change much from cutoff time to cutoff time. Calculating the aggregation features at specific times every hour and using it for all cutoff times within the hour would save time and perhaps not lose much information.  The approximate parameter in calculate_feature_matrix and dfs let's you specify a window size to use when approximating these direct aggregation features.  This example will showcase how to use this feature.

In [1]:
import featuretools as ft
import pandas as pd

The flight dataset is about 1.6 GB in size, so you may want to use the nrows paramater to if the size is too great.

In [2]:
# es = ft.demo.load_flight()
es = ft.demo.load_flight(nrows=200000)

With the entityset loaded, let's use dfs to get create some features to calculate.

In [3]:
es

Entityset: flight_dataset
  Entities:
    flights (shape = [141436, 9])
    trips (shape = [200000, 24])
  Relationships:
    trips.flight_id -> flights.flight_id

In [20]:
trips = es['trips'].df
trips.head(5)

Unnamed: 0_level_0,trip_id,Unnamed: 0,FL_DATE,CRS_DEP_TIME,DEP_TIME,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,...,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 27,flight_id
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,2016-02-01,615,614.0,22.0,636.0,1258.0,5.0,1325,...,250.0,229.0,1892.0,,,,,,,DL_N818DA_1592
1,1,1,2016-02-01,2316,2307.0,15.0,2322.0,635.0,8.0,653,...,277.0,276.0,2182.0,,,,,,,DL_N831DN_1593
2,2,2,2016-02-01,2215,2212.0,23.0,2235.0,2323.0,3.0,2328,...,133.0,134.0,674.0,,,,,,,DL_N905DL_1594
3,3,3,2016-02-01,1644,1639.0,16.0,1655.0,1922.0,8.0,1955,...,251.0,231.0,1547.0,,,,,,,DL_N698DL_1595
4,4,4,2016-02-01,1930,1952.0,51.0,2043.0,2342.0,6.0,2312,...,162.0,176.0,1020.0,22.0,0.0,14.0,0.0,0.0,,DL_N982AT_1596


In [21]:
trips.tail(5)

Unnamed: 0_level_0,trip_id,Unnamed: 0,FL_DATE,CRS_DEP_TIME,DEP_TIME,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,...,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 27,flight_id
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
199995,199995,199995,2016-02-14,1050,1042.0,17.0,1059.0,1325.0,10.0,1350,...,120.0,113.0,667.0,,,,,,,EV_N13997_4220
199996,199996,199996,2016-02-14,657,647.0,20.0,707.0,754.0,8.0,810,...,73.0,75.0,253.0,,,,,,,EV_N14105_4221
199997,199997,199997,2016-02-14,900,855.0,13.0,908.0,1011.0,6.0,1030,...,90.0,82.0,489.0,,,,,,,EV_N19554_4221
199998,199998,199998,2016-02-14,1015,1011.0,11.0,1022.0,1312.0,6.0,1333,...,138.0,127.0,844.0,,,,,,,EV_N16911_4222
199999,199999,199999,2016-02-14,620,611.0,10.0,621.0,734.0,9.0,758,...,98.0,92.0,427.0,,,,,,,EV_N14171_4223


In [5]:
features = ft.dfs(entityset=es, target_entity='trips', features_only=True)

In [None]:
cutoff_time = trips.filter(['trip_id', 'FL_DATE'])

Now we time calculate_feature_matrix using the cutoff times and features.

In [10]:
%%time
feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es,
                                             cutoff_time=cutoff_time,
                                             verbose=True)

calculate_feature_matrix: 100%|██████████| 13/13 [06:53<00:00, 42.40s/it]
CPU times: user 6min 54s, sys: 0 ns, total: 6min 54s
Wall time: 6min 53s


The number of tasks in the calculate_feature_matrix progress bar refers to the number of unique dates features are being calculated at. 

In [23]:
feature_matrix.tail(5)

Unnamed: 0_level_0,CANCELLED,LATE_AIRCRAFT_DELAY,flight_id,MONTH(FL_DATE),NAS_DELAY,TAXI_OUT,YEAR(FL_DATE),ACTUAL_ELAPSED_TIME,WHEELS_ON,Unnamed: 27,...,flights.MEAN(trips.DISTANCE),flights.MIN(trips.WHEELS_ON),flights.MIN(trips.NAS_DELAY),flights.SKEW(trips.Unnamed: 0),flights.COUNT(trips),flights.SUM(trips.CRS_ARR_TIME),flights.SKEW(trips.WHEELS_ON),flights.MAX(trips.SECURITY_DELAY),flights.MEAN(trips.Unnamed: 27),flights.N_UNIQUE(trips.WEEKDAY(FL_DATE))
instance_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
199995,0.0,,EV_N13997_4220,2,,17.0,2016,113.0,1325.0,,...,667.0,1157.0,,0.0,2,2507,0.0,,,2
199996,0.0,,EV_N14105_4221,2,,20.0,2016,75.0,754.0,,...,253.0,754.0,,0.0,1,810,0.0,,,1
199997,0.0,,EV_N19554_4221,2,,13.0,2016,82.0,1011.0,,...,489.0,1011.0,,0.0,1,1030,0.0,,,1
199998,0.0,,EV_N16911_4222,2,,11.0,2016,127.0,1312.0,,...,844.0,1312.0,,0.0,1,1333,0.0,,,1
199999,0.0,,EV_N14171_4223,2,,10.0,2016,92.0,734.0,,...,427.0,734.0,,0.0,1,758,0.0,,,1


And time it again using the approximate parameter 

In [12]:
%%time
feature_matrix_approximated = ft.calculate_feature_matrix(features=features, entityset=es, 
                                                          cutoff_time=cutoff_time,
                                                          approximate=ft.Timedelta(3, 'd'),
                                                          verbose=True)

approximate_features: 100%|██████████| 5/5 [06:33<00:00, 84.97s/it] 
calculate_feature_matrix: 100%|██████████| 13/13 [00:05<00:00,  1.83it/s]
CPU times: user 6min 42s, sys: 0 ns, total: 6min 42s
Wall time: 6min 41s


The number of tasks in the approximate features bar refers to the number of unique dates approximate features are being calculated at.  This is smaller than the number in the non-approximated calculate_feature_matrix, due to how multiple dates are grouped together for approximation.  Notice that the final features are calculated on the same number of dates as before, but the time needed for those calculations is much faster due to the approximation step.

In [24]:
feature_matrix_approximated.tail(5)

Unnamed: 0_level_0,CANCELLED,LATE_AIRCRAFT_DELAY,flight_id,MONTH(FL_DATE),NAS_DELAY,TAXI_OUT,YEAR(FL_DATE),ACTUAL_ELAPSED_TIME,WHEELS_ON,Unnamed: 27,...,flights.MEAN(trips.DISTANCE),flights.MIN(trips.WHEELS_ON),flights.MIN(trips.NAS_DELAY),flights.SKEW(trips.Unnamed: 0),flights.COUNT(trips),flights.SUM(trips.CRS_ARR_TIME),flights.SKEW(trips.WHEELS_ON),flights.MAX(trips.SECURITY_DELAY),flights.MEAN(trips.Unnamed: 27),flights.N_UNIQUE(trips.WEEKDAY(FL_DATE))
instance_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
199995,0.0,,EV_N13997_4220,2,,17.0,2016,113.0,1325.0,,...,667.0,1157.0,,0.0,2,2507,0.0,,,2
199996,0.0,,EV_N14105_4221,2,,20.0,2016,75.0,754.0,,...,253.0,754.0,,0.0,1,810,0.0,,,1
199997,0.0,,EV_N19554_4221,2,,13.0,2016,82.0,1011.0,,...,489.0,1011.0,,0.0,1,1030,0.0,,,1
199998,0.0,,EV_N16911_4222,2,,11.0,2016,127.0,1312.0,,...,844.0,1312.0,,0.0,1,1333,0.0,,,1
199999,0.0,,EV_N14171_4223,2,,10.0,2016,92.0,734.0,,...,427.0,734.0,,0.0,1,758,0.0,,,1


In [14]:
%%time
feature_matrix_approximated_2 = ft.calculate_feature_matrix(features=features, entityset=es,
                                                            cutoff_time=cutoff_time,
                                                            approximate=ft.Timedelta(6, 'd'),
                                                            verbose=True)

approximate_features: 100%|██████████| 3/3 [06:08<00:00, 118.69s/it]
calculate_feature_matrix: 100%|██████████| 13/13 [00:05<00:00,  1.81it/s]
CPU times: user 6min 17s, sys: 608 ms, total: 6min 17s
Wall time: 6min 16s


In [22]:
feature_matrix_approximated_2.tail(5)

Unnamed: 0_level_0,CANCELLED,LATE_AIRCRAFT_DELAY,flight_id,MONTH(FL_DATE),NAS_DELAY,TAXI_OUT,YEAR(FL_DATE),ACTUAL_ELAPSED_TIME,WHEELS_ON,Unnamed: 27,...,flights.MEAN(trips.DISTANCE),flights.MIN(trips.WHEELS_ON),flights.MIN(trips.NAS_DELAY),flights.SKEW(trips.Unnamed: 0),flights.COUNT(trips),flights.SUM(trips.CRS_ARR_TIME),flights.SKEW(trips.WHEELS_ON),flights.MAX(trips.SECURITY_DELAY),flights.MEAN(trips.Unnamed: 27),flights.N_UNIQUE(trips.WEEKDAY(FL_DATE))
instance_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
199995,0.0,,EV_N13997_4220,2,,17.0,2016,113.0,1325.0,,...,667.0,1157.0,,0.0,2,2507,0.0,,,2
199996,0.0,,EV_N14105_4221,2,,20.0,2016,75.0,754.0,,...,253.0,754.0,,0.0,1,810,0.0,,,1
199997,0.0,,EV_N19554_4221,2,,13.0,2016,82.0,1011.0,,...,489.0,1011.0,,0.0,1,1030,0.0,,,1
199998,0.0,,EV_N16911_4222,2,,11.0,2016,127.0,1312.0,,...,844.0,1312.0,,0.0,1,1333,0.0,,,1
199999,0.0,,EV_N14171_4223,2,,10.0,2016,92.0,734.0,,...,427.0,734.0,,0.0,1,758,0.0,,,1


**Appendix**

Below is a reference for filtering the data to ensure there is at least one data point before or on the approximate date when the features for a trip are approximated.

We merge the 'first_trips_time' field from the flights entity into the trips dataframe.  This let's us know, for each trip, what the oldest data is for that flight.

In [6]:
flights = es['flights'].df
flights['first_trips_time'].head(5)

flight_id
AA_N002AA_139    2016-02-01
AA_N004AA_1258   2016-02-01
AA_N004AA_1494   2016-02-01
AA_N004AA_182    2016-02-01
AA_N004AA_183    2016-02-01
Name: first_trips_time, dtype: datetime64[ns]

In [7]:
first_trips = trips[['flight_id', 'FL_DATE']].merge(flights[['first_trips_time']], how='left',left_on=['flight_id'], right_index=True)
first_trips.head(5)

Unnamed: 0_level_0,flight_id,FL_DATE,first_trips_time
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,DL_N818DA_1592,2016-02-01,2016-02-01
1,DL_N831DN_1593,2016-02-01,2016-02-01
2,DL_N905DL_1594,2016-02-01,2016-02-01
3,DL_N698DL_1595,2016-02-01,2016-02-01
4,DL_N982AT_1596,2016-02-01,2016-02-01


Next we filter the trips based on the time they are approximated: for each trip, the flight it belongs to must have at least one trip occur before the approximate cutoff time.

In [8]:
trip_by_cutoff_date = first_trips[first_trips['first_trips_time'] <= first_trips['FL_DATE'].apply(ft.bin_cutoff_time, args=(ft.Timedelta(6, 'd'),))].index

In [9]:
approx_cutoff_time = trips[trips['trip_id'].isin(trip_by_cutoff_date)].filter(['trip_id', 'FL_DATE'])
approx_cutoff_time.rename(columns={'trip_id': 'instance_id', 'FL_DATE': 'time'}, inplace=True)
approx_cutoff_time.head(5)

Unnamed: 0_level_0,instance_id,time
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1
15202,15202,2016-02-02
15203,15203,2016-02-02
15204,15204,2016-02-02
15205,15205,2016-02-02
15206,15206,2016-02-02
