# Feature Selection

Featuretools provides users with the ability to remove features that are unlikely to be useful in building an effective machine learning model. Reducing the number of features in the feature matrix can both produce better results in the model as well as reduce the computational cost involved in prediction.

Featuretools enables users to perform feature selection on the results of Deep Feature Synthesis with three functions:

- `ft.selection.remove_highly_null_features`
- `ft.selection.remove_single_value_features`
- `ft.selection.remove_highly_correlated_features`

Each of the functions takes in a calculated feature matrix and can optionally take in the list of Feature definitions as well. If just a feature matrix is used as input, then just a feature matrix will be returned, but if both the matrix and the Feature definitions are used, then the results of feature selection will, similarly, be the matrix and the Feature definitions.

We will describe each of these functions in depth, but first we must create an entity set with which we can run `ft.dfs`.

In [1]:
import pandas as pd
import featuretools as ft

from featuretools.selection import (
    remove_highly_correlated_features,
    remove_highly_null_features,
    remove_single_value_features,
)

from featuretools.primitives import NaturalLanguage
from featuretools.demo.flight import load_flight

es = load_flight(nrows=50)
es

Downloading data ...


Entityset: Flight Data
  Entities:
    trip_logs [Rows: 50, Columns: 21]
    flights [Rows: 6, Columns: 9]
    airlines [Rows: 1, Columns: 1]
    airports [Rows: 4, Columns: 3]
  Relationships:
    trip_logs.flight_id -> flights.flight_id
    flights.carrier -> airlines.carrier
    flights.dest -> airports.dest

## Remove Highly Null Features

We might have a dataset with columns that have many null values. Deep Feature Synthesis might build features off of those null columns, creating even more highly null features. In this case, we might want to remove any features whose null values pass a certain threshold. Below is our feature matrix with such a case:

In [27]:
es['trip_logs'].df[['trip_log_id', 'date_scheduled']].sort_values('trip_log_id', ascending=False)

Unnamed: 0,trip_log_id,date_scheduled
49,49,2016-09-06
48,48,2016-09-05
47,47,2016-09-04
46,46,2016-09-03
45,45,2016-09-10
44,44,2016-09-09
43,43,2016-09-08
42,42,2016-09-07
41,41,2016-09-06
40,40,2016-09-05


In [31]:
fm, features = ft.dfs(entityset=es,
                          target_entity="trip_logs",
                          cutoff_time=pd.DataFrame({
                              'trip_log_id':[30, 1, 2, 3, 4],
                              'time':pd.to_datetime(['2016-09-22 00:00:00']*5)
                                                  }),
                          trans_primitives=[],
                          agg_primitives=[],
                          max_depth=2)
fm

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,air_time,distance,carrier_delay,weather_delay,national_airspace_delay,security_delay,late_aircraft_delay,canceled,diverted,flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
30,AA-494:RSW->CLT,,,,,6660000000000.0,,600.0,,,,,,,,RSW,"Fort Myers, FL",FL,CLT,3.0,AA,494.0,"Charlotte, NC",NC
1,AA-494:CLT->PHX,,,,,16620000000000.0,,1773.0,,,,,,,,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
2,AA-494:CLT->PHX,,,,,16620000000000.0,,1773.0,,,,,,,,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
3,AA-494:CLT->PHX,,,,,16620000000000.0,,1773.0,,,,,,,,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
4,,,,,,,,,,,,,,,,,,,,,,,,


We look at the above feature matrix and decide to remove the highly null features

In [3]:
ft.selection.remove_highly_null_features(fm)

Unnamed: 0_level_0,flight_id,scheduled_elapsed_time,distance,flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
30,AA-494:RSW->CLT,6660000000000.0,600.0,RSW,"Fort Myers, FL",FL,CLT,3.0,AA,494.0,"Charlotte, NC",NC
1,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
2,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
3,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
4,,,,,,,,,,,,


Notice that calling `remove_highly_null_features` didn't remove every feature that contains a null value. By default, we only remove features where the percentage of null values in the calculated feature matrix is above 95%. If we want to lower that threshold, we can set the `pct_null_threshold` paramter ourselves.

In [4]:
remove_highly_null_features(fm, pct_null_threshold=.2)

30
1
2
3
4


## Remove Single Value Features

Another situation we might run into is one where our calculated features don't have any variance. In those cases, we are likely to want to remove the uninteresting features. For that, we use `remove_single_value_features`.

Let's see what happens when we remove the single value features of the feature matrix below.

In [5]:
fm

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,air_time,distance,carrier_delay,weather_delay,national_airspace_delay,security_delay,late_aircraft_delay,canceled,diverted,flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
30,AA-494:RSW->CLT,,,,,6660000000000.0,,600.0,,,,,,,,RSW,"Fort Myers, FL",FL,CLT,3.0,AA,494.0,"Charlotte, NC",NC
1,AA-494:CLT->PHX,,,,,16620000000000.0,,1773.0,,,,,,,,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
2,AA-494:CLT->PHX,,,,,16620000000000.0,,1773.0,,,,,,,,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
3,AA-494:CLT->PHX,,,,,16620000000000.0,,1773.0,,,,,,,,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
4,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
new_fm, new_features = remove_single_value_features(fm, features=features)
new_fm

Unnamed: 0_level_0,flight_id,scheduled_elapsed_time,distance,flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
30,AA-494:RSW->CLT,6660000000000.0,600.0,RSW,"Fort Myers, FL",FL,CLT,3.0,"Charlotte, NC",NC
1,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,"Phoenix, AZ",AZ
2,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,"Phoenix, AZ",AZ
3,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,"Phoenix, AZ",AZ
4,,,,,,,,,,


The features that were removed are:

In [7]:
set(features) - set(new_features)

{<Feature: air_time>,
 <Feature: arr_delay>,
 <Feature: canceled>,
 <Feature: carrier_delay>,
 <Feature: dep_delay>,
 <Feature: diverted>,
 <Feature: flights.carrier>,
 <Feature: flights.flight_num>,
 <Feature: late_aircraft_delay>,
 <Feature: national_airspace_delay>,
 <Feature: security_delay>,
 <Feature: taxi_in>,
 <Feature: taxi_out>,
 <Feature: weather_delay>}

With the function used as it is above, null values are not considered when counting a feature's unique values. If we'd like to consider `NaN` its own value, we can set `count_nan_as_value` to `True` and we'll see `flights.carrier` and `flights.flight_num` back in the matrix.

In [8]:
new_fm, new_features = remove_single_value_features(fm, features=features, count_nan_as_value=True)
new_fm

Unnamed: 0_level_0,flight_id,scheduled_elapsed_time,distance,flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
30,AA-494:RSW->CLT,6660000000000.0,600.0,RSW,"Fort Myers, FL",FL,CLT,3.0,AA,494.0,"Charlotte, NC",NC
1,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
2,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
3,AA-494:CLT->PHX,16620000000000.0,1773.0,CLT,"Charlotte, NC",NC,PHX,8.0,AA,494.0,"Phoenix, AZ",AZ
4,,,,,,,,,,,,


The features that were removed are:

In [9]:
set(features) - set(new_features)

{<Feature: air_time>,
 <Feature: arr_delay>,
 <Feature: canceled>,
 <Feature: carrier_delay>,
 <Feature: dep_delay>,
 <Feature: diverted>,
 <Feature: late_aircraft_delay>,
 <Feature: national_airspace_delay>,
 <Feature: security_delay>,
 <Feature: taxi_in>,
 <Feature: taxi_out>,
 <Feature: weather_delay>}

## Remove Highly Correlated Features

The last feature selection function we have allows us to remove features that would likely be redundant to the model we're attempting to build by considering the correlation between pairs of calculated features.

When two features are determined to be highly correlated, we remove the more complex of the two. For example, say we have two features: `col` and `-(col)`.

We can see that `-(col)` is just the negation of `col`, and so we can guess those features are going to be highly correlated. `-(col)` has has the `Negate` primitive applied to it, so it is more complex than the identity feature `col`. Therefore, if we only want one of `col` and `-(col)`, we should keep the identity feature. For features that don't have an obvious difference in complexity, we discard the feature that comes later in the feature matrix. 

Let's try this out on our data:

In [10]:
fm, features = ft.dfs(entityset=es,
                          target_entity="trip_logs",
                          trans_primitives=['negate'],
                          agg_primitives=[],
                          max_depth=3)
fm.head()

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,air_time,distance,carrier_delay,weather_delay,national_airspace_delay,security_delay,late_aircraft_delay,canceled,diverted,-(air_time),-(arr_delay),-(carrier_delay),-(dep_delay),-(distance),-(late_aircraft_delay),-(national_airspace_delay),-(scheduled_elapsed_time),-(security_delay),-(taxi_in),-(taxi_out),-(weather_delay),flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
30,AA-494:RSW->CLT,-11.0,12.0,10.0,-12.0,6660000000000,88.0,600.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-88.0,12.0,-0.0,11.0,-600.0,-0.0,-0.0,-6660000000000,-0.0,-10.0,-12.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,3,AA,494,"Charlotte, NC",NC
38,AA-495:ATL->PHX,-6.0,28.0,5.0,1.0,15000000000000,224.0,1587.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-224.0,-1.0,-0.0,6.0,-1587.0,-0.0,-0.0,-15000000000000,-0.0,-5.0,-28.0,-0.0,ATL,"Atlanta, GA",GA,PHX,7,AA,495,"Phoenix, AZ",AZ
46,AA-495:CLT->ATL,-2.0,18.0,8.0,-3.0,4620000000000,50.0,226.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-50.0,3.0,-0.0,2.0,-226.0,-0.0,-0.0,-4620000000000,-0.0,-8.0,-18.0,-0.0,CLT,"Charlotte, NC",NC,ATL,1,AA,495,"Atlanta, GA",GA
31,AA-494:RSW->CLT,0.0,11.0,10.0,-3.0,6660000000000,87.0,600.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-87.0,3.0,-0.0,-0.0,-600.0,-0.0,-0.0,-6660000000000,-0.0,-10.0,-11.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,3,AA,494,"Charlotte, NC",NC
39,AA-495:ATL->PHX,-4.0,26.0,3.0,10.0,15000000000000,235.0,1587.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-235.0,-10.0,-0.0,4.0,-1587.0,-0.0,-0.0,-15000000000000,-0.0,-3.0,-26.0,-0.0,ATL,"Atlanta, GA",GA,PHX,7,AA,495,"Phoenix, AZ",AZ


Note that we have some pretty clear correlations here between all the features and their negations.

Now, using `remove_highly_correlated_features`, our default threshold for correlation is 95% correlated, and we get all of the obviously correlated features removed, leaving just the less complex features.

In [11]:
new_fm, new_features = remove_highly_correlated_features(fm, features=features)
new_fm.head()

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,carrier_delay,weather_delay,national_airspace_delay,security_delay,late_aircraft_delay,canceled,diverted,-(security_delay),-(weather_delay),flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
30,AA-494:RSW->CLT,-11.0,12.0,10.0,-12.0,6660000000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,AA,494,"Charlotte, NC",NC
38,AA-495:ATL->PHX,-6.0,28.0,5.0,1.0,15000000000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,ATL,"Atlanta, GA",GA,PHX,AA,495,"Phoenix, AZ",AZ
46,AA-495:CLT->ATL,-2.0,18.0,8.0,-3.0,4620000000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,CLT,"Charlotte, NC",NC,ATL,AA,495,"Atlanta, GA",GA
31,AA-494:RSW->CLT,0.0,11.0,10.0,-3.0,6660000000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,AA,494,"Charlotte, NC",NC
39,AA-495:ATL->PHX,-4.0,26.0,3.0,10.0,15000000000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,ATL,"Atlanta, GA",GA,PHX,AA,495,"Phoenix, AZ",AZ


The features that were removed are:

In [12]:
set(features) - set(new_features)

{<Feature: -(air_time)>,
 <Feature: -(arr_delay)>,
 <Feature: -(carrier_delay)>,
 <Feature: -(dep_delay)>,
 <Feature: -(distance)>,
 <Feature: -(late_aircraft_delay)>,
 <Feature: -(national_airspace_delay)>,
 <Feature: -(scheduled_elapsed_time)>,
 <Feature: -(taxi_in)>,
 <Feature: -(taxi_out)>,
 <Feature: air_time>,
 <Feature: distance>,
 <Feature: flights.distance_group>}

#### Change the correlation threshold

We can lower the threshold at which to remove correlated features if we'd like to be more restrictive by using the `pct_corr_threshold` parameter.

In [13]:
new_fm , new_features = remove_highly_correlated_features(fm, features=features, pct_corr_threshold=.9)
new_fm.head()

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,carrier_delay,weather_delay,security_delay,late_aircraft_delay,canceled,diverted,-(security_delay),-(weather_delay),flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
30,AA-494:RSW->CLT,-11.0,12.0,10.0,-12.0,6660000000000,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,AA,494,"Charlotte, NC",NC
38,AA-495:ATL->PHX,-6.0,28.0,5.0,1.0,15000000000000,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,ATL,"Atlanta, GA",GA,PHX,AA,495,"Phoenix, AZ",AZ
46,AA-495:CLT->ATL,-2.0,18.0,8.0,-3.0,4620000000000,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,CLT,"Charlotte, NC",NC,ATL,AA,495,"Atlanta, GA",GA
31,AA-494:RSW->CLT,0.0,11.0,10.0,-3.0,6660000000000,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,AA,494,"Charlotte, NC",NC
39,AA-495:ATL->PHX,-4.0,26.0,3.0,10.0,15000000000000,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,ATL,"Atlanta, GA",GA,PHX,AA,495,"Phoenix, AZ",AZ


The features that were removed are:

In [14]:
set(features) - set(new_features)

{<Feature: -(air_time)>,
 <Feature: -(arr_delay)>,
 <Feature: -(carrier_delay)>,
 <Feature: -(dep_delay)>,
 <Feature: -(distance)>,
 <Feature: -(late_aircraft_delay)>,
 <Feature: -(national_airspace_delay)>,
 <Feature: -(scheduled_elapsed_time)>,
 <Feature: -(taxi_in)>,
 <Feature: -(taxi_out)>,
 <Feature: air_time>,
 <Feature: distance>,
 <Feature: flights.distance_group>,
 <Feature: national_airspace_delay>}

#### Check a Subset of Features

If we only want to check a subset of features, we can set `features_to_check` to the list of features whose correlation we'd like to check, and no features outside of that list will be removed.

In [15]:
new_fm, new_features = remove_highly_correlated_features(fm, features=features, features_to_check=['air_time', 'distance', 'flights.distance_group'])
new_fm.head()

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,air_time,carrier_delay,weather_delay,national_airspace_delay,security_delay,late_aircraft_delay,canceled,diverted,-(air_time),-(arr_delay),-(carrier_delay),-(dep_delay),-(distance),-(late_aircraft_delay),-(national_airspace_delay),-(scheduled_elapsed_time),-(security_delay),-(taxi_in),-(taxi_out),-(weather_delay),flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
30,AA-494:RSW->CLT,-11.0,12.0,10.0,-12.0,6660000000000,88.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-88.0,12.0,-0.0,11.0,-600.0,-0.0,-0.0,-6660000000000,-0.0,-10.0,-12.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,AA,494,"Charlotte, NC",NC
38,AA-495:ATL->PHX,-6.0,28.0,5.0,1.0,15000000000000,224.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-224.0,-1.0,-0.0,6.0,-1587.0,-0.0,-0.0,-15000000000000,-0.0,-5.0,-28.0,-0.0,ATL,"Atlanta, GA",GA,PHX,AA,495,"Phoenix, AZ",AZ
46,AA-495:CLT->ATL,-2.0,18.0,8.0,-3.0,4620000000000,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-50.0,3.0,-0.0,2.0,-226.0,-0.0,-0.0,-4620000000000,-0.0,-8.0,-18.0,-0.0,CLT,"Charlotte, NC",NC,ATL,AA,495,"Atlanta, GA",GA
31,AA-494:RSW->CLT,0.0,11.0,10.0,-3.0,6660000000000,87.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-87.0,3.0,-0.0,-0.0,-600.0,-0.0,-0.0,-6660000000000,-0.0,-10.0,-11.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,AA,494,"Charlotte, NC",NC
39,AA-495:ATL->PHX,-4.0,26.0,3.0,10.0,15000000000000,235.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-235.0,-10.0,-0.0,4.0,-1587.0,-0.0,-0.0,-15000000000000,-0.0,-3.0,-26.0,-0.0,ATL,"Atlanta, GA",GA,PHX,AA,495,"Phoenix, AZ",AZ


The features that were removed are:

In [16]:
set(features) - set(new_features)

{<Feature: distance>, <Feature: flights.distance_group>}

#### Protect Features from Removal

To protect specific features from being removed from the feature matrix, we can include a list of `features_to_keep`, and these features will not be removed

In [17]:
new_fm, new_features = remove_highly_correlated_features(fm, features=features, features_to_keep=['air_time', 'distance', 'flights.distance_group'])
new_fm.head()

Unnamed: 0_level_0,flight_id,dep_delay,taxi_out,taxi_in,arr_delay,scheduled_elapsed_time,air_time,distance,carrier_delay,weather_delay,national_airspace_delay,security_delay,late_aircraft_delay,canceled,diverted,-(security_delay),-(weather_delay),flights.origin,flights.origin_city,flights.origin_state,flights.dest,flights.distance_group,flights.carrier,flights.flight_num,flights.airports.dest_city,flights.airports.dest_state
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
30,AA-494:RSW->CLT,-11.0,12.0,10.0,-12.0,6660000000000,88.0,600.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,3,AA,494,"Charlotte, NC",NC
38,AA-495:ATL->PHX,-6.0,28.0,5.0,1.0,15000000000000,224.0,1587.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,ATL,"Atlanta, GA",GA,PHX,7,AA,495,"Phoenix, AZ",AZ
46,AA-495:CLT->ATL,-2.0,18.0,8.0,-3.0,4620000000000,50.0,226.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,CLT,"Charlotte, NC",NC,ATL,1,AA,495,"Atlanta, GA",GA
31,AA-494:RSW->CLT,0.0,11.0,10.0,-3.0,6660000000000,87.0,600.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,RSW,"Fort Myers, FL",FL,CLT,3,AA,494,"Charlotte, NC",NC
39,AA-495:ATL->PHX,-4.0,26.0,3.0,10.0,15000000000000,235.0,1587.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,ATL,"Atlanta, GA",GA,PHX,7,AA,495,"Phoenix, AZ",AZ


The features that were removed are:

In [18]:
set(features) - set(new_features)

{<Feature: -(taxi_out)>,
 <Feature: -(scheduled_elapsed_time)>,
 <Feature: -(carrier_delay)>,
 <Feature: -(dep_delay)>,
 <Feature: -(air_time)>,
 <Feature: -(taxi_in)>,
 <Feature: -(late_aircraft_delay)>,
 <Feature: -(arr_delay)>,
 <Feature: -(distance)>,
 <Feature: -(national_airspace_delay)>}