# Feature Selection - Automatic Methods with Continuous Target

Automated feature selection is a more powerful way to select our features compared to manual feature selection, as there is statistical justification to support our decision when selecting features.

This notebook will focus on automated feature selection methods when faced with continuous target variables:

1. Continuous target with continuous features: Using Correlation coefficients
2. Continuous target with categorical features: Using ANOVA technique

### Import Basic Packages & Data

In [None]:
# Data manipulation libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import data to a pandas dataframe
df_cars = pd.read_csv('indian cars dataset nonulls.csv')
df_cars

# Our dataset contains information about cars that are for sale, and the target variable is the CONTINUOUS starting_price.

In [None]:
# Create variable to separate target from rest of dataframe
target_variable = df_cars['starting_price']
target_variable

### Selecting or Removing Continuous Features Using Pearson Correlation Coefficients

These first methods help us select or remove continuous features, so our first step is to select only numeric features. This should NOT include features that have been One Hot Encoded.

In [None]:
# Define data frame of numeric columns
df_num = df_cars.select_dtypes(include=np.number).drop(['starting_price'], axis = 1)
df_num.head()

When looking at continuous features and continuous target variables, we can use correlation coefficients to see the strength of the linear relationship with the target variable.

In [None]:
# Calculate correlation with the target variable
corr_with_tgt = df_num.corrwith(target_variable).sort_values(ascending = False)
corr_with_tgt

With these correlation coefficients, we can create a visual to easily see the magnitudes.

In [None]:
# Plot a bar plot of the correlation coefficients
sns.barplot(x = corr_with_tgt.values, y = corr_with_tgt.index)

Based on the bar chart visual, we can consider removing the features that have the least correlation with our target variable starting price.

Additionally, we can look for multicollinearity between our features using the `corr` function and the heatmap visual. If there is a high correlation between our features, we can consider dropping them.

Generally a correlation of about +/-0.8 and up would make us seriously consider whether we should drop one of the offending features.

In [None]:
# Plot a heatmap of correlation coefficients
plt.rcParams['figure.figsize']=(10,7)
sns.heatmap(df_num.corr(), annot=True)

Based on the heat map visual, we can consider to drop features with a high correlation amongst each other to reduce multicollinearity in our data.

### Selecting or Removing Categorical Features Using ANOVA F-score

The ANOVA method helps us select or remove **categorical** features, **including** features that have been One Hot Encoded.

In [None]:
# Define data frame of categorical columns
df_cat = df_cars.select_dtypes('object')
df_cat.head()

The ANOVA test is a statistical test that analyses variance between groups. It first calculates the mean of the continuous target variable for each category in the categorical column. It then performs a test to calculate whether any of these means are statistically significantly different from eachother.

In [None]:
# One hot encode variables (would do this with OneHotEncoder in a realworld scenario as per FE chapter.)
df_cat_enc = pd.get_dummies(df_cat)
df_cat_enc

In [None]:
# Import the required packages for SKLearns, SelectKBest and f_regression
from sklearn.feature_selection import SelectKBest, f_regression

In [None]:
# Define the x and y datasets
x = df_cat_enc
y = target_variable
num_features = len(df_cat_enc.columns)

# define the feature selection algorithm
f_test = SelectKBest(score_func=f_regression).fit(x, y)

# define the f test output results
f_output = pd.DataFrame()
f_output['feature'] = df_cat_enc.columns
f_output['f_score'] = f_test.scores_
f_output['p_value'] = f_test.pvalues_

f_output = f_output.sort_values(by=['p_value'])
# Print the test results
print(f_output)

It is possible to use the SelectKBest transform method to return a reduced list of features, however, we will assume manual removal of features for now.

In [None]:
# plot the scores
sns.barplot(data = f_output, x = 'f_score', y = 'feature')

We can see from the bar chart above that although only a handful of values have a high F score, they come from all of our categorical features.