In  [Machine Learning Part 3: How to choose best multiple linear model]({% post_url 2019-02-16-How-to-choose-best-model %}) we have explained how to choose best model by trying all possible combinations of features or using step feature selection (Exercise 2). This often impossible or very expensive to do due to, for example, large number of features, or the algorithm we use is complicated and it takes lots of time to try them all. So there is a fast way that often leads to good results and it based on correlation.

## Forward selection using correlation

The idea behind this approach is very simple. First we choose variables that are the most correlated to the target variable. This way we choose variables that are more likely influence the model the most. Then additionally we remove variables that are strongly correlated with each other. This should remove noise from the model.

Let's see it with example. First we read data and add target column to the the rest of data. Then we calculate the correlation matrix.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import roc_auc_score, recall_score, precision_score
from sklearn.datasets import load_breast_cancer, load_boston

data = load_breast_cancer()

X = data.data
y = data.target == 0 # if zero then we detect malignant tumor.
labels = data.feature_names.tolist()

Xy = pd.DataFrame(X.copy(), columns=labels)
Xy['target'] = y.astype(int)
labels.append('target')

# Compute the correlation matrix
corr_matrix = Xy.corr()

In [6]:
data = load_boston()

X = data.data
y = data.target
labels = data.feature_names.tolist()

Xy = pd.DataFrame(X.copy(), columns=labels)
Xy['target'] = y
labels.append('target')

# Compute the correlation matrix
corr_matrix = Xy.corr()

Let's plot it to have an idea what we deal with. For that we use very good package called `plotly` it allows to interact with the plot by hovering over it.

In [7]:
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

trace = go.Heatmap(z=corr_matrix)
data=[trace]
iplot(data, filename='basic-heatmap')

The last column corresponds to the target. Let's have a closer look it.

In [65]:
corr_martix['target']

mean radius                0.730029
mean texture               0.415185
mean perimeter             0.742636
mean area                  0.708984
mean smoothness            0.358560
mean compactness           0.596534
mean concavity             0.696360
mean concave points        0.776614
mean symmetry              0.330499
mean fractal dimension    -0.012838
radius error               0.567134
texture error             -0.008303
perimeter error            0.556141
area error                 0.548236
smoothness error          -0.067016
compactness error          0.292999
concavity error            0.253730
concave points error       0.408042
symmetry error            -0.006522
fractal dimension error    0.077972
worst radius               0.776454
worst texture              0.456903
worst perimeter            0.782914
worst area                 0.733825
worst smoothness           0.421465
worst compactness          0.590998
worst concavity            0.659610
worst concave points       0

So now let's choose only those one that have absolute value of the correlation greater than 0.3.

In [164]:
correlated_with_target = list(corr_martix['target'][corr_martix['target'] >= 0.3].index[:-1])
correlated_with_target

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean symmetry',
 'radius error',
 'perimeter error',
 'area error',
 'concave points error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry',
 'worst fractal dimension']

So let's build a model and see how it performs. But first let's verify how the model based on first column was performing.

In [167]:
X = Xy.drop("target", axis=1)

In [168]:
X_train_dev, X_test, y_train_dev, y_test = train_test_split(X, y, random_state=666, test_size=0.2)
X_train, X_dev, y_train, y_dev = train_test_split(X_train_dev, y_train_dev, random_state=667, test_size=0.25)

In [169]:
def train_and_test_model(columns):
    clf = LogisticRegression(solver='liblinear')
    clf.fit(X_train[columns], y_train)
    y_test_hat = clf.predict(X_test[columns])
    print("AUC: ", roc_auc_score(y_test, y_test_hat))

train_and_test_model(['mean radius'])

AUC:  0.9092071611253196


In [170]:
train_and_test_model(correlated_with_target)

AUC:  0.948849104859335


In [172]:
corr_submatrix = Xy[correlated_with_target+["target"]].corr()

trace = go.Heatmap(z=corr_submatrix)
data=[trace]
iplot(data, filename='basic-heatmap')

We see that variable 0 is strongly correlated with variable 2 (more than 99%). Since variable 2 is more correlated with target let's remove the column 0 and retrain the model. 

In [173]:
best = correlated_with_target[::]
best.pop(0)

train_and_test_model(best)

AUC:  0.9562020460358057


In [None]:
clf = LogisticRegression(solver='liblinear')
clf.fit(X_train[columns], y_train)
y_test_hat = clf.predict(X_test[columns])
print("AUC: ", roc_auc_score(y_test, y_test_hat))

In [128]:
train_and_test_model(X.columns.values)

AUC:  0.9562020460358057


In [None]:
trace = go.Heatmap(z=corr_matrix)
data=[trace]
iplot(data, filename='basic-heatmap')

Now let's recalculate the correlation matrix, plot it, and see which one are strongly correlated with each other.

In [129]:
corr_submatrix = Xy[correlated_with_target+["target"]].corr()

trace = go.Heatmap(z=corr_submatrix)
data=[trace]
iplot(data, filename='basic-heatmap')