# Feature Selection

Here, I want to apply what I learnt in [Introduction to Statistical Learning](https://www.statlearning.com/) Chapter 6 about Feature Selection. This also answers some of my questions in the Least Squares / Logistic Regression files.

In [None]:
import pandas as pd
import numpy as np
import itertools
import statsmodels.api as sm
from ISLP.models import (ModelSpec as MS, summarize, poly)
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import \
     (cross_validate,
      KFold)

df = pd.read_csv("../train.csv")
print(df.shape)
print(df.columns)
print(df.dtypes)
for col in df.columns:
    print("Missing rows in {0}:".format(col), df[col].shape[0] - df[col].count())
print(df.describe())

y = df['Survived']

First of all, I have to do some preprocessing:

In [38]:
df['Sex'] = df['Sex'].astype('category')
df['SexNr'] = df['Sex'].cat.codes

mean = np.mean(df['Age'])
df['Age'] = df['Age'].fillna(mean)

df['Embarked'] = df['Embarked'].astype('category')

print(df.dtypes)

PassengerId       int64
Survived          int64
Pclass            int64
Name             object
Sex            category
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin            object
Embarked       category
SexNr              int8
dtype: object


## Best Subset Selection

1. Generate all best models with 0 to p predictors.

In [42]:
predictors = ['Pclass', 'SexNr', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

for p in range(0, len(predictors)+1):
    print('Anzahl an Predictors:', p)
    best_r2 = 0
    best_predictors = ()
    for combination in itertools.combinations(predictors, p):
        X = MS(list(combination)).fit_transform(df)
        results = sm.OLS(y,X, missing='drop').fit()
        #print(summarize(results))
        #print("Rsquared:", results.rsquared)
        if results.rsquared > best_r2:
            best_r2 = results.rsquared
            best_predictors = (combination)
    print('Beste Kombination:', combination, 'mit R^2:', best_r2)

Anzahl an Predictors: 0
Beste Kombination: () mit R^2: 0
Anzahl an Predictors: 1
Beste Kombination: ('Embarked',) mit R^2: 0.2952307228626888
Anzahl an Predictors: 2
Beste Kombination: ('Fare', 'Embarked') mit R^2: 0.36768020891350406
Anzahl an Predictors: 3
Beste Kombination: ('Parch', 'Fare', 'Embarked') mit R^2: 0.38336856306416056
Anzahl an Predictors: 4
Beste Kombination: ('SibSp', 'Parch', 'Fare', 'Embarked') mit R^2: 0.39326065634846097
Anzahl an Predictors: 5
Beste Kombination: ('Age', 'SibSp', 'Parch', 'Fare', 'Embarked') mit R^2: 0.3974572876555986
Anzahl an Predictors: 6
Beste Kombination: ('SexNr', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked') mit R^2: 0.3978260472917734
Anzahl an Predictors: 7
Beste Kombination: ('Pclass', 'SexNr', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked') mit R^2: 0.39834629200884475


This works for now, but 'Embarked' as a categorical value with 3 options is probably not implemented properly, since it only counts as one variable for my code, while it is implemented as two variables in the model. I will ignore that for now and see whether that is talked about later in the Book Chapter or in the Lab.

Another problem with this is that the R^2 is measured on the training set instead of a test set, since I don't have a dedicated test set. Another problem is also that since the model all have different sizes, you actually can't really use R^2 (it will always be lower with more predictors). The solution for that is to use kfold CV as a measure. When having implemented that, I can properly rely on the model error.

In [45]:
# Implementing 10fold CV on one example

cv = KFold(n_splits=10,
           shuffle=True,
           random_state=0) # use same splits for each degree

for id,item in enumerate (range(1,6)):
    print(id,item)

X = np.power.outer(H, np.arange(d+1))

X = MS(list(combination)).fit_transform(df)
results = sm.OLS(y,X, missing='drop').fit()
print(summarize(results))
print("Rsquared:", results.rsquared)

               coef  std err       t  P>|t|
intercept    1.3483    0.074  18.332  0.000
Pclass      -0.1719    0.020  -8.508  0.000
SexNr       -0.5044    0.028 -17.879  0.000
Age         -0.0059    0.001  -5.457  0.000
SibSp       -0.0412    0.013  -3.160  0.002
Parch       -0.0160    0.018  -0.879  0.380
Fare         0.0003    0.000   0.873  0.383
Embarked[Q] -0.0025    0.055  -0.045  0.964
Embarked[S] -0.0663    0.034  -1.931  0.054
Rsquared: 0.39834629200884475
0 1
1 2
2 3
3 4
4 5


## Forward Stepwise Selection

## Backward Stepwise Selection