# Feature Selection in der Data Science Pipeline
## *am Beispiel von medizinischen Diagnosen (Brustkrebs Diagnose)*

Wir haben den öffentlich verfügbaren Datensatz Breast Cancer Wisconsin verwendet und vom UCI Machine Learning Repository heruntergeladen. 

Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

![](./data.png)

Unser Ziel ist es, herauszufinden, welche Merkmale bei der Vorhersage von bösartigem oder gutartigem Krebs am hilfreichsten sind, und zu klassifizieren, ob der Brustkrebs gutartig oder bösartig ist.

Wir haben den öffentlich verfügbaren Datensatz Breast Cancer Wisconsin verwendet und vom UCI Machine Learning Repository heruntergeladen.

Die typische Leistungsanalyse wird durchgeführt

![](conf_matrix.png)

In [1]:
# Bibliotheken importieren
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from cf_matrix import make_confusion_matrix
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.utils import plot_model
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, StratifiedKFold, GridSearchCV, cross_validate
import matplotlib.pylab as pl
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [36]:
# Download des Krebs-Datensatzes 
import seaborn as sns
from sklearn import preprocessing
(X, y) = load_breast_cancer(return_X_y=True, as_frame=True)
# Überblick über die Daten
X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


### Baseline: Entscheidungsbäumen Klassifikator

In [3]:
# Baseline in der Performance mit Entscheidungsbäumen
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from cf_matrix import make_confusion_matrix
# Daten Skalierung
t = MinMaxScaler()
t.fit(X)
X = t.transform(X)
# Einfacher binärer Klassifikator
model = XGBClassifier()
#  definieren Sie das Verfahren der Kreuzvalidierung
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=1)
# Modell auswerten
scores = cross_validate(estimator=model, X=X, y=y, cv=cv, n_jobs=-1, 
                        scoring=['accuracy', 'roc_auc', 'precision', 'recall', 'f1'])
print('Accuracy: ', scores['test_accuracy'].mean())
print('Precision: ', scores['test_precision'].mean())
print('Recall: ', scores['test_recall'].mean())
print('F1: ', scores['test_f1'].mean(), '\n')

Accuracy:  0.9676785714285713
Precision:  0.9688758480459333
Recall:  0.9809603174603175
F1:  0.974475044983878 



# Statistikbasierte Feature-Selection

Es ist üblich, statistische Maße des Korrelationstyps zwischen Eingangs- und Ausgangsvariablen als Grundlage für die Feature Selection zu verwenden.

Die Wahl der statistischen Maße hängt also stark von den Datentypen der Variablen ab.

## $\chi^2$

Für den ersten Ansatz berechnen wir die Chi-Quadrat-Statistik zwischen jedem nicht-negativen Feature und der Klasse.

Dieser Wert kann verwendet werden, um die n_features-Merkmale mit den höchsten Werten für die Test-Chi-Quadrat-Statistik aus dem Eingabe-Merkmalsvektor relativ zu den Klassen auszuwählen.

Erinnern Sie sich, dass der Chi-Quadrat-Test die Abhängigkeit zwischen stochastischen Variablen misst, so dass wir damit die Features entfernen, die am wahrscheinlichsten unabhängig von der Klasse sind und daher für die Klassifizierung irrelevant sind.


In [37]:
# Download des Krebs-Datensatzes 
import seaborn as sns
from sklearn import preprocessing
(X, y) = load_breast_cancer(return_X_y=True, as_frame=True)

In [34]:
# Chi-Quadrat Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Ermitteln der besten k = 20 Features 
chi_best = SelectKBest(chi2, k=20).fit(X, y)
mask_features = chi_best.get_support() 
new_features = [] # The list of your K best features
for bool, feature in zip(mask_features, X.columns):
    if bool:
        new_features.append(feature)
X_new = chi_best.fit_transform(X, y)
# Überblick über die filtrierte Daten
new_features

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'radius error',
 'perimeter error',
 'area error',
 'compactness error',
 'concavity error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry']

In [35]:
# chi-quadrat Feature Selection in der Performance mit XGBoost Modell
# Einfacher binärer Klassifikator
model = XGBClassifier()
#  definieren Sie das Verfahren der Kreuzvalidierung
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=1)
# Modell auswerten
scores = cross_validate(estimator=model, X=X_new, y=y, cv=cv, n_jobs=-1, 
                        scoring=['accuracy', 'roc_auc', 'precision', 'recall', 'f1'])
print('Accuracy: ', scores['test_accuracy'].mean())
print('Precision: ', scores['test_precision'].mean())
print('Recall: ', scores['test_recall'].mean())
print('F1: ', scores['test_f1'].mean(), '\n')

Accuracy:  0.9694235588972431
Precision:  0.9725356750308377
Recall:  0.9795634920634919
F1:  0.9757032902543239 



## Varianzanalyse (ANOVA)

ANOVA ist ein Akronym für "Varianzanalyse" und ist ein parametrischer statistischer Hypothesentest zur Bestimmung, ob die Durchschnittswerte aus zwei oder mehr Stichproben von Daten (oft drei oder mehr) aus der gleichen Verteilung stammen oder nicht.

Eine F-Statistik oder ein F-Test ist eine Klasse von statistischen Tests, die das Verhältnis zwischen Varianzwerten berechnen, z. B. die Varianz aus zwei verschiedenen Stichproben oder die erklärte und unerklärte Varianz durch einen statistischen Test, wie ANOVA. Die ANOVA-Methode ist eine Art von F-Statistik, die hier als ANOVA f-Test bezeichnet wird.

Die Ergebnisse dieses Tests können für die Feature-Selection verwendet werden, bei der diejenigen Features aus dem Datensatz entfernt werden können, die unabhängig von der Zielvariablen sind.


In [15]:
# Datensatz in Trainings- und Testdatensatz aufteilen
# Download des Krebs-Datensatzes 
import seaborn as sns
from sklearn import preprocessing
(X, y) = load_breast_cancer(return_X_y=True, as_frame=True)
# Überblick über die Daten
X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [43]:
# ANOVA Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# Ermitteln der besten k = 20 Features 
anova_best = SelectKBest(f_classif, k=20).fit(X, y)
mask_features = anova_best.get_support() 
new_features = [] # The list of your K best features
for bool, feature in zip(mask_features, X.columns):
    if bool:
        new_features.append(feature)
X_new = anova_best.fit_transform(X, y)
# Überblick über die filtrierte Daten
new_features

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'radius error',
 'perimeter error',
 'area error',
 'concave points error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry']

In [42]:
# ANOVA Feature Selection in der Performance mit XGBoost Modell
# Einfacher binärer Klassifikator
model = XGBClassifier()
#  definieren Sie das Verfahren der Kreuzvalidierung
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=1)
# Modell auswerten
scores = cross_validate(estimator=model, X=X_new, y=y, cv=cv, n_jobs=-1, 
                        scoring=['accuracy', 'roc_auc', 'precision', 'recall', 'f1'])
print('Accuracy: ', scores['test_accuracy'].mean())
print('Precision: ', scores['test_precision'].mean())
print('Recall: ', scores['test_recall'].mean())
print('F1: ', scores['test_f1'].mean(), '\n')

Accuracy:  0.9690758145363408
Precision:  0.9701396642044863
Recall:  0.9817857142857143
F1:  0.9755156394262383 



# minimum-Redundancy-Maximum-Relevance (mRMR)

Das Ziel ist es, eine Feature-Submenge auszuwählen, die die statistische Eigenschaft einer Ziel-Klassifikationsvariable am besten charakterisiert, unter der Einschränkung, dass diese Features untereinander so unähnlich wie möglich sind, aber der Klassifikationsvariable so wenig wie möglich ähnlich sind. 

Es gibt verschiedene Formen von mRMR, wobei "Relevanz" und "Redundanz" durch Mutual Information, Korrelation, t-Test/F-Test, Distanzen, etc. definiert wurden.


In [28]:
import pymrmr

rel_feat = pymrmr.mRMR(X, 'MID', 20)
X_new = X[X.columns.intersection(rel_feat)]
rel_feat

['mean area',
 'worst concave points',
 'mean perimeter',
 'worst radius',
 'worst symmetry',
 'radius error',
 'worst area',
 'worst smoothness',
 'worst perimeter',
 'fractal dimension error',
 'compactness error',
 'mean concave points',
 'area error',
 'mean symmetry',
 'mean smoothness',
 'perimeter error',
 'worst concavity',
 'mean compactness',
 'mean concavity',
 'worst compactness']

In [25]:
# mRMR Feature Selection in der Performance mit XGBoost Modell
# Einfacher binärer Klassifikator
model = XGBClassifier()
#  definieren Sie das Verfahren der Kreuzvalidierung
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=1)
# Modell auswerten
scores = cross_validate(estimator=model, X=X_new, y=y, cv=cv, n_jobs=-1, 
                        scoring=['accuracy', 'roc_auc', 'precision', 'recall', 'f1'])
print('Accuracy: ', scores['test_accuracy'].mean())
print('Precision: ', scores['test_precision'].mean())
print('Recall: ', scores['test_recall'].mean())
print('F1: ', scores['test_f1'].mean(), '\n')

Accuracy:  0.9717167919799498
Precision:  0.9717353498945992
Recall:  0.9843015873015873
F1:  0.9776585423470732 



Über diesen Github-Link können Sie auf den gesamten Code und die Daten des Projekts zugreifen.

![](download_code.png)