<div id="titlepage">
    <h1 style="line-height: 1.5em; margin-bottom: 80px">
        Feature-Selection am Beispiel<br/>
        des Datensatzes &raquo;20 Newsgroups&laquo;
    </h1>
    <table style="font-size: 20px; margin: 0; text-align: left">
        <tr style="background: none">
            <th style="border-style: none; padding-left: 2px">Vorlesung</th>
            <td style="border-style: none">Advanced Data Science Pipelines</td>
        </tr>
        <tr style="background: none">
            <th style="border-style: none; padding-left: 2px">Dozent</th>
            <td style="border-style: none">Christoph Gietl</td>
        </tr>
        <tr style="background: none">
            <th style="border-style: none; padding-left: 2px">Datum</th>
            <td style="border-style: none">14. Juni 2021</td>
        </tr>
        <tr style="background: none">
            <th style="border-style: none; padding-left: 2px">Folien</th>
            <td style="border-style: none"><a href="https://christophgietl.github.io/feature-selection">christophgietl.github.io/feature-selection</a></td>
        </tr>
        <tr style="background: none">
            <th style="border-style: none; padding-left: 2px">Code</th>
            <td style="border-style: none"><a href="https://github.com/christophgietl/feature-selection">github.com/christophgietl/feature-selection</a></td>
        </tr>
    </table>
</div>

In [31]:
import matplotlib.pyplot
import numpy
import random
import sklearn.datasets
import sklearn.feature_extraction.text
import sklearn.feature_selection
import sklearn.linear_model
import sklearn.metrics
import sklearn.model_selection
import sklearn.naive_bayes
import sklearn.pipeline
import sklearn.svm

## Über diese Vorlesung

- Zielgruppe: 5. bis 7. Semester im Studiengang _Data Science & Scientific Computing_
- Fachgruppe: WPF Data-Science
- Vorkenntnisse
    - grundlegende Kenntnisse in Python und scikit-learn
    - Maschinelles Lernen (2. Semester)
    - Datenaufbereitung & Visualisierung (2. Semester)
- Lernziele
    - Kennenlernen und Anwendung von
        - fortgeschrittener Methoden des Maschinellen Lernens
        - Best-Practices aus dem Bereich _Reproducible Data-Science_
    - im Beruf
        - einzeln und im Team komplexe, unstrukturierte und unsaubere Datensätze analysieren zu können
        - Code und Artefakte aus jedem Analyseprozessschritt teilen und archivieren
- Inhalt
    - Datenaufbereitung
        - Feature-Extraktion
            - strukturierte Daten
            - unstrukturierte Daten
        - Imputation fehlender Werte
    - Fortgeschrittene Supervised-Learning-Verfahren
        - Modellregularisierung in Regression und Klassifikation
        - Ensemble-Learning
        - Multiclass- und Multilabel-Probleme
        - Modellkalibrierung
    - Modellevaluation:
        - Auswahl geeigneter Metriken
        - Visualisierung von Metriken und Lernfortschritt
    - Strategien für den Umgang mit großen Datenmengen
        - Datenhaltung außerhalb des Arbeitsspeichers
        - Online-Learning
        - Feature-Selection
    - Speicherung von trainierten Modellen

<h2>Datensatz &raquo;20 Newsgroups&laquo; und<br/>Ziel der Modellierung</h2>

### Hintergrund des Datensatzes: Usenet

- historischer Internet-Dienst (neben World-Wide-Web, E-Mail und anderen Diensten)
- unterteilt in zahlreiche Diskussionsforen (sogenannte Newsgroups)
- Nachrichtenformat ähnlich zu E-Mail
    - Kopfzeilen
    - Signaturen
    - Zitate

### Datenquelle und Datenformat

In [32]:
ng20 = sklearn.datasets.fetch_20newsgroups(
    # Der Datensatz wurde bereits in Subsets "train" und "test" unterteilt.
    # Lade den gesamten Datensatz:
    subset="all",
    
    # Wir unterteilen den Datensatz später selbst.
    # Mische den Datensatz gut durch:
    random_state=42,
    shuffle=True,
    
    # Entferne Kopfzeilen, Signaturen und Zitate,
    # um die Zuordnung von Nachrichten zu Personen zu erschweren:
    remove=['headers', 'footers', 'quotes']
)

In [33]:
type(ng20.data), type(ng20.target), type(ng20.target_names)

(list, numpy.ndarray, list)

In [34]:
len(ng20.data), len(ng20.target), len(ng20.target_names)

(18846, 18846, 20)

#### Input-Daten (erklärende Variablen)

In [35]:
types = set(type(item) for item in ng20.data)
types

{str}

In [36]:
lengths = [len(item) for item in ng20.data]
min(lengths), max(lengths)

(0, 158791)

In [37]:
numpy.mean(lengths).round(1), numpy.std(lengths).round(1)

(1169.7, 3858.6)

#### Output-Daten (Zielvariable)

In [38]:
numpy.unique(ng20.target, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 array([799, 973, 985, 982, 963, 988, 975, 990, 996, 994, 999, 991, 984,
        990, 987, 997, 910, 940, 775, 628]))

In [39]:
print(ng20.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


#### Beispielnachricht aus dem Datensatz (Input- und Output-Daten)

In [40]:
print(ng20.data[4_328][:434])


Not when your talking about cryptography.


Think again.  You won't see me using apple's new signature from the
finder feature.


This analogy fails in its assumption that the government gives two
squirts about credibility.


In addition, Apple's proclaimed purpose in releasing the Macintosh wasn't
survellience.

Quite the opposite:
"On January 24, Apple will introduce.... Macintosh, and you'll see why
1984 won't be, like '1984'"


In [41]:
ng20.target[4_328], ng20.target_names[ng20.target[4_328]]

(11, 'sci.crypt')

### Ziel der Modellierung: Thematischer Textklassifikator

- nimmt Texte entgegen
- ordnet sie jeweils einem der 20 Themen zu (entsprechend den 20 Newsgroups)

## Datenmodellierung unter Verwendung bisheriger Kenntnisse

### Preprocessing der Daten

In [42]:
# Aufteilung des Datensatzes in Trainings- und Testdaten:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    ng20.data,
    ng20.target,
    random_state=42,
    stratify=ng20.target, # gleiche relative Klassenhfgktn. in Trainings- und Testdaten
    test_size=2_000       # nötig für aussagekräftige Metriken bei 20 Klassen
)

In [43]:
# Balance der Trainingsdaten:
numpy.unique(y_train, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 array([714, 870, 880, 878, 861, 883, 872, 885, 890, 889, 893, 886, 880,
        885, 882, 891, 813, 840, 693, 561]))

In [44]:
# Balance der Testdaten:
numpy.unique(y_test, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 array([ 85, 103, 105, 104, 102, 105, 103, 105, 106, 105, 106, 105, 104,
        105, 105, 106,  97, 100,  82,  67]))

### Feature-Extraction und Modellanpassung

In [45]:
# Der TF-IDF-Vectorizer transformiert Listen von Strings (=Texten) in Matrizen:
xtrct = sklearn.feature_extraction.text.TfidfVectorizer(
    stop_words="english" # Wörter wie 'the' und 'is' werden nicht berücksichtigt.
)

In [46]:
# Der Naive-Bayes-Klassifikator hat sich im Bereich Textklassifikation etabliert: 
clssf_nb = sklearn.naive_bayes.ComplementNB()

In [47]:
# Verbinde Feature-Extractor und Klassifikator zu einer Pipeline:
ppln = sklearn.pipeline.Pipeline([
    ("xtrct", xtrct),
    ("clssf", clssf_nb)
])

In [48]:
# Wiederverwendbare Logik zur Anpassung von Pipelines an den Datensatz (X_train, y_train):
def cross_validate_and_fit(estimator, param_grid):
    gscv = sklearn.model_selection.GridSearchCV(
        estimator=estimator,
        n_jobs=3,
        param_grid=param_grid,
        return_train_score=True,
        # Die gewichtete Area Under the Curve ist eine stabile Metrik
        # zum Tuning von Multiclass-Klassifikatoren:
        scoring="roc_auc_ovr_weighted",
        verbose=9
    )
    gscv.fit(X_train, y_train)
    print()
    
    print(f"Parameters:       {gscv.cv_results_['params']}")
    print(f"Mean train score: {gscv.cv_results_['mean_train_score'].round(3)}")
    print(f"Mean test score:  {gscv.cv_results_['mean_test_score'].round(3)}")
    
    return gscv


tuned_ppln = cross_validate_and_fit(
    estimator=ppln,
    param_grid={'clssf__alpha': [0.01, 0.03, 0.1, 0.3, 1.0]} # Regularisier. d. NB-Klass.
)

Fitting 5 folds for each of 5 candidates, totalling 25 fits

Parameters:       [{'clssf__alpha': 0.01}, {'clssf__alpha': 0.03}, {'clssf__alpha': 0.1}, {'clssf__alpha': 0.3}, {'clssf__alpha': 1.0}]
Mean train score: [0.998 0.997 0.997 0.995 0.993]
Mean test score:  [0.973 0.975 0.977 0.976 0.974]


### Evaluation der Pipeline

In [49]:
# Wiederverwendbare Logik zur Ausgabe einer sortierten Stichprobe aus einer Liste:
def print_sorted_samples(lst):
    rng = numpy.random.default_rng(seed=42)
    samples = rng.choice(lst, replace=False, size=100)
    samples.sort()
    print(samples)

#### Features (vom Extractor erzeugt)

In [50]:
feature_names = tuned_ppln.best_estimator_.named_steps["xtrct"].get_feature_names()
len(feature_names)

127345

In [51]:
print_sorted_samples(feature_names)

['2024' '3388' '3999' '417' '4tq1jv' '5286' '578' '5e8g4' '5pv'
 '83jx0dadf' '87___________________' '93109' '_cheap_' '_rqpp4db' 'a4gc0'
 'ak' 'anafranil' 'ap8h' 'arromdian' 'bench' 'benes' 'booktitle'
 'circumstantial' 'comparitive' 'cyberspace' 'delia' 'despised' 'dxsx'
 'eighth' 'ellison' 'endlessly' 'endprocedure' 'f0j' 'fjpbkbpu' 'frame'
 'glucoma' 'goofy' 'grad' 'gravis' 'gukasian' 'gyv' 'hallandale'
 'hallucinating' 'hippi' 'homelessness' 'howell' 'hyqe' 'institutional'
 'interplanetary' 'janney' 'k2mv5805t' 'kzm' 'l4y4j2' 'leedom' 'lmrcr1o'
 'mtm' 'multipath' 'myopia' 'nb6c' 'noble' 'oi_w_' 'orthogonal'
 'overwhelming' 'owen' 'p90t' 'pendelum' 'petah' 'petcock' 'ppw' 'q6p1i'
 'quiet' 'qwt' 'redneck' 'regent' 'reincarnated' 'restrains' 'ripem'
 'roelle' 'rq9' 'rythm' 'scatter' 'serials' 'sng' 'spaceward' 'sq9wmgk'
 'squalid' 'stg' 'strategic' 't5k1' 't5m19' 'trp' 'tyrant' 'ug3'
 'uniformed' 'vdu' 'victimized' 'walkie' 'xhm' 'xoutput_info'
 'xtungrabpointer']


#### Koeffizienten des Klassifikators

In [52]:
coefficients = tuned_ppln.best_estimator_.named_steps["clssf"].feature_log_prob_
coefficients.shape

(20, 127345)

In [53]:
numpy.prod(coefficients.shape), (coefficients==0).any()

(2546900, False)

### Neues Ziel der Modellierung: _Kompakter_ thematischer Textklassifikator

- nimmt Texte entgegen
- ordnet sie jeweils einem der 20 Themen zu (entsprechend den 20 Newsgroups)
- _**Auditierbarkeit:** Entscheidungsprozess für Stakeholder:innen innerhalb und außerhalb des Unternehmens nachvollziehbar und dokumentiert_

#### Hindernis

Zu viele Features und Koeffizienten
- erschweren Dokumentation und
- verhindern Nachvollziehbarkeit.

## Einführung in die Feature-Selection

### Grundgedanke der Feature-Selection

- Idee
    - Entferne &raquo;unnütze&laquo; Features vor der Anpassung des finalen Modells.
    - Behalte nur die &raquo;nützlichen&laquo; Features.
- Ziel: kompaktes Modell
- Vorteile
    - schnellere Berechnung der Vorhersagen
    - bessere Interpretierbarkeit
- Nachteile
    - in der Regel höhere Trainingszeit
    - in der Regel (geringfügig) schlechtere Vorhersagequalität
- Anwendungsgebiete
    - Datensätze mit vielen Features _(bspw. in der Bioinformatik)_
    - aufwendige Datenerfassung _(bspw. Labortests oder Fragebögen)_
    - Bedarf nach interpretierbaren Modellen _(bspw. aufgrund regulatorischer Anforderungen oder zur Verbesserung des Vertrauens)_
    - Deployment auf Maschinen mit eingeschränkter Rechenleistung _(bspw. im Internet-of-Things-Umfeld)_

### Feature-Selection-Methoden im Überblick

#### Filter-Methoden

- Algorithmus
    - Berechne für jedes Feature die Interaktion mit der Zielvariablen.
    - Entferne alle Features, deren Interaktionswert unterhalb einer bestimmten Schwelle liegt.
- Schwierigkeit
    - Auswahl des geeigneten Interaktionsmaßes
        - diskrete vs. kontinuierliche Variablen
        - lineare vs. nichtlineare Interaktion
- Vorteile
    - schnelle Berechnung
    - numerische Stabilität
    - gute Skalierbarkeit
- Nachteile
    - keine Berücksichtigung des Modells
    - mögliche Entfernung von Features, die für das Modell wichtig sind
- Empfehlung
    - vorsichtig verwenden
    - nicht zu viele Features entfernen

##### Beispiel: Filterung kontinuierlicher Daten mittels Chi-Quadrat-Statistik

In [54]:
# Spielzeug-Datensatz zur Illustration verschiedener Feature-Selection-Methoden:
X, y = sklearn.datasets.load_diabetes(return_X_y=True)
X.shape, y.shape

((442, 10), (442,))

In [55]:
rgr = sklearn.linear_model.LinearRegression()
rgr_cv = sklearn.model_selection.GridSearchCV(rgr, param_grid={}, return_train_score=True)
rgr_cv.fit(X,y)
rgr_cv.cv_results_

{'mean_fit_time': array([0.0010417]),
 'std_fit_time': array([0.00017433]),
 'mean_score_time': array([0.0005115]),
 'std_score_time': array([0.00014351]),
 'params': [{}],
 'split0_test_score': array([0.42955643]),
 'split1_test_score': array([0.52259828]),
 'split2_test_score': array([0.4826784]),
 'split3_test_score': array([0.42650827]),
 'split4_test_score': array([0.55024923]),
 'mean_test_score': array([0.48231812]),
 'std_test_score': array([0.0492662]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.52428374]),
 'split1_train_score': array([0.51032284]),
 'split2_train_score': array([0.52379623]),
 'split3_train_score': array([0.53137699]),
 'split4_train_score': array([0.50775176]),
 'mean_train_score': array([0.51950631]),
 'std_train_score': array([0.00899606])}

In [56]:
rgr_cv.cv_results_["mean_train_score"].round(3), rgr_cv.cv_results_["mean_test_score"].round(3)

(array([0.52]), array([0.482]))

In [57]:
fltr_rgr = sklearn.pipeline.Pipeline([
    ("slct", sklearn.feature_selection.SelectKBest(k=4, score_func=sklearn.feature_selection.mutual_info_regression)),
    ("rgr", sklearn.linear_model.LinearRegression())
])
fltr_rgr_cv = sklearn.model_selection.GridSearchCV(fltr_rgr, param_grid={}, return_train_score=True)
fltr_rgr_cv.fit(X,y)
fltr_rgr_cv.cv_results_

{'mean_fit_time': array([0.02712026]),
 'std_fit_time': array([0.00286115]),
 'mean_score_time': array([0.00049601]),
 'std_score_time': array([8.6878631e-06]),
 'params': [{}],
 'split0_test_score': array([0.34640813]),
 'split1_test_score': array([0.48557479]),
 'split2_test_score': array([0.49588666]),
 'split3_test_score': array([0.39610646]),
 'split4_test_score': array([0.48743854]),
 'mean_test_score': array([0.44228292]),
 'std_test_score': array([0.06018441]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.47480387]),
 'split1_train_score': array([0.47640634]),
 'split2_train_score': array([0.46458324]),
 'split3_train_score': array([0.4870041]),
 'split4_train_score': array([0.45537312]),
 'mean_train_score': array([0.47163413]),
 'std_train_score': array([0.01079993])}

In [58]:
fltr_rgr_cv.cv_results_["mean_train_score"].round(3), fltr_rgr_cv.cv_results_["mean_test_score"].round(3)

(array([0.472]), array([0.442]))

In [59]:
bool_support = fltr_rgr_cv.best_estimator_.named_steps["slct"].get_support()
support = [idx for idx in range(len(bool_support)) if bool_support[idx]]
support

[2, 7, 8, 9]

#### Wrapper-Methoden

- Idee:
    - Passe Modell auf einer Teilmenge der Features an.
    - Bewerte die Modellqualität.
    - Verbessere die Teilmenge der Features iterativ.
- Vorteil:
    - Features, die nur in Kombination nützlich sind, können beibehalten werden.
- Nachteil:
    - teure Berechnung

##### Beispiel: Forward-Selection

In [60]:
slct_fwd = sklearn.feature_selection.SequentialFeatureSelector(
    sklearn.linear_model.LinearRegression(),
    direction="forward",
    n_features_to_select=4
)
wrpr_rgr = sklearn.pipeline.Pipeline([
    ("slct", slct_fwd),
    ("rgr", sklearn.linear_model.LinearRegression())
])
tuned_wrpr_rgr = sklearn.model_selection.GridSearchCV(wrpr_rgr, param_grid={}, return_train_score=True)
tuned_wrpr_rgr.fit(X,y)
tuned_wrpr_rgr.cv_results_

{'mean_fit_time': array([0.1973598]),
 'std_fit_time': array([0.00594485]),
 'mean_score_time': array([0.00042439]),
 'std_score_time': array([1.85292809e-06]),
 'params': [{}],
 'split0_test_score': array([0.37782954]),
 'split1_test_score': array([0.49784408]),
 'split2_test_score': array([0.46023915]),
 'split3_test_score': array([0.40836977]),
 'split4_test_score': array([0.53003558]),
 'mean_test_score': array([0.45486362]),
 'std_test_score': array([0.05589806]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.50470573]),
 'split1_train_score': array([0.48662052]),
 'split2_train_score': array([0.49184417]),
 'split3_train_score': array([0.50700358]),
 'split4_train_score': array([0.48095316]),
 'mean_train_score': array([0.49422543]),
 'std_train_score': array([0.01012695])}

In [61]:
tuned_wrpr_rgr.cv_results_["mean_train_score"].round(3), tuned_wrpr_rgr.cv_results_["mean_test_score"].round(3)

(array([0.494]), array([0.455]))

In [62]:
bool_support = tuned_wrpr_rgr.best_estimator_.named_steps["slct"].get_support()
support = [idx for idx in range(len(bool_support)) if bool_support[idx]]
support

[2, 3, 6, 8]

##### Beispiel: Backward-Elimination

In [63]:
slct_bwd = sklearn.feature_selection.SequentialFeatureSelector(
    sklearn.linear_model.LinearRegression(),
    direction="backward",
    n_features_to_select=4
)
wrpr_rgr = sklearn.pipeline.Pipeline([
    ("slct", slct_bwd),
    ("rgr", sklearn.linear_model.LinearRegression())
])
tuned_wrpr_rgr = sklearn.model_selection.GridSearchCV(wrpr_rgr, param_grid={}, return_train_score=True)
tuned_wrpr_rgr.fit(X,y)
tuned_wrpr_rgr.cv_results_

{'mean_fit_time': array([0.26473608]),
 'std_fit_time': array([0.00383217]),
 'mean_score_time': array([0.00042005]),
 'std_score_time': array([1.56268784e-06]),
 'params': [{}],
 'split0_test_score': array([0.42356667]),
 'split1_test_score': array([0.49784408]),
 'split2_test_score': array([0.46023915]),
 'split3_test_score': array([0.40836977]),
 'split4_test_score': array([0.53003558]),
 'mean_test_score': array([0.46401105]),
 'std_test_score': array([0.04527657]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.49651582]),
 'split1_train_score': array([0.48662052]),
 'split2_train_score': array([0.49184417]),
 'split3_train_score': array([0.50700358]),
 'split4_train_score': array([0.48095316]),
 'mean_train_score': array([0.49258745]),
 'std_train_score': array([0.00888561])}

In [64]:
tuned_wrpr_rgr.cv_results_["mean_train_score"].round(3), tuned_wrpr_rgr.cv_results_["mean_test_score"].round(3)

(array([0.493]), array([0.464]))

In [65]:
bool_support = tuned_wrpr_rgr.best_estimator_.named_steps["slct"].get_support()
support = [idx for idx in range(len(bool_support)) if bool_support[idx]]
support

[2, 3, 4, 8]

#### Embedded-Methoden

- Idee: Modell trifft während des Trainings eine Auswahl der Features.
- Beispiele:
    - Entscheidungsbäume
    - lineare Modelle mit L1-Regularisierung
- Vorteil:
    - schnelle Berechnung
    - Auswahl von Features, die für das Modell nützlich sind
- Nachteil:
    - weniger mächtig als Wrapper-Methoden

##### Beispiel: Lineare Regression mit L1-Regularisierung (Lasso)

In [66]:
tuned_embedded_rgr = sklearn.model_selection.GridSearchCV(
    sklearn.linear_model.Lasso(normalize=True),
    param_grid={"alpha": [0.3]},
    return_train_score=True
)
tuned_embedded_rgr.fit(X,y)
tuned_embedded_rgr.cv_results_

{'mean_fit_time': array([0.00111504]),
 'std_fit_time': array([0.00023035]),
 'mean_score_time': array([0.00039592]),
 'std_score_time': array([5.68494291e-05]),
 'param_alpha': masked_array(data=[0.3],
              mask=[False],
        fill_value='?',
             dtype=object),
 'params': [{'alpha': 0.3}],
 'split0_test_score': array([0.37792307]),
 'split1_test_score': array([0.48731802]),
 'split2_test_score': array([0.48070249]),
 'split3_test_score': array([0.45049619]),
 'split4_test_score': array([0.51318003]),
 'mean_test_score': array([0.46192396]),
 'std_test_score': array([0.04650036]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.49621581]),
 'split1_train_score': array([0.47664432]),
 'split2_train_score': array([0.48928877]),
 'split3_train_score': array([0.48948488]),
 'split4_train_score': array([0.47353178]),
 'mean_train_score': array([0.48503311]),
 'std_train_score': array([0.00855142])}

In [67]:
tuned_embedded_rgr.cv_results_["mean_train_score"].round(3), tuned_embedded_rgr.cv_results_["mean_test_score"].round(3)

(array([0.485]), array([0.462]))

In [68]:
bool_support = (tuned_embedded_rgr.best_estimator_.coef_!=0)
support = [idx for idx in range(len(bool_support)) if bool_support[idx]]
support

[2, 3, 6, 8]

#### Sonderfall: Recursive-Feature-Elimination (RFE)

In [69]:
slct_rfe = sklearn.feature_selection.RFE(
    sklearn.linear_model.LinearRegression(),
    n_features_to_select=4
)
rfe_rgr = sklearn.pipeline.Pipeline([
    ("slct", slct_rfe),
    ("rgr", sklearn.linear_model.LinearRegression())
])
tuned_rfe_rgr = sklearn.model_selection.GridSearchCV(rfe_rgr, param_grid={}, return_train_score=True)
tuned_rfe_rgr.fit(X,y)
tuned_rfe_rgr.cv_results_

{'mean_fit_time': array([0.00619001]),
 'std_fit_time': array([0.00086974]),
 'mean_score_time': array([0.00048828]),
 'std_score_time': array([7.62499742e-05]),
 'params': [{}],
 'split0_test_score': array([0.38333569]),
 'split1_test_score': array([0.50553064]),
 'split2_test_score': array([0.49985887]),
 'split3_test_score': array([0.37537075]),
 'split4_test_score': array([0.50505252]),
 'mean_test_score': array([0.4538297]),
 'std_test_score': array([0.06089444]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([0.48487474]),
 'split1_train_score': array([0.46480069]),
 'split2_train_score': array([0.46974272]),
 'split3_train_score': array([0.49512594]),
 'split4_train_score': array([0.46820776]),
 'mean_train_score': array([0.47655037]),
 'std_train_score': array([0.01156153])}

In [70]:
tuned_rfe_rgr.cv_results_["mean_train_score"].round(3), tuned_rfe_rgr.cv_results_["mean_test_score"].round(3)

(array([0.477]), array([0.454]))

In [71]:
bool_support = tuned_rfe_rgr.best_estimator_.named_steps["slct"].get_support()
support = [idx for idx in range(len(bool_support)) if bool_support[idx]]
support

[2, 4, 5, 8]

<h2>Erweiterung der Modell-Pipeline<br/>um Feature-Selection</h2>

## Filter-Methoden

In [None]:
slct_kbest_chi2 = sklearn.feature_selection.SelectKBest(
    score_func=sklearn.feature_selection.chi2
)
ppln_fltr = sklearn.pipeline.Pipeline([
    ("xtrct", xtrct),
    ("slct", slct_kbest_chi2),
    ("clssf", clssf_nb)
])
tuned_ppln_fltr = cross_validate_and_fit(
    estimator=ppln_fltr,
    param_grid={
        "slct__k": [100, 300, 1_000, 3_000, 10_000, 30_000, "all"],
        "clssf__alpha": [0.1]
    }
)

## Wrapper-Methoden

- entfallen, da rechnerisch zu aufwendig

## Embedded-Methoden

### Logistische Regression mit L1-Regularisierung

In [None]:
clssf_lr = sklearn.linear_model.LogisticRegression(
    penalty="l1",
    random_state=42,
    solver="liblinear"
)
ppln_embd_lr = sklearn.pipeline.Pipeline([
    ("xtrct", xtrct),
    ("clssf", clssf_lr)
])
tuned_ppln_embd_lr = cross_validate_and_fit(
    estimator = ppln_embd_lr,
    param_grid={"clssf__C": [0.6]}
)

In [None]:
coefficients = tuned_ppln_embd_lr.best_estimator_.named_steps["clssf"].coef_
coefficients.shape

In [None]:
feature_has_non_zero_coefficient = (coefficients!=0).any(axis=0)
feature_has_non_zero_coefficient.sum()

### Logistische Regression mit L1-Regularisierung und anschließendem Naive-Bayes

In [None]:
slct_mdl_lr = sklearn.feature_selection.SelectFromModel(
    estimator=clssf_lr
)
ppln_embd_lr_nb = sklearn.pipeline.Pipeline([
    ("xtrct", xtrct),
    ("slct", slct_mdl_lr),
    ("clssf", clssf_nb)
])
tuned_ppln_embd_lr_nb = cross_validate_and_fit(
    estimator=ppln_embd_lr_nb,
    param_grid={
        "slct__estimator__C": [0.6],
        "clssf__alpha": [0.1]
    }
)

In [None]:
tuned_ppln_embd_lr_nb.best_estimator_.named_steps["slct"].get_support().sum()

### RFE-SVC mit anschließendem Naive-Bayes

In [None]:
slct_rfe_svc = sklearn.feature_selection.RFE(
    estimator=sklearn.svm.LinearSVC(random_state=42),
    n_features_to_select=1_000,
    step=10_000
)
ppln_embd_rfe = sklearn.pipeline.Pipeline([
    ("xtrct", xtrct),
    ("slct", slct_rfe_svc),
    ("clssf", clssf_nb)
])
tuned_ppln_embd_rfe = cross_validate_and_fit(
    estimator=ppln_embd_rfe,
    param_grid={
        "slct__estimator__C": [0.03, 0.1, 0.3, 1.0],
        "clssf__alpha": [0.1]
    }
)

## Extrahiere selektierte Features

In [None]:
feature_names = tuned_ppln_embd_rfe.best_estimator_.named_steps["xtrct"].get_feature_names()

In [None]:
selected = tuned_ppln_embd_rfe.best_estimator_.named_steps["slct"].get_support()
numpy.unique(selected, return_counts=True)

In [None]:
selected_features = [
    feature
    for idx, feature in enumerate(feature_names)
    if selected[idx]
]
print_sorted_samples(selected_features)

## Vereinfache Vectorizer

In [None]:
xtrct_vcb = sklearn.feature_extraction.text.TfidfVectorizer(
    vocabulary=selected_features
)
smpl_ppln = sklearn.pipeline.Pipeline([
    ("xtrct", xtrct_vcb),
    ("clssf", clssf_nb)
])
tuned_smpl_ppln = cross_validate_and_fit(
    estimator=smpl_ppln,
    param_grid={"clssf__alpha": [0.003, 0.01, 0.03, 0.1, 0.3]}
)

In [None]:
tuned_smpl_ppln.score(X_test, y_test)

## Plausibilitätscheck

In [None]:
sklearn.metrics.plot_confusion_matrix(
    tuned_smpl_ppln,
    X_train,
    y_train,
    cmap=matplotlib.pyplot.cm.Greys,
    display_labels=ng20.target_names,
    include_values=False,
    xticks_rotation="vertical"
)

### Erkläre falsche Atheisten

In [None]:
y_train_pred = tuned_smpl_ppln.predict(X_train)
y_train_pred_proba = tuned_smpl_ppln.predict_proba(X_train)

In [None]:
false_atheists = numpy.logical_and(y_train>0, y_train_pred==0)
false_atheist_idcs = numpy.where(false_atheists)[0]
false_atheist_idcs.shape

In [None]:
false_atheist_probas=y_train_pred_proba[false_atheist_idcs,:]
false_atheist_probas.round(3)

In [None]:
max_prob_per_sample = false_atheist_probas.max(axis=1)
min_prob_per_sample = false_atheist_probas.min(axis=1)
sample_has_constant_prob = \
    (min_prob_per_sample == max_prob_per_sample)
numpy.unique(sample_has_constant_prob, return_counts=True)

In [None]:
false_atheists_with_const_prob = false_atheist_idcs[sample_has_constant_prob]

In [None]:
print(X_train[false_atheists_with_const_prob[25]])

In [None]:
print(X_train[false_atheists_with_const_prob[527]])

In [None]:
print(X_train[false_atheists_with_const_prob[773]])

## Zusammenfassung und Ausblick

### Weiterführende Literatur

- A. Zheng, A. Casari: _Feature engineering for machine learning: principles and techniques for data scientists._ O'Reilly, 2018.
- I. Guyon, A. Elisseeff: "An introduction to variable and feature selection." _Journal of machine learning research_ 3: 1157-1182.
- Y. Saeys, I. Inza, P. Larrañaga: "A review of feature selection techniques in bioinformatics." _Bioinformatics_ 23: 2507-2517.