# Feature Selection/Extraction
In this lab session we will use the library scikit-learn to work with the titanic dataset we will apply some feature selection techniques and experiment with dimensionality reduction by using:
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import DecisionBoundaryDisplay

We load the titanic data from scikit learn, the `X` data contains information about the passengers on board the titanic, we will be using that data to predict the class `y`, which tells us if a passenger survives the accident or not. 
The details on the different features name and definition can be found here: https://www.kaggle.com/c/titanic/data 

In [None]:
# load the data here by using fetch_openml
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

## Question 1
Use the functions `head()`, `info()`, `describe()` from the `pandas` library to explore the different features and assess their type (categorical, scalar) and possible missing values (e.g., NaN)

In [None]:
# example
X.info()

### Question 2
Now, we are going to drop the `name` and `ticket` features which are not very informative (although the name may contain a noble or religious title). For that we use the function `drop()` 

In [None]:
X.drop(...)

### Question 3
Check if any of the remaining features have missing values using `isnull()` and remove those with missing value ratios above 25% using the function `drop()` from `pandas`. 

In [None]:
print(X.isnull().mean())

### Question 4
There remain features with missing values. Here we decide to drop these data (persons) from the list using the function `notnull()` to find the indices of the remaining data.

In [None]:
idx = X.notnull().all(axis=1)

### Question 5
Check which features are categorical and which are numerical. For the `sex` variable, which is binary, we will simple use 0's and 1's. For the `embarked` variable, we want to code a one-hot encoding. 

In [None]:
X['sex'] = X['sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [None]:
# Get one hot encoding of column `embarked``
one_hot = pd.get_dummies(X['embarked'], dtype=int)

# Join one_hot to the existing X

# Drop the column `embarked` in X


### Question 6
Split `X` and `y` into training and testing sets with 20% of the data in the testing set by using `rain_test_split()` from `sklearn`.

### Question 7
We will now train a classifier on the training dataset. 
For that, we will use a pipeline that preprocesses the data and then fits it into a classifier.

The idea is to try different classifiers:
- `LogisticRegression()`
- `SVC(kernel='linear')`
- `KNeighborsClassifier(n_neighbors=3)`
- `DecisionTreeClassifier()`
- `RandomForestClassifier()`

Which one give the best results?

In [None]:
model = Pipeline(steps=[
    ('normalizer', StandardScaler()),
    ('classifier', LogisticRegression())
])

In [None]:
# Fit the training set here

# Predict the class of the testing dataset
y_predict = ...

# Assess the quality of the prediction
print("Number of errors:", np.sum(y_predict != y_test))

In [None]:
def plot_decision_boundaries(model, X, ypred, ytrue):                                                                      
    _, ax = plt.subplots()
    
    # Plot decision boundary of the model
    DecisionBoundaryDisplay.from_estimator(
        model,
        X[:, :2],
        ax=ax,
        cmap='cool',
        response_method="predict",
        plot_method="pcolormesh",
        shading="auto",
    )

    # Plot the testing points
    plt.scatter(
        X[:, 0],
        X[:, 1],
        c=ytrue.astype(float),
        marker='o',
        cmap='jet',
    )
    
    # Plot the predicted points
    plt.scatter(
        X[ypred != ytrue, 0],
        X[ypred != ytrue, 1],
        c=ypred[ypred != ytrue].astype(float),
        marker='*',
        cmap='jet',
    )
    

### Question 8
Redo question 7, but apply PCA to the preprocessed data before fitting it to reduce the dimension of the features, change the number of components used in PCA and check how that affects the testing accuracy. When the number of components is 2, you can use the function `plot_decision_boundaries()` to plot the decision boundary together with the classes predicted and the real classes.

In [None]:
pca = PCA(n_components=2) 
pca.fit(X_train)
Xnew_train = pca.transform(X_train)
Xnew_test = pca.transform(X_test)