# Feature Selection/Extraction
In this lab session we will use the library scikit-learn to work with the titanic dataset we will apply some feature selection techniques and experiment with dimensionality reduction by using:
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, plot_confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

### Question 1
Load the titanic data from scikit learn, the data contains information about the passengers on board the titanic, we will be using that data to predict if a passenger survives the accident or not <br>
After loading the dataset split it into training and testing sets with 20% of the data in the testing set.

In [None]:
# load the data here by using fetch_openml
np.random.seed(42)

In [None]:
# split the data here by using train_test_split

In [None]:
X_train.head()

### Question 2
Check if any of the features have missing values and remove those with high missing value ratio

### Question 3
Plot the correlation matrix of the numerical data.

In [None]:
# Plot the correlation matrixk by using sns.heatmap
X_comb = pd.concat([X_train, y_train.astype(float)], axis=1)

### Question 4
Combine the parch and sibsp features into one feature called family_size since they are highly correlated.

In [None]:
for dataset in [X_train, X_test]:

X_train.head()

### Question 5
Extract the title form the 'name' feature so that we can get some useful information from it.
Check all the titles created for males and females and group those that are redundant or rare.

In [None]:
# Create the title featrure here
for dataset in [X_train, X_test]:
    dataset['title'] =  dataset['name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
    dataset.drop(["name"], axis=1, inplace=True)

X_train.head()

In [None]:
# Use this to group the rare titles
X_comb = pd.concat([X_train, X_test])
rare_titles = (X_comb['title'].value_counts() < 10)
rare_titles

In [None]:
#  Group other titles here that you think can be grouped
for dataset in [X_train, X_test]:

In [None]:
# Drop the ticket feature here
for dataset in [X_train, X_test]:

X_train.head()

### Question 6
Check which features are categorical and which are numerical, and define two different transformers to preprocess the data. <br>
For categorical features replace the missing values by the most frequent one and onehot encode the categories.<br>
For numerical features replace the missing values with the mean and normalize the data.

In [None]:
X_train.dtypes

In [None]:
# define the categorical preprocessing here by using Pipeline()
cat_cols = ['embarked', 'sex', 'pclass', 'title', 'is_alone']

In [None]:
# define the numerical preprocessing here by using Pipeline()
num_cols = ['age', 'fare', 'family_size']

In [None]:
# define the overall preprocessor by using ColumnTransformer()
preprocessor = ColumnTransformer()

### Question 7
Define a pipeline that preprocesses the data and then fits it into a random forest classifier.<br>
Fit the data and evaluate the data on the testing set.

In [None]:
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

In [None]:
# fit the training set here

In [None]:
# apply the prediction here

In [None]:
# use classification_report() to evaluate he model

In [None]:
# plot the confusion matrix by using plot_confusion_matrix()

### Question 8
Redo question 7, but apply PCA to the preprocessed data before fitting it to reduce the dimension of the features, change the number of components used in PCA and check how that affects the testing accuracy.

In [None]:
n_components = 2
modelpca = Pipeline(steps=[('preprocessor', preprocessor),
                      ('pca', PCA(n_components=n_components)),
                      ('classifier', RandomForestClassifier())])

In [None]:
# fit the training set here

In [None]:
# apply the prediction here

In [None]:
# use classification_report() to evaluate he model

In [None]:
# plot the confusion matrix by using plot_confusion_matrix()

### Question 9
Redo question 7, but apply LDA to the preprocessed data before fitting it to reduce the dimension of the features, and check how that affects the testing accuracy.

In [None]:
modellda = Pipeline(steps=[('preprocessor', preprocessor),
                      ('lda', LinearDiscriminantAnalysis(n_components=1)),
                      ('classifier', RandomForestClassifier())])

In [None]:
# fit the training set here

In [None]:
# apply the prediction here

In [None]:
# use classification_report() to evaluate he model

In [None]:
# plot the confusion matrix by using plot_confusion_matrix()