<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# Feature Selection

🎯 This exercise is a continuation of exercise 3 on day 1 (***01-Data-Preparation/03-Preprocessing-Workflow***). Here,  you will perform feature selection to determine what the most important features are. 

👇 Run the cell below to load your preprocessed dataset. No need to worry aout scaling and missing data, this is all already taken care of for you.

In [None]:
from nbta.utils import download_data
download_data(id='1PV1BDs1dIob8E40wqgNw2IyPuqkZVaA0')

In [None]:
import pandas as pd

data = pd.read_csv("raw_data/clean_dataset.csv")

data.head()

# Collinearity investigation

First, create a new variable called <code>features</code> that contains all of our features including <code>Depth CSF-A (m)</code> but excluding any other feature that relates to the expedition, well, core, section name or core top.

👇 Plot a heatmap of the Pearson Correlation between the dataset columns.

<details>
    <summary>💡 Hint</summary>
ℹ️ The easiest way to draw heatmaps is using the Seaborn <code>heatmap()</code> methjod. <a href='https://seaborn.pydata.org/generated/seaborn.heatmap.html'>Read the doc</a> and do all of the necessary imports. Don't forget that you will need to also obtain a correlation matrix of your features.
</details>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

features = data[['Depth CSF-A (m)',
       'NGR total counts (cps)', 'Reflectance L*', 'Reflectance a*',
       'Reflectance b*', 'H', 'X', 'R']]


In [None]:
corr = features.corr()
corr

In [None]:
sns.heatmap(corr)

👇 Visualize the correlation between column pairs in a dataframe.

<details>
    <summary>💡 Hint</summary>
ℹ️ You should investigate the Seaborn <a href='https://seaborn.pydata.org/generated/seaborn.pairplot.html'> <code>pairplot()</code></a> method.
</details>

In [None]:
sns.pairplot(features);

❓ How many pairs of features exceed a correlation of 0.9 or -0.9? Save your answer as an integer under variable name `correlated_features`

In [None]:
correlated_features = 0

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('correlation',
                         correlated_features = correlated_features
)

result.write()
print(result.check())

# Base Modelling

We want to model the natural gamma ray response (<code>NGR total counts (cps)</code>) as a function of the other feature. NGR is useful to predict lithologies because it is highly correlated to the presence of clay minerals and organics in rocks. <br> 
👇 Prepare the feature set `X` and target `y`. Remember that we want to model the `NGR total counts (cps)` with the preprocessed features.

In [None]:
X = features.drop('NGR total counts (cps)', axis=1).copy()
y = features[['NGR total counts (cps)']].copy()

👇 Cross validate a linear regression model. Save the best score under variable name `base_model_score`.

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
import numpy as np

cv = cross_validate(LinearRegression(), X,y, scoring = 'r2')

scores = cv['test_score']

base_model_score = np.max(abs(scores))
base_model_score

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('base_model',
                         score = base_model_score
)

result.write()
print(result.check())

# Feature Permutation

👇 Perform feature permutation, and rank features by order of importance.

In [None]:
from sklearn.inspection import permutation_importance

model = LinearRegression().fit(X,y)

permutation_score = permutation_importance(model,X, y, n_repeats=10)

In [None]:
importance_df = pd.DataFrame(np.vstack((X.columns,
permutation_score.importances_mean)).T)

importance_df.columns=['feature','score decrease']
importance_df.sort_values(by="score decrease", ascending = False, inplace=True)

importance_df

❓ Which feature is the most important? Save your answer as a `string` under variable name `best_feature`.

In [None]:
best_feature = 'Reflectance L*'

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('feature_permutation',
                         feature = best_feature
)

result.write()
print(result.check())

# Reduced complexity model

👇 Drop the the weak features and cross-validate a new model. You should aim to maintain a score close to the previous one (though it may fall a bit). Save the score under variable name `simplified_model_score`.

In [None]:
simplified_X = data[['Reflectance L*','Depth CSF-A (m)','H','X','Reflectance a*']].copy()

cv2 = cross_validate(LinearRegression(),simplified_X, y)

scores = cv2['test_score']

simplified_model_score = np.max(abs(scores))
simplified_model_score

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('reduced_complexity_model',
                         model_score = simplified_model_score
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.