## Home assignment 08: Feature importances


Please, fill the lines in the code below.
Your goal is to estimate importance of the existing features using several methods.

Your main goal is to estimate feature importances for Logistic Regression and Gradient Boosting using several methods.

The model should be trained using only `train` part of the data


In this task you meet the [dataset](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29) describing different cars for multiclass ($k=4$) classification problem, but we use only binary subset for classes `bus` and `opel`. The data is available below.

In [None]:
# If on colab, uncomment the following lines

! wget https://raw.githubusercontent.com/girafe-ai/ml-course/23f_basic/homeworks/lab01_ml_pipeline/car_data.csv

In [None]:
import pandas as pd
import numpy as np
import shap

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

from matplotlib import pyplot as plt


dataset = pd.read_csv('car_data.csv', delimiter=',', header=None).values
data = dataset[:, :-1].astype(int)
target = dataset[:, -1]
binary_subset = np.array([x in ['bus', 'opel'] for x in target])
data, target = data[binary_subset], target[binary_subset]

print(data.shape, target.shape)

In [None]:
# do not change the code in the block below
# __________start of block__________
submission_dict = {}
# __________end of block__________

In [None]:
X_train, y_train = data[:350, 1:], target[:350]
X_val, y_val = data[350:, 1:], target[350:]
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

#### Estimating features importances using logistic regression coefficients.
Train basic logistic regression and save its coefficients (weights).

In [None]:
lr_basic = None# YOUR CODE HERE

Check the classification results on the original train data:

In [None]:
print(classification_report(y_train, lr_basic.predict(X_train)))

And on validation:

In [None]:
print(classification_report(y_val, lr_basic.predict(X_val)))

Find the Logistic Regression weights and save them to the variable `lr_basic_coef`:

In [None]:
lr_basic_coef = None # YOUR CODE HERE

It should have the same number of coefficients as number of features.

In [None]:
assert lr_basic_coef.shape[-1] == X_train.shape[1]

In [None]:
# do not change the code in the block below
# __________start of block__________
submission_dict['lr_basic_coef'] = lr_basic_coef
# __________end of block__________

#### Estimating features importances using logistic regression coefficients.
Train basic logistic regression on scaled data and save its coefficients (weights) as well

In [None]:
lr_scaled = None # YOUR CODE HERE

Use `StandardScaler` on your data.

In [None]:
scaler = # YOUR CODE HERE
X_train_scaled = # YOUR CODE HERE
X_val_scaled = # YOUR CODE HERE

In [None]:
# YOUR CODE HERE

Check the classification results on the scaled train data:

In [None]:
print(classification_report(y_train, lr_scaled.predict(X_train_scaled)))

And on validation:

In [None]:
print(classification_report(y_val, lr_scaled.predict(X_val_scaled)))

Save model coefficients to the variable `lr_scaled_coef`:

In [None]:
lr_scaled_coef = None # YOUR CODE HERE

It should also have the same number of coefficients as number of features.

In [None]:
assert lr_scaled_coef.shape[-1] == X_train_scaled.shape[1]

Save index of the most important feature for lr_scaled to the variable `lr_scaled_most_important_index`:

In [None]:
lr_scaled_most_important_index = None # YOUR CODE HERE

In [None]:
# do not change the code in the block below
# __________start of block__________
assert isinstance(int(lr_scaled_most_important_index), int)
submission_dict['lr_scaled_coef'] = lr_scaled_coef
submission_dict['lr_scaled_most_important_index'] = lr_scaled_most_important_index
# __________end of block__________

#### Estimating features importances for logistic regression using shap
Use [`shap` library](https://shap.readthedocs.io/en/latest/index.html) to check the importance of the features. Use [`Linear` explainer](https://shap.readthedocs.io/en/latest/generated/shap.explainers.Linear.html) and the scaled data.

In [None]:
explainer = None # YOUR CODE HERE
shap_values_scaled = explainer(X_train_scaled)

Summary plot:

In [None]:
shap.summary_plot(shap_values_scaled, X_train_scaled)

Finally, write a function which transforms shap values to Logistic Regression coefficients. Their relations are described in the [docs](https://shap.readthedocs.io/en/latest/generated/shap.explainers.Linear.html).

*Note: This task main goal is your deeper understanding of the shap importance estimation process.*

In [None]:
def get_coef_from_shap_values(shap_values, X_train_scaled):
    # YOUR CODE HERE
    return None

In [None]:
coef_from_shap = get_coef_from_shap_values(shap_values_scaled, X_train_scaled)

If everything is correct, the next assert should pass.

In [None]:
assert np.allclose(coef_from_shap, lr_scaled_coef)

#### Training the GradientBoosting

In [None]:
gb_basic = GradientBoostingClassifier(n_estimators=10)
gb_basic.fit(X_train, y_train)

In [None]:
gb_basic_feature_importances = None # YOUR CODE HERE

In [None]:
gb_scaled = GradientBoostingClassifier(n_estimators=10)
gb_scaled.fit(X_train_scaled, y_train)

In [None]:
gb_scaled_feature_importances = None # YOUR CODE HERE

In [None]:
# do not change the code in the block below
# __________start of block__________
assert np.allclose(gb_basic_feature_importances, gb_scaled_feature_importances, atol=1e-1)
submission_dict['gb_basic_feature_importances'] = gb_basic_feature_importances
# __________end of block__________

**Question:** Why are the feature importances so similar for scaled and unscaled data?

#### Using shap to explain trees ensemble solution

In [None]:
explainer = None # YOUR CODE HERE
shap_values = explainer.shap_values(X_train)

In [None]:
shap.summary_plot(shap_values, X_train)

In [None]:
gb_scaled_most_important_index = None # YOUR ANSWER HERE

In [None]:
# do not change the code in the block below
# __________start of block__________
assert isinstance(int(gb_scaled_most_important_index), int)
submission_dict['gb_scaled_most_important_index'] = gb_scaled_most_important_index
# __________end of block__________

In [None]:
# do not change the code in the block below
# __________start of block__________
np.save('submission_dict_hw08.npy', submission_dict, allow_pickle=True)
print('File saved to `submission_dict_hw.npy`')
# __________end of block__________

Great job! Please, submit your solution to the grading system! Please, note, you need to submit both `submission_dict_hw.npy` and `get_coef_from_shap_values` function code.