## **Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(style='whitegrid')

## **Building Machine Learning Model**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

In [None]:
df_final = pd.read_csv('delaney_mordred_truncated.csv')

In [None]:
df_final.head()

In [None]:
y = df_final['measured log(solubility:mol/L)']

**Scaling the Datset**

In [None]:
scaled_DF = pd.DataFrame(StandardScaler().fit_transform(df_final.iloc[:,3:]), columns=([df_final.iloc[:,3:].columns]))

In [None]:
scaled_DF

**Train-Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_DF, y, test_size=0.20, random_state=45)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
print(f'The r2 score for train set is : {lr.score(X_train, y_train)}')
print(f'The r2 score for test set is : {lr.score(X_test, y_test)}')

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

In [None]:
print(f'The r2 score for train set is : {rf.score(X_train, y_train)}')
print(f'The r2 score for test set is : {rf.score(X_test, y_test)}')

**Scikit Learn's Feature Importance Algorithm**



> In scikit-learn, the feature importance in a Random Forest regression model can be obtained through the feature_importances_ attribute. This attribute provides a relative importance score for each feature in the dataset based on how much each feature contributes to reducing the impurity (e.g., mean squared error) in the construction of the decision trees within the Random Forest.



In [None]:
# get importance
importance = rf.feature_importances_
# summarize feature importance
dicts = {
    'Features':[x for x in df_final.iloc[:,3:].columns],
    'Importance':importance
    }
DF_imp = pd.DataFrame(dicts)
DF_imp = DF_imp.sort_values('Importance',ascending=False)
DF_imp.to_excel('imp.xlsx', index=None)

# plot feature importance
top_desc_fi = DF_imp[:6]
plt.subplots(figsize=(6,6))
sns.barplot(data=top_desc_fi, x = 'Features', y='Importance', palette = 'Set2')
# plt.bar('Feature','Importance',data = DF_imp.iloc[:7,:])
plt.xticks(rotation = 90)
# plt.show()
# plt.savefig('fi.png', dpi=300, bbox_inches='tight')

**Scikit Learn's Permutation Importance**


> Permutation importance works by permuting the values of a single feature and measuring the change in the model's performance (e.g., accuracy or mean squared error). The idea is that important features, when permuted, will cause a significant drop in model performance. Scikit-learn provides the permutation_importance function to calculate and extract permutation importance.



In [None]:
from sklearn.inspection import permutation_importance

In [None]:
result = permutation_importance(
    rf, X_test, y_test, random_state=42)

dicts = {
    'Features':[x for x in df_final.iloc[:,3:].columns],
    'Importance':result.importances_mean
}
DF_pi = pd.DataFrame(dicts)
DF_pi = DF_pi.sort_values('Importance',ascending=False) #please note in the video there was a mistake
# In the line just above this comment, there was a typo which has been corrected in here.
# In the video it was DF_imp.sort_values('Importance', ascending=False) which was incorrect.
DF_pi.to_excel('imp.xlsx', index=None)

# plot feature importance
top_desc_pi = DF_pi[:6]
sns.barplot(data=top_desc_pi, x = 'Features', y='Importance', palette = 'Set2')
# plt.bar('Feature','Importance',data = DF_imp.iloc[:7,:])
plt.xticks(rotation = 90)
plt.show()

In [None]:
DF_pi.head()

**SHAP Feature Importance**


> SHAP feature importance is based on Shapley values from cooperative game theory, and it is used to explain the output of a model by assigning a value to each feature indicating its contribution to the model's prediction for a particular instance. SHAP values consider all possible combinations of features and calculate the average contribution of each feature to the prediction.



In [None]:
%%capture
!pip install shap
import shap

In [None]:
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

In [None]:
shap.summary_plot(shap_values, X_test)

## **Model Evalualtion with Reduced Features**

In [None]:
top_desc_fi['Features'][:5]

In [None]:
scaled_DF_5 = scaled_DF[top_desc_fi['Features'][:5]]

In [None]:
scaled_DF_5.head()

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_DF_5, y, test_size=0.20, random_state=45)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
print(f'The r2 score for train set is : {lr.score(X_train, y_train)}')
print(f'The r2 score for test set is : {lr.score(X_test, y_test)}')

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

In [None]:
print(f'The r2 score for train set is : {rf.score(X_train, y_train)}')
print(f'The r2 score for test set is : {rf.score(X_test, y_test)}')