## Making trees work - Exercise
```In this exercise you will experience with Decision Trees and Random Forests. During this part you will explore the different features of them and will plot your results. Hence, whenever exploration tasks are marked with (*), know that you are asked to plot two graphs (on the same plot): the training score against the explored feature and the test score against it.```

```~Ittai Haran```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

```Read the dataset. In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. The task is to predict the "Response" variable.```

```the dataset can be found in: ```https://drive.google.com/open?id=1t_P64gM1M1_c2n4PvH7AZoELH2CNh6ui

In [None]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('insurance_fixed.csv')
X = df.drop(['Response'], axis = 1)
Y = df['Response']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.7, test_size = 0.3)

In [None]:
df['Response'].unique()

```We will start by using Decision trees. Use a simple DecisionTreeClassifier with default values to predict on your train and on your test. Evaluate the model using the accuracy metric, which you can find in sklearn.```

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
decision_tree_clf = DecisionTreeClassifier()
decision_tree_clf.fit(X_train,Y_train)
y_pred_test = decision_tree_clf.predict(X_test)
y_pred_train = decision_tree_clf.predict(X_train)
print('TEST ACCURACY : ' + str(accuracy_score(Y_test,y_pred_test)))
print('TRAIN ACCURACY : ' + str(accuracy_score(Y_train,y_pred_train)))

```Unfortunately, you are at overfit. Now let's try to get better. Try playing with the max depth of the tree, for``` $1\leq depth \leq25$ ```(*) (This means you are asked to plot some graphs, remember? :) )```

```Choose the optimal max_depth based on the graph you got.```

In [None]:
# from sklearn.model_selection import GridSearchCV

# parameters={'max_depth': range(1,26)}
# clf_tree=DecisionTreeClassifier()

In [None]:
# clf=GridSearchCV(clf_tree,parameters,scoring='accuracy')
# clf.fit(X_train,Y_train)

In [None]:
# scores_df = pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score')
# scores_df.head()

#scores_df_for_plot = scores_df[['param_max_depth' , 'mean_test_score' , 'std_test_score']]

In [None]:
# import seaborn as sns
# sns.lineplot(data = scores_df_for_plot, x='param_max_depth' , y='mean_test_score')

In [None]:
accuracy_train_test_by_depth = {}
for i in range(1,26):
    clf = DecisionTreeClassifier(max_depth=i).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_depth[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))
 

In [None]:
df = pd.DataFrame(accuracy_train_test_by_depth , index=['test_accuracy' , 'train_accuracy']).T

In [None]:
import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(data= df, x=df.index , y='test_accuracy')
sns.lineplot(data= df, x=df.index , y='train_accuracy')
plt.title('TRAIN-TEST ACCURACY BY MAX DEPTH PARAMETER')

```Choose the best max_depth you found. Now try playing with min_samples_leaf. use the following values:
[1, 10, 100, 300,700, 1000]. Do it also with max_depth = 20. What can we learn from the graphs? Please answer the question ```$\ \underline{in\ another\ cell}$```.(*)```

In [None]:
df.sort_values(by='test_accuracy' , ascending=False) #it looks like the best depth for us is 8 - small difference betweem test & train with good accuracy

In [None]:
#train with best deapth 8
accuracy_train_test_by_min_samples_leaf = {}
for i in [1, 10, 100, 300,700, 1000]:
    clf = DecisionTreeClassifier(max_depth=8, min_samples_leaf=i).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_min_samples_leaf[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))

In [None]:
df_min_samples_leaf = pd.DataFrame(accuracy_train_test_by_min_samples_leaf , index=['test_accuracy' , 'train_accuracy']).T

In [None]:
sns.lineplot(data= df_min_samples_leaf, x=df_min_samples_leaf.index , y='test_accuracy' , legend='full', label='TEST')
sns.lineplot(data= df_min_samples_leaf, x=df_min_samples_leaf.index , y='train_accuracy', legend='full', label='TRAIN')
plt.title('TRAIN-TEST ACCURACY BY MIN SAMPLES LEAF')

In [None]:
#from the graph above we can see that with the 'best' max_depth parameter (8) - the bigger the min sample leaf is than the accuracy reduces!

In [None]:
#train with best deapth 20
accuracy_train_test_by_min_samples_leaf = {}
for i in [1, 10, 100, 300,700, 1000]:
    clf = DecisionTreeClassifier(max_depth=20, min_samples_leaf=i).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_min_samples_leaf[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))
    
df_min_samples_leaf = pd.DataFrame(accuracy_train_test_by_min_samples_leaf , index=['test_accuracy' , 'train_accuracy']).T

sns.lineplot(data= df_min_samples_leaf, x=df_min_samples_leaf.index , y='test_accuracy' , legend='full' , label='TEST')
sns.lineplot(data= df_min_samples_leaf, x=df_min_samples_leaf.index , y='train_accuracy', legend='full', label='TRAIN')
plt.title('TRAIN-TEST ACCURACY BY MIN SAMPLES LEAF')
plt.legend()

In [None]:
#from the graph above we can see that using max_depth parameter to 20 - in the low values of min sample leaf there is a big overfitting, it starts to get balanced arount MIN_SAMPLES_LEAF=100 and above..but again - the bigger the min sample leaf is than the accuracy reduces!

```Decision Tree is a very nice algorithm, especially because it is very intuitive and explainable. We can even draw it!
Train a simple Decision Tree with max_depth = 3. Call it basic_tree and run the cell below. Examine the file tree.png you created.```

In [None]:
basic_tree = DecisionTreeClassifier(max_depth=3).fit(X_train,Y_train)

In [None]:
X_train

In [None]:
# from sklearn.tree import export_graphviz
# export_graphviz(basic_tree, out_file = 'tree.dot', filled  = True,
#                 rounded = True, feature_names = X_train.columns)
# !dot -Tpng tree.dot -o tree.png

In [None]:
Y_train_N = sorted(Y_train.unique())
Y_train_N = list(map(str,Y_train_N))
Y_train_N

In [None]:
Y_train.value_counts()

In [None]:
#sns.set_style("darkgrid")
import matplotlib as mpl
from sklearn.tree import plot_tree
#plt.style.use('dark_background')
mpl.rcParams['text.color'] = 'black'

fig, ax = plt.subplots(figsize=(40, 20), facecolor='b')
plot_tree(basic_tree, rotate=True, ax=ax , fontsize=12 , feature_names=X_train.columns , class_names=Y_train_N)
plt.show()

```Look at the tree you got. What, would you say, are the most important features?
As you recall, we talked about feature importance in the lecture notes. Use the attribute feature_importance_ of your tree to get a list of the most important features.```

In [None]:
#the 10 most feature important
pd.DataFrame(list(zip(basic_tree.feature_importances_,X_train.columns)) , columns=['value', 'feature_importance']).sort_values('value' ,ascending=False ).head(10)

```We will now move to Random Forest. Repeat the exlporations tasks with a Random forest with 100 trees (max depth and min samples leaf). In addition, vary the number of trees between 10 and 400, while maintaining low max_depth (*) and the max_feature parameter, between 0.1 and 1 (*). Try explaining the graphs you see ```$\ \underline{in\ a\ different\ cell}$```. Use the flag n_jobs = -1 in your experiments to accelerate your computation time. Make sure to understand where your model is overfitted.```

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
#EXPLORE MAX DEPTH AND SIMPLE LEAF - USING RANDOM FOREST WITH 100 TREES:

accuracy_train_test_by_max_depth = {}
for i in range(1,26):
    clf = RandomForestClassifier(n_estimators=100, max_depth=i , n_jobs=-1).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_max_depth[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))
    
df_max_depth = pd.DataFrame(accuracy_train_test_by_max_depth , index=['test_accuracy' , 'train_accuracy']).T

In [None]:
sns.lineplot(data= df_max_depth, x=df_max_depth.index , y='test_accuracy' , legend='full' , label='TEST')
sns.lineplot(data= df_max_depth, x=df_max_depth.index , y='train_accuracy', legend='full', label='TRAIN')
plt.title('TRAIN-TEST ACCURACY BY MAX DEAPTH - WITH 100 TREES')
plt.legend()

In [None]:

accuracy_train_test_by_min_samples_leaf = {}
for i in [1, 10, 100, 300,700, 1000]:
    clf = RandomForestClassifier(n_estimators=100, min_samples_leaf=i , n_jobs=-1).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_min_samples_leaf[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))
    
df_min_samples_leaf = pd.DataFrame(accuracy_train_test_by_min_samples_leaf , index=['test_accuracy' , 'train_accuracy']).T

In [None]:
sns.lineplot(data= df_min_samples_leaf, x=df_min_samples_leaf.index , y='test_accuracy' , legend='full' , label='TEST')
sns.lineplot(data= df_min_samples_leaf, x=df_min_samples_leaf.index , y='train_accuracy', legend='full', label='TRAIN')
plt.title('TRAIN-TEST ACCURACY BY MIN SAMPLES LEAF - WITH 100 TREES')
plt.legend()

In [None]:
accuracy_train_test_by_num_of_trees = {}
for i in range(10,401):
    clf = RandomForestClassifier(n_estimators=i, max_depth=5 , n_jobs=-1).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_num_of_trees[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))
    
df_num_of_trees = pd.DataFrame(accuracy_train_test_by_num_of_trees , index=['test_accuracy' , 'train_accuracy']).T

In [None]:
sns.lineplot(data= df_num_of_trees, x=df_num_of_trees.index , y='test_accuracy' , legend='full' , label='TEST')
sns.lineplot(data= df_num_of_trees, x=df_num_of_trees.index , y='train_accuracy', legend='full', label='TRAIN')
plt.title('TRAIN-TEST ACCURACY BY NUMBER OF TREE WITH LOW DEPTH(5)')
plt.legend()

In [None]:
df_num_of_trees.sort_values(by='test_accuracy' , ascending=False) #i will choose 30 tree for the next question

In [None]:
#max_feature

accuracy_train_test_by_max_feature = {}
for i in np.arange(0.1,1.1,0.1):
    clf = RandomForestClassifier(n_estimators=5,max_features=i, n_jobs=-1).fit(X_train,Y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    accuracy_train_test_by_max_feature[i] = (accuracy_score(Y_test,y_pred_test),accuracy_score(Y_train,y_pred_train))
    
df_max_feature = pd.DataFrame(accuracy_train_test_by_max_feature , index=['test_accuracy' , 'train_accuracy']).T

In [None]:
df_max_feature

In [None]:
sns.lineplot(data= df_max_feature, x=df_max_feature.index , y='test_accuracy' , legend='full' , label='TEST')
sns.lineplot(data= df_max_feature, x=df_max_feature.index , y='train_accuracy', legend='full', label='TRAIN')
plt.title('TRAIN-TEST ACCURACY BY MAX FEATURE')
plt.legend()

```As you could see, at least one of your graphs turned out to be very noisy. Use K Fold cross validation to evalute your model more accurately. In K Fold cross validation we split our data into K segments, and for each ```$\ 1\leq i\leq K\ $``` we test our model on the i-th segment while training it using the others.```

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
rfc = RandomForestClassifier(n_jobs=-1) 

param_grid = {
    'n_estimators': range(10,401)
}

grid_search_rf = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 2, refit=True,n_jobs=-1, scoring='accuracy' , verbose=5) #using cross validation = 2 because it takes too much time..

re = grid_search_rf.fit(X_train,Y_train)

In [None]:
grid_results_df = pd.DataFrame(re.cv_results_)
grid_results_df.sort_values(by='mean_test_score' , ascending=True).head(10)

In [None]:
re.best_params_

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(data=grid_results_df , x='param_n_estimators' ,y='mean_test_score')
#sns.lineplot(data=grid_results_df , x='param_n_estimators' ,y='std_test_score')

In [None]:
re.best_params_

In [None]:
test_preds = re.best_estimator_.predict(X_test)
train_preds = re.best_estimator_.predict(X_train)

In [None]:
accuracy_score(Y_test,test_preds),accuracy_score(Y_train,train_preds)

In [None]:
# from sklearn.model_selection import KFold

# kf = KFold(n_splits=3) # Define the split - into 2 folds 
# kf.get_n_splits(X_train,Y_train) # returns the number of splitting iterations in the cross-validator
# print(kf) 

# KFold(n_splits=2, random_state=None, shuffle=False)

# for train_index, test_index in kf.split(X_train):
#     #print('TRAIN:', train_index, 'TEST:', test_index)
#     #print(Y_train.iloc[test_index])
    
#     accuracy_train_test_by_num_of_trees = {}
#     for i in range(10,401):
        
#         clf = RandomForestClassifier(n_estimators=i, max_depth=5 , n_jobs=-1).fit(X_train.iloc[train_index] , Y_train.iloc[train_index])
#         y_pred_test = clf.predict(X_train.iloc[test_index])
#         y_pred_train = clf.predict(X_train.iloc[train_index])
#         accuracy_train_test_by_num_of_trees[i] = (accuracy_score(Y_train.iloc[test_index],y_pred_test),accuracy_score(Y_train.iloc[train_index],y_pred_train))
    
#     df_num_of_trees = pd.DataFrame(accuracy_train_test_by_num_of_trees , index=['test_accuracy' , 'train_accuracy']).T
    
#     sns.lineplot(data= df_num_of_trees, x=df_num_of_trees.index , y='test_accuracy' , legend='full' , label='TEST')
#     sns.lineplot(data= df_num_of_trees, x=df_num_of_trees.index , y='train_accuracy', legend='full', label='TRAIN')
#     plt.title('TRAIN-TEST ACCURACY BY NUMBER OF TREE WITH LOW DEPTH(5)')
#     plt.legend()
#     plt.show()

## Extra thinking on feature importance

```We talked about feature importance in the lecture notes. get the feature importance of each feature using a decision tree and using a random forest. Use in both cases the best hyper parameters you found so far. Discuss the differences between the answers``` $\underline{in\ a\ cell}$.

```We can define a concept of feature importance for linear regression: Suppose you have two features, ```$x_1$ ```and``` $x_2$. ```Suppose that you got a linear regression of the form```

$y = 100\cdot x_1 + 1\cdot x_2$

```What feature is more important? What if we have -100 instead of 100? Generalize this idea to any number of features. Train a linear regression on your data and get the feature importances.```

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
#Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

## Ensemble methods and stacking
```In this part we will explore the concept of model stacking: that is, training a model, the combining model, on the outputs of several other models. Hence, the stacking method has two steps: first we train our models, and than we train the combining model using the outputs of those models.```

```In the setting of stacking models it is very important to train the several models on one segment of the data and train the combining model on another segment. Hence, start by splitting the data to 3 segments: train_1 segment, 35% of the data, train_2 segment, 35% of the data, and test segment, the last 30% of the data.```

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.7, test_size = 0.3)

X_train_1, X_train_2 ,Y_train_1, Y_train_2 = train_test_split(X_train, Y_train, train_size = 0.5, test_size = 0.5)

In [None]:
int(len(X)*0.35)

In [None]:
t1 = X[:20783]
t1_y = Y[:20783]

In [None]:
t2 = X[20783: 20783*2]
t2_y = Y[20783: 20783*2]

In [None]:
test = X[20783*2 :]
test_y = Y[20783*2 :]

```Our first experiment is as follows: train a random forest of simple decision trees (30 trees, max_depth = 3), using train_1. Use the estimators of the forest to create 30*8=240 features: for each estimator get the probabilities it gives for the target to belong to any of the classes. You can get the list of the estimators using RandomForestClassifier.estimators_ and have the probabilities mentioned using model.predict_proba.
Using the new features you got (and them only), train a logistic regression (LogisticRegression).
Compare between the accuracy of the first random forest (on the test segment) and the accuracy of the stacked models (again, on the test segment).```

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

In [None]:
clf_rf = RandomForestClassifier(n_estimators=30, max_depth=3).fit(t1,t1_y)

In [None]:
# for index,clf_dicision_tree in enumerate(clf_rf.estimators_):
#     print(index,clf_dicision_tree)


In [None]:
df = pd.DataFrame()
#dict_proba = {}
for index,clf_dicision_tree in enumerate(clf_rf.estimators_[:]):
    probabilities=[]
    probabilities = clf_rf.estimators_[index].predict_proba(t2.reset_index(drop=True))
    x = pd.DataFrame(probabilities)
    #print(x)
    x.rename(columns=lambda x: 'TREE:'  + str(index+1) +'_CLASS:' + str(x+1), inplace=True)
    if (index==0):
        df = x
    else:
        df = pd.concat([df,x] , axis=1)
                  
    #print(clf_dicision_tree.predict_proba(X_train_2))
    #dict_proba[index] = probabilities

In [None]:
df

In [None]:
t2_y = t2_y.reset_index(drop=True)

In [None]:
df_test = pd.DataFrame()
#dict_proba = {}
for index,clf_dicision_tree in enumerate(clf_rf.estimators_[:]):
    probabilities=[]
    probabilities = clf_rf.estimators_[index].predict_proba(test.reset_index(drop=True))
    x = pd.DataFrame(probabilities)
    #print(x)
    x.rename(columns=lambda x: 'TREE:'  + str(index+1) +'_CLASS:' + str(x+1), inplace=True)
    if (index==0):
        df_test = x
    else:
        df_test = pd.concat([df_test,x] , axis=1)
                  
    #print(clf_dicision_tree.predict_proba(X_train_2))
    #dict_proba[index] = probabilities

In [None]:
test_y = test_y.reset_index(drop=True)

In [None]:
clf_lr = LogisticRegression().fit(df,t2_y)

In [None]:
test_preds = clf_lr.predict(df_test)

In [None]:
accuracy_score(test_y,test_preds)

In [None]:
train_preds= clf_lr.predict(df)

In [None]:
train_preds

In [None]:
accuracy_score(t2_y,train_preds)

In [None]:
Y.value_counts()

```We will conduct a similar experiment: create a set of at least 5 different models, of different kinds - use algorithms we talked about in the course. Stack them to get a better model. Compare the accuracies of the models to the accuracy of your stacked model.```

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegressionCV, LogisticRegression, RidgeClassifier, RidgeClassifierCV

In [None]:
clf1 = DecisionTreeClassifier().fit(X_train_1,Y_train_1)
clf2 = KNeighborsClassifier().fit(X_train_1,Y_train_1)
clf3 = LogisticRegression()
clf3.fit(X_train_1,Y_train_1)
clf4 = RidgeClassifier()
clf4.fit(X_train_1,Y_train_1)
clf5 = RidgeClassifierCV()
clf5.fit(X_train_1,Y_train_1)

In [None]:
#every classifier predictions seperatly:

clf1_preds = clf1.predict(X_test)
print(accuracy_score(Y_test,clf1_preds))

clf2_preds = clf2.predict(X_test)
print(accuracy_score(Y_test,clf2_preds))

clf3_preds = clf3.predict(X_test)
print(accuracy_score(Y_test,clf3_preds))

clf4_preds = clf4.predict(X_test)
print(accuracy_score(Y_test,clf4_preds))

clf5_preds = clf5.predict(X_test)
print(accuracy_score(Y_test,clf5_preds))

In [None]:
# Voting Classifier with soft voting 
from sklearn.ensemble import VotingClassifier

In [None]:
votingC = VotingClassifier(estimators=[('clf1', clf1),('clf2', clf2),('clf3', clf3),('clf4', clf4),('clf5', clf5)], voting='hard')
votingC = votingC.fit(X_train_2, Y_train_2)
predict_y = votingC.predict(X_test)

In [None]:
accuracy_score(Y_test,predict_y)

```As we said earlier, it is very important use two different train segments. What happens if you use the same train segment in both steps of the stacked model? Note that you now use more data to train your models, and also your combining model. Do you get better results? Do it and explain your results ```$\underline{\ in\ a\ cell\ below.}$

In [None]:
# #CHECK
# votingC = VotingClassifier(estimators=[('clf1', clf1),('clf2', clf2),('clf3', clf3),('clf4', clf4),('clf5', clf5)], voting='hard')
# votingC = votingC.fit(X_train_1, Y_train_1)
# predict_y = votingC.predict(X_test)

# accuracy_score(Y_test,predict_y)

In [None]:
# if we use the same training data for the weak learners and the combination the model will be less general and can be overfitted