<img src="../../../../images/dt.png" style="background:white; display: block; margin-left: auto;margin-right: auto; width:80%"/>

---
<h2>1. Importing the Dataset</h2>

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../../../../data/clean/Social_Network_Ads.csv')
display(df.head())
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


---
<h2>2. Splitting the Dataset</h2>

In [2]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)
print("train dataset size : {} observations\ntest dataset size : {} observations".format(x_train.shape[0], x_test.shape[0]))

train dataset size : 320 observations
test dataset size : 80 observations


---
<h2>3. Feature Scaling</h2>

In [3]:
from sklearn.preprocessing import StandardScaler

stand_x = StandardScaler().fit(x_train)
x_ss = stand_x.transform(x_train)

---
<h2>4. Training the Model with Train Dataset</h2>

In [4]:
from sklearn.tree import DecisionTreeClassifier

'''
> 'criterion' — [”gini”,”entropy”], optional (default=”gini”)
    - “gini” for the Gini impurity
    - Gini Impurity measures the divergences between the probability distributions of the
      target attribute’s values and splits a node such that it gives the least amount of impurity
    - “entropy” for the information gain
    - Information gain uses the entropy measure as the impurity measure and splits a node such that
      it gives the most amount of information gain
    - “entropy” might be a little slower to compute because it requires you to compute a logarithmic function
> 'splitter' — [“best”, “random”], optional (default=”best”)
    - “best” splitter evaluates all splits using the criterion before splitting
    - “best” will calculate the best features to split based on the impurity measure and use that to split the nodes
      which is perfectly fit if we have hundreds of features
    - “random” splitter uses a random uniform function with min_feature_value, max_feature_value and random_state as inputs
    - “random” doesn’t have the computational overhead of computing the optimal split
    -  if our model is overfitting, then we can change the splitter to “random” and retrain
> 'max_depth' — int or None, optional (default=None)
    - This indicates how deep the tree can be
    - The deeper the tree, the more splits it has and it captures more information about the data
    - The default value (None) will often result in over-fitted decision trees!!!
    - if our model is overfitting, reducing the number for max_depth is one way to combat overfitting
    - It is also bad to have a very low depth because our model will underfit and experiment is the best way to find the best value
    - 'max_depth' is tied with 'min_samples_split' and 'min_samples_leaf' parameters
> 'min_samples_split' — int, float, optional (default=2)
    - 'min_samples_split' talks about an internal node and by definition an internal node can have further split
    - When we increase this parameter, the tree becomes more constrained as it has to consider more samples at each node
    - The ideal 'min_samples_split' values tend to be between 1 to 40
    - It is used to control over-fitting
    - Too high values can also lead to under-fitting hence depending on the level of underfitting or overfitting
    - 'min_samples_split' and 'min_samples_leaf' are the most responsible for the performance of the final trees from their relative importance analysis
> 'min_samples_leaf' — int, float, optional (default=1)
    - 'min_samples_leaf' talks about an external node and by definition is a node without any children
    - It is always guaranteed no matter the 'min_samples_split' value
    - It is also used to control over-fitting by defining that each leaf has more than one element
    - The ideal 'min_samples_leaf' values tend to be between 1 to 20
    - A very small number will usually mean the tree will overfit
    - Increasing this value may cause underfitting
    - 'min_samples_split' and 'min_samples_leaf' are the most responsible for the performance of the final trees from their relative importance analysis
> 'max_features' — int, float or [“auto”, “sqrt”, “log2”], default=None
    - If “auto”, then max_features=sqrt(n_features)
    - If “sqrt”, then max_features=sqrt(n_features)
    - If “log2”, then max_features=log2(n_features)
        - if we have a high computational cost or we have a lot of overfitting, you can try with “log2”
        - we can either bring it slightly up using sqrt
        - or take it down further using a custom float value
    - If None, then max_features=n_features
    - Another use of 'max_features' is to limit overfitting
'''
dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(x_ss, y_train)

DecisionTreeClassifier(criterion='entropy')

---
<h2>5. Predicting the Test Dataset and Display Results</h2>

In [5]:
y_pred = dt.predict(stand_x.transform(x_test))

pd.DataFrame(data=np.stack((y_test, y_pred), axis=1),
             index=None, columns=['y actual', 'y prediction'],
             copy=False).head(10)

Unnamed: 0,y actual,y prediction
0,1,1
1,0,0
2,0,0
3,0,1
4,0,0
5,1,1
6,0,0
7,1,1
8,0,1
9,0,0


---
<h2>6. Making the Confusion Matrix</h2>

In [6]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))
print("\nConfusion matrix result shows that:\n\t- 45 correct predictions of the class 0 (who didn\'t buy the product)\
        \n\t- 6 incorrect predictions of the class 1 (predicted as user who bought the product but in reality not to)\
        \n\t- 26 correct predictions of the class 1 (who bought the product)\
        \n\t- 3 incorrect predictions of the class 0 (predicted as user who didn\'t buy the product but in reality they bought the product)")

[[45  6]
 [ 3 26]]

Confusion matrix result shows that:
	- 45 correct predictions of the class 0 (who didn't buy the product)        
	- 6 incorrect predictions of the class 1 (predicted as user who bought the product but in reality not to)        
	- 26 correct predictions of the class 1 (who bought the product)        
	- 3 incorrect predictions of the class 0 (predicted as user who didn't buy the product but in reality they bought the product)
