# Lab 4

## Decision Tree

For this lab, we are going to implement a decision tree based on the C4.5 algorithm. C4.5 provides several improvements over ID3 though the base structure is very similar. C4.5 removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. 

We will start with our titanic dataset.

**Note:** Exercises can be autograded and count towards your lab and assignment score. Problems are graded for participation.

In [25]:
from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. change to something else if this is not the case on your system

In [26]:
%load_ext autoreload
%autoreload 2

# make sure your run the cell above before running this
import Lab4_helper

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


For developing this lab, we can a diabetes factors dataset. Description of the data is found https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset.

In [27]:
import pandas as pd
diabetes_df = pd.read_csv(
    f"../data/diabetes_indicators.csv"
)
diabetes_df.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


We need to do some simple preprocessing before our neural network can deal with this data. 

In [28]:
features = ['Sex','Age','Education','Income','Fruits','Veggies','Smoker', "HighChol", "BMI"]
X = diabetes_df.loc[:,features][:1000]
X = X.dropna()
X

Unnamed: 0,Sex,Age,Education,Income,Fruits,Veggies,Smoker,HighChol,BMI
0,0.0,9.0,4.0,3.0,0.0,1.0,1.0,1.0,40.0
1,0.0,7.0,6.0,1.0,0.0,0.0,1.0,0.0,25.0
2,0.0,9.0,4.0,8.0,1.0,0.0,0.0,1.0,28.0
3,0.0,11.0,3.0,6.0,1.0,1.0,0.0,0.0,27.0
4,0.0,11.0,5.0,4.0,1.0,1.0,0.0,1.0,24.0
...,...,...,...,...,...,...,...,...,...
995,0.0,2.0,6.0,8.0,1.0,0.0,0.0,0.0,31.0
996,0.0,10.0,5.0,8.0,0.0,1.0,0.0,0.0,21.0
997,1.0,7.0,4.0,1.0,0.0,0.0,0.0,1.0,31.0
998,0.0,5.0,4.0,8.0,1.0,1.0,0.0,0.0,37.0


In [29]:
X.dtypes

Sex          float64
Age          float64
Education    float64
Income       float64
Fruits       float64
Veggies      float64
Smoker       float64
HighChol     float64
BMI          float64
dtype: object

We will first implement ID3 before we move towards C4.5. This means we cannot handle continuois data such as ``BMI``. We will bin this into 20 categories. I picked 20 after trying a few different values. At this point, I do not know if it is a good selection or bad. This is part of the reason we will switch to C4.5. 

In [30]:
X2 = X.copy()
X2['BMI'] = pd.cut(X2['BMI'],bins=20).astype(str) # bin Age up
X2['BMI'].value_counts()

(28.9, 31.05]      194
(26.75, 28.9]      147
(24.6, 26.75]      141
(22.45, 24.6]      125
(31.05, 33.2]       88
(20.3, 22.45]       65
(33.2, 35.35]       64
(35.35, 37.5]       41
(37.5, 39.65]       27
(41.8, 43.95]       21
(18.15, 20.3]       20
(39.65, 41.8]       19
(15.957, 18.15]     17
(43.95, 46.1]       17
(46.1, 48.25]        5
(48.25, 50.4]        4
(54.7, 56.85]        2
(56.85, 59.0]        1
(50.4, 52.55]        1
(52.55, 54.7]        1
Name: BMI, dtype: int64

In [31]:
t = diabetes_df.loc[X2.index,'Diabetes_012']
t

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
995    0.0
996    0.0
997    0.0
998    0.0
999    0.0
Name: Diabetes_012, Length: 1000, dtype: float64

#### Exercise 1
Construct a function called ``entropy`` that calculates the entropy of a set (Pandas Series Object)

In [32]:
e1 = Lab4_helper.entropy(t)
e2 = Lab4_helper.entropy(X2['Income'])
e1,e2

(0.9013501922245392, 2.8994397663680536)

#### Exercise 2
Write a function called ``gain`` that calculates the information gain after splitting with a specific variable (Equation 12.2 from Marsland).

In [33]:
g1 = Lab4_helper.gain(t,X2['Sex'])
g2 = Lab4_helper.gain(t,X2['Income'])
g3 = Lab4_helper.gain(t,X2['Age'])
g1,g2,g3

(0.00027367382888343617, 0.02399688322746074, 0.04855942918948519)

#### Exercise 3
C4.5 actually uses the gain ratio which is defined as the information gain "normalized" (divided) by the entropy before the split. You have written everything you need here. Just put it together.

In [34]:
gr1 = Lab4_helper.gain_ratio(t,X2['Sex'])
gr2 = Lab4_helper.gain_ratio(t,X2['Income'])
gr3 = Lab4_helper.gain_ratio(t,X2['Age'])
gr1,gr2,gr3

(0.0003036265274521183, 0.026623263005287934, 0.053874098667067626)

#### Exercise 4
Define a function called ``select_split`` that chooses the column to place in the decision tree. This function returns the column name and the gain ratio for this column.

In [35]:
col,gain_ratio = Lab4_helper.select_split(X2,t)
col,gain_ratio

('BMI', 0.06419179429194832)

#### Exercise 5
Now put it all together and construct a function called ``make_tree`` that returns a tree in the format shown below. This function is a recursive function. Think carefully about how to debug recursion (i.e., grab yourself a debugger such as https://docs.python.org/3/library/pdb.html). Think carefully the base cases. 

In [36]:
tree = Lab4_helper.make_tree(X2,t)
Lab4_helper.print_tree(tree)

{
    "BMI": {
        "(15.957, 18.15]": {
            "Sex": {
                "0.0": 0.0,
                "1.0": {
                    "Age": {
                        "9.0": 0.0,
                        "10.0": 2.0
                    }
                }
            }
        },
        "(18.15, 20.3]": {
            "Education": {
                "2.0": 0.0,
                "3.0": {
                    "Age": {
                        "8.0": 0.0,
                        "9.0": 0.0,
                        "10.0": 2.0,
                        "13.0": 2.0
                    }
                },
                "4.0": 0.0,
                "5.0": 0.0,
                "6.0": 0.0
            }
        },
        "(20.3, 22.45]": {
            "Age": {
                "1.0": 0.0,
                "2.0": 0.0,
                "3.0": 0.0,
                "4.0": 0.0,
                "5.0": 0.0,
                "6.0": 0.0,
                "7.0": 0.0,
                "8.0": 0.0,
              

#### Exercise 6
Create a recrusive function called ``generate_rules`` that returns an array of the rules from a tree. A rule is the form of:
```python
 [[('BMI', '(24.6, 26.75]'),
  ('Age', 7.0),
  ('Income', 8.0),
  ('Education', 5.0)],
  2.0],
```
A single rule has a type of list. The last element in the list is the prediction, which is Survived=0 in this example. The tuples that preceed the last element are the conditions. Put another way, the above rule is equivalent to:
```python
if BMI == '(24.6, 26.75]' and Age == 7.0 and Income == 8.0 and Education == 5.0:
    predicted_value = 2.0
```

In [37]:
rules = Lab4_helper.generate_rules(tree)
rules[:5] # the first 5 rules

[[('BMI', '(39.65, 41.8]'), ('Age', 9.0), ('Income', 3.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 9.0), ('Income', 2.0), 2.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 9.0), ('Income', 4.0), 1.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 7.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 8.0), ('Education', 4.0), 2.0]]

In [38]:
rules

[[('BMI', '(39.65, 41.8]'), ('Age', 9.0), ('Income', 3.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 9.0), ('Income', 2.0), 2.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 9.0), ('Income', 4.0), 1.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 7.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 8.0), ('Education', 4.0), 2.0],
 [('BMI', '(39.65, 41.8]'),
  ('Age', 8.0),
  ('Education', 5.0),
  ('Income', 1.0),
  2.0],
 [('BMI', '(39.65, 41.8]'),
  ('Age', 8.0),
  ('Education', 5.0),
  ('Income', 7.0),
  0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 8.0), ('Education', 2.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 5.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 10.0), 2.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 11.0), ('Income', 8.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 11.0), ('Income', 7.0), 2.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 11.0), ('Income', 4.0), 2.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 2.0), 0.0],
 [('BMI', '(39.65, 41.8]'), ('Age', 6.0), 2.0],
 [('BMI', '(24.6, 26.75]'), ('Age', 7.0), ('

#### Exercise 7
Create an improved function to create a tree called ``make_tree2``. This function is a recursive function. This function must add support for numeric columns, and it must incorporate a parameter that battles overfitting called ``min_split_count``. Minimum split count is incorporated as an additional base case. To implement, check to see if you have at least min_split_count items (i.e., num_elements >= min_split_count to split). The biggest change comes with the addition of numeric columns (Age and Fare in their original format). Please refer to the Marsland textbook for details on handling numeric values. In short, you try all possible locations to divide a numeric variable. For example, if your column has the values:
```
values = [1,3,2,5]
sorted_values = [1,2,3,5]
possible_splits = [<1.5,<2.5,<4]
```
Please make sure you denote your splits like I am doing above and how they are printed below.

In [39]:
tree2 = Lab4_helper.make_tree2(X,t,min_split_count=5)
Lab4_helper.print_tree(tree2)

{
    "Age<7.50": {
        "False": {
            "BMI<31.50": {
                "False": {
                    "Income<2.50": {
                        "False": {
                            "HighChol<0.50": {
                                "False": {
                                    "Veggies<0.50": {
                                        "False": {
                                            "Education<4.50": {
                                                "False": {
                                                    "Smoker<0.50": {
                                                        "False": {
                                                            "Sex<0.50": {
                                                                "False": {
                                                                    "Fruits<0.50": {
                                                                        "False": 2.0,
                                                             

#### Exercise 8
So how are we doing? We can put everything together and evaluate our solutions.

Create a function to make predictions called ``make_prediction``. Then use your Lab3_helper solutions to do some evaluations.

In [40]:
default = 0
from sklearn.model_selection import train_test_split

X2_train, X2_test, t_train, t_test = train_test_split(X2, t, test_size=0.3, random_state = 0)
X_train, X_test = X.loc[X2_train.index], X.loc[X2_test.index]

tree_id3 = Lab4_helper.make_tree(X2_train,t_train)
rules_id3 = Lab4_helper.generate_rules(tree_id3)
tree_c45 = Lab4_helper.make_tree2(X_train,t_train, min_split_count=20)
rules_c45 = Lab4_helper.generate_rules(tree_c45)

y_id3 = X2_test.apply(lambda x: Lab4_helper.make_prediction(rules_id3,x,default),axis=1)
y_c45 = X_test.apply(lambda x: Lab4_helper.make_prediction(rules_c45,x,default),axis=1)

In [41]:
import Lab3_helper

In [42]:
# Evaluate the id3
cm_id3 = Lab3_helper.confusion_matrix(t_test,y_id3,labels=[0,1,2])
stats_id3 = Lab3_helper.evaluation(cm_id3,positive_class=2)
stats_id3

{'accuracy': 0.7093425605536332,
 'sensitivity/recall': 0.2857142857142857,
 'specificity': 0.8632075471698113,
 'precision': 0.43137254901960786,
 'F1': 0.34375}

In [43]:
# Evaluate the c45
cm_c45 = Lab3_helper.confusion_matrix(t_test,y_c45,labels=[0,1,2])
stats_c45 = Lab3_helper.evaluation(cm_c45,positive_class=2)
stats_c45

{'accuracy': 0.7137931034482758,
 'sensitivity/recall': 0.14285714285714285,
 'specificity': 0.92018779342723,
 'precision': 0.39285714285714285,
 'F1': 0.2095238095238095}

In [44]:
source = pd.DataFrame.from_records([stats_id3,stats_c45])
source['Method'] = ['ID3','C4.5']
source

Unnamed: 0,accuracy,sensitivity/recall,specificity,precision,F1,Method
0,0.709343,0.285714,0.863208,0.431373,0.34375,ID3
1,0.713793,0.142857,0.920188,0.392857,0.209524,C4.5


**Problem 1:** How do the two algorithms compare for this dataset?

Your answer here: https://canvas.calpoly.edu/courses/81417/assignments/545570

**Problem 2:** Is this a robust experiment? How would you make it more robust? i.e., what are the flaws with what we did?

Your answer here: https://canvas.calpoly.edu/courses/81417/assignments/545571

**Problem 3:** Repeat this experiment with min_split_count = 10, 20, 40, 80. How do the results change for C4.5?

Your answer here: https://canvas.calpoly.edu/courses/81417/assignments/545572

In [45]:
# Good job!
# Don't forget to push with ./submit.sh

#### Having trouble with the test cases and the autograder?

You can always load up the answers for the autograder. The autograder runs your code and compares your answer to the expected answer. I manually review your code, so there is no need to hide this from you.

```python
import joblib
answers = joblib.load(f"{home}/csc-466-student/tests/answers_Lab4.joblib")
answers.keys()
```