# Data-Mining Course (EECS 6412)
# Assignment (II): Decision Tree Classifier Implementation in Python





## Objective: Implement a Decision Tree classifier in Python to gain a deeper understanding of its working principles.

**Overal Instructions:**


*   Your task is to implement a Decision Tree classifier in Python.
*   The implementation has been broken down into multiple subfunctions, each with accompanying hints. Your goal is to complete the code for each function.
* You are only allowed to use the **pandas** and **numpy** libraries for this assignment. Some functions from Pandas have been provided for your convenience in the initial section, and you may use them if you feel they are necessary.
* Each part of your solution will be graded separately. However the sections are interrelated. It is crucial that your code is well-documented with comments explaining each part of your implementation.
* Please be aware that your responses will be thoroughly reviewed to ensure originality. Plagiarized or copied work will result in penalties.


**- Please skip the following descriptions and move directly to the Questions section if you are familiar with reading CSV files with Pandas library**



---


##Please write your full name/names and student IDs here:




*   Full Name: Parsa Merat
*   Student ID: 217554197





        

---





## Dataset Description for Car Acceptability Classification:
 Your codes must be general and should work on each tabular datasets with  categorical data types. For this example, have been provided with two datasets for training and testing- a training dataset (1400 samples) and a test dataset (327 samples). Pleased download datasets from [here](https://drive.google.com/drive/folders/1aka1ySucu1e3PqytQnVdEf63v9LT0E5z?usp=sharing). These samples represent the decisions of car experts regarding the acceptability of cars. The experts have categorized the cars into one of four classes: "acceptable," "unacceptable," "good," or "very good" based on six categorical features.

# Features:

* **'BUYING':** This feature determines the purchase price of the car and is categorized into four classes: 'vhigh' (very high), 'high', 'med' (medium), or 'low'.

* **'MAINTENANCE':** This feature indicates how high the car's maintenance cost is, and it is categorized into four classes: 'vhigh' (very high), 'high', 'med' (medium), or 'low'.

* **'DOORS':** This featurte indicates number of the doors each car has: '2', '3', '4', '5more'(5 or more than 5 doors).

* **'PERSONS':** This feature determines the car's capacity in terms of the number of persons it can accommodate and is categorized as '2', '4', or 'more'.

* **'LUG_BOOT':** This feature represents the size of the car's luggage boot (trunk) and is categorized as 'small', 'med' (medium), or 'big'.

* **'SAFETY':** This feature provides an estimate of the car's safety level and is categorized as 'low', 'med' (medium), or 'high'.

* **'CLASS':** This is the target variable. It indicates the acceptance level of the car and is categorized as 'unacc' (unacceptable), 'acc' (acceptable), 'good', or 'vgood' (very good).

**Please note that in this example the "CLASS" attribute is located at the last column of the tabular datasets**




---


## Accessing the Datasets:
To access and read datasets from Google Drive in Google Colab using the Pandas library, you can follow these steps:

1.   Upload CSV Files to Google Drive: First, ensure that you've uploaded the CSV files (train dataset and test dataset) to your Google Drive. You can create a folder for your project and upload the files there.


2.   Mount Google Drive in Google Colab:mount your Google Drive using the following code:


In [1]:
# # using jupyter, no need for theese
# from google.colab import drive
# drive.mount('/content/drive')



3.   Access and Read Data using Pandas: You can access your CSV files in the mounted Google Drive directory. For example, if your CSV files are located in a folder named "data-mining/assignment2/UG/" in your Google Drive, you can read them as follows:




In [2]:
import pandas as pd

# Define the file paths for your CSV files
# train_csv_path = '/content/drive/MyDrive/data-mining/assignment2/UG/data_train_c.csv'
# test_csv_path = '/content/drive/MyDrive/data-mining/assignment2/UG/data_test_c.csv'
train_csv_path = 'data_train_c.csv'
test_csv_path = 'data_test_c.csv'

# Read the data into Pandas DataFrames
train_df = pd.read_csv(train_csv_path)
test_df = pd.read_csv(test_csv_path)



4.   See Some Samples with head() Function:





In [3]:
# See the first 5 samples in the training dataset
print("Samples in the Training Dataset:")
print(train_df.head())

# See the first 5 samples in the test dataset
print("\nSamples in the Test Dataset:")
print(test_df.head())

Samples in the Training Dataset:
  BUYING MAINTENANCE  DOORS PERSONS LUG_BOOT SAFETY  CLASS
0  vhigh         low      2    more      med   high    acc
1  vhigh         low  5more       2      med   high  unacc
2    med       vhigh      2       2      med    low  unacc
3    med         low      2       2      med    low  unacc
4    low         med      2       2      big    low  unacc

Samples in the Test Dataset:
  BUYING MAINTENANCE  DOORS PERSONS LUG_BOOT SAFETY  CLASS
0    med       vhigh      2    more    small   high  unacc
1    low         low      4       4    small    med    acc
2    low         low      4    more    small    low  unacc
3  vhigh       vhigh      4       2      big    med  unacc
4  vhigh         med  5more       4    small   high    acc




5.   Access Feature Names using columns Attribute:






In [4]:
# Get the feature names (column names) of the training dataset
feature_names = train_df.columns
print("Feature Names:")
print(feature_names)


Feature Names:
Index(['BUYING', 'MAINTENANCE', 'DOORS', 'PERSONS', 'LUG_BOOT', 'SAFETY',
       'CLASS'],
      dtype='object')


6. Access Each Column as a Series:

In [5]:
# Access the 'BUYING' column as a Series using square bracket notation
buying_price = train_df['BUYING']
print(buying_price.head())

# Access the 'MAINTENANCE' column:
maintenance_cost = train_df['MAINTENANCE']
print(maintenance_cost.head())

# Access the 'CLASS' column in the test dataset as a Series
labels = train_df['CLASS']
print(labels.head())

0    vhigh
1    vhigh
2      med
3      med
4      low
Name: BUYING, dtype: object
0      low
1      low
2    vhigh
3      low
4      med
Name: MAINTENANCE, dtype: object
0      acc
1    unacc
2    unacc
3    unacc
4    unacc
Name: CLASS, dtype: object


7. Use value_counts() function to  find the number of samples for each distinct value for a particular column:

In [6]:
print("Counts of each distinct value in 'BUYING':")
print (maintenance_cost.value_counts())

Counts of each distinct value in 'BUYING':
MAINTENANCE
vhigh    355
high     355
low      350
med      340
Name: count, dtype: int64


---
---

---






# Questions
---

## - Part 1: Check Terminal Node Condition:
(Q.1., **5 Marks**): In the first step, we need to check if a node containing a DataFrame is a terminal node or it needs further splitting. Implement a function called "check_if_terminal" to do this task.

Function Requirements:

Input:


*   parent_data: the DataFrame corresponding to a node.

*   threshold: Proportion threshold for the majority class.



Calculate the proportion of samples with the majority class label.

If the proportion ≥ threshold, return "Leaf" as flag.

If the proportion < threshold, return "Internal" as the flag.

In addition to the flag, the function must return majority class ("acc"/"unacc"/"good", "vgood")

In [7]:

def check_if_terminal(dataframe, threshold):
    # Get all attribute names from the DataFrame
    all_attrs = dataframe.columns
    
    # Select the last attribute as the class attribute
    class_attrs = all_attrs[-1]
    
    # Extract the labels (values of the class attribute)
    labels = dataframe[class_attrs]
    
    #.................................
    # write the rest here:
    counts = labels.value_counts(sort=False)
    majority_class = counts.idxmax() 
    ratio = counts[majority_class] / len(labels.index)
    
    flag = "Internal" if ratio<threshold else "Leaf"
    
    # output flag must be a string (whether "Internal" or "Leaf")
    # majority_class must be a string indicating the majority label of the samples in the node
    #..................................
    return flag, majority_class

In [8]:
# Check your implementation on training dataframe:
flag, majority_class = check_if_terminal(train_df, 0.9)
print("the node type is {}".format(flag))
print("the majority class of the node is {}".format(majority_class))

the node type is Internal
the majority class of the node is unacc



---

## - Part 2: Entropy Function:
(Q.2.,  **10 Marks**): In order to split a node in a decision tree based on the Information Gain criterion, we need to calculate the entropy of the samples. Entropy is a measure of impurity in the data, and it is used to quantify the uncertainty associated with a set of class labels.


**Task:** Write a Python function called "entropy" that takes a the CLASS column of the dataframe denoted as "label" and returns the entropy as the output.

Function Signature: def entropy(labels: list) -> float

In [9]:
import numpy as np

def entropy(labels):
    # Count the occurrences of each unique label
    value_counts = labels.value_counts()
    #.................................
    # write the rest here:
    
    probs = value_counts.to_numpy()
    probs = probs / len(labels.index)
    logs = np.log2(probs)
    
    entp = np.dot(logs, probs)
    
    #.................................
    #Return the calculated entropy
    return -entp


In [10]:
# Check your implementation on training dataframe:
labels = train_df["CLASS"]
entrp = entropy(labels)
print("entropy of the node is {}".format(entrp))

entropy of the node is 1.1790359988713874




---

## - Part 3: Calculating Information Gain:
(Q.3., **15 Marks**): In this step, you are required to implement a function named "information_gain" that computes the information gain obtained by splitting samples denoted by 'CLASS' column referenced as 'labels' based on a specific attribute column denoted as 'x'. It should be noted that both 'labels' and 'x' are columns of a DataFrame. Please use the function written in "Part 2" for this part.



In [11]:
def information_gain(x, labels):
    #Calculate the entropy of the parent node
    parent_entropy = entropy(labels)
    #.................................
    # write the rest here:
    childs_entropy = 0
    
    for val in x.unique():
        devided = labels[x==val]
        childs_entropy += len(devided.index) * entropy(devided)
    childs_entropy /= len(labels.index)

    #Calculate the information gain by subtracting child entropy from parent entropy
    info_gain = parent_entropy - childs_entropy
    #.................................
    return info_gain

In [12]:
# Check your implementation for training dataframe on "PERSONS" attribute:
labels = train_df["CLASS"]
x = train_df["PERSONS"]
info_gain = information_gain(x,labels)
print("information gain of the node in splitting over PERSONS attribute is {}".format(info_gain))

information gain of the node in splitting over PERSONS attribute is 0.21326310194104836




---


## - Part 4: Selecting the Best Attribute for Splitting
(Q.4., **10 Marks**): In this part, you are tasked with implementing a function called "select_attribute." This function will take a parent DataFrame referenced as "parent_data" along with a list of splittable attributes denoted by "remaining_attrs" as the input and returns a string representing name of the best attribute which yields to the highest information gain after splitting. You may use the function written in "Part 3".


In [13]:
def select_attribute(parent_data, remaining_attrs):
    all_attrs = parent_data.columns
    # Extract the class attribute:
    class_attr = all_attrs[-1]
    
    # Extract the labels (target values) from the parent data
    labels = parent_data[class_attr]
    
    #.................................
    # write the rest here:
    # Loop through "remaining_attrs" attributes and calculate their information gains
    gains = np.zeros(len(remaining_attrs))
    for i,attr in enumerate(remaining_attrs):
        gains[i] = information_gain(parent_data[attr], labels)
    sel_attr = remaining_attrs[gains.argmax()]
    
    # Find the attribute with the highest information gain and return it as sel_attr
    #.................................
    
    return sel_attr


In [14]:
# Check your implementation on training dataframe:
remaining_attrs = list(train_df.columns[:-1])
sel_attr = select_attribute(train_df, remaining_attrs)
print("the best attribute for splitting the node is {}".format(sel_attr))

the best attribute for splitting the node is SAFETY





---
# - Part 5: Splitting the nodes at each tree level

(Q.5., **20 Marks**):


 In this assignment, you will be implementing a crucial part of the decision tree implementation by creating a Python function called data_split. The purpose of this function is to split a parent node's dataframe into child dataframes based on the best attribute, which yields the highest information gain. You may use the helper functions that you have already implemented in previous sections.


**Instructions:**
* Write a function called "data_split" to split all the nodes in level "n" and to generate all the children nodes in level "n+1".

* Perform node splitting in a systematic manner, progressing level by level. This entails creating all nodes at level n+1 by dividing all nodes eligible for splitting at level n. Refer to the example below for clarification:

![Image](https://drive.google.com/uc?export=download&id=1kIOCkYaxUJMEKumBP6RxOLriQQY2Wlqx)
  As depicted in the illustration, at level 1, there is a solitary node designated as "Node_1_1," symbolizing the first node of the first level. Level 1 has been subdivided into three nodes, identified as "Node_2_1," "Node_2_2," and "Node_2_3," signifying the first, second, and third nodes of the second level of splitting. Please adhere to this notation for naming each node.

* Imagine a dictionary named "dataframe_dict," where the "keys" correspond to the node names at a specific splitting level, and the "values" represent the associated dataframes. To illustrate, for level 1, the "dataframe_dict" would consist of a single key, "Node_1_1," with the corresponding value being the primary dataframe:
                dataframe_dict = {"Node_1_1": the main dataframe}
In this example, following the execution of the "data_split" function, the "dataframe_dict" dictionary should be replaced with a dictionary containing three entries, as demonstrated below:
      dataframe_dict = {
                            "Node_2_1": dataframe_2_1,
                            "Node_2_2": dataframe_2_2,
                            "Node_2_3": dataframe_2_3
                        }



* Similarly, consider another dictionary called "remaining_attrs" with "keys" representing the nodes' names, and "values" representing the splittable attributes for each node.  For the first level, the "remaining_attrs" dictionary might be defined as:

      remaining_attrs = {
                            "Node_1_1": ['BUYING', 'MAINTENANCE', 'DOORS', 'PERSONS', 'LUG_BOOT', 'SAFETY']
                        }
but after running the function "data_split", it would be updated to a dictionary with three keys-values as:

      remaining_attrs = {
                         "Node_2_1": ['BUYING', 'MAINTENANCE', 'DOORS', 'LUG_BOOT', 'SAFETY'],
                         "Node_2_2": ['BUYING', 'MAINTENANCE', 'DOORS', 'LUG_BOOT', 'SAFETY'] ,
                         "Node_2_3": ['BUYING', 'MAINTENANCE', 'DOORS', 'LUG_BOOT', 'SAFETY']
                         }

Please note that once we've performed a split on a categorical attribute such as "PERSONS" and generated the children nodes in the subsequent level, we are no longer permitted to split on the same categorical attribute within that branch of the tree. It's important to emphasize that this restriction doesn't apply to numerical attributes.

In the context of this example, this means that the "remaining_attrs" dictionary is updated to a three-element dictionary, where none of the nodes in this specific branch have the "PERSONS" attribute as a splittable option anymore.

* Consider the "tree_model" as a list containing three additional dictionaries: "tree_connectivity", "node_labels", "and node_types":

            tree_model = [tree_connectivity , node_types, node_labels]

where "tree_connectivity" is a dictionary representing the node connection to the parents. The "node_types" and "node_labels" are also dictionaries containing the ("Leaf" or "Internal") and the majority class for each node, respectively. Your "data_split" function must take the "tree_model" generated up to  level "n" and must update it to the model up top level "n+1" after splitting. See the below image for this example:
The tree_model at level 1 is:

<img src="https://drive.google.com/uc?export=download&id=1GSJzh4CNE298LFXQR86LYpEfj4Q883-S"  width=400>


After running the "data_split" function, the tree_model will be updated up to level 2 as follows:

![Image](https://drive.google.com/uc?export=download&id=1Y3sGXHBpQtPVpMh8UnVlO0AYXvHNTGfo)


**Therefore**: You must write the function "data_split" which takes "dataframe_dict", "remaining_attrs", "tree_model", "level", and "threshold" as the input and update "dataframe_dict", "remaining_attrs", and "tree_model" upto level "level+1". The function must also retun a boolean flag "stop_train" which must be True if any child node is generated. Otherwise, it must return False. Here, input "threshold" is the majority class threshold for checking wether a node is a "Leaf" node or an "Internal" node.

**To complete the function**:
* Loop through the nodes in "dataframe_dict" in the current "level". For each node, check if it's an "Internal" node and if so, find the best attribute for splitting. Create child nodes and finally update all the variables.





In [15]:

def data_split(dataframe_dict, remaining_attrs, tree_model, level, threshold):
    # Unpack the tree_model list into three separate variables
    [tree_connectivity, node_labels, node_types] = tree_model
    
    # Create an empty dictionary to store new child dataframes
    dataframe_dict_new = {}
    remaining_attrs_new = {}
    
    # Initialize a counter for child nodes
    child_ind = 1
    #.................................
    # write the rest here:
    # Iterate over keys in the dataframe_dict (representing nodes at the current level)






    

    # I made some small changes tho this part
    # dataframe_dict and remaining_attrs are only going to include data for INTERNAL nodes.
    # since if a node is not internal and theres no need to split it, why keep its data through levels. (keep in mind Leaves are still saved in tree_modle)
    
    # i also made tree_conectivity to have different fields for "the selected attribute" (example value: DOORS) and "the attributes value" (example values: 2, 4, 5more, etc)
    # this just made sense to me. rather than having a boolean string (eg: DOORS==2)
    
    for parent_node, df in dataframe_dict.items():
        remaining_list = remaining_attrs[parent_node]
        selected_atr = select_attribute(df, remaining_list)

        tree_connectivity[parent_node] = {"split_attribute":selected_atr, "attribute_value":{}} #set the selected attribute in tree_conectivity and create connections to current node's children

        remaining_list.remove(selected_atr)
        
        for val in df[selected_atr].unique(): # for every unique value in the selected_atr, make a node and update the tree model
            new_df = df.loc[df[selected_atr]==val]
            node_name = "node_" + str(level+1) + "_" + str(child_ind)
            child_ind += 1

            flag, majority_class = check_if_terminal(new_df, threshold)
            if not remaining_list: #if theres no more remaining attributes to split, set the node to be a leaf
                flag = "Leaf"
                
            node_labels[node_name] = majority_class
            node_types[node_name] = flag
            
            tree_connectivity[parent_node]["attribute_value"][str(val)] = node_name

            if flag == "Internal":
                dataframe_dict_new[node_name] = new_df
                remaining_attrs_new[node_name] = remaining_list.copy()
        
    
    
    #replace the new old dictionaries with new dictionaries
    dataframe_dict = dataframe_dict_new
    remaining_attrs = remaining_attrs_new
    stop_train = child_ind==1
    # update the tree_model and return it
    # also return True as stop_train if no child node is generated. Otherwise return False
    #.................................
    # Return the updated tree_model and a flag indicating whether training should stop
    return tree_model, stop_train, dataframe_dict, remaining_attrs


In [16]:
# Now Check your implementation on training dataframe:
# Initializing
threshold = 0.9

tree_connectivity = {}

flag, majority_class = check_if_terminal(train_df, 0.9)

node_types = {"node_1_1": flag}
node_labels = {"node_1_1": majority_class}

# Create an initial tree_model
tree_model = [tree_connectivity, node_labels, node_types]

# Create an initial dataframe_dict
dataframe_dict = {"node_1_1": train_df}

# Create an initial remaining_attrs

independent_attrs = list(train_df.columns[:-1])
remaining_attrs = {"node_1_1": independent_attrs}

# Set level to 1
level = 1


# Update tree model
tree_model, stop_train, dataframe_dict, remaining_attrs = data_split(dataframe_dict, remaining_attrs, tree_model, level, threshold)


[tree_connectivity, node_labels, node_types] = tree_model

print("\n tree connectivity:")
print(tree_connectivity)

print("\n node labels:")
print(node_labels)

print("\n node types:")
print(node_types)

print("\n remaining attributes are:")
print(remaining_attrs)



 tree connectivity:
{'node_1_1': {'split_attribute': 'SAFETY', 'attribute_value': {'high': 'node_2_1', 'low': 'node_2_2', 'med': 'node_2_3'}}}

 node labels:
{'node_1_1': 'unacc', 'node_2_1': 'unacc', 'node_2_2': 'unacc', 'node_2_3': 'unacc'}

 node types:
{'node_1_1': 'Internal', 'node_2_1': 'Internal', 'node_2_2': 'Leaf', 'node_2_3': 'Internal'}

 remaining attributes are:
{'node_2_1': ['BUYING', 'MAINTENANCE', 'DOORS', 'PERSONS', 'LUG_BOOT'], 'node_2_3': ['BUYING', 'MAINTENANCE', 'DOORS', 'PERSONS', 'LUG_BOOT']}




---

## -Part 6: Training the Decision Tree

(Q.6., **10 Marks**): Now, let's create a function called "tree_train" to train the decision tree. This function begins by initializing the tree model and dataframe dictionary using the root node named "node_1_1." It then iteratively updates these structures as it progresses through the tree, continuing until no further child nodes are generated. The process starts at level 1, and with each iteration, the level is incremented. Importantly, make sure to utilize the "split_data" function, which you've previously implemented, to assist in the tree construction. Ultimately, the function must return the fully trained tree model.



In [17]:
def tree_train(training_data, threshold):
    # Initializing
    tree_connectivity = {}
    
    flag, majority_class = check_if_terminal(training_data, threshold)
    
    node_types = {"node_1_1": flag}
    node_labels = {"node_1_1": majority_class}
    
    # Create a tree_model list to store connectivity, node labels, and node types
    tree_model = [tree_connectivity, node_labels, node_types]
    
    # Create a dataframe_dict with the initial training data and associate it with the root node
    dataframe_dict = {"node_1_1": training_data}
    
    # Create a remaining_attrs dictionary with all the independent attributes and associate it with the root node
    indp_attrs = list(training_data.columns[:-1])
    remaining_attrs = {"node_1_1": indp_attrs}
    
    # Initialize the level of the tree to 1
    level = 1
    
    
    # Continue tree construction until a stopping condition is met (use while loop)
    #.................................
    # write the rest here:
    # write a loop function and exit the loop if terminating criterion is met
    stop_train = flag=="Leaf"
    while not stop_train:
        tree_model, stop_train, dataframe_dict, remaining_attrs = data_split(dataframe_dict, remaining_attrs, tree_model, level, threshold)
        level += 1    
    #.................................
    # Return the final tree model
    return tree_model

In [18]:
# Check your implementation on training dataframe:
tree_model = tree_train(train_df, 0.9)
[tree_connectivity, node_labels, node_types] = tree_model

print("\n tree connectivity:")
print(tree_connectivity)

print("\n node labels:")
print(node_labels)

print("\n node types:")
print(node_types)


 tree connectivity:
{'node_1_1': {'split_attribute': 'SAFETY', 'attribute_value': {'high': 'node_2_1', 'low': 'node_2_2', 'med': 'node_2_3'}}, 'node_2_1': {'split_attribute': 'PERSONS', 'attribute_value': {'more': 'node_3_1', '2': 'node_3_2', '4': 'node_3_3'}}, 'node_2_3': {'split_attribute': 'PERSONS', 'attribute_value': {'4': 'node_3_4', 'more': 'node_3_5', '2': 'node_3_6'}}, 'node_3_1': {'split_attribute': 'BUYING', 'attribute_value': {'vhigh': 'node_4_1', 'low': 'node_4_2', 'high': 'node_4_3', 'med': 'node_4_4'}}, 'node_3_3': {'split_attribute': 'BUYING', 'attribute_value': {'high': 'node_4_5', 'low': 'node_4_6', 'vhigh': 'node_4_7', 'med': 'node_4_8'}}, 'node_3_4': {'split_attribute': 'BUYING', 'attribute_value': {'vhigh': 'node_4_9', 'low': 'node_4_10', 'high': 'node_4_11', 'med': 'node_4_12'}}, 'node_3_5': {'split_attribute': 'BUYING', 'attribute_value': {'high': 'node_4_13', 'low': 'node_4_14', 'med': 'node_4_15', 'vhigh': 'node_4_16'}}, 'node_4_1': {'split_attribute': 'MAINTE



---

# Part 7: Prediction by the Desicion Tree
(Q.7., **20 Marks**): Following the completion of decision tree training, the next step is to implement the prediction process through the trained tree structure. To achieve this, we need to create a function named 'tree_prediction.' This function takes two inputs: a test dataframe containing the samples to be predicted and the trained decision tree. It returns the predicted labels generated by the decision tree as a single DataFrame column.

In [19]:
def tree_prediction(testing_data, tree_model):

    pred_labels = []
    # Unpack the tree_model list into three separate variables: tree_connectivity, node_labels, and node_types
    [tree_connectivity, node_labels, node_types] = tree_model
    
    # Iterate through each sample in the testing_data
    for i in range(len(testing_data)):
        # Get a sample from the testing dataset
        sample = testing_data.loc[i]
        
        # Start at the root node, which is always named "node_1_1"
        current_node = "node_1_1"
        
        #.................................
        # write the rest here:
        # Begin a loop to traverse the decision tree until a leaf node is reached
        current_type = node_types[current_node]

        while current_type != "Leaf": #keep going until u reach a leaf
            split_atr = tree_connectivity[current_node]["split_attribute"]
            value = sample[split_atr]
            
            if str(value) not in tree_connectivity[current_node]["attribute_value"]: #if the value doesnt exist, return the parents label
                break
            
            current_node = tree_connectivity[current_node]["attribute_value"][str(value)]
            current_type = node_types[current_node]
            
        pred_labels.append(node_labels[current_node])
    
    # find the node label and put it in the pred_labels Pandas Series
    #.................................
    # Return the Pandas Series containing the predicted labels
    return pred_labels

In [20]:
# Check your implementation on training dataframe:
tree_model = tree_train(train_df, 0.9)
pred_labels = tree_prediction(train_df, tree_model)
print(pred_labels)

['acc', 'unacc', 'unacc', 'unacc', 'unacc', 'vgood', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'acc', 'vgood', 'acc', 'good', 'good', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'acc', 'acc', 'acc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'good', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'acc', 'unacc', 'acc', 'acc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'vgood', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'good', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'acc', 'unacc', 'acc', 'vgood', 'unacc', 'unacc', 'acc', 'unacc', 'unacc', 'unacc', 'unacc', 'acc', 'unacc', 'vgood', 'unacc', 'unacc',



---

## Part 8: Evaluating the Model

* (Q.8-a, **5 Marks**)In the final step of this assignment, you'll apply the decision tree learning process. Start by training the decision tree on the training dataset using the 'tree_train' function, setting the terminating threshold to 0.9. Next, employ the 'tree_prediction' function, as previously implemented, to generate predictions for both the training and testing datasets. Following this, your task is to compare these predicted labels with the actual ground-truth labels to compute and report the accuracy rates for both the training and testing datasets.

* (Q.8-b, **5 Marks**) Now, repeat the process with a different terminating threshold, specifically 0.7, and once again calculate and report the accuracy rates for the training and testing datasets. Finally, compare and contrast the results obtained with the two different threshold values (0.9 and 0.7). Provide an analysis and discussion of why one threshold might yield higher accuracy compared to the other.

In [21]:
#.................................
# write the rest here:
def compare_accurcy(train_df, test_df, thresh):
    print("for threshold: ", thresh)
    tree_model = tree_train(train_df, thresh)
    train_pred_labels = tree_prediction(train_df, tree_model)
    test_pred_labels = tree_prediction(test_df, tree_model)
    
    train_correct = (train_pred_labels == train_df[train_df.columns[-1]]).sum()
    siz = len(train_df.index)
    print (train_correct, "out of", siz, "accuracy = ", train_correct/siz)
    
    test_correct = (test_pred_labels == test_df[train_df.columns[-1]]).sum()
    siz = len(test_df.index)
    print (test_correct, "out of", siz, "accuracy = ", test_correct/siz)
    print()
    
compare_accurcy(train_df, test_df, 0.9)
compare_accurcy(train_df, test_df, 0.7)
compare_accurcy(train_df, test_df, 0.8)

for threshold:  0.9
1398 out of 1400 accuracy =  0.9985714285714286
302 out of 327 accuracy =  0.9235474006116208

for threshold:  0.7
997 out of 1400 accuracy =  0.7121428571428572
213 out of 327 accuracy =  0.6513761467889908

for threshold:  0.8
1394 out of 1400 accuracy =  0.9957142857142857
304 out of 327 accuracy =  0.9296636085626911



if the threshold is lower, the tree is going to stop splitting sooner

in case of 0.7: 
the leaves arent pure enough. it barely matches the training set, so its expected that testing accuracy is low. we need to split the nodes furthur.
the model is underfitted. both training and testing accuracy are low (keep in mind that 1000/1400 labels are UNACC. so if u just return UNAC ull get the same accuracy of 0.71). 

in case of 0.9: 
since 0.8 gives us a better test accuracy and lower train accuracy,
the model is (a tiny,tiny,tiny bit) overfitted.

in case of 0.8:
the model performs well both on training and testing.
the model is good.

(since the data is skewed, maybe accuracy isnt the best messurement) 

In [22]:
train_df["CLASS"].value_counts()

CLASS
unacc    997
acc      298
good      55
vgood     50
Name: count, dtype: int64