# Assignment 1

## Question `2` (Decision Trees)

| | |
|-|-|
| Course | Statistical Methods in AI |
| Release Date | `19.01.2023` |
| Due Date | `29.01.2023` |

This assignment will have you working and experimenting with decision trees. Initially, you will be required to implement a decision tree classifier by choosing thresholds based on various impurity measures and reporting the scores. Later, you can experiment with the `scikit-learn` implementation of decision trees, and how various other parameters can be leveraged for better performance.

The dataset is a very simple one, the [banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication). It has 5 columns, the first 4 are the features, and the last one is the class label. The features are the variance, skewness, curtosis and entropy of the [wavelet transformed](https://en.wikipedia.org/wiki/Wavelet_transform) image of the banknote. The class label is 1 if the banknote is authentic, and 0 if it is forged. The data is present in `bankAuth.txt`. There are a total of 1372 samples in the dataset.

### Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# additional imports if necessary

### Impurity Measures

Decision trees are only as good as the impurity measure used to choose the best split. In this section, you will be required to implement the following impurity measures and use them to build a decision tree classifier.

1. Gini Index
2. Entropy/Log loss
3. Misclassification Error

Write functions that calculate the impurity measures for a given set of labels. The functions should take in a list of labels and return the impurity measure.

In [1]:
# your code here

def giniIndex(items):
    zeroCount = 0
    oneCount = 0
    for item in items:
        if item[-1] == 0:
            zeroCount += 1
        else:
            oneCount += 1

    N = float(items.shape[0])
    oneFreq = oneCount/N
    zeroFreq = zeroCount/N

    gini = oneFreq*(1.0 - oneFreq) * zeroFreq*(1.0 - zeroFreq)
    return gini

def entropy(items):
    zeroCount = 0
    oneCount = 0
    for item in items:
        if item[-1] == 0:
            zeroCount += 1
        else:
            oneCount += 1

    N = float(items.shape[0])
    oneFreq = oneCount/N
    zeroFreq = zeroCount/N

    S = -(np.log(oneFreq) * oneFreq + np.log(zeroFreq)*zeroFreq) 
    return S

def misclassificationError(items):
    zeroCount = 0
    oneCount = 0
    for item in items:
        if item[-1] == 0:
            zeroCount += 1
        else:
            oneCount += 1

    N = float(items.shape[0])
    oneFreq = oneCount/N
    zeroFreq = zeroCount/N

    if oneFreq > zeroFreq:
        msce = 1 - oneFreq
    else:
        msce = 1 - zeroFreq
    return msce

### Decision Tree

Fit a decision tree using any one of the above impurity measures with a depth of 3. This means you will have eight leaf nodes and seven internal nodes. Report the threshold values at each internal node and the impurity measure at the final leaf node with the label. Also report the accuracy of the classifier on the training and test data (instructions for splitting the data will be given in the end).

In [3]:
# your code here
class Partition():
    def __init__(self, featureNum: int, value: float, isLessThan: bool):
        self.featureNum = featureNum
        self.value = value
        self.isLessThan = isLessThan

        if featureNum >= 4:
            print("Warning: Invalid feature num")

    # pass a np array of items
    def __call__(self, items): 
        # False -> left node, True -> right node
        result = np.array(items.shape[0])

        result = 

        if item[self.featureNum] < self.value:
            result = True
        
        if not self.isLessThan:
            result = not result

        return result

class DecisionTree():
    """
                1
           2         3
        4    5    6     7   
    """
    
    def __init__(self, trainData, impurity=giniIndex):
        self.partitions = [Partition(-1,0.0,False)] * 7
        self.impurity = impurity
        self.trainData = trainData

        
    def splitAtPartition(self, partitionNum: int, data):

    


### `sklearn` Decision Tree Experiments

1. Scikit-learn has two decision tree implementations: [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and [`DecisionTreeRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). 

When would you use one over the other? What would you use in the case of the banknote authentication dataset? Explain the changes that need to be made in the dataset to use the other implementation.

2. Fit a decision tree to the training set. Change various parameters and compare them to one another. Mainly try and experiment with the `criterion`, `max_depth` and `min_samples_split` parameters. Report the accuracy on the training and test set for each of the experiments while varying the parameters for comparison purposes.

3. Plot your trees !! (optional) (for visualization)

```python
from sklearn.tree import plot_tree

def plotTree(tree):
    """
    tree: Tree instance that is the result of fitting a DecisionTreeClassifier
          or a DecisionTreeRegressor.
    """
    plt.figure(figsize=(30,20))
    plot_tree(tree, filled=True, rounded=True,
                  class_names=['forged', 'authentic'],
                  feature_names=['var', 'skew', 'curt', 'ent'])
    plt.show()
    return None
```

In [4]:
# your code here

### Load Data

The data has been loaded onto a Pandas DataFrame. Try to get an initial feel for the data by using functions like `describe()`, `info()`, or maybe try to plot the data to check for any patterns.

Note: To obtain the data from the UCI website, `wget` can be used followed by shuffling the samples using `shuf` and adding a header for easier reading via `pandas`. It is not necessary to view the data in a DataFrame and can be directly loaded onto NumPy as convenient.

In [4]:
data = pd.read_csv('bankAuth.txt')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371 entries, 0 to 1370
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   3.6216    1371 non-null   float64
 1   8.6661    1371 non-null   float64
 2   -2.8073   1371 non-null   float64
 3   -0.44699  1371 non-null   float64
 4   0         1371 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


In [6]:
# your code here

### Splitting the Data

It is a good practice to split the data into training and test sets. This is to ensure that the model is not overfitting to the training data. The test set is used to evaluate the performance of the model on unseen data. The test set is not used to train the model in any way. The test set is only used to evaluate the performance of the model. You may use the `train_test_split` function from `sklearn.model_selection` to split the data into training and test sets.

It is a good idea to move your data to NumPy arrays now as it will make computing easier.

In [7]:
# your code here

### Denouement

Use this place to report all comparisons and wrap up the calls to the functions written above.

In [8]:
# your code here