# Q1 - Decision Tree

## a.) Check out dt.py

## b.) Check out dt.py

## c.) See Below for 1C

## d.) 

For every split, since the train method goes through every *d* feature, through all their *n* values to get the optimal split, for this mandatory method so far we have at least *dn*

For each level, we have one split, so since we train at each level to the maximum depth, so let us make the depth of the entire tree *p*. Thus, we get O(*dnp*). This is for train.

For predict, it is O(*np*) because this is like searching in a binary tree getting to the leaf node, which results in the max depth, where predict, from the root to the leaf it results in makes a comparison to reach that node (which costs *np*), *n* for each comparison.   
Predict would have take make a comparison for every node from and including the root till that leaf.

Overall, O(*dnp*) + O(*np*) = O(*dnp*) 

## Q1 - c.)

In [None]:
import math
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from tqdm import tqdm
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import dt

In [None]:
minLeafSample = np.arange(10,31)
maxDepth = np.arange(1,16)
X, Y = np.meshgrid(maxDepth, minLeafSample)
trainAcc_gini = np.zeros((31,10))
testAcc_gini = np.zeros((31,10))
trainAcc_entropy = np.zeros((31,10))
testAcc_entropy = np.zeros((31,10))

xTrain = pd.read_csv('q4xTrain.csv')
yTrain = pd.read_csv('q4yTrain.csv')
xTest = pd.read_csv('q4xTest.csv')
yTest = pd.read_csv('q4yTest.csv')

def accTest(criterion, maxDepth, minLeafSample):
    return dt.dt_train_test(dt.DecisionTree(criterion, maxDepth, minLeafSample), xTrain, yTrain, xTest, yTest)

for i in tqdm(range(31)):
    for j in range(10):
        trainAcc, testAcc = accTest('gini', maxDepth[j], minLeafSample[i])
        trainAcc_gini[i][j] = trainAcc
        testAcc_gini[i][j] = testAcc
        trainAcc, testAcc = accTest('entropy', maxDepth[j], minLeafSample[i])
        trainAcc_entropy[i][j] = trainAcc
        testAcc_entropy[i][j] = testAcc

HBox(children=(FloatProgress(value=0.0, max=31.0), HTML(value='')))

In [None]:
plt.rcParams.update({'font.size': 40})
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15) 
fig = plt.figure(figsize=(20,15))
ax = fig.gca(projection='3d')
surf = ax.plot_wireframe(X, Y, testAcc_gini, cmap = 'plasma')
#surf = ax.plot_wireframe(X, Y, testAcc_entropy, cmap='inferno')
ax.set_xlabel('Max depth')
ax.set_ylabel('Min leaf samples')
ax.set_zlabel('Accuracy')
ax.text2D(0.1, 0.9, "Test Data Accuracy with Gini", transform=ax.transAxes)

### Graph Analysis

We can see here that 

# Q2 - Model Assessment

## a.) Check out q2.py

## b.) Check out q2.py

## c.) Check out q2.py

## d.)

Here are three runs of the model assessment, which is the amount of runs my laptop would fit on the screen

![Model Assessment](q3.PNG)
```Python
6   True Test  0.954458  0.828526  0.000000
```

### Analysis

### AUC Comparison:

We can see that TrainAUC is higher than ValAUC at every single row and column position, and this remains consistently so. We can assume that there is at least a general .1 advantage to TrainAUC vs. ValAUC for any strategy. For variance, which can be seen below in generality, the Value AUC's variate much more than Train AUC in general.

### Different Model Selection Techniques vs. AUC:

TrainAUC and ValAUC seems generally similar for each technique, but through a thorough analysis of variances, we can see that:

After running my code 50 times:

![MA](q2analysiscode.PNG)
![MA](q2analysiscode2.PNG)
![MA](q2analysiscode3.PNG)
![MA](q3analysiscode4.PNG)

We can see that:

Overall the value AUC here has much more variation than the training, and the training values variances are very small, which is good. 

#### For Holdout:
Holdout value variance is extremely high!

#### For K-Fold:

K-folds value variance are decent, especially K-2 and K-5, which are very close to each other. K-10 is generally worse, but is still not as bad as holdout value variance.

#### For MCCV:

MCCV value variances are extremely high as well, comparable to holdout. MCCV-10 does better than MCCV-5 surprisingly for the average of 50 runs.

#### Overall:

From this, since 5-fold has low value variance and has pretty good performance, generally better than k-2 for ValAUC, being closer to .8, 5-fold should be the best and most robust, a good compromise. 

### Computation Time:

#### Holdout & True-test:

True Test is always the fastest, with holdout sometimes coming in 2nd.

My laptop is just so speedy that sometimes holdout gives 0.0000.

However, we can see that sometimes holdout gives a value close to the 2-fold strategy.

Truetest and holdout both don't require multiple AUC calculations, so the speed makes sense.

#### For the k-fold strategies:

It makes sense that time wise the k-fold strategy takes longer than holdout, since it must get an average through many, many AUC calculations. For each split, we see that as there are more splits, the more time the k-fold strategy takes. It seems roughly proportional to the amount of k-folds linearly in these 3 examples.

#### For the MCCV:

MCCV also needs to get an average through many AUC calculations. As k or s increases, the time also increases, which makes sense, and this therefore can be seen somewhat similar to the k-fold strategies in how it needs to spend more time 'calculating' through each split. MCCV-5 is usually roughly 1/2 the time MCCV-10.

# Q3 - Decision Tree Robustness

## a.) 

#### While from my code:
```Python
def get_param(classifier, xFeat, y, xTest, yTest):
    if classifier == "knn":
        clf = GridSearchCV(
            KNeighborsClassifier(),
            {'n_neighbors': range(1, 50, 1)},
            cv=5, scoring='f1_macro')
        clf.fit(xFeat, y['label'])
    else:
        clf = GridSearchCV(
            DecisionTreeClassifier(),
            [{'max_depth': range(1, 20),
              'min_samples_leaf': range(1, 30),
              'criterion': ['entropy', 'gini']}],
            cv=5, scoring='f1_macro')
        clf.fit(xFeat, y)

    optimal_parameter = clf.best_params_
    optimal_parameter_string = str(optimal_parameter)

    print("optimal parameter: " + optimal_parameter_string)

    return optimal_parameter
```

For Cross Validation for both Decision Tree and KNN, I chose kfold as 5 since 5 was one of the popular choices in the decision tree powerpoint, and also barely the best in my analysis for Q2. It seems here that 10 is also fine from the powerpoint, but I chose not to do that because of my q2 results. 

See that the 'n_neighbors' range is (1, 50, 1) for KNN, 

whereas for Decision Tree 'max_depth': range(1, 20), and 'min_samples_leaf': range(1, 30) here

As we change these ranges, the best parameters also change. Thus, for these ranges and inputs, the:

#### Best Knn Parameter:
optimal parameter: {'n_neighbors': 1}

#### Best Decision Tree Parameter:
optimal parameter: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1}

The optimal parameter for knn is n-neighbors 1 because of previous ranges

The optimal parameter for the decision tree is entropy, its max_depth 5, and min_samples_leaf is 1 because of previous ranges accordingly


## b.) Check out q3.py

## c.) Check out q3.py

## d.)

### Results:
#### Overall for KNN:

knn auc: 0.72345

knn accuracy: 0.85833

#### For 5%:

knn auc: 0.72224

knn accuracy: 0.85625

#### For 10%:

knn auc: 0.68906

knn accuracy: 0.84375

#### For 20%:

knn auc: 0.70037

knn accuracy: 0.85208

#### Overall for Decision Tree:

decision tree auc: 0.88002

decision tree accuracy: 0.88125

#### For 5%:

decision tree auc: 0.85679

decision tree accuracy: 0.86458

#### For 10%:

decision tree auc: 0.86918

decision tree accuracy: 0.9

#### For 20%:

decision tree auc: 0.86536

decision tree accuracy: 0.86667

### Analysis

Overall, Decision Tree AUC is higher than KNN AUC, whereas their accuracy is closer, but Decision Tree seems to be better in here as well just slightly. Both have similar patterns for accuracy, which seemed to stay generally the same or increase or decrease a little bit as data was lost, but for AUC Decision Tree is the clear winner. 

Decision Tree for AUC performs higher than KNN when data is lost, but it seems KNN and DT AUC go down as data is lost, which makes sense. 

KNN for 20% had higher AUC than 10%, which may be an outlier, and DT 10% was higher than DT 5%. Nevertheless, DT AUC is better, staying around a higher range than KNN. Both seem slightly volatile as data is removed, but stay in a consistent range (so far). Since KNN 10% to 20% had a higher gap than DT, if we are to talk technicalities I can conclude KNN has a bit higher sensitivity from these data points. 

