# Q1 - Decision Tree

## a.) Check out dt.py

## b.) Check out dt.py

## c.) See Below for 1C

## d.) 

For every split, since the train method goes through every *d* feature, through all their *n* values to get the optimal split, for this mandatory method so far we have at least *dn*

For each level, we have one split, so since we train at each level to the maximum depth, so let us make the depth of the entire tree *p*. Thus, we get O(*dnp*). This is for train.

For predict, it is O(*np*) because this is like searching in a binary tree getting to the leaf node, which results in the max depth, where predict, from the root to the leaf it results in makes a comparison to reach that node (which costs *n*).   
 and predict would have take make a comparison for every node from and including the root till that leaf.

## Q1 - c.)

In [1]:
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import q1



# Q2 - Model Assessment

## a.) Check out q2.py

## b.) Check out q2.py

## c.) Check out q2.py

## d.)

Here are three runs of the model assessment, which is the amount of runs my laptop would fit on the screen

![Model Assessment](q3.PNG)
```Python
6   True Test  0.954458  0.828526  0.000000
```

### Analysis

AUC Comparison:

We can see that TrainAUC is higher than ValAUC at every single row and column position, and this remains consistently so. We can assume that there is at least a general .1 advantage to TrainAUC vs. ValAUC for any strategy.

Different Model Selection Techniques vs. AUC:

TrainAUC and ValAUC seems generally similar for each technique, but through a thorough analysis of variances, we can see that:

2-fold TrainAUC has a 0 variance. 2-fold, 5-fold, and 10-fold have notably low variances for
ValAUC when compared to the other methods, with Holdout and MCCV w/10 having the highest variance for ValAUC. From this along with 5-fold consistently having ValAUC very close to 0.8, I think 5-fold is the most robust. 

Computation Time:
Holdout & True-test:
True Test is always the fastest, with holdout sometimes coming in 2nd.

My laptop is just so speedy that sometimes holdout gives 0.0000.

However, we can see that sometimes holdout gives a value close to the 2-fold strategy.

Truetest and holdout both don't require multiple AUC calculations, so the speed makes sense.

For the k-fold strategies:
It makes sense that time wise the k-fold strategy takes longer than holdout, since it must get an average through many, many AUC calculations. For each split, we see that as there are more splits, the more time the k-fold strategy takes. It seems roughly proportional to the amount of k-folds linearly in these 3 examples. 

After running my code 50 times:


We can see that:


For the MCCV:
MCCV also needs to get an average through many AUC calculations. As k or s increases, the time also increases, which makes sense, and this therefore can be seen somewhat similar to the k-fold strategies in how it needs to spend more time 'calculating' through each split. MCCV-5 is usually roughly 1/2 the time MCCV-10.

# Q3 - Decision Tree Robustness

## a.) 

#### For: 

python3 q3.py 5 10

#### While from my code:
```Python
def get_param(classifier, xFeat, y, xTest, yTest):
    if classifier == "knn":
        clf = GridSearchCV(
            KNeighborsClassifier(),
            {'n_neighbors': range(1, 50, 1)},
            cv=5, scoring='f1_macro')
        clf.fit(xFeat, y['label'])
    else:
        clf = GridSearchCV(
            DecisionTreeClassifier(),
            [{'max_depth': range(1, 20),
              'min_samples_leaf': range(1, 30),
              'criterion': ['entropy', 'gini']}],
            cv=5, scoring='f1_macro')
        clf.fit(xFeat, y)

    optimal_parameter = clf.best_params_
    optimal_parameter_string = str(optimal_parameter)

    print("optimal parameter: " + optimal_parameter_string)

    return optimal_parameter
```

See that the 'n_neighbors' range is (1, 50, 1) for KNN, 

whereas for Decision Tree 'max_depth': range(1, 20), and 'min_samples_leaf': range(1, 30) here

#### Best Knn Parameter:
optimal parameter: {'n_neighbors': 1}

#### Best Decision Tree Parameter:
optimal parameter: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1}

### Analysis

Therefore,

## b.) Check out q3.py

## c.) Check out q3.py

## d.)

### Results:
#### Overall for KNN:

knn auc: 0.72345

knn percent accuracy: 0.85833

#### For 5%:

knn auc: 0.72224

knn percent accuracy: 0.85625

#### For 10%:

knn auc: 0.68906

knn percent accuracy: 0.84375

#### For 20%:

knn auc: 0.70037

knn percent accuracy: 0.85208

#### Overall for Decision Tree:

decision tree auc: 0.88002

decision tree percent accuracy: 0.88125

#### For 5%:

decision tree auc: 0.85679

decision tree percent accuracy: 0.86458

#### For 10%:

decision tree auc: 0.86918

decision tree percent accuracy: 0.9

#### For 20%:

decision tree auc: 0.86536

decision tree percent accuracy: 0.86667

### Analysis

Therefore,