# Datasets


### Source
Both datasets were taken from the UCI Machine Learning Repository.  The HTRU2 dataset can be found at https://archive.ics.uci.edu/ml/datasets/HTRU2, and the Letter Recognition set can be found at https://archive.ics.uci.edu/ml/datasets/letter+recognition.

### Data Dictionaries

 
### Data Pre-processing
#### Compiling Data Sources

#### Missing Values


#### Categorical Values

### The Problems
#### HTRU2
The HTRU2 dataset posed the problem of classifying pulsar candidates as a positive case (legitimate pulsar) or a negative case (not a legitimate pulsar).

#### Letter Recognition
The Letter Recognition posed the problem of classifying records as one of the 26 capital letters in the English alphabet.

### Why These Datasets?
I chose these datasets because while both were used for classification problems, the datasets were drastically different.  The HTRU2 dataset contained a large number of records, and relatively low number of features.  In contrast, the Letter Recognition dataset contained many more features.  In addition, the problem for the HTRU2 dataset was binary classification while the problem for the Letter Recognition set was to group into 26 different classes.

# Decision Tree

The first algorithm used to model the two datasets was a decision tree.  For each dataset, 70% of the records were selected randomly as a training set, and the remaining 30% were held out as a test set to be only used once the models were finished.  One interesting hyperparameter in a decision tree model is the max depth of the tree.  As a decision tree is created to fit training data, the tree can be allowed to grow large enough to perfectly classify the data, unless there are some anomolies (for example two records with identical attributes, but different classification).  As a result, the tree can become so dependant on the training records selected that its ability to model new data points decreases.  This is a result of the high variance created by allowing a tree to grow too large.  However, if a tree is too small, it will exhibit high bias as the classification model is restricted to a smaller hypothesis space. In an attempt to balance these two factors, I plotted model complexity curves below.  This was done using 5-fold cross validation.  Each training set was broken into 5 folds, then each set of 4 folds was used to train a model which is then tested against the 4 training folds, and the 5th hold out fold.

![](DT_complexity_curve.png)

## Complexity Curve Analysis
The learning curves showed both similarities and differences.  The HTRU2 dataset had very high accuracy on both the training and the cross validation set even with a small max depth value.  This was not a surprise as ~91% of records in the dataset are negative examples, even an algorithm that assigns every record to the negative class would be expected to have a 91% accuracy.  This is a high baseline, and decision tree algorithms could improve on this baseline, even with a small depth limit.  The letter recognition dataset showed very low accuracy with small max depth values, and drastically increased as the max depth was increased.  There are many more target classes, so I would expect it to take more branches in the decision tree to create a satisfactory model to group into one of the 26 classes.

Despite the difference in initial behavior, the datasets showed similarities as the max depth increased.  Accuracy on the training sets for each approached 1.  As the decision trees are given more freedom to grow larger and larger, they can completely, or almost completely model the training data.  While this seems like a good result, the complexity curves reflect that there is a cost.  Although the bias is reduced, increasing the max depth hyperparameter increases variance.  The tree becomes so dependant on the training set that it exhibits overfitting, and ability to predict classes in the hold out set is reduced.  This is clearly seen in the HTRU2 complexity curve, as the accuracy against the validation set increases at first, but then steadily declines.  While not as obvious in the letter recognition set, it is apparent that at some point adding to the max depth does not increase the accuracy on the held out validation set.

## Optimizing Hyperparameters
After analysis of the complexity curves, the next step was to optimize the hyperparameter of max depth.  This was done using a Grid Search method.  This method is given a set of possible hyperparemeter values, and for each possible value conducts kfold validation.  The results of this analysis can be used to find the "best" values of the hyperparameter, and to train a model using the best value.  After running each of my training sets through this grid search process (using 5-fold validation) I determined that the optimal max depth parameters for the HTRU2 and Letter Recognition datasets were 4 and 44, respectively.  This reflected my analysis of the complexity curves.

## Learning Curves
The final step in Decision Tree analysis was to plot the learning curves.  A learning curve shows how the size of the training set may affect the model's accuracy on both the training set and a validation set.

![](Letters_DT_LC.png)[](HTRU_DT_LC.png)

<tr>
    <td> <img src="Letters_DT_LC.png" alt="Drawing" style="width: 450px;"/> </td>
    <td> <img src="HTRU_DT_LC.png" alt="Drawing" style="width: 450px;"/> </td>
</tr>

Both graphs showed an increase in accuracy on the testing sample as the size of the training sample increased.  This is expected; the more data the model can see and learn from, the more it adjust its model to better predict future classes.  The HTRU2 graph shows a slight decrease in accuracy on the training set as the size of the training set increases.  A model can easily fit to classify a small number of training examples correcty.  However, as the training set increases in size, the model can no longer classify every training example correctly (due to restrictions on the depth of the tree).  Despite this fact, the Letter Recognition data model did not seem to decrease in accuracy on the training data even when training size got very large.  I attributed this to the larger max depth parameter.  By allowing the model to grow to a depth of 44, it has a greater ability to model training data - even as the training set grows large.  To test this theory, I plotted a second learning curve for a decision tree, this time setting the max depth to 10.  The graph below shows that in fact if the tree model is further restricted, its ability to model all the training data does decrease as the training set grows larger.

![](Letters_DT_LC-2.png)

## Accuracy
The final step was to use my models to predict the class of the test sets, and compare the predicted classes to the actual classes.  The HTRU2 model was able to predict the correct class in 97.7% of the test cases.  The Letter Recognition model was able to predict the correct class for 86.8% of the test cases.  While this might indicate that the HTRU2 model was better, the nature of the datasets prevents direct comparision.  As mentioned previously, a simple algorithm that predicts a negative class for any attribute values would in fact be expected to predict 91% of test cases accurately.  To visualize this, I created a confusion matrix for the HTRU2 model, below: 

![](HTRU_DT_conf_mat.png)

While the model was successful in >99% of negative test cases, it was not as successful predicting positve cases at around 82%.

# Artificial Neural Networks


# Boosting

# K-Nearest Neighbors
The final model used was a K-Nearest neighbors (KNN) model.  The KNN model attempts to classify a vector by finding the "closest" k data points and observing the distribution of classes among those points.  In this algorithm k is a hyperparameter which can be changed.  If k is very small (for example 1) then the algorithm can model training data exactly, or almost exactly.  This results in a high bias model that overfits the training data, and may be a poor model on new unseen test records.  Below are complexity curves, showing how accuracy on training and validation sets is affected by value of k.

<tr>
    <td> <img src="Letters_KNN_LC.png" alt="Drawing" style="width: 450px;"/> </td>
    <td> <img src="HTRU2_KNN_LC.png" alt="Drawing" style="width: 450px;"/> </td>
</tr>

As expected, both models could perfectly model the training data when k=1.  In this model, each training point is matched to the closest point which is in fact itself.  Unless there is noise in the data (for example two records with identical attributes but different classes), k=1 will perfectly classify training data.  The cross validation accuracy scores showed differences; the letter recognition accuracy appeared to be at its highest when k=1, and then decrease for larger values of k.  The HTRU2 curve showed an increase in accuracy as k increased, but at some point began to decrease (or at least no noticable increase).  Using a grid search method with 5-fold cross validation, I determined that the optimal values of k for the HTRU2 and Letter Recognition set were 8 and 1, respectively.

## Learning Curves

## Accuracy
The letter recognition KNN model was able to predict ~95.5% of test records correctly.  The HTRU2 KNN model classified ~97.6% of test records correctly.  The confusion matrix again showed a much lower rate in predicting legitimate pulsars, predicting ~79.8% of positive test cases correctly.

![](HTRU_KNN_conf_mat.png)

# Conclusion
One conclusion that I drew from my analysis was that the letter recognition dataset was far less susceptible to overfitting than the HTRU2 dataset.  For the letter recognition set, my decision tree model was optimized with a max depth of 44, and the KNN model was maximized with k=1.  Typically a large max depth and a low k mean that a model has lots of freedom to conform to a specific set of training examples.  This can possibly result in high variance, and poor accuracy when predicting new examples.  In this case, I did not observe this high variance and concluded that each capital letter is so unique in terms of the 16 attributes that models do not easily overfit.

