# SVMen - Comp 598 - Machine Learning - Project 2 Report

---
### David Rapoport, Ethan Macdonald, Charlie Bloomfield

---
### 1. Introduction
In this report, we discuss our methods for classifying NPR interviews from their text content into one of four categories: Author Interviews, Movie Interviews, Music Interviews, and Interviews. Given an annotated dataset of 54,000 interviews, we implement three vectorizing techniques: Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BoW), and *n*-grams. We then train several models: a Support Vector Classifier (SVC) using scikit-learn, Naive Bayes (NB) implemented by our team, and *k*-Nearest Neighbor (KNN) implemented by our team. Using standard cross validation techniques, we identify scikit-learn's SVC as the best classifier. We then tune our selection techniques to improve the classification performance of this model.

---
### 2. Running the Code

In order to run our code, we execute main.py. When the program asks for a configuration file, we may enter a custom configuration file or simply leave the input as blank to use a default configuration file. A configuration file defines which vectorizors, feature selectors, and learners we wish to use.

After we select a configuration file, the program displays a list of possible combinations to choose from. Each combination consists of a vectorizor, selector, and learner. To select every item, simply type 'a'. Once we have selected the combinations we wish to use, we enter the path to the data file we wish to use. To use the default path type 'd'.

In [None]:
execfile("main.py")

Enter the name of the config file in python import notation.
 File must be in config directory. [Default config.default]


The program then trains a model for each of the selected combinations and outputs the accuracy of each model. At the end, the program returns information about the best model based on accuracy (TODO: validation set accuracy? test set accuracy? cross validation?). This information includes which vectorizor, selector, learner and parameters were used.

---
### 3. Data Preprocessing
For this project, we used three different vectorizing techniques: TF-IDF, BoW, and *n*-gram. 

TF-IDF is a statistic used to quantify how important a word or phrase is in a document. As such, the TF-IDF value increases when a given word or phrase is found in a document, but this value is offset by the frequency of the word or phrase in the entire corpus. This is mathematically represented by the following equation:

$$ \mathrm{tfidf}(t,d,D) = \log \frac{N}{|\{d \in D: t \in d\}|} \times \left(0.5 + \frac{0.5 \times \mathrm{f}(t, d)}{\max\{\mathrm{f}(t, d):t \in d\}}\right) $$

BoW, on the other hand, counts how many times each word or phrase appears in a document without offsetting.

For both BoW and TF-IDF we must decide whether to count unigrams, bigrams, trigrams or some combination thereof. Variance increases, and bias decreases, propotional to the length of the *n*-grams we consider. As such, it is important to select the right range of *n*-gram lengths to include in our vectors. The length of *n*-grams considered was independently determined for each combination of vectorizor, selector, and learner.

TODO: Figures — comparison of BoW vs. TF-IDF in various scenarios. Comparison of unigram, bigram, trigram.

---
### 4. Feature Design and Selection
David


---
### 5. Parameter Selection
In order to select optimal hyperparameters for our models, we employ GridSearchCV from scikit-learn, which exhaustively searches a manually defined range of hyperparameters. This search is guided by performance on a our validation set.

TODO: Figures — comparison of various hyperparameters?

---
### 6. Testing and Validation

---
### 7. Algorithm Selection

##### 7.1 Naive Bayes
David


##### 7.2 *k*-Nearest Neighbor
We implemented K Nearest Neighbor as our standard algorithm. K Nearest Neighbor involves  

##### 7.3 SVM
Finally, we test using scikit learn's Support Vector Classifier to classify the NPR articles. This implementation provides many model hyperparameters and optimization parameters. We explore two of these parameters: the penalty value for misclassifications and the kernel used by the SVC.

Initial attempts to run SVC with the default parameters on the entire dataset proves to be infeasible on our machines due to runtime constraints. The polynomial runtime of SVC leads to a several hour runtime with no response. So we start again and try to assess it's classification percentage on 1% of the dataset. With the previous framework for validating on a single 80/20 train/test split, we get relatively impressive initial results with default parameters. Coupled with the optimum tfidf feature set, we see a 58.88% classification rate.

--show graph

With these initial results, we explore optimizing the misclassifcation penalty and kernel parameters to the SVC. We are within %10 range from our best Naive Bayes classification results and hope the optimizations might make the difference. To do so, we perform a search over the set of all scikit learn's provided kernels [rbf", "linear", "poly", "sigmoid", "precomputed"] with each of the C values [0.1, 1.0, 2.5, 5.0]. While the choice of kernels was to test out each of the provided kernels, the choice of C values was somewhat arbitrary. Testing over a range of values is optimal but grows combinatorially with the number of inputs and quickly becomes too expensive to execute.


--show graph of svc combinations

With the insight we gained from running SVC on 1% of the data, we are ready to train on all the data and try make a submission to see how our training generalizes to test data. We are still running into problems with running SVC on the entire dataset, so we choose to run our code on a remote server provided by [Cloud Digital](https://www.digitalocean.com/) with improved processing power. We install and run our program to train the SVC model overnight, but an entire night of execution on the remote server proves to be insufficient to train the model and we terminate it's execution. This is somewhat surprising at the time, though reconsidering the polynomial runtime suggests that 

Unable to complete model training, we do not use SVC as a . Because it's test results show classification percentages lower then Naive Bayes, we do not use it in making a submission.

---
### 8. Optimization


---
### 9. Further Discussion


---
### 10. Statement of Faith
We hereby state that all the work presented in this report is that of the authors.

---
### 11. References

TODO: scikit-learn reference

TODO: David's refs from facebook?
