# SVMen - Comp 598 - Machine Learning - Project 2 Report

---
### David Rapoport, Ethan Macdonald, Charlie Bloomfield

---
### 1. Introduction
In this report, we discuss our methods for classifying NPR interviews from their text content into one of four categories: Author Interviews, Movie Interviews, Music Interviews, and Interviews. Given an annotated dataset of 54,000 interviews, we implement three vectorizing techniques: Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BoW), and *n*-grams. We then train several models: a Support Vector Classifier (SVC) using scikit-learn, Naive Bayes (NB) implemented by our team, and *k*-Nearest Neighbor (KNN) implemented by our team. Using standard cross validation techniques, we identify scikit-learn's SVC as the best classifier. We then tune our selection techniques to improve the classification performance of this model.

---
### 2. Running the Code

In order to run our code, we execute main.py. When the program asks for a configuration file, we may enter a custom configuration file or simply leave the input as blank to use a default configuration file. A configuration file defines which vectorizors, feature selectors, and learners we wish to use.

After we select a configuration file, the program displays a list of possible combinations to choose from. Each combination consists of a vectorizor, selector, and learner. To select every item, simply type 'a'. Once we have selected the combinations we wish to use, we enter the path to the data file we wish to use. To use the default path type 'd'.

In [None]:
execfile("main.py")

Enter the name of the config file in python import notation.
 File must be in config directory. [Default config.default]


The program then trains a model for each of the selected combinations and outputs the accuracy of each model. At the end, the program returns information about the best model based on accuracy (TODO: validation set accuracy? test set accuracy? cross validation?). This information includes which vectorizor, selector, learner and parameters were used.

---
### 3. Data Preprocessing
The data is originally given as a CSV file containing id,text,label. Before creating different feature sets we first load the text and targets and perform preprocesssing on the text. Our preprocessing consists of several steps in which we normalize the data. First the text is tokenized using the nltk wordpunct tokenizer. After that we transform each token into lowercase, removing all non alphabetic characters (parentheses, commas, apostrophes, etc.), and remove words of length less than 3. This reason for this last step is that we want to remove the tail ends of contractions as well as any words such as "be", "is", "am" which are length two and are most likely irrelevant stop words. An example of the pipeline is shown below. After this we use the NLTK WordNetLemmatizer to lemmatize the word. We chose to lemmatize rather than stem because lemmatization is more likely to return a proper english word. This means we could later leverage the meaning of the word (for example with word2vec). The final step of the preprocessing was the remove stop words. The stop words used were the nltk stopwords with "__EOS__" appended to the list of stopwords. 

In [7]:
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(wordpunct_tokenize("Shouldn't can't isn't doesn't!!!"))
tokens = wordpunct_tokenize("Shouldn't this author sing? He isn't Michael Jackson!!!")
print([lemmatizer.lemmatize(word.lower()) for word in tokens if len(word)>2 and word.isalpha()])

['Shouldn', "'", 't', 'can', "'", 't', 'isn', "'", 't', 'doesn', "'", 't', '!!!']
['shouldn', 'this', 'author', 'sing', 'isn', 'michael', 'jackson']


---
### 4. Feature Design and Selection
For this project, we used scikit_learn's CountVectorizer and TfidfVectorizer as well as our home grown Word2VecVectorizer for feature design. We used scikit_learn's SelectPercentile for feature selection. 

TF-IDF is a statistic used to quantify how important a word or phrase is in a document. As such, the TF-IDF value increases when a given word or phrase is found in a document, but this value is offset by the frequency of the word or phrase in the entire corpus. This is mathematically represented by the following equation:

$$ \mathrm{tfidf}(t,d,D) = \log \frac{N}{|\{d \in D: t \in d\}|} \times \left(0.5 + \frac{0.5 \times \mathrm{f}(t, d)}{\max\{\mathrm{f}(t, d):t \in d\}}\right) $$

BoW, on the other hand, counts how many times each word or phrase appears in a document without offsetting.

For both BoW and TF-IDF we must decide whether to count unigrams, bigrams, trigrams or some combination thereof. Variance increases, and bias decreases, propotional to the length of the *n*-grams we consider. As such, it is important to select the right range of *n*-gram lengths to include in our vectors. The length of *n*-grams considered was independently determined for each combination of vectorizor, selector, and learner.

TODO: Figures — comparison of BoW vs. TF-IDF in various scenarios. Comparison of unigram, bigram, trigram.
David

---
### 5. Algorithm Selection

##### 5.1 Naive Bayes
David


##### 5.2 *k*-Nearest Neighbor
We implemented K Nearest Neighbor as our standard algorithm, chosen for its implementation simplicity. 

##### 5.3 SVM

---
### 6. Optimization

---
### 7. Parameter Selection
In order to select optimal hyperparameters for our models, we employ GridSearchCV from scikit-learn, which exhaustively searches a manually defined range of hyperparameters. This search is guided by performance on a our validation set.

TODO: Figures — comparison of various hyperparameters?

---
### 8. Testing and Validation

---
### 9. Further Discussion


We hereby state that all the work presented in this report is that of the authors.

---
### 11. References

TODO: scikit-learn reference

TODO: David's refs from facebook?