## NLP for ML Classification

**Hypothesis**: Part of Speech (POS) tagging and syntactic dependency parsing provides valuable information for classifying imperative phrases. The thinking is that being able to detect imperative phrases will transfer well to detecting tasks and to-dos.

#### Some Terminology
- [_Imperative mood_](https://en.wikipedia.org/wiki/Imperative_mood) is "used principally for ordering, requesting or advising the listener to do (or not to do) something... also often used for giving instructions as to how to perform a task."
- _Part of speech (POS)_ is a way of categorizing a word based on its syntactic function.
    - The POS tagger from Spacy.io that is used in this notebook differentiates between [*pos_* and *tag_*](https://spacy.io/docs/api/annotation#pos-tagging-english) - *POS (pos_)* refers to "coarse-grained part-of-speech" like `VERB`, `ADJ`, or `PUNCT`; and *POSTAG (tag_)* refers to "fine-grained part-of-speech" like `VB`, `JJ`, or `.`.
- _Syntactic dependency parsing_ is a way of connecting words based on syntactic relationships, [such as](https://spacy.io/docs/api/annotation#dependency-parsing-english) `DOBJ` (direct object), `PREP` (prepositional modifier), or `POBJ` (object of preposition).
    - Check out the dependency parse for the phrase ["Send the report by Kyle by tomorrow"](https://demos.explosion.ai/displacy/?text=Send%20the%20report%20by%20Kyle%20by%20tomorrow&model=en&cpu=1&cph=1) as an example

#### Features
The imperative mood centers around _actions_, and actions are generally represented in English using verbs. So the features are engineered to also center on the VERB:
1. *FeatureName.VERB*: Does the phrase contain VERB(s) of the tag form VB*?
2. *FeatureName.FOLLOWING_POS*: Are the words following the VERB(s) of certain parts of speech?
3. *FeatureName.FOLLOWING_POSTAG*: Are the words following the VERB(s) of certain POS tags?
4. *FeatureName.CHILD_DEP*: Are the VERB(s) parents of certain syntactic dependencies?
5. *FeatureName.PARENT_DEP*: Are the VERB(s) children of certain syntactic dependencies?
6. *FeatureName.CHILD_POS*: Are the syntactic dependencies that the VERB(s) are children of of certain parts of speech?
7. *FeatureName.CHILD_POSTAG*: Are the syntactic dependencies that the VERB(s) are children of of certain POS tags?
8. *FeatureName.PARENT_POS*: Are the syntactic dependencies that the VERB(s) parent of certain parts of speech?
9. *FeatureName.PARENT_POSTAG*: Are the syntactic dependencies that the VERB(s) parent of certain POS tags?

Note that features 2-9 all depend on feature 1 between `True`; if `False`, phrase vectorization will result in all zeroes.

## Data and Setup

### Building a recipe corpus

I wrote and ran `epicurious_recipes.py`\* to scrape Epicurious.com for recipe instructions and descriptions. Output is `epicurious-pos.txt` and `epicurious-neg.txt`.

\* _script (very) loosely based off of https://github.com/benosment/hrecipe-parse_

Note that deriving all negative examples in the training set from Epicurious recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.

To (hopefully?) correct for this a bit, I will add the short movie reviews found at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.

This still feels weird because we're selecting negative examples only from specific categories of text (recipe descriptions, short movie reviews) - just because they're readily available.

Ultimately though, this recipe corpus is a **stopgap/proof of concept** for a corpus more relevant to tasks later on, so I won't worry further about this for now.

In [1]:
import os
import random

In [2]:
BASE_DIR = os.getcwd()
pos_data_path = BASE_DIR + '/pos.txt'
neg_data_path = BASE_DIR + '/neg.txt'

In [3]:
with open(pos_data_path, 'r', encoding='utf-8') as f:
    pos_data = f.read()
with open(neg_data_path, 'r', encoding='utf-8') as f:
    neg_data = f.read()

In [4]:
pos_data_split = pos_data.split('\n')
neg_data_split = neg_data.split('\n')

num_pos = len(pos_data_split)
num_neg = len(neg_data_split)

# 50/50 split between the number of positive and negative samples
num = num_pos if num_pos < num_neg else num_neg

# shuffle samples
random.shuffle(pos_data_split)
random.shuffle(neg_data_split)

In [5]:
lines = []
for l in pos_data_split[:num]:
    lines.append((l, 'pos'))
for l in neg_data_split[:num]:
    lines.append((l, 'neg'))

In [6]:
from enum import Enum, auto
class FeatureName(Enum):
    VERB = auto()
    FOLLOWING_POS = auto()
    FOLLOWING_POSTAG = auto()
    CHILD_DEP = auto()
    PARENT_DEP = auto()
    CHILD_POS = auto()
    CHILD_POSTAG = auto()
    PARENT_POS = auto()
    PARENT_POSTAG = auto()

## [spaCy.io](https://spacy.io/) for NLP
_Because Stanford CoreNLP is hard to install for Python_

Found Spacy through an article on ["Training a Classifier for Relation Extraction from Medical Literature"](https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/) ([GitHub](https://github.com/CatalystCode/corpus-to-graph-ml))

<img src="nltk_library_comparison.png" alt="NLTK library comparison chart https://spacy.io/docs/api/#comparison" style="width: 400px; margin: 0;"/>

In [7]:
#!conda config --add channels conda-forge
#!conda install spacy
#!python -m spacy download en

### Using the Spacy Data Model for NLP

In [8]:
import spacy
nlp = spacy.load('en')

Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\n' will start a new Spacy Doc.

In [9]:
def create_spacy_docs(ll):
    dd = [(nlp(l[0]), l[1]) for l in ll]
    # collapse noun phrases into single compounds
    for d in dd:
        for np in d[0].noun_chunks:
            np.merge(np.root.tag_, np.text, np.root.ent_type_)
    return dd

In [10]:
docs = create_spacy_docs(lines)

### NLP output

Tokenization, POS tagging, and dependency parsing happened automatically with the `nlp(line)` calls above! So let's look at the outputs.

https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward

In [11]:
for doc in docs[:10]:
    print(list(doc[0].sents))

[Serve pasta salad topped with remaining 1/2 cup basil and 1/2 cup mint and a drizzle of oil.]
[Top with onion, then tomatoes.]
[Repeat with the remaining leaves.]
[Add scallions and paprika; season with salt.]
[Decorate with the remaining sour cherries and peanuts.]
[Add the garlic and chile and cook for 2 minutes or until the garlic is soft.]
[Bring to room temperature before serving.]
[Paul, do your homework now]
[Transfer cutlets to a platter.]
[Prepare the dough]


In [12]:
for doc in docs[:10]:
    print(list(doc[0].noun_chunks))

[Serve pasta salad, 1/2 cup basil, 1/2 cup mint, a drizzle, oil]
[onion]
[the remaining leaves]
[scallions, paprika, season, salt]
[the remaining sour cherries, peanuts]
[the garlic, chile, cook, 2 minutes, the garlic]
[temperature]
[Paul, your homework]
[Transfer, a platter]
[the dough]


[Spacy's dependency graph visualization](https://demos.explosion.ai/displacy)

In [13]:
for doc in docs[:5]:
    for token in doc[0]:
        print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))

Serve pasta salad ROOT Serve pasta salad NOUN NN Serve pasta salad [topped, .]
topped acl top VERB VBN Serve pasta salad [with]
with prep with ADP IN topped [remaining]
remaining pcomp remain VERB VBG with [1/2 cup basil]
1/2 cup basil dobj 1/2 cup basil NOUN NN remaining [and, 1/2 cup mint]
and cc and CCONJ CC 1/2 cup basil []
1/2 cup mint conj 1/2 cup mint NOUN NN 1/2 cup basil [and, a drizzle]
and cc and CCONJ CC 1/2 cup mint []
a drizzle conj a drizzle NOUN NN 1/2 cup mint [of]
of prep of ADP IN a drizzle [oil]
oil pobj oil NOUN NN of []
. punct . PUNCT . Serve pasta salad []
Top ROOT top ADJ JJ Top [with, tomatoes, .]
with prep with ADP IN Top [onion]
onion pobj onion NOUN NN with [,]
, punct , PUNCT , onion []
then advmod then ADV RB tomatoes []
tomatoes dep tomato NOUN NNS Top [then]
. punct . PUNCT . Top []
Repeat ROOT repeat VERB VB Repeat [with, .]
with prep with ADP IN Repeat [the remaining leaves]
the remaining leaves pobj the remaining leaves NOUN NNS with []
. punct . PUN

### Featurization

In [14]:
import re
from collections import defaultdict

def featurize(d):
    s_features = defaultdict(int)
    for idx, token in enumerate(d):
        #print(token, token.pos_, token.tag_)
        if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags 
            s_features[FeatureName.VERB.name] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(d):
                s_features[f'{FeatureName.FOLLOWING_POS.name}_{d[next_idx].pos_}'] += 1
                s_features[f'{FeatureName.FOLLOWING_POSTAG.name}_{d[next_idx].tag_}'] += 1
            # VERB_HEAD_DEP
            # VERB_HEAD_POS
            '''
            "Because the syntactic relations form a tree, every word has exactly one head.
            You can therefore iterate over the arcs in the tree by iterating over the words in the sentence."
            https://spacy.io/docs/usage/dependency-parse#navigating
            '''
            if (token.head is not token):
                s_features[f'{FeatureName.PARENT_DEP.name}_{token.head.dep_.upper()}'] += 1
                s_features[f'{FeatureName.PARENT_POS.name}_{token.head.pos_}'] += 1
                s_features[f'{FeatureName.PARENT_POSTAG.name}_{token.head.tag_}'] += 1
            # VERB_CHILD_DEP
            # VERB_CHILD_POS
            for child in token.children:
                s_features[f'{FeatureName.CHILD_DEP.name}_{child.dep_.upper()}'] += 1
                s_features[f'{FeatureName.CHILD_POS.name}_{child.pos_}'] += 1
                s_features[f'{FeatureName.CHILD_POSTAG.name}_{child.tag_}'] += 1
    return dict(s_features)
        #print(dict(s_features))
    #print()

#print(featuresets, len(featuresets))

In [15]:
featuresets = [(doc[0], (featurize(doc[0]), doc[1])) for doc in docs]

In [16]:
from statistics import mean, median, mode, stdev
f_lengths = [len(fs[1][0]) for fs in featuresets]

print('Stats on number of features per example:')
print(f'mean: {mean(f_lengths)}')
print(f'stdev: {stdev(f_lengths)}')
print(f'median: {median(f_lengths)}')
print(f'mode: {mode(f_lengths)}')
print(f'max: {max(f_lengths)}')
print(f'min: {min(f_lengths)}')

Stats on number of features per example:
mean: 22.856582388840454
stdev: 14.66419131467603
median: 22.0
mode: 0
max: 75
min: 0


In [17]:
featuresets[:2]

[(Serve pasta salad topped with remaining 1/2 cup basil and 1/2 cup mint and a drizzle of oil.,
  ({'CHILD_DEP_DOBJ': 1,
    'CHILD_DEP_PREP': 1,
    'CHILD_POSTAG_IN': 1,
    'CHILD_POSTAG_NN': 1,
    'CHILD_POS_ADP': 1,
    'CHILD_POS_NOUN': 1,
    'FOLLOWING_POSTAG_IN': 1,
    'FOLLOWING_POSTAG_NN': 1,
    'FOLLOWING_POS_ADP': 1,
    'FOLLOWING_POS_NOUN': 1,
    'PARENT_DEP_PREP': 1,
    'PARENT_DEP_ROOT': 1,
    'PARENT_POSTAG_IN': 1,
    'PARENT_POSTAG_NN': 1,
    'PARENT_POS_ADP': 1,
    'PARENT_POS_NOUN': 1,
    'VERB': 2},
   'pos')),
 (Top with onion, then tomatoes., ({}, 'pos'))]

On one run, the above line printed the following featureset:
`(Gather foil loosely on top and bake for 1 1/2 hours., ({}, 'pos'))`

This is because the Spacy.io POS tagger provided this:
   `Gather/NNP foil/NN loosely/RB on/IN top/NN and/CC bake/NN for/IN 1 1/2 hours./NNS`

With no VERBs tagged, which is incorrect.

---
Compare to [Stanford CoreNLP POS tagger](http://nlp.stanford.edu:8080/corenlp/process):
   `Gather/VB foil/NN loosely/RB on/IN top/JJ and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.`

And [Stanford Parser](http://nlp.stanford.edu:8080/parser/index.jsp):
   `Gather/NNP foil/VB loosely/RB on/IN top/NN and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.`

### Classification

In [18]:
random.shuffle(featuresets)

split_num = round(num / 5)

print(f'# training samples: {num-split_num}')
print(f'# test samples: {split_num}')

# train and test sets
testing_set = [fs[1] for i, fs in enumerate(featuresets[:split_num])]
training_set =  [fs[1] for i, fs in enumerate(featuresets[split_num:])]

# training samples: 1835
# test samples: 459


In [19]:
# decoupling the functionality of nltk.classify.accuracy
def predict(classifier, gold, prob=True):
    if (prob is True):
        predictions = classifier.prob_classify_many([fs for (fs, ll) in gold])
    else:
        predictions = classifier.classify_many([fs for (fs, ll) in gold])
    return list(zip(predictions, [ll for (fs, ll) in gold]))

def accuracy(predicts, prob=True):
    if (prob is True):
        correct = [label == prediction.max() for (prediction, label) in predicts]
    else:
        correct = [label == prediction for (prediction, label) in predicts]
        
    if correct:
        return sum(correct) / len(correct)
    else:
        return 0

Note below the use of `DummyClassifier` to provide a simple sanity check, a baseline of random predictions. `stratified` means it "generates random predictions by respecting the training set class distribution." (http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)

> More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong: features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc…

If a classifier can beat the `DummyClassifier`, it is at least learning something valuable! How valuable is another question...

In [20]:
from nltk import NaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

dummy = SklearnClassifier(DummyClassifier(strategy='stratified', random_state=0))
dummy.train(training_set)
dummy_predict = predict(dummy, testing_set)
dummy_accuracy = accuracy(dummy_predict)
print("Dummy classifier accuracy percent:", dummy_accuracy*100)

nb = NaiveBayesClassifier.train(training_set)
nb_predict = predict(nb, testing_set)
nb_accuracy = accuracy(nb_predict)
print("NaiveBayes classifier accuracy percent:", nb_accuracy*100)

multinomial_nb = SklearnClassifier(MultinomialNB())
multinomial_nb.train(training_set)
mnb_predict = predict(multinomial_nb, testing_set)
mnb_accuracy = accuracy(mnb_predict)
print("MultinomialNB classifier accuracy percent:", mnb_accuracy*100)

bernoulli_nb = SklearnClassifier(BernoulliNB())
bernoulli_nb.train(training_set)
bnb_predict = predict(bernoulli_nb, testing_set)
bnb_accuracy = accuracy(bnb_predict)
print("BernoulliNB classifier accuracy percent:", bnb_accuracy*100)

# ??logistic_regression._clf
#   sklearn.svm.LinearSVC : learns SVM models using the same algorithm.
logistic_regression = SklearnClassifier(LogisticRegression())
logistic_regression.train(training_set)
lr_predict = predict(logistic_regression, testing_set)
lr_accuracy = accuracy(lr_predict)
print("LogisticRegression classifier accuracy percent:", lr_accuracy*100)

# ??sgd._clf
#    The 'log' loss gives logistic regression, a probabilistic classifier.
# ??linear_svc._clf
#   can optimize the same cost function as LinearSVC
#   by adjusting the penalty and loss parameters. In addition it requires
#   less memory, allows incremental (online) learning, and implements
#   various loss functions and regularization regimes.
sgd = SklearnClassifier(SGDClassifier(loss='log'))
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print("SGD classifier accuracy percent:", sgd_accuracy*100)

# using libsvm with kernel 'rbf' (radial basis function)
svc = SklearnClassifier(SVC(probability=True))
svc.train(training_set)
svc_predict = predict(svc, testing_set)
svc_accuracy = accuracy(svc_predict)
print("SVC classifier accuracy percent:", svc_accuracy*100)

# ??linear_svc._clf
#    Similar to SVC with parameter kernel='linear', but implemented in terms of
#    liblinear rather than libsvm, so it has more flexibility in the choice of
#    penalties and loss functions and should scale better to large numbers of
#    samples.
#    Prefer dual=False when n_samples > n_features.
linear_svc = SklearnClassifier(LinearSVC(dual=False))
linear_svc.train(training_set)
linear_svc_predict = predict(linear_svc, testing_set, False)
linear_svc_accuracy = accuracy(linear_svc_predict, False)
print("LinearSVC classifier accuracy percent:", linear_svc_accuracy*100)

# slow
dt = DecisionTreeClassifier.train(training_set)
dt_predict = predict(dt, testing_set, False)
dt_accuracy = accuracy(dt_predict, False)
print("DecisionTree classifier accuracy percent:", dt_accuracy*100)

Dummy classifier accuracy percent: 49.89106753812636
NaiveBayes classifier accuracy percent: 63.39869281045751
MultinomialNB classifier accuracy percent: 77.34204793028321
BernoulliNB classifier accuracy percent: 72.54901960784314
LogisticRegression classifier accuracy percent: 82.13507625272331
SGD classifier accuracy percent: 80.61002178649237
SVC classifier accuracy percent: 81.48148148148148
LinearSVC classifier accuracy percent: 82.13507625272331
DecisionTree classifier accuracy percent: 76.47058823529412


### SGD: Multiple Epochs

`sgd` classifiers improves with epochs. `??sgd._clf` tells us that the default number of epochs `n_iter` is 5. So let's run more epochs. Also not that the training_set shuffle is `True` by default.

In [21]:
num_epochs = 1000
sgd = SklearnClassifier(SGDClassifier(loss='log', n_iter=num_epochs))
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print(f"SGDClassifier classifier accuracy percent (epochs: {num_epochs}):", sgd_accuracy*100)

SGDClassifier classifier accuracy percent (epochs: 1000): 81.91721132897604


Fortunately, 1000 epochs run very quickly! And `SGDClassifier` performance has improved with more iterations.

_Also note that we can set `warm_start` to `True` if we want to take advantage of online learning and reuse the solution of the previous call._

### Analysis

We're going to scope analysis down to our top-performing classifiers, which consistently perform with >80% accuracy: `LogisticRegression`, `SVC`, `LinearSVC`, and `SGD`.

We'll also include `Dummy` as a baseline.

**TODO**: Omit `LinearSVC`? Since we have `SVC` performing well, and `LinearSVC` does not provide probability estimates. (However, `LinearSVC` is meant to "scale better to large numbers of samples" - i.e., `LinearSVC` is faster, as I've witnessed)

#### Most Informative Features

In [22]:
# https://stackoverflow.com/a/11140887
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:round(n/2)], coefs_with_fns[:-(round(n/2) + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In [23]:
print('SGD')
show_most_informative_features(sgd._vectorizer, sgd._clf, 15)
print()
print('Logistic Regression')
show_most_informative_features(logistic_regression._vectorizer, logistic_regression._clf, 15)
print()
print('LinearSVC')
show_most_informative_features(linear_svc._vectorizer, linear_svc._clf, 15)

SGD
	-2.9502	FOLLOWING_POSTAG_WRB		2.3073	CHILD_POSTAG_-RRB-
	-2.6512	CHILD_POSTAG_HYPH		1.8851	PARENT_POSTAG_JJS
	-2.6238	CHILD_DEP_AGENT		1.8317	CHILD_DEP_ADVMOD||XCOMP
	-2.3884	CHILD_POSTAG_''		1.7581	PARENT_DEP_ADVMOD||CONJ
	-2.1527	CHILD_DEP_INTJ 		1.7482	CHILD_DEP_DOBJ||XCOMP
	-1.9526	PARENT_DEP_AMOD		1.7001	VERB           
	-1.7630	CHILD_DEP_POSS 		1.5025	PARENT_POS_PROPN
	-1.6010	PARENT_POSTAG_VBZ		1.5025	PARENT_POSTAG_NNP

Logistic Regression
	-2.2737	CHILD_POSTAG_HYPH		1.8990	CHILD_POSTAG_-RRB-
	-2.1417	CHILD_DEP_AGENT		1.3712	VERB           
	-1.7385	FOLLOWING_POSTAG_WRB		1.2428	PARENT_POS_PROPN
	-1.6321	CHILD_POSTAG_''		1.2428	PARENT_POSTAG_NNP
	-1.4852	CHILD_DEP_NSUBJ		1.1854	CHILD_DEP_NPADVMOD
	-1.4673	PARENT_POSTAG_VBZ		1.1349	CHILD_DEP_DOBJ||XCOMP
	-1.4556	CHILD_DEP_INTJ 		1.1208	PARENT_POSTAG_JJS
	-1.4226	PARENT_DEP_AMOD		1.1148	CHILD_POSTAG_-LRB-

LinearSVC
	-1.6650	FOLLOWING_POSTAG_WRB		1.5831	PARENT_DEP_ADVMOD||CONJ
	-1.5288	CHILD_POSTAG_''		1.4427	CHILD_DEP_ADVMOD|

*Note: Because `SVC` is using the nonlinear RBF kernel, we cannot show the most informative features (`coef_ is only available when using a linear kernel`).*

In [32]:
spacy.explain("JJS")

'adjective, superlative'

**Negative coefficients**:
- VERB parents [`AGENT`](http://universaldependencies.org/docs/sv/dep/nmod-agent.html): "used for agents of passive verbs" - interpreting this to mean that _existence of passive verbs (i.e., the opposite of active verbs) means negative correlation with it being imperative_
- VERB followed by a `WRB`: "wh-adverb" (where, when)
- VERB is a child of [`AMOD`](http://universaldependencies.org/en/dep/amod.html): "any adjective or adjectival phrase that serves to modify the meaning" of the verb


**Positive coefficients**:
- VERB parents a `-RRB-`: "right round bracket"
- VERB is a child of `PROPN`: "proper noun"
- VERB is a child of `NNP`: "noun, proper singular"

#### Scikit Learn metrics: Confusion matrix, Classification report, F1 score, Log loss

http://scikit-learn.org/stable/modules/model_evaluation.html

In [26]:
from sklearn import metrics

def classification_report(predict, prob=True):
    predictions, labels = zip(*predict)
    if prob is True:
        return metrics.classification_report(labels, [p.max() for p in predictions])
    else:
        return metrics.classification_report(labels, predictions)

def confusion_matrix(predict, prob=True, print_layout=False):
    predictions, labels = zip(*predict)
    if print_layout is True:
        print('Layout\n[[tn   fp]\n [fn   tp]]\n')
    if prob is True:
        return metrics.confusion_matrix(labels, [p.max() for p in predictions])
    else:
        return metrics.confusion_matrix(labels, predictions)

def log_loss(predict):
    predictions, labels = zip(*predict)
    return metrics.log_loss(labels, [p.prob('pos') for p in predictions])

def roc_auc_score(predict):
    predictions, labels = zip(*predict)
    # need to convert labels to binary classification of 0 or 1
    return metrics.roc_auc_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions], average='weighted')

In [27]:
print('SGD')
print(classification_report(sgd_predict))
print()
print('Logistic Regression')
print(classification_report(lr_predict))
print()
print('SVC')
print(classification_report(svc_predict))
print()
print('LinearSVC')
print(classification_report(linear_svc_predict, False))

SGD
             precision    recall  f1-score   support

        neg       0.88      0.74      0.80       229
        pos       0.78      0.90      0.83       230

avg / total       0.83      0.82      0.82       459


Logistic Regression
             precision    recall  f1-score   support

        neg       0.88      0.75      0.81       229
        pos       0.78      0.90      0.83       230

avg / total       0.83      0.82      0.82       459


SVC
             precision    recall  f1-score   support

        neg       0.84      0.78      0.81       229
        pos       0.80      0.85      0.82       230

avg / total       0.82      0.81      0.81       459


LinearSVC
             precision    recall  f1-score   support

        neg       0.88      0.74      0.81       229
        pos       0.78      0.90      0.83       230

avg / total       0.83      0.82      0.82       459



In [28]:
print('Layout\n[[tn   fp]\n [fn   tp]]\n')

print('SGD')
print(confusion_matrix(sgd_predict))
print()
print('Logistic Regression')
print(confusion_matrix(lr_predict))
print()
print('SVC')
print(confusion_matrix(svc_predict))
print()
print('LinearSVC')
print(confusion_matrix(linear_svc_predict, False))

Layout
[[tn   fp]
 [fn   tp]]

SGD
[[169  60]
 [ 23 207]]

Logistic Regression
[[171  58]
 [ 24 206]]

SVC
[[179  50]
 [ 35 195]]

LinearSVC
[[170  59]
 [ 23 207]]


The lower the better for `log_loss`...

In [29]:
print(f'SGD: {log_loss(sgd_predict)}')
print(f'Logistic Regression: {log_loss(lr_predict)}')
print(f'SVC: {log_loss(svc_predict)}')

SGD: 0.3760332841394919
Logistic Regression: 0.3737483528148235
SVC: 0.4115819481442995


The higher the better for `roc_auc_score`...

In [30]:
print(f'SGD: {roc_auc_score(sgd_predict)}')
print(f'Logistic Regression: {roc_auc_score(lr_predict)}')
print(f'SVC: {roc_auc_score(svc_predict)}')

SGD: 0.9117903930131004
Logistic Regression: 0.913613062464401
SVC: 0.8834820580975886


*Note: We cannot compute `log_loss` or `roc_auc_score` for `LinearSVC` because it does not provide probability estimates.*

#### Performance on sample tasks

In [31]:
sample_tasks = ["Mow lawn", "Mow the lawn", "Buy new shoes", "Feed the dog", "Send report to Kyle", "Send the report to Kyle", "Peel the potatoes"]
features = [featurize(nlp(task)) for task in sample_tasks]

tasks_dummy = [(l, p.prob('pos')*1.0) for l, p in zip(dummy.classify_many(features), dummy.prob_classify_many(features))]
tasks_logistic = [(l, p.prob('pos')) for l,p in zip(logistic_regression.classify_many(features), logistic_regression.prob_classify_many(features))]
tasks_svc = [(l, p.prob('pos')) for l,p in zip(svc.classify_many(features), svc.prob_classify_many(features))]
tasks_linear_svc = linear_svc.classify_many(features)
tasks_sgd = [(l, p.prob('pos')) for l,p in zip(sgd.classify_many(features), sgd.prob_classify_many(features))]

print(f'Dummy: {tasks_dummy}')
print(f'LogisticRegression: {tasks_logistic}')
print(f'SVC: {tasks_svc}')
print(f'LinearSVC: {tasks_linear_svc}')
print(f'SGD: {tasks_sgd}')

Dummy: [('pos', 1.0), ('pos', 1.0), ('pos', 1.0), ('pos', 1.0), ('neg', 0.0), ('pos', 1.0), ('neg', 0.0)]
LogisticRegression: [('pos', 0.57298972690232164), ('pos', 0.74914827929209371), ('pos', 0.92555644935813564), ('pos', 0.74914827929209371), ('pos', 0.84718903707614501), ('pos', 0.74914827929209371), ('pos', 0.76142796074102892)]
SVC: [('pos', 0.78384043695156047), ('pos', 0.71240000777557533), ('pos', 0.82474873945630034), ('pos', 0.71240000777557533), ('pos', 0.81690641392969443), ('pos', 0.71240000777557533), ('pos', 0.75085791752042796)]
LinearSVC: ['pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos']
SGD: [('pos', 0.57030488957533609), ('pos', 0.77095223826448478), ('pos', 0.94141783856061578), ('pos', 0.77095223826448478), ('pos', 0.87513046688642004), ('pos', 0.77095223826448478), ('pos', 0.77650804938617179)]


_Note: `LinearSVC` is not implemented to provide probability estimates (https://github.com/scikit-learn/scikit-learn/issues/4820)_

**TODO** tabular format would be cool above

**Next up**: digging into the results (confusion matrix, most informative features), comparing results to LUIS model, cross validation/grid search

## Next Steps and Improvements

1. Training set may be too specific/not relevant enough (recipe instructions for positive dataset, recipe descriptions+short movie reviews for negative dataset)
2. Throwing features into a blender - need to understand value of each
3. Need to review different classifiers, strengths/weaknesses
4. Phrase vectorizations of all 0s
5. Varying feature vector lengths
6. Voting
7. Reducing dimensionality using most informative feature information
8. Combining verb phrases

# Things abandoned

## NLTK

I needed a library that supports dependency parsing, which NLTK does not... so I thought I'd add the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) toolkit and [its associated software](https://nlp.stanford.edu/software/) to NLTK. However, there are many conflicting instructions for installing the Java-based project, depending on NLTK version used. By the time I figured this out, the installation had become a time sink. So I abandoned this effort in favor of Spacy.io.

I might return this way if I want to improve results/implement a voter system between the various linguistic and classification methods later.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Tokenization

In [None]:
sentences = [s for l in lines for s in sent_tokenize(l)] # punkt
sentences

In [None]:
tagged_sentences = []
for s in sentences:
    words = word_tokenize(s)
    tagged = nltk.pos_tag(words) # averaged_perceptron_tagger
    tagged_sentences.append(tagged)
print(tagged_sentences)

#### Note: POS accuracy

`Run down to the shop, will you, Peter` is parsed unexpectedly by `nltk.pos_tag`:
> `[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]`

`Run` is tagged as a `NNP (proper noun, singular)`

I expected an output more like what the [Stanford Parser](http://nlp.stanford.edu:8080/parser/) provides:
> `Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

`Run` is tagged as a `VGB (verb, gerund/present participle)` - still not quite the `VB` I want, but at least it's a `V*`

_MEANWHILE..._

`nltk.pos_tag` did better with:
> `[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]`

Compared to [Stanford CoreNLP](http://nlp.stanford.edu:8080/corenlp/process) (note that this is different than what [Stanford Parser](http://nlp.stanford.edu:8080/parser/) outputs):
> `(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))`

Concern: _clean_ as `VB (verb, base form)` vs `JJ (adjective)` 

**IMPROVE** POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc.

Note what Spacy POS tagger did with `Run down to the shop, will you Peter`:

`Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

    where `Run` is the `VB` I expected from POS tagging (compared to `nltk.pos_tag` result of `NNP`). Also note that Spacy collapses `the shop` into a single unit, which should be helpful during featurization.

### Featurization

In [None]:
import re
from collections import defaultdict

featuresets = []
for ts in tagged_sentences:
    s_features = defaultdict(int)
    for idx, tup in enumerate(ts):
        #print(tup)
        pos = tup[1]
        # FeatureName.VERB
        is_verb = re.match(r'VB.?', pos) is not None
        print(tup, is_verb)
        if is_verb:
            s_features[FeatureName.VERB] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(ts):
                s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1
            # VERB_MODIFIER
            # VERB_MODIFYING
        else:
            s_features[FeatureName.VERB] = 0
    featuresets.append(dict(s_features))

print()
print(featuresets)

### [Stanford NLP](https://nlp.stanford.edu/software/)
Setup guide used: https://stackoverflow.com/a/34112695

In [None]:
# Get dependency parser, NER, POS tagger
!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
!unzip stanford-parser-full-2017-06-09.zip
!unzip stanford-ner-2017-06-09.zip
!unzip stanford-postagger-full-2017-06-09.zip

In [None]:
from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer