Notes from Lukas Biewald's [Crowdflower Machine Learning class](https://github.com/lukas/ml-class)

### Feature Extraction

In [None]:
#feature-extraction-1.py
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv('tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']
count_vect=CountVectorizer()
count_vect.fit(text)

Calling `fit` raises `ValueError: np.nan is an invalid document, expected byte or unicode string` as `text` is an object. One solution is to pass the text as string in iterable form : `text = [str(df['tweet_text'])]`. The accepted solution is to retain only non null values as follows:


In [None]:
#feature-extraction-2.py
fixed_target = target[pd.notnull(text)]
fixed_text = text[pd.notnull(text)]
count_vect.fit(fixed_text)

### Build your first classifier

[Naive Bayes](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/) is a classification technique that assumes independence among predictors. For this text classification problem, we will use Multinomial NB and pass in discrete counts.

In [None]:
#classifer.py
counts = count_vect.transform(text)
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(counts, target)
nb.predict(count_vect.transform(['i love my iphone']))
#=> ['Positive emotion']
nb.predict(count_vect.transform(['android or iphone?']))
#=> ['No emotion toward brand or product']

### Build another classifier
[Support Vector Machine](http://scikit-learn.org/stable/modules/svm.html) is an efficient classifier that works well for small, clean datasets. 

In [None]:
#classifier-svm.py 
from sklearn.svm import SVC
clf = SVC()
clf.fit(counts, target)
clf.predict(count_vect.transform(['i do not love my iphone']))
#=> ['No emotion toward brand or product']

This prediction seems wrong. Scikit-learn default SVC settings are:
```
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
```

Modifying the classifier as: `clf = SVC(class_weight='balanced')` results in a prediction of `["I can't tell"]` for the example above. This classifier does not perform well on other predictions run on this dataset.

### Evaluating classifier performance

Perform `fit` on training data (first 6000 tweets) then `predict` on testing data (remaining 3092 tweets):

In [None]:
#test-algorithm-2.py
nb.fit(counts[0:6000], target[0:6000])

predictions = nb.predict(counts[6000:9092])
correct_predictions = sum(predictions == target[6000:9092])
print('Percent correct: ', 100.0 * correct_predictions / 3092)
#=> Percent correct:  66.3971539457

Construct a confusion matrix, ignoring "I can't tell" for simplicity:

In [None]:
#test-algorithm-3.py
from sklearn.metrics import confusion_matrix

label_list = ['Positive emotion', 'No emotion toward brand or product', 'Negative emotion']
confusion_matrix(target[6000:9092], predictions, labels=label_list)

Evaluate model performance using [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html):

In [None]:
##test-algorithm-cross-validation.py
from sklearn.model_selection import cross_val_score
nb = MultinomialNB()

scores = cross_val_score(nb, counts, fixed_target, cv=10)
scores.mean()
#=> 0.648153102333

Compared with a baseline model with [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) using predictions based on the most_frequent class value, the cross_val score of the NB model is only slightly better.

In [None]:
#test-algorithm-cross-validation-dummy.py
from sklearn.dummy import DummyClassifier
dc = DummyClassifier(strategy='most_frequent')

scores = cross_val_score(dc, counts, fixed_target, cv=10)
scores.mean()
#=> 0.592609330138

### Evaluating other algorithms and hyperparameters