# _Playground: NLP_

### Bag of Words representation

- text analysis = major application field for ML
- raw data (sequence of symbols) cannot be fed directly to algorithms
    - most expect numerical feature vectors with a fixed size
    - **not** raw text documents with variable length
- `scikit` provides ways to extract numerical features from text content, namely:
    - **tokenizing** strings, giving integer ids for each possible token, for instance by using white-space and punctuation as token separators
    - **counting** the occurrences of tokens in each document
    - **normalizing** and weighting with diminishing importance tokens that occur in the majority of samples/documents
- corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus
- **vectorization**, general process of turning a collection of text documents into numerical feature vectors
- **Bag of Words** representation = tokenization, counting and normalization
    - documents are described by word occurrences while completely ignoring the relative position information of the words in the document

### _Sparsity_

- Most documents use very small subset of words in corpus
    - resulting matrix will have many feature values that are zeros (typically > 99%)
- Example: 10,000 short text documents (emails) use a vocab with a size in the order of 100,000 unique words in total
    - will use 100-1000 unique words individually
- Use sparse representation to be able to store matrix in memory

In [135]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range

# Setup Pandas
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

In [136]:
critics = pd.read_csv('./critics.csv')
#let's drop rows with missing quotes
critics = critics[~critics.quote.isnull()]
critics.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
1,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559,Toy story
2,Richard Corliss,fresh,114709,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559,Toy story
3,David Ansen,fresh,114709,Newsweek,A winning animated feature that has something ...,2008-08-18,9559,Toy story
4,Leonard Klady,fresh,114709,Variety,The film sports a provocative and appealing st...,2008-06-09,9559,Toy story
5,Jonathan Rosenbaum,fresh,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559,Toy story


In [137]:
df = critics.copy()

# fresh = True, rotten = False
df['fresh'] = df.fresh == 'fresh'

In [138]:
df.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
1,Derek Adams,True,114709,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559,Toy story
2,Richard Corliss,True,114709,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559,Toy story
3,David Ansen,True,114709,Newsweek,A winning animated feature that has something ...,2008-08-18,9559,Toy story
4,Leonard Klady,True,114709,Variety,The film sports a provocative and appealing st...,2008-06-09,9559,Toy story
5,Jonathan Rosenbaum,True,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559,Toy story


### [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

- converts a collection of text documents to a matrix of token counts
- produces a sparse representation of the counts using `scipy.sparse.csr_matrix`
- number of features will be equal to the vocabulary size found by analyzing the data (no a-priori dictionary, and no analyzer for feature selection)

In [139]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# vectorize quotes column
X = vectorizer.fit_transform(df['quote'])
# print(X.shape)

# create target column
y = df['fresh'].values.astype(int)

### Term Frequency times Inverse Document Frequency (TF-IDF) & [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

- creates word frequence scores that try to highlight words that are more interesting, e.g. frequent in a document but not across all documents
- `TfidfVectorizer` will tokenize documents, learn the vocabulary and inverse document frequency weightings and allow encoding of new documents
    - if you already have a learned `CountVectorizer`, can use `TfidfTransformer` to calculate inverse document frequencies and start encoding documents

In [140]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(X)
X.shape

(15561, 22417)

# _Training a classifier_

We now have our features in `X_train_tfidf`. Now we are going to train a classifier to predict a category of a post. Since we're focusing on _naive bayes_ we start by using a [naive bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes). 

More specifically we'll be using a [multinomial Naive Bayes classifer](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) which is suitable for classification with discrete features (i.e. word counts for text classification). The basis for naive Bayes is the assumption of conditional independence between every pair of feautres given the value of the class variable. 

In [141]:
# split the data set into training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

In [142]:
from sklearn.naive_bayes import MultinomialNB

# create classifier object
clf = MultinomialNB()

# fit classifier to data
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [143]:
# test classifier on test set
y_pred = clf.predict(X_test)

# test classifer on training set
y_pred_train = clf.predict(X_train)

In [144]:
train_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.814738646101114
The accuracy score for the test set:  0.7044461578000514


# _Build a Pipeline_

To streamline the process,we can use the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) class to do all these steps for us. 

In [145]:
X = df.quote
y = df.fresh.values.astype(int)

In [146]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

In [147]:
# multinomial Naive Bayes
from sklearn.pipeline import Pipeline
from sklearn import metrics

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

# fit classifier
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [148]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [149]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.824764353041988
The accuracy score for the test set:  0.7026471344127474


In [150]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.27      0.41      1515
           1       0.68      0.98      0.80      2376

    accuracy                           0.70      3891
   macro avg       0.78      0.62      0.61      3891
weighted avg       0.76      0.70      0.65      3891



### _Support Vector Machines_

Main advantage: effective in high-dimensional space
Main disadvantage(s): choosing Kernel functions and regularization terms can be cruicial, and do not provide probability estimates (are calculated using an expensive five-fold cross-validation)

In [151]:
from sklearn.svm import SVC

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SVC())
])

# fit classifier
text_clf.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree=3,
                 

In [152]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [153]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.6089117395029991
The accuracy score for the test set:  0.6106399383191982


In [154]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1515
           1       0.61      1.00      0.76      2376

    accuracy                           0.61      3891
   macro avg       0.31      0.50      0.38      3891
weighted avg       0.37      0.61      0.46      3891



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### _Linear Classifiers with SGD training_

- Implements regularized linear models with stochastic gradient descent.
- Gradient of the loss is estimated each sample at a time, model is updated along the way with a decreasing learning rate

In [155]:
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier())
])

# fit classifier
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='hinge',
                               max_iter=1000,

In [156]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [157]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.9177377892030848
The accuracy score for the test set:  0.7761500899511694


In [158]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.64      0.69      1515
           1       0.79      0.86      0.82      2376

    accuracy                           0.78      3891
   macro avg       0.77      0.75      0.76      3891
weighted avg       0.77      0.78      0.77      3891



### _Logistic Regression_

In [159]:
from sklearn.linear_model import LogisticRegression

In [160]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

In [161]:
# fit classifier
text_clf.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
  

In [162]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [163]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.8707797772065125
The accuracy score for the test set:  0.7676689796967361


In [164]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.57      0.65      1515
           1       0.76      0.90      0.82      2376

    accuracy                           0.77      3891
   macro avg       0.77      0.73      0.74      3891
weighted avg       0.77      0.77      0.76      3891



### _Decision Tree_

In [165]:
from sklearn.tree import DecisionTreeClassifier

In [166]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', DecisionTreeClassifier())
])

In [167]:
# fit classifier
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 DecisionTreeClassifier(class_weight=None, criterion='gini',
                                        max_depth=None, max_features=None,
                            

In [168]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [169]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.9999143101970865
The accuracy score for the test set:  0.6126959650475456


### _Random Forest_

In [170]:
from sklearn.ensemble import RandomForestClassifier

In [171]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier())
])

In [172]:
# fit classifier
text_clf.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0

In [173]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [174]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.9916023993144816
The accuracy score for the test set:  0.6861989205859677


In [175]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.63      0.48      0.54      1515
           1       0.71      0.82      0.76      2376

    accuracy                           0.69      3891
   macro avg       0.67      0.65      0.65      3891
weighted avg       0.68      0.69      0.68      3891



### _Gradient Boosting_

In [176]:
from sklearn.ensemble import GradientBoostingClassifier

In [177]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', GradientBoostingClassifier())
])

In [178]:
# fit classifier
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                                            learning_rate=0.1, loss='deviance',
                                            max_depth=3, max_features=None,
                                            max_leaf_nodes=None,
                                            min_impurity_decrease=0.0,
                                            

In [179]:
# test classifier on test set
y_pred = text_clf.predict(X_test)

# test classifer on training set
y_pred_train = text_clf.predict(X_train)

In [180]:
# get accuracy for training and test set
train_accuracy = text_clf.score(X_train, y_train)
test_accuracy = text_clf.score(X_test, y_test)

# print the accuracy scores for both the training and test sets
print('The accuracy score for the training set: ', train_accuracy)
print('The accuracy score for the test set: ', test_accuracy)

The accuracy score for the training set:  0.6970008568980292
The accuracy score for the test set:  0.6635826265741455


In [181]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.20      0.32      1515
           1       0.65      0.96      0.78      2376

    accuracy                           0.66      3891
   macro avg       0.70      0.58      0.55      3891
weighted avg       0.69      0.66      0.60      3891

