# Support vector machines and machine learning on documents

Support vector machine (SVM), which is widely regarded as one of the best text classification algorithms (though it’s also a bit slower than naïve Bayes).

Support vector classifier (SVC) is a powerful and widely used memory-based classifier is the nonlinear. Like KNN, nonlinear SVC makes predictions by the weighted average of the labels of similar examples (measured by a kernel function). However, only the support vectors, i.e., examples falling onto or inside the margin, can have positive weights and need to be remembered. In practice, SVC usually remembers much fewer examples than KNN does. Another difference is that SVC is not an lazy learner---the weights are trained eagerly in the training phase.

In [1]:
from sklearn.datasets import load_files
twenty_train = load_files('twenty_newsgroups/20news-bydate-train', encoding='latin1')
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

### Tokenizing text with scikit-learn

Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

In [3]:
count_vect.vocabulary_.get('for')

56283

CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has built a dictionary of feature indices:

In [4]:
ngram_count_vect = CountVectorizer(ngram_range=(1, 5))
XX_train_counts = ngram_count_vect.fit_transform(twenty_train.data)
XX_train_counts.shape

(11314, 8069416)

In [5]:
ngram_count_vect.vocabulary_.get('algorithm for')

627642

### From occurrences to frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows:

### TfidfTransformer
Equivalent to CountVectorizer followed by TfidfTransformer

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(11314, 130107)

In [7]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

### TfidfVectorizer
Equivalent to CountVectorizer followed by TfidfTransformer

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(twenty_train.data)
X_train_tfidf.shape

(11314, 130107)

## Building a pipeline

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier

In [9]:
from sklearn.pipeline import Pipeline

Support vector machine (SVM), which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by just plugging a different classifier object into our pipeline:

In [10]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [11]:
text_clf = Pipeline([('vect', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

In [12]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

### load test data

In [13]:
twenty_test = load_files('twenty_newsgroups/20news-bydate-test', encoding='latin1')

In [14]:
predicted = text_clf.predict(twenty_test.data)

In [15]:
import numpy as np
np.mean(predicted == twenty_test.target)  

0.85315985130111527

In [16]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.82      0.80      0.81       319
           comp.graphics       0.76      0.80      0.78       389
 comp.os.ms-windows.misc       0.77      0.73      0.75       394
comp.sys.ibm.pc.hardware       0.71      0.76      0.74       392
   comp.sys.mac.hardware       0.84      0.86      0.85       385
          comp.windows.x       0.87      0.76      0.81       395
            misc.forsale       0.83      0.91      0.87       390
               rec.autos       0.92      0.91      0.91       396
         rec.motorcycles       0.95      0.95      0.95       398
      rec.sport.baseball       0.92      0.95      0.93       397
        rec.sport.hockey       0.96      0.98      0.97       399
               sci.crypt       0.93      0.94      0.93       396
         sci.electronics       0.81      0.79      0.80       393
                 sci.med       0.90      0.87      0.88       396
         

### Parameter tuning using grid search

We’ve already encountered some parameters such as use_idf in the TfidfTransformer. Classifiers tend to have many parameters as well; e.g., MultinomialNB includes a smoothing parameter alpha and SGDClassifier has a penalty parameter alpha and configurable loss and penalty terms in the objective function (see the module documentation, or use the Python help function, to get a description of these).

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams, with or without idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:

In [17]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'vect__use_idf': (True, False),
              'clf__C': (1.0, 0.1, 1e-2, 1e-3),
}

Obviously, such an exhaustive search can be expensive. If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eight parameter combinations in parallel with the n_jobs parameter. If we give this parameter a value of -1, grid search will detect how many cores are installed and uses them all:

In [18]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

The grid search instance behaves like a normal scikit-learn model. Let’s perform the search on a smaller subset of the training data to speed up the computation:

In [19]:
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

The result of calling fit on a GridSearchCV object is a classifier that we can use to predict:

In [20]:
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

In [21]:
gs_clf.best_score_

0.9231041187908785

In [22]:
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__C: 1.0
vect__ngram_range: (1, 2)
vect__use_idf: True


In [23]:
clf = gs_clf.best_estimator_

In [24]:
predicted = clf.predict(twenty_test.data)

In [25]:
import numpy as np
np.mean(predicted == twenty_test.target)  

0.85740839086563991

In [26]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.83      0.79      0.81       319
           comp.graphics       0.74      0.80      0.77       389
 comp.os.ms-windows.misc       0.77      0.77      0.77       394
comp.sys.ibm.pc.hardware       0.73      0.76      0.74       392
   comp.sys.mac.hardware       0.83      0.86      0.85       385
          comp.windows.x       0.87      0.76      0.81       395
            misc.forsale       0.84      0.91      0.87       390
               rec.autos       0.94      0.91      0.92       396
         rec.motorcycles       0.96      0.97      0.96       398
      rec.sport.baseball       0.91      0.94      0.93       397
        rec.sport.hockey       0.95      0.98      0.97       399
               sci.crypt       0.93      0.95      0.94       396
         sci.electronics       0.82      0.78      0.80       393
                 sci.med       0.90      0.86      0.88       396
         

In [28]:
gs_clf.cv_results_

{'mean_fit_time': array([  6.76273831,   6.48441784,  38.51126877,  35.11828041,
          6.5181636 ,   6.07969292,  38.12305562,  28.60024603,
          6.10935195,   6.23029232,  27.42246596,  25.40827171,
          6.06645139,   5.30015779,  28.39650591,  19.55859637]),
 'mean_score_time': array([ 1.83603573,  1.85204299,  6.81356255,  5.23203429,  2.36111132,
         2.08001137,  5.55435038,  5.22367303,  2.1113929 ,  2.24261491,
         5.80143579,  5.00339142,  2.02967366,  1.92598343,  5.24472411,
         3.39409868]),
 'mean_test_score': array([ 0.91894997,  0.8902245 ,  0.92310412,  0.89985858,  0.8959696 ,
         0.82950327,  0.89853279,  0.8432915 ,  0.82791232,  0.64221319,
         0.83515998,  0.65494078,  0.70673502,  0.39208061,  0.71813682,
         0.38403748]),
 'mean_train_score': array([ 0.99929296,  0.99500614,  0.99973487,  0.99907207,  0.98258771,
         0.93167821,  0.9913825 ,  0.96199421,  0.90569246,  0.71398305,
         0.94175374,  0.74602304,  0.

The *cv_results_* parameter can be easily imported into pandas as a DataFrame for further inspection.