#### char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The char analyzer, alternatively, creates n-grams that span across words:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])
                               


ngram_vectorizer.get_feature_names() == (
    [' fox ', ' jump', 'jumpy', 'umpy '])


ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])
                               


ngram_vectorizer.get_feature_names() == (
    ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])

True

In [2]:
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 2))
ngram_vectorizer.fit_transform(['concussion with no loss of consciousness'])

<1x5 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [3]:
ngram_vectorizer.get_feature_names() 

['concussion with', 'loss of', 'no loss', 'of consciousness', 'with no']

In [10]:
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2))
cv_fit=ngram_vectorizer.fit_transform(['concussion with no loss of consciousness'])
ngram_vectorizer.get_feature_names() 

['concussion',
 'concussion with',
 'consciousness',
 'loss',
 'loss of',
 'no',
 'no loss',
 'of',
 'of consciousness',
 'with',
 'with no']

### Compute a simple word frequency

Option 1: it is much faster to perform the sum on the sparse matrix and then transform it to an array:

In [11]:
import numpy as np
np.asarray(cv_fit.sum(axis=0))

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

Option 2: perform as array

In [12]:
print(cv_fit.toarray().sum(axis=0))

[1 1 1 1 1 1 1 1 1 1 1]


#### Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices (LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).

### Image Feature Extraction

#### 4.2.4.1. Patch extraction

The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use reconstruct_from_patches_2d. For example let use generate a 4x4 pixel picture with 3 color channels (e.g. in RGB format):

In [10]:
import numpy as np
from sklearn.feature_extraction import image

one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
one_image[:, :, 0]  # R channel of a fake RGB picture



patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
    random_state=0)
patches.shape

patches[:, :, :, 0]



patches = image.extract_patches_2d(one_image, (2, 2))
patches.shape

patches[4, :, :, 0]

array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])