In [5]:
# Install the packages

!pip install scikit-learn > /dev/null 2>&1

# Count Vectorization

Before we can start training the model with textual datasets, we need to produce a matrix that maps a feature with some quantified measures. A feature should be a token. For text, word is commonly used. A super simple measure is word count.

We can perform a count vectorization using the `scikit.feature_extraction.text` module.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'The quick brown fox jumps over the lazy dog',
    'The five boxing wizards jump quickly',
]

vec = CountVectorizer()
X = vec.fit_transform(corpus)
names = vec.get_feature_names_out()
 
print(names, X.toarray())

['boxing' 'brown' 'dog' 'five' 'fox' 'jump' 'jumps' 'lazy' 'over' 'quick'
 'quickly' 'the' 'wizards'] [[0 1 1 0 1 0 1 1 1 1 0 2 0]
 [1 0 0 1 0 1 0 0 0 0 1 1 1]]


## Output

Here's the output:

```
['boxing' 'brown' 'dog' 'five' 'fox' 'jump' 'jumps' 'lazy' 'over' 'quick'
 'quickly' 'the' 'wizards'] [[0 1 1 0 1 0 1 1 1 1 0 2 0]
 [1 0 0 1 0 1 0 0 0 0 1 1 1]]
```

This translates to this matrix:

|          | boxing | brown | dog | five | fox | jump | jumps | lazy | over | quick | quickly | the | wizards |
|----------|:------:|:-----:|:---:|:----:|:---:|:----:|:-----:|:----:|:----:|:-----:|:-------:|:---:|:-------:|
| Corpus 1 |   0    |   1   |  1  |  0   |  1  |  0   |   1   |  1   |  1   |   1   |    0    |  2  |    0    |
| Corpus 2 |   1    |   0   |  0  |  1   |  0  |  1   |   0   |  0   |  0   |   0   |    1    |  1  |    1    |
