# Using Dask-ML's CountVectorizer

Dask-ML includes a `CountVectorizer` that's appropriate for parallel / distributed processing of large datasets.

## Loading Data

In this example, we'll work with the 20 newsgroups dataset from scikit-learn.

In [3]:
import sklearn.datasets

news = sklearn.datasets.fetch_20newsgroups()
news['data'][:2]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

This returns a list of documents (strings). Dask-ML's `CountVectorizer` expects a `dask.bag.Bag` of documents. We'll use `dask.delayed` to load the 20 newsgroups in parallel, taking care to load the data on the workers and not place large values (like `news['data']`) in the the task graph. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more on these concepts.

As we'll see later, Dask-ML's `CountVectorizer` benefits from using the `dask.distributed` scheduler, even on a single machine.

In [7]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:54715  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 17.18 GB


In [8]:
import dask
import numpy as np
import dask.bag as db
import toolz

@dask.delayed
def load_news(slice_):
    """Load a slice of the 20 newsgroups dataset."""
    return sklearn.datasets.fetch_20newsgroups()['data'][slice_]

npartitions = 10
partition_size = len(news['data']) // npartitions

lengths = np.cumsum([partition_size] * npartitions)
lengths = [0] + list(lengths) + [None]

slices = [slice(a, b) for a, b in
          toolz.sliding_window(2, lengths)]
documents = db.from_delayed([load_news(x) for x in slices]).persist()
documents

dask.bag<bag-from-delayed, npartitions=11>

In [9]:
import dask_ml.feature_extraction.text

In [12]:
vectorizer = dask_ml.feature_extraction.text.CountVectorizer()
%time result = vectorizer.fit_transform(documents)

CPU times: user 340 ms, sys: 30.4 ms, total: 371 ms
Wall time: 2.47 s


The call to `fit_transform` did some work to discover the *vocabulary*, a mapping from terms in the documents to positions in the transformed result array.

In [16]:
list(vectorizer.vocabulary_.items())[:5]

[('00', 0), ('000', 1), ('0000', 2), ('00000', 3), ('000000', 4)]

Speaking of the result, it's a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`

In [20]:
local_result = result.compute()
local_result[:5].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Notice that we persisted `documents` earlier. If possible, persisting the input documents is preferable to avoid making two passes over the data. One to discover the vocabulary and a second to transform. If the dataset is larger than (distributed) memory, then two passes will be necessary.

##  A note on vocabularies

You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer`.

In [43]:
vocabulary = vectorizer.vocabulary_
remote_vocabulary, = client.scatter([vocabulary], broadcast=True)

vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(
    vocabulary=remote_vocabulary
)

In [44]:
%time result = vectorizer2.transform(documents)

CPU times: user 7.15 ms, sys: 2.45 ms, total: 9.59 ms
Wall time: 8.54 ms


In [45]:
%time result.compute()

CPU times: user 162 ms, sys: 37.1 ms, total: 199 ms
Wall time: 1.08 s


<11314x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 1787565 stored elements in Compressed Sparse Row format>