# Text Vectorization Example

This example illustrates how Dask-ML can be used to vectorize textual data in parallel.
This example is adapted from https://github.com/scikit-learn/scikit-learn/tree/master/examples/applications/plot_out_of_core_classification.py#L143.

In [None]:
from dask.distributed import Client, progress

client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
client

In [None]:
import os
import re
import tarfile
from glob import glob
from urllib.request import urlretrieve

import scipy.sparse
from sklearn.datasets import get_data_home

import dask
import dask.bag as db


import dask_ml.feature_extraction.text

In [None]:
def fetch_reuters(data_path=None):
    """Fetch documents of the Reuters dataset.
    """

    DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
                    'reuters21578-mld/reuters21578.tar.gz')
    ARCHIVE_FILENAME = 'reuters21578.tar.gz'

    if data_path is None:
        data_path = os.path.join(get_data_home(), "reuters")
    if not os.path.exists(data_path):
        """Download the dataset."""
        print("downloading dataset (once and for all) into %s" %
              data_path)
        os.mkdir(data_path)

        def progress(blocknum, bs, size):
            if blocknum % 100 == 0:
                total_sz_mb = '%.2f MB' % (size / 1e6)
                current_sz_mb = '%.2f MB' % ((blocknum * bs) / 1e6)
                print('\rdownloaded %s / %s' % (current_sz_mb, total_sz_mb))

        archive_path = os.path.join(data_path, ARCHIVE_FILENAME)
        urlretrieve(DOWNLOAD_URL, filename=archive_path,
                    reporthook=progress)
        print('\r')
        print("untarring Reuters dataset...")
        tarfile.open(archive_path, 'r:gz').extractall(data_path)
        print("done.")
    return data_path


def load_from_filename(file_path):
    with open(file_path, 'rb') as fh:
        txt = fh.read().decode('latin-1')

    return re.findall('(?<=<BODY>)[^<]+(?=</BODY>)', txt)

In [None]:
data_path = fetch_reuters()
files = glob(os.path.join(data_path, 'reut2*'))
files[:5]

`files` is a list of filepaths. We can build a Dask Bag that will (lazily) read in the contents of these files.

In [None]:
text = (db.from_sequence(files)
          .map(load_from_filename)
          .flatten())
text

Each element of `text` is a single article. Here's the first 100 characters from the first document.

In [None]:
print(text.take(1)[0][:100])

The API is the same as with scikit-learn. You instantiate the estimator, and pass the data to `fit` or `fit_transform`.
Only in this case the data is a `dask.bag.Bag` or `dask.dataframe.Series`.
Transfomration, once we actually compute the result, happens in parallel and returns a dask Array.

In [None]:
vect = dask_ml.feature_extraction.text.HashingVectorizer()
X = vect.fit_transform(text)
X

Each block of the dask array contains a scipy sparse matrix.

In [None]:
X.blocks[0].compute()

SciPy sparse matrics don't meet the full ndarray interface, so while you can *store* them in a Dask Array, many operations on a Dask Array composed of SciPy sparse matricies will fail. The [`sparse`](http://sparse.pydata.org/en/latest/) project implements an n-dimensional sparse array conforms to NumPy's ndarray interface.

In this case, we'll convert it to a a sparse COO array, and then call `compute`, materializing the result as a single `COO` array. For this dataset, that's only about 24 MB. For large datasets, you would want to continue processing the data as a Dask array.

Watch the distributed dashboard during this part. Notice that the data loading and transformation happen in parallel.

In [None]:
import sparse

X.map_blocks(sparse.COO.from_scipy_sparse, dtype=X.dtype).compute()