# Text Vectorization Pipeline

This example illustrates how Dask-ML can be used to vectorize large textual datasets in parallel.
This example is adapted from [this scikit-learn example](https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py).

It's addapted to

* Fit the entire model, including text vectorization, as a pipeline.
* Use dask collections like `dask.bag`, `dask.dataframe`, and `dask.array`
  rather than generators to work with larger than memory datasets.

In [1]:
from dask.distributed import Client, progress

client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:64775  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 4  Memory: 2.00 GB


## Fetch the data

The data are available on the [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php).
The details of downloading and parsing the data aren't too important.

In [2]:
import os
import re
import tarfile
from glob import glob
from urllib.request import urlretrieve

import scipy.sparse
from sklearn.datasets import get_data_home

import dask
import dask.bag as db


import dask_ml.feature_extraction.text

In [3]:
def fetch_reuters(data_path=None):
    """Fetch documents of the Reuters dataset"""
    DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
                    'reuters21578-mld/reuters21578.tar.gz')
    ARCHIVE_FILENAME = 'reuters21578.tar.gz'

    if data_path is None:
        data_path = os.path.join(get_data_home(), "reuters")
    if not os.path.exists(data_path):
        """Download the dataset."""
        print("downloading dataset (once and for all) into %s" %
              data_path)
        os.mkdir(data_path)

        def progress(blocknum, bs, size):
            if blocknum % 100 == 0:
                total_sz_mb = '%.2f MB' % (size / 1e6)
                current_sz_mb = '%.2f MB' % ((blocknum * bs) / 1e6)
                print('\rdownloaded %s / %s' % (current_sz_mb, total_sz_mb))

        archive_path = os.path.join(data_path, ARCHIVE_FILENAME)
        urlretrieve(DOWNLOAD_URL, filename=archive_path,
                    reporthook=progress)
        print('\r')
        print("untarring Reuters dataset...")
        tarfile.open(archive_path, 'r:gz').extractall(data_path)
        print("done.")
    return data_path


def load_from_filename(file_path):
    """
    Load a list of (topic, article) pairs.
    
    Examples
    --------
    >>> pairs = load_from_file("reut2-004.sgm")
    >>> topics, article = pairs[0]
    >>> topics
    ['interest', 'retail', 'ipi']
    >>> article[:37]
    'U.S. economic data this week could be'
    """
    with open(file_path, 'rb') as fh:
        txt = fh.read().decode('latin-1')        

    *articles, _ = [x.strip() for x in txt.split("</REUTERS>")]
    articles = [x for x in articles
                if '<BODY>' in x
                and '<TOPICS></TOPICS>' not in x]
    articles = '\n'.join(articles)
    topics = re.findall('<TOPICS>(?P<topics>.*)<\/TOPICS>', articles)
    topics = [re.findall('(?<=<D>)[^<]+(?=</D>)', x) for x in topics]
    bodies = re.findall('(?<=<BODY>)[^<]+(?=</BODY>)', articles)

    return list(zip(topics, bodies))

In [4]:
data_path = fetch_reuters()
files = glob(os.path.join(data_path, 'reut2*'))
files[:5]

['/Users/taugspurger/scikit_learn_data/reuters/reut2-004.sgm',
 '/Users/taugspurger/scikit_learn_data/reuters/reut2-010.sgm',
 '/Users/taugspurger/scikit_learn_data/reuters/reut2-011.sgm',
 '/Users/taugspurger/scikit_learn_data/reuters/reut2-005.sgm',
 '/Users/taugspurger/scikit_learn_data/reuters/reut2-013.sgm']

## Load the Data

`files` is a list of filepaths. We can build a Dask Bag that will (lazily)
read in the contents of these files. Again, the details of loading aren't
the point of this example, but you may want to glance through `load_from_filename`
to see an example of `dask.bag`.

In [5]:
text = (
    db.from_sequence(files)
      .map(load_from_filename)
      .flatten()
      .to_dataframe(columns=['topics', 'text'])
)

text.head()

Unnamed: 0,topics,text
0,"[interest, retail, ipi]",U.S. economic data this week could be\nthe key...
1,[earn],Oper shr loss two cts vs profit three cts\n ...
2,[earn],Shr 25 cts vs 36 cts\n Net 1.4 mln vs 1.4 m...
3,[earn],Shr loss 1.02 dlrs vs 1.01 dlr\n Net loss 1...
4,"[crude, nat-gas, iron-steel]",USX Corp said proved reserves of oil\nand natu...


`text` is a dask dataframe with two columns.

* topics: A list of topics this article is classified with
* text: the body of the article

## Feature Hashing

`dask_ml.feature_extraction.text.HashingVectorizer` provides a similar API to [scikit-learn's implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). In fact, Dask-ML's implementation uses scikit-learn's, applying it to each partition of the input `dask.dataframe.Series` or `dask.bag.Bag`.

Transformation, once we actually compute the result, happens in parallel and returns a dask Array.

In [6]:
vect = dask_ml.feature_extraction.text.HashingVectorizer()
X = vect.fit_transform(text['text'])
X

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan, 1048576)","(nan, 1048576)"
Count,88 Tasks,22 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan, 1048576) (nan, 1048576) Count 88 Tasks 22 Chunks Type float64 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan, 1048576)","(nan, 1048576)"
Count,88 Tasks,22 Chunks
Type,float64,numpy.ndarray


The output array has unknown chunk sizes becase dask Series and Bags don't know their own length.
If you need those, as some `dask.array` operations do, use `X.compute_chunk_sizes()` to get them (at the cost of some computation).


Each block in `X` is a `scipy.sparse` matrix.

In [7]:
X.blocks[0].compute()

<505x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 39630 stored elements in Compressed Sparse Row format>

This is a document-term matrix. Each row is the hashed representation of the original article.

## Classification Pipeline

We can combine the `HashingVectorizer` with `Incremental` and a classifier like `SGDClassifier` to
create a classification pipeline.

We'll predict whether the document is assigned the `acq` topic. Recall that each row in the
Series is a list of topics, so `acq` in `x` applied to each row will give us the indicator.

In [8]:
y = text.topics.apply(
    lambda x: int('acq' in x), meta=('topics', 'int')
)
y.to_dask_array(lengths=True)

Unnamed: 0,Array,Chunk
Bytes,83.02 kB,4.57 kB
Shape,"(10377,)","(571,)"
Count,110 Tasks,22 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 83.02 kB 4.57 kB Shape (10377,) (571,) Count 110 Tasks 22 Chunks Type int64 numpy.ndarray",10377  1,

Unnamed: 0,Array,Chunk
Bytes,83.02 kB,4.57 kB
Shape,"(10377,)","(571,)"
Count,110 Tasks,22 Chunks
Type,int64,numpy.ndarray


In [9]:
import pandas as pd
import dask_ml.metrics
from dask_ml.wrappers import Incremental
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline

Because the input comes from a dask Series, with unknown chunk sizes, we need to specify `assume_equal_chunks=True`. This tells Dask-ML that we know that each partition in `X`
matches a partition in `y`.

In [10]:
sgd = SGDClassifier(max_iter=5, tol=1e-3)
clf = Incremental(sgd, scoring='accuracy', assume_equal_chunks=True)
pipe = make_pipeline(vect, clf)

In [11]:
pipe.fit(text['text'], y, incremental__classes=[0, 1])

Pipeline(memory=None,
         steps=[('hashingvectorizer',
                 HashingVectorizer(alternate_sign=True, analyzer='word',
                                   binary=False, decode_error='strict',
                                   dtype=<class 'numpy.float64'>,
                                   encoding='utf-8', input='content',
                                   lowercase=True, n_features=1048576,
                                   ngram_range=(1, 1), norm='l2',
                                   preprocessor=None, stop_words=None,
                                   strip_accents=None,
                                   token_pattern='(?u)\\b\\w\\w+\\b',
                                   t...
                                                     class_weight=None,
                                                     early_stopping=False,
                                                     epsilon=0.1, eta0=0.0,
                                                     fit_interc

As usual, `Incremental.predict` lazily returns the predictions as a dask Array.

In [12]:
predictions = pipe.predict(text['text'])
predictions

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,110 Tasks,22 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 110 Tasks 22 Chunks Type int64 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,110 Tasks,22 Chunks
Type,int64,numpy.ndarray


We can compute them and the score in parallel with `dask_ml.metrics.accuracy_score`.

In [13]:
dask_ml.metrics.accuracy_score(y, predictions)

0.9774501300954033

This simple combination of a HashingVectorizer and SGDClassifier is
pretty effective at this prediction task.