# Working with Text Data

Dask-ML includes several ways to [process text data](https://ml.dask.org/modules/api.html#dask-ml-feature-extraction-text-feature-extraction). Typically these work with the [`Dask DataFrame`](https://docs.dask.org/en/latest/dataframe.html) or [`Bag`](https://docs.dask.org/en/latest/bag.html) collections, which can reference larger-than-memory datasets stored on disk or in distributed memory on a Dask Cluster.

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1, memory_limit='2GB')
client

In this example, we'll work with the 20 newsgroups dataset from scikit-learn. Each element in the dataset has a bit of metadata and the full text of a post.

In [None]:
import sklearn.datasets

news = sklearn.datasets.fetch_20newsgroups()
print(news.data[0][:500])

This returns a list of documents (strings). We'll load the datset using `dask.bag.from_sequence`, but in practice you would want to load the data on the workers. See https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask and https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections.

In [None]:
import dask
import numpy as np
import dask.bag as db

documents = db.from_sequence(news['data'], npartitions=10).persist()
documents

## Feature Extraction

Dask-ML's feature extractors turn Bags or DataFrames of raw documents (strings) into Dask Arrays backed by scipy.sparse matrices.

If the limitations of `HashingVectorizer` (no inverse transform, no IDF weighting) are acceptable, then we strongly recommend using it over something like `CountVectorizer`. `HashingVectorizer` is completely stateless and so is much easier (and faster) to use in a distributed setting.

Note that becuase `HashingVectorizer` is stateless, the calls to `fit` and `transform` are nearly instant.

In [None]:
import dask_ml.feature_extraction

hashing_vectorizer = dask_ml.feature_extraction.text.HashingVectorizer()
%time hashing_vectorizer.fit(documents)
%time transformed = hashing_vectorizer.transform(documents)

It's only when you `.compute()` the result that we load data and do the transformation.

In [None]:
%time transformed.compute()

`CountVectorizer` is not stateless unless you provide a `vocabulary` ahead of time. When no vocabulary is provided, `CountVectorizer.fit` or `CountVectorizer.fit_transform` will need to load data to discover the unique set of terms in the documents.

In [None]:
vectorizer = dask_ml.feature_extraction.text.CountVectorizer()
%time result = vectorizer.fit_transform(documents)

Now `.fit_transform` (and `fit` and `transform`) is much more expensive since all the documents must be loaded to determine the `vocabulary`.

Thee result is again a Dask `Array` backed by `scipy.sparse.csr_matrix` objects. We can bring it back to the client with `.compute()`

In [None]:
%time result.compute()

Notice that we persisted `documents` earlier. If possible, persisting the input documents is preferable to avoid making two passes over the data. One to discover the vocabulary and a second to transform. If the dataset is larger than (distributed) memory, then two passes will be necessary.

##  A note on vocabularies

You can also provide a vocabulary ahead of time, which avoids the need for making two passes over the data. This makes operations like `vectorizer.transform` instantaneous, since no vocabulary needs to be discovered. However, vocabularies can become quite large. Consider persisting your data ahead of time to avoid bloating the size of the `CountVectorizer` object. Dask-ML's `CountVectorizer` works just fine when the `vocabulary` is a pointer to a piece of data on the cluster.

In [None]:
# reuse the vocabulary from the previously fitted estimator.
# In practice this would come from an external source.
vocabulary = vectorizer.vocabulary_
remote_vocabulary, = client.scatter([vocabulary], broadcast=True)

vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(
    vocabulary=remote_vocabulary
)

`CountVectorizer.transform` doesn't need to do any real work now, so it's fast.

In [None]:
%time result = vectorizer2.transform(documents)

In [None]:
%time result.compute()

See https://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick for more on problems with large vocabularies, which recommends "feature hashing" as a possible solution.

## Feature Hashing

Feature hashing transforms a DataFrame or Bag of inputs (mappings or strings) to a sparse array. It is completely stateless, and so doesn't suffer from the same issues as `CountVectorizer`. See https://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing for more.

In [None]:
hasher = dask_ml.feature_extraction.text.FeatureHasher(input_type="string")
result = hasher.transform(documents)
result

In [None]:
%time result.compute()

## Text Vectorization Pipeline

The rest of this example is adapted from [this scikit-learn example](https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py).

The primary differences are that

* We fit the entire model, including text vectorization, as a pipeline.
* We use dask collections like [Dask Bag](https://docs.dask.org/en/latest/bag.html), [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html), and [Dask Array](https://docs.dask.org/en/latest/array.html)
  rather than generators to work with larger than memory datasets.

We'll load the documents and targets directly into a dask DataFrame.
In practice, on a larger than memory dataset, you would likely load the
documents from disk or cloud storage using `dask.bag` or `dask.delayed`.

In [None]:
import dask.dataframe as dd
import pandas as pd

df = dd.from_pandas(pd.DataFrame({"text": news.data, "target": news.target}),
                    npartitions=25)

df

## Classification Pipeline

We can combine the [HashingVectorizer](https://ml.dask.org/modules/generated/dask_ml.feature_extraction.text.HashingVectorizer.html#dask_ml.feature_extraction.text.HashingVectorizer) with [Incremental](https://ml.dask.org/modules/generated/dask_ml.wrappers.Incremental.html#dask_ml.wrappers.Incremental) and a classifier like scikit-learn's `SGDClassifier` to
create a classification pipeline.

We'll predict whether the topic was in the `comp` category.

In [None]:
news.target_names

In [None]:
import numpy as np

positive = np.arange(len(news.target_names))[['comp' in x for x in news.target_names]]
y = df['target'].isin(positive).astype(int)
y

In [None]:
import numpy as np
import sklearn.linear_model
import sklearn.pipeline

import dask_ml.wrappers

Because the input comes from a dask Series, with unknown chunk sizes, we need to specify `assume_equal_chunks=True`. This tells Dask-ML that we know that each partition in `X`
matches a partition in `y`.

In [None]:
sgd = sklearn.linear_model.SGDClassifier(
    tol=1e-3
)
vect = dask_ml.feature_extraction.text.HashingVectorizer()
clf = dask_ml.wrappers.Incremental(
    sgd, scoring='accuracy', assume_equal_chunks=True
)
pipe = sklearn.pipeline.make_pipeline(vect, clf)

`SGDClassifier.partial_fit` needs to know the full set of classes up front.
Because our `sgd` is wrapped inside an `Incremental`, we need to pass it through
as the `incremental__classes` keyword argument in `fit`.

In [None]:
pipe.fit(df['text'], y,
         incremental__classes=[0, 1]);

As usual, `Incremental.predict` lazily returns the predictions as a dask Array.

In [None]:
predictions = pipe.predict(df['text'])
predictions

We can compute the predictions and score in parallel with `dask_ml.metrics.accuracy_score`.

In [None]:
dask_ml.metrics.accuracy_score(y, predictions)

This simple combination of a HashingVectorizer and SGDClassifier is
pretty effective at this prediction task.