Skip to content

Commit

Permalink
[SPARK-12217][ML] Document invalid handling for StringIndexer
Browse files Browse the repository at this point in the history
Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation.

I wonder if I should also add a snippet to the code example, input welcome.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10257 from BenFradet/SPARK-12217.
  • Loading branch information
BenFradet authored and jkbradley committed Dec 11, 2015
1 parent 1b82203 commit aea676c
Showing 1 changed file with 36 additions and 0 deletions.
36 changes: 36 additions & 0 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,42 @@ column, we should get the following:
"a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
index `2`.

Additionaly, there are two strategies regarding how `StringIndexer` will handle
unseen labels when you have fit a `StringIndexer` on one dataset and then use it
to transform another:

- throw an exception (which is the default)
- skip the row containing the unseen label entirely

**Examples**

Let's go back to our previous example but this time reuse our previously defined
`StringIndexer` on the following dataset:

~~~~
id | category
----|----------
0 | a
1 | b
2 | c
3 | d
~~~~

If you've not set how `StringIndexer` handles unseen labels or set it to
"error", an exception will be thrown.
However, if you had called `setHandleInvalid("skip")`, the following dataset
will be generated:

~~~~
id | category | categoryIndex
----|----------|---------------
0 | a | 0.0
1 | b | 2.0
2 | c | 1.0
~~~~

Notice that the row containing "d" does not appear.

<div class="codetabs">

<div data-lang="scala" markdown="1">
Expand Down

0 comments on commit aea676c

Please sign in to comment.