From 8c293a5c93efc1bb196dcf3ac5b42d0827141caa Mon Sep 17 00:00:00 2001 From: BenFradet Date: Thu, 10 Dec 2015 22:40:06 +0100 Subject: [PATCH 1/2] added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features doc --- docs/ml-features.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/docs/ml-features.md b/docs/ml-features.md index 7ad7c4eb7ea65..3478963056ddb 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -459,6 +459,42 @@ column, we should get the following: "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with index `2`. +Additionaly, there are two strategies regarding how `StringIndexer` will handle +unseen labels when you have set up a `StringIndexer` on a dataset which you want +to reuse on another: + +- throw an exception (which is the default) +- skip the row containing the unseen label entirely + +**Examples** + +Let's go back to our previous example but this time reuse our previously defined +`StringIndexer` on the following dataset: + +~~~~ + id | category +----|---------- + 0 | a + 1 | b + 2 | c + 3 | d +~~~~ + +If you've not set how `StringIndexer` handles unseen labels or set it to +"error", an exception will be thrown. +However, if you had called `setHandleInvalid("skip")`, the following dataset +will be generated: + +~~~~ + id | category | categoryIndex +----|----------|--------------- + 0 | a | 0.0 + 1 | b | 2.0 + 2 | c | 1.0 +~~~~ + +Notice that the row containing "d" does not appear. +
From 0fb5e2b9880477501dc959f503fb10d142350ee9 Mon Sep 17 00:00:00 2001 From: BenFradet Date: Fri, 11 Dec 2015 23:28:14 +0100 Subject: [PATCH 2/2] addressed comments --- docs/ml-features.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 3478963056ddb..72eb2de6aeae1 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -460,8 +460,8 @@ column, we should get the following: index `2`. Additionaly, there are two strategies regarding how `StringIndexer` will handle -unseen labels when you have set up a `StringIndexer` on a dataset which you want -to reuse on another: +unseen labels when you have fit a `StringIndexer` on one dataset and then use it +to transform another: - throw an exception (which is the default) - skip the row containing the unseen label entirely