From 6259bad966a998c556bff498207007ce750325c4 Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Fri, 20 May 2016 14:28:55 -0700 Subject: [PATCH 01/11] User guide changes to CountVectorizer, QuantileDiscretizer and HashingTF --- docs/ml-features.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 3db24a3840599..1ea9b5d93f16d 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -53,7 +53,10 @@ collisions, where different raw features may become the same term after hashing. chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the feature dimension, otherwise the features will -not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. +not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. +An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are +set to 1. This is especially useful for discrete probabilistic models that model binary counts +rather than integer. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. @@ -145,9 +148,11 @@ for more details on the API. passed to other algorithms like LDA. During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be - included in the vocabulary. + included in the vocabulary. Another optional binary toggle parameter controls the output vector. + If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete + probabilistic models that model binary events rather than integer counts **Examples** @@ -1092,14 +1097,11 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned -categorical features. -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts. -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values. -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may -find fewer depending on the data sample values. - -Note that the result may be different every time you run it, since the sample strategy behind it is -non-deterministic. +categorical features. The number of bins is set by the `numBuckets` parameter. +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala) +for a detailed description). The precision of the approximation can be controlled with the +`relativeError` parameter. When set to zero, exact quantiles are calculated. +The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values. **Examples** From e8b920527396cc20cb3bac81fa7757263c7ce55d Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Fri, 20 May 2016 14:43:11 -0700 Subject: [PATCH 02/11] Review comments --- docs/ml-features.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 1ea9b5d93f16d..e1aebcd5391ca 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -56,7 +56,7 @@ it is advisable to use a power of two as the feature dimension, otherwise the fe not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary counts -rather than integer. +rather than integer counts. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. @@ -152,7 +152,7 @@ for more details on the API. by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete - probabilistic models that model binary events rather than integer counts + probabilistic models that model binary events rather than integer counts. **Examples** From d7a03defc51383cf0b90e59cf558adcef031d0de Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Mon, 23 May 2016 18:37:30 -0700 Subject: [PATCH 03/11] Review comments --- docs/ml-features.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index e1aebcd5391ca..782aa5684d099 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -151,7 +151,7 @@ for more details on the API. term frequency across the corpus. An optional parameter `minDF` also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. - If set to true all nonzero counts are set to 1. This is especially useful for modelling discrete + If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary events rather than integer counts. **Examples** @@ -1098,9 +1098,9 @@ for more details on the API. `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the `numBuckets` parameter. -The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala) +The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala) for a detailed description). The precision of the approximation can be controlled with the -`relativeError` parameter. When set to zero, exact quantiles are calculated. +`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation. The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values. **Examples** From 2cb29f81d697febe82ea0117cc40774df58328d2 Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Tue, 24 May 2016 14:26:53 -0700 Subject: [PATCH 04/11] Review comments --- docs/ml-features.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 782aa5684d099..2cbaa81d79679 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1100,7 +1100,7 @@ for more details on the API. categorical features. The number of bins is set by the `numBuckets` parameter. The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala) for a detailed description). The precision of the approximation can be controlled with the -`relativeError` parameter. When set to zero, exact quantiles are calculated. Computing exact quantiles is an expensive operation. +`relativeError` parameter. When set to zero, exact quantiles are calculated(**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values. **Examples** From ce72fe3b64510914beca643d0c2015ec28a9b970 Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Wed, 25 May 2016 19:16:24 -0700 Subject: [PATCH 05/11] Review comments --- docs/ml-features.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 2cbaa81d79679..ca8b6d6867096 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1100,7 +1100,7 @@ for more details on the API. categorical features. The number of bins is set by the `numBuckets` parameter. The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala) for a detailed description). The precision of the approximation can be controlled with the -`relativeError` parameter. When set to zero, exact quantiles are calculated(**Note:** Computing exact quantiles is an expensive operation). +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values. **Examples** From 4eb0394eee8d21592a832a19de105f88003b703b Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Thu, 26 May 2016 11:54:39 -0700 Subject: [PATCH 06/11] Review Comments --- docs/ml-features.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index ca8b6d6867096..b3642060eb8b7 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -46,7 +46,7 @@ In MLlib, we separate TF and IDF to make them flexible. `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. `HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing). -A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies +A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash).Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the @@ -55,8 +55,7 @@ of the hash table. Since a simple modulo is used to transform the hash function it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are -set to 1. This is especially useful for discrete probabilistic models that model binary counts -rather than integer counts. +set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. @@ -151,8 +150,7 @@ for more details on the API. term frequency across the corpus. An optional parameter `minDF` also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. - If set to true all nonzero counts are set to 1. This is especially useful for discrete - probabilistic models that model binary events rather than integer counts. + If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts. **Examples** From 83f2b6620a8e95847ce233867320d67c08292bbe Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Tue, 31 May 2016 16:11:23 -0700 Subject: [PATCH 07/11] Fixing QuantileDiscretizer doc and example --- docs/ml-features.md | 2 +- .../apache/spark/examples/ml/QuantileDiscretizerExample.scala | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index b3642060eb8b7..d593607ae0139 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1120,7 +1120,7 @@ Assume that we have a DataFrame with the columns `id`, `hour`: ~~~ `hour` is a continuous feature with `Double` type. We want to turn the continuous feature into -a categorical one. Given `numBuckets = 3`, we should get the following DataFrame: +a categorical one. Given `numBuckets = 3`, and computing exact quantiles (by setting `relativeError = 0`), we should get the following DataFrame: ~~~ id | hour | result diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala index 1a16515594161..316d918654031 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala @@ -38,6 +38,7 @@ object QuantileDiscretizerExample { .setInputCol("hour") .setOutputCol("result") .setNumBuckets(3) + .setRelativeError(0) val result = discretizer.fit(df).transform(df) result.show() From 4a06292a22bebe434e7a49bb7befe743b49344d8 Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Thu, 2 Jun 2016 20:06:43 -0700 Subject: [PATCH 08/11] Including relativeError in all examples with a note --- docs/ml-features.md | 2 +- .../spark/examples/ml/JavaQuantileDiscretizerExample.java | 6 +++++- examples/src/main/python/ml/quantile_discretizer_example.py | 4 +++- .../spark/examples/ml/QuantileDiscretizerExample.scala | 2 ++ 4 files changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index d593607ae0139..15023858968ce 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1098,7 +1098,7 @@ for more details on the API. categorical features. The number of bins is set by the `numBuckets` parameter. The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala) for a detailed description). The precision of the approximation can be controlled with the -`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01. The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values. **Examples** diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java index 16f58a852d8a2..27194f3fc0c6e 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java @@ -58,7 +58,11 @@ public static void main(String[] args) { QuantileDiscretizer discretizer = new QuantileDiscretizer() .setInputCol("hour") .setOutputCol("result") - .setNumBuckets(3); + .setNumBuckets(3) + .setRelativeError(0); + // Note that we compute exact quantiles here by setting `relativeError` to 0 for + // illustrative purposes, however in most cases the default parameter value should suffice + Dataset result = discretizer.fit(df).transform(df); result.show(); diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py b/examples/src/main/python/ml/quantile_discretizer_example.py index 6ae7bb18f8c67..43ca09c198e5c 100644 --- a/examples/src/main/python/ml/quantile_discretizer_example.py +++ b/examples/src/main/python/ml/quantile_discretizer_example.py @@ -30,7 +30,9 @@ data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)] dataFrame = spark.createDataFrame(data, ["id", "hour"]) - discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result") + discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result", relativeError=0) + # Note that we compute exact quantiles here by setting `relativeError` to 0 for + # illustrative purposes, however in most cases the default parameter value should suffice result = discretizer.fit(dataFrame).transform(dataFrame) result.show() diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala index 316d918654031..a89b7afd40324 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala @@ -39,6 +39,8 @@ object QuantileDiscretizerExample { .setOutputCol("result") .setNumBuckets(3) .setRelativeError(0) + // Note that we compute exact quantiles here by setting `relativeError` to 0 for + // illustrative purposes, however in most cases the default parameter value should suffice val result = discretizer.fit(df).transform(df) result.show() From 3b40a7f329107070a13381891857d6537c4898e0 Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Thu, 2 Jun 2016 21:33:29 -0700 Subject: [PATCH 09/11] Fixed python style issue --- examples/src/main/python/ml/quantile_discretizer_example.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py b/examples/src/main/python/ml/quantile_discretizer_example.py index 43ca09c198e5c..ca634ad5a73a7 100644 --- a/examples/src/main/python/ml/quantile_discretizer_example.py +++ b/examples/src/main/python/ml/quantile_discretizer_example.py @@ -30,7 +30,8 @@ data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)] dataFrame = spark.createDataFrame(data, ["id", "hour"]) - discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result", relativeError=0) + discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result", + relativeError=0) # Note that we compute exact quantiles here by setting `relativeError` to 0 for # illustrative purposes, however in most cases the default parameter value should suffice From edb9e2b3a9ba9dbe4dea55d287c17650600fac8e Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Fri, 3 Jun 2016 14:19:34 -0700 Subject: [PATCH 10/11] Review comments --- .../spark/examples/ml/JavaQuantileDiscretizerExample.java | 5 ++--- examples/src/main/python/ml/quantile_discretizer_example.py | 4 ++-- .../spark/examples/ml/QuantileDiscretizerExample.scala | 4 ++-- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java index 27194f3fc0c6e..fcca1092ea7cb 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java @@ -55,14 +55,13 @@ public static void main(String[] args) { Dataset df = spark.createDataFrame(data, schema); + // Note that we compute exact quantiles here by setting `relativeError` to 0 for + // illustrative purposes, however in most cases the default parameter value should suffice QuantileDiscretizer discretizer = new QuantileDiscretizer() .setInputCol("hour") .setOutputCol("result") .setNumBuckets(3) .setRelativeError(0); - // Note that we compute exact quantiles here by setting `relativeError` to 0 for - // illustrative purposes, however in most cases the default parameter value should suffice - Dataset result = discretizer.fit(df).transform(df); result.show(); diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py b/examples/src/main/python/ml/quantile_discretizer_example.py index ca634ad5a73a7..fa0cf72f91c64 100644 --- a/examples/src/main/python/ml/quantile_discretizer_example.py +++ b/examples/src/main/python/ml/quantile_discretizer_example.py @@ -30,10 +30,10 @@ data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)] dataFrame = spark.createDataFrame(data, ["id", "hour"]) - discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result", - relativeError=0) # Note that we compute exact quantiles here by setting `relativeError` to 0 for # illustrative purposes, however in most cases the default parameter value should suffice + discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result", + relativeError=0) result = discretizer.fit(dataFrame).transform(dataFrame) result.show() diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala index a89b7afd40324..b9aa276d832b1 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala @@ -34,13 +34,13 @@ object QuantileDiscretizerExample { val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)) val df = spark.createDataFrame(data).toDF("id", "hour") + // Note that we compute exact quantiles here by setting `relativeError` to 0 for + // illustrative purposes, however in most cases the default parameter value should suffice val discretizer = new QuantileDiscretizer() .setInputCol("hour") .setOutputCol("result") .setNumBuckets(3) .setRelativeError(0) - // Note that we compute exact quantiles here by setting `relativeError` to 0 for - // illustrative purposes, however in most cases the default parameter value should suffice val result = discretizer.fit(df).transform(df) result.show() From 87844efaeee06a9d4cd7c30cbd6d4b947978112c Mon Sep 17 00:00:00 2001 From: GayathriMurali Date: Fri, 10 Jun 2016 15:56:37 -0700 Subject: [PATCH 11/11] Remove default value inclusion --- docs/ml-features.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 15023858968ce..d593607ae0139 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1098,7 +1098,7 @@ for more details on the API. categorical features. The number of bins is set by the `numBuckets` parameter. The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala) for a detailed description). The precision of the approximation can be controlled with the -`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The default value of `relativeError` is 0.01. +`relativeError` parameter. When set to zero, exact quantiles are calculated (**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds will be `-Infinity` and `+Infinity` covering all real values. **Examples**