From 2cc977e15bfe86c0944fab9fb3f0609339d580a0 Mon Sep 17 00:00:00 2001 From: Yuhao Yang Date: Fri, 6 May 2016 11:38:07 -0400 Subject: [PATCH 1/3] copy doc --- docs/ml-features.md | 49 ++++++++++++++++++++++++++------ docs/mllib-feature-extraction.md | 2 ++ 2 files changed, 42 insertions(+), 9 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 0b8f2d773c2eb..9ba890fcc6b1a 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -18,27 +18,58 @@ This section covers algorithms for working with features, roughly divided into t # Feature Extractors -## TF-IDF (HashingTF and IDF) - -[Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF. +## TF-IDF + +[Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) +is a feature vectorization method widely used in text mining to reflect the importance of a term +to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`. +Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while +document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use +term frequency to measure the importance, it is very easy to over-emphasize terms that appear very +often but carry little information about the document, e.g., "a", "the", and "of". If a term appears +very often across the corpus, it means it doesn't carry special information about a particular document. +Inverse document frequency is a numerical measure of how much information a term provides: +`\[ +IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1}, +\]` +where `$|D|$` is the total number of documents in the corpus. Since logarithm is used, if a term +appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid +dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF: +`\[ +TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). +\]` +There are several variants on the definition of term frequency and document frequency. +In `spark.mllib`, we separate TF and IDF to make them flexible. **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors. `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. -The algorithm combines Term Frequency (TF) counts with the -[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction. +`HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing). +A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies +are calculated based on the mapped indices. This approach avoids the need to compute a global +term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash +collisions, where different raw features may become the same term after hashing. To reduce the +chance of collision, we can increase the target feature dimension, i.e., the number of buckets +of the hash table. The default feature dimension is `$2^{20} = 1,048,576$`. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The -`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column. -Intuitively, it down-weights columns which appear frequently in a corpus. +`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and +scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. + +Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for RDD-based API. -Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency. +**Note:** `spark.mllib` doesn't provide tools for text segmentation. +We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and +[scalanlp/chalk](https://github.com/scalanlp/chalk). -In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm. +In the following code segment, we start with a set of sentences. We split each sentence into words +using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into +a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance +when using text as features. Our feature vectors could then be passed to a learning algorithm.
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md index 7a97285032655..d81a51960f3f4 100644 --- a/docs/mllib-feature-extraction.md +++ b/docs/mllib-feature-extraction.md @@ -44,6 +44,8 @@ To reduce the chance of collision, we can increase the target feature dimension, the number of buckets of the hash table. The default feature dimension is `$2^{20} = 1,048,576$`. +We recommend users to adapt to the DataFrame-based API in [ML user guide on TF-IDF](ml-features.html#tf-idf). + **Note:** `spark.mllib` doesn't provide tools for text segmentation. We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and [scalanlp/chalk](https://github.com/scalanlp/chalk). From b151cfb8534a94b5f7f6b3dd8f23b36a0cdc8ecf Mon Sep 17 00:00:00 2001 From: Yuhao Yang Date: Wed, 11 May 2016 15:34:44 +0800 Subject: [PATCH 2/3] update default feature num --- docs/ml-features.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 567e5e8244be5..80a0bf9baa72c 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -39,7 +39,7 @@ dividing by zero for terms outside the corpus. The TF-IDF measure is simply the TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). \]` There are several variants on the definition of term frequency and document frequency. -In `spark.mllib`, we separate TF and IDF to make them flexible. +In MLlib, we separate TF and IDF to make them flexible. **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors. @@ -51,7 +51,7 @@ are calculated based on the mapped indices. This approach avoids the need to com term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets -of the hash table. The default feature dimension is `$2^{20} = 1,048,576$`. +of the hash table. The default feature dimension is `$2^{18} = 262,144$`. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. From e9dbe02a43ec70c2019608cb87037b2dcc6b4061 Mon Sep 17 00:00:00 2001 From: Yuhao Yang Date: Tue, 17 May 2016 11:35:03 -0400 Subject: [PATCH 3/3] updates --- docs/ml-features.md | 10 ++++++---- docs/mllib-feature-extraction.md | 5 +++-- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 80a0bf9baa72c..c44ace91f23f6 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -51,7 +51,9 @@ are calculated based on the mapped indices. This approach avoids the need to com term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets -of the hash table. The default feature dimension is `$2^{18} = 262,144$`. +of the hash table. Since a simple modulo is used to transform the hash function to a column index, +it is advisable to use a power of two as the feature dimension, otherwise the features will +not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. @@ -60,12 +62,12 @@ of the hash table. The default feature dimension is `$2^{18} = 262,144$`. `IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. -Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for RDD-based API. - -**Note:** `spark.mllib` doesn't provide tools for text segmentation. +**Note:** `spark.ml` doesn't provide tools for text segmentation. We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and [scalanlp/chalk](https://github.com/scalanlp/chalk). +**Examples** + In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md index d81a51960f3f4..4c027c84ec90b 100644 --- a/docs/mllib-feature-extraction.md +++ b/docs/mllib-feature-extraction.md @@ -10,6 +10,9 @@ displayTitle: Feature Extraction and Transformation - spark.mllib ## TF-IDF +**Note** We recommend using the DataFrame-based API, which is detailed in the [ML user guide on +TF-IDF](ml-features.html#tf-idf). + [Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`. @@ -44,8 +47,6 @@ To reduce the chance of collision, we can increase the target feature dimension, the number of buckets of the hash table. The default feature dimension is `$2^{20} = 1,048,576$`. -We recommend users to adapt to the DataFrame-based API in [ML user guide on TF-IDF](ml-features.html#tf-idf). - **Note:** `spark.mllib` doesn't provide tools for text segmentation. We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and [scalanlp/chalk](https://github.com/scalanlp/chalk).