Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3143][MLLIB] add tf-idf user guide #2061

Closed
wants to merge 2 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Aug 20, 2014

Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. @atalwalkar

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 2061 at commit a5ea4b4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 2061 at commit a5ea4b4.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`.
And document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...$d$. And..." -> "...$d$, while..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 2061 at commit ca04c70.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 2061 at commit ca04c70.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

## Word2Vec

Word2Vec computes distributed vector representation of words. The main advantage of the distributed
[Word2Vec](https://code.google.com/p/word2vec/) computes distributed vector representation of words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "distributed" mean in "distributed vector representation"? Does it refer to the fact that the computation is distributed? If so, could we say "...computes vector representation of words in a distributed fashion."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in the original paper and the term "distributed" is from http://www.indiana.edu/~clcl/BEAGLE/Jones_Mewhort_PR.pdf . I have trouble understanding "distributed vector representation" as well. I think "distributed" means we map a single word to multiple values ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is independent of this PR. Does the current doc look good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the TF-IDF stuff LGTM.

@asfgit asfgit closed this in e157187 Aug 21, 2014
asfgit pushed a commit that referenced this pull request Aug 21, 2014
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2061 from mengxr/tfidf-doc and squashes the following commits:

ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide

(cherry picked from commit e157187)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor Author

mengxr commented Aug 21, 2014

I've merged this into master and branch-1.1. Thanks @atalwalkar for reviewing!

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#2061 from mengxr/tfidf-doc and squashes the following commits:

ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants