[ML][SPARK-6529] Add Word2Vec transformer #5596

yinxusen · 2015-04-20T16:07:15Z

See JIRA issue here.

There are some notes:

I add learningRate in sharedParams since it is a common parameter for ML algorithms.
We will not support transform of finding synonyms from a Vector, which will support in further JIRA issues.
Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an RDD[Iterable[String]] which represents documents, but the transformed set we want is an RDD[String] that represents unique words. So you have to switch your inputCol in these two stages.

SparkQA · 2015-04-20T16:13:47Z

Test build #30592 has started for PR 5596 at commit 618abd0.

oefirouz · 2015-04-20T23:52:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

+  def getSeed: Long = get(seed)
+
+  /**
+   * The minimum count of words that can be kept in training set.


this wording is unclear, perhaps it would just be easier to copy the comments from the implementation?

so for example: "The minimum number of times a token must appear to be included in the word2vec model's vocabulary"

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

Sure, I'll refine it in the next commit. Thx

jkbradley · 2015-04-21T02:35:18Z

A few initial thoughts:

learningRate: This is the same as stepSize. We should probably only use one of the two. I'm OK with either but would vote for stepSize if I had to choose since that's more common in the optimization literature, which is ML is referencing more and more. Ping @mengxr -- opinions?
Very good point about the different input column types. I'm adding a note to the JIRA about it since it's a design issue we should discuss a little.

yinxusen · 2015-04-21T04:11:17Z

@jkbradley Yes, I think stepSize is good enough. Then I think we can break the name consistency between Word2Vec in mllib and ml. Say, learningRate of Word2Vec in mllib package can be substituted with stepSize. There is also another different name, i.e. the numIterations and maxIter.

jkbradley · 2015-04-21T05:31:31Z

@yinxusen True, we'll have to introduce some inconsistencies between .ml and .mllib no matter what. For iterations, I like "max" since it's more precise that "num" for some algorithms.

mengxr · 2015-04-22T06:31:57Z

mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala

+      .setNumSynonyms(1)
+      .transform(wordsDF)
+
+    assert(


move assert into foreach to get more information.

yinxusen · 2015-04-22T14:01:35Z

@mengxr As we talked, I average all vectors of words in a sentence as the output column.

yinxusen · 2015-04-22T14:02:28Z

@jkbradley I add stepSize as a sharedParam in the codegen file.

SparkQA · 2015-04-22T14:05:01Z

Test build #30758 has finished for PR 5596 at commit b54399f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase
- trait HasStepSize extends Params
This patch does not change any dependencies.

yinxusen · 2015-04-22T14:09:17Z

@mengxr Maybe the generated file sharedParams breaks the max columns in a line rule.

jkbradley · 2015-04-22T16:59:15Z

@yinxusen Run dev/scalastyle
It's not the generated sharedParams since style checking skips that file.

yinxusen · 2015-04-23T09:46:28Z

@jkbradley Thanks! Can't believe that I knew the tool just now.

SparkQA · 2015-04-23T11:11:43Z

Test build #30830 has finished for PR 5596 at commit 566ec20.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase
- trait HasStepSize extends Params
This patch does not change any dependencies.

yinxusen · 2015-04-24T14:21:22Z

@mengxr ready to review

mengxr · 2015-04-24T15:37:26Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

+  /**
+   * A random seed to random an initial vector.
+   */
+  final val seed = new LongParam(this, "seed", "a random seed to random an initial vector")


If #5626 gets merged first, please update this PR to use shared params.

Adds support for the math functions for DataFrames in PySpark. rxin I love Davies. Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#5750 from brkyvz/python-math-udfs and squashes the following commits: 7c4f563 [Burak Yavuz] removed is_math 3c4adde [Burak Yavuz] cleanup imports d5dca3f [Burak Yavuz] moved math functions to mathfunctions 25e6534 [Burak Yavuz] addressed comments v2.0 d3f7e0f [Burak Yavuz] addressed comments and added tests 7b7d7c4 [Burak Yavuz] remove tests for removed methods 33c2c15 [Burak Yavuz] fixed python style 3ee0c05 [Burak Yavuz] added python functions

This patch adds SQL to the set of excluded libraries when generating a callSite. This makes the callSite mechanism work properly for the data frame API. I also added a small improvement for JDBC queries where we just use the string "Spark JDBC Server Query" instead of trying to give a callsite that doesn't make any sense to the user. Before (DF): ![screen shot 2015-04-28 at 1 29 26 pm](https://cloud.githubusercontent.com/assets/320616/7380170/ef63bfb0-edae-11e4-989c-f88a5ba6bbee.png) After (DF): ![screen shot 2015-04-28 at 1 34 58 pm](https://cloud.githubusercontent.com/assets/320616/7380181/fa7f6d90-edae-11e4-9559-26f163ed63b8.png) After (JDBC): ![screen shot 2015-04-28 at 2 00 10 pm](https://cloud.githubusercontent.com/assets/320616/7380185/02f5b2a4-edaf-11e4-8e5b-99bdc3df66dd.png) Author: Patrick Wendell <patrick@databricks.com> Closes apache#5757 from pwendell/dataframes and squashes the following commits: 0d931a4 [Patrick Wendell] Attempting to fix PySpark tests 85bf740 [Patrick Wendell] [SPARK-7204] Fix callsite for dataframe operations.

…egations This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory. This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses. As a result, this allows for higher memory utilization. It can also result in better cache locality since objects will be stored closer together in memory. This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`. For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric. ### Managing memory with sun.misc.Unsafe This patch supports both on- and off-heap managed memory. - In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object. - In off-heap mode, memory is addressed directly with 64-bit long addresses. To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields. In off-heap mode, we simply pass `null` as `baseObject`. We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks. By default, we use on-heap mode. To enable off-heap mode, set `spark.unsafe.offHeap=true`. To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`. These classes work together to track allocations and detect memory leaks. ### Compact tuple format This patch introduces `UnsafeRow`, a compact row layout. In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values: ![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png) - Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes) - The bit set is used for null tracking: - Position _i_ is set if and only if field _i_ is null - The bit set is aligned to an 8-byte word boundary. - Every field appears as an 8-byte word in the fixed-length values part: - If a field is null, we zero out the values. - If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part. - Each variable-length data type can have its own encoding: - For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes. If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment. For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this: ![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png) This format allows us to compare tuples for equality by directly comparing their raw bytes. This also enables fast hashing of tuples. ### Hash map for performing aggregations This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with. This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values. `BytesToBytesMap` stores pointers to key and value tuples. For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place. This map is implemented using open hashing with triangular sequence probing. Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode. By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision). `UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map. Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`. This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)  Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5725 from JoshRosen/unsafe and squashes the following commits: eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns. 81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate 1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing. 017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize() 50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication 70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager: 6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods. de5e001 [Josh Rosen] Fix debug vs. trace in logging message. a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention. 78a5b84 [Josh Rosen] Add logging to MemoryManager ce3c565 [Josh Rosen] More comments, formatting, and code cleanup. 529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds. 3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace 162caf7 [Josh Rosen] Fix test compilation b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size. a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv. 0925847 [Josh Rosen] Disable MiMa checks for new unsafe module cde4132 [Josh Rosen] Add missing pom.xml 9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap 6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow. 31eaabc [Josh Rosen] Lots of TODO and doc cleanup. a95291e [Josh Rosen] Cleanups to string handling code afe8dca [Josh Rosen] Some Javadoc cleanup f3dcbfe [Josh Rosen] More mod replacement 854201a [Josh Rosen] Import and comment cleanup 06e929d [Josh Rosen] More warning cleanup ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections 29a7575 [Josh Rosen] Remove debug logging 49aed30 [Josh Rosen] More long -> int conversion. b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation. 765243d [Josh Rosen] Enable optional performance metrics for hash map. 23a440a [Josh Rosen] Bump up default hash map size 628f936 [Josh Rosen] Use ints intead of longs for indexing. 92d5a06 [Josh Rosen] Address a number of minor code review comments. 1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests. d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...) b3eaccd [Josh Rosen] Extract aggregation map into its own class. d2bb986 [Josh Rosen] Update to implement new Row methods added upstream 58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable) 7df6008 [Josh Rosen] Optimizations related to zeroing out memory: c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free(): 738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed. 62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns. c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort" c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup f764d13 [Josh Rosen] Simplify address + length calculation in Location. 079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract. 1a483c5 [Josh Rosen] First version that passes some aggregation tests: fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate 53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters 1ff814d [Josh Rosen] Add reminder to free memory on iterator completion 8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration. 5d55cef [Josh Rosen] Add skeleton for Row implementation. f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods. ab68e08 [Josh Rosen] Begin merging the UTF8String implementations. 480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.

Obtain HBase security token with Kerberos credentials locally to be sent to executors. Tested on eBay's secure HBase cluster. Similar to obtainTokenForNamenodes and fails gracefully if HBase classes are not included in path. Requires hbase-site.xml to be in the classpath(typically via conf dir) for the zookeeper configuration. Should that go in the docs somewhere? Did not see an HBase section. Author: Dean Chen <deanchen5@gmail.com> Closes apache#5586 from deanchen/master and squashes the following commits: 0c190ef [Dean Chen] [SPARK-6918][YARN] Secure HBase support.

…> ask. The old naming scheme was very confusing between askWithReply and sendWithReply. I also divided RpcEnv.scala into multiple files. Author: Reynold Xin <rxin@databricks.com> Closes apache#5768 from rxin/rpc-rename and squashes the following commits: a84058e [Reynold Xin] [SPARK-7223] Rename RPC askWithReply -> askWithReply, sendWithReply -> ask.

I believe column access via `__getattr__` is bad and shouldn't be implicitly encouraged by the error message when accessing a non-existing attribute on DataFrame. This patch changes the error message from 'no such column' to the more generic 'no such attribute', which is also what Pandas DFs will throw. Author: ksonj <kson@siberie.de> Closes apache#5771 from ksonj/master and squashes the following commits: bcc2220 [ksonj] Better error message on access to non-existing attribute

Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#5712 from cloud-fan/minor and squashes the following commits: be23064 [Wenchen Fan] fix java doc for DataFrame.agg

mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes apache#5769 from yinxusen/patch-1 and squashes the following commits: 43235f4 [Xusen Yin] Update PearsonCorrelation.scala f7287ee [Xusen Yin] Fix a typo of "threshold"

jkbradley · 2015-04-29T17:50:25Z

@yinxusen The easiest way to fix merge issues is to update your master branch and then use "git rebase master" (and fix whatever conflicts come up)

yinxusen · 2015-04-29T18:17:17Z

Well ... I just rebased it. Can you merge it? @jkbradley If not, I will reopen a new PR and close this.

SparkQA · 2015-04-29T19:54:59Z

Test build #31302 has finished for PR 5596 at commit ee2b37a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase
- trait HasStepSize extends Params
This patch does not change any dependencies.

jkbradley · 2015-04-29T20:08:27Z

It looks like the rebase worked. I'll let @mengxr give the final OK since he's done the review.

mengxr · 2015-04-29T21:40:19Z

@yinxusen Could you close this PR and re-open it? It forces GitHub to update the diff.

mengxr · 2015-04-29T21:57:31Z

Okay, LGTM. I verified the diff on my machine and merged this into master. Thanks!

See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529). There are some notes: 1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms. 2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues. 3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages. Author: Xusen Yin <yinxusen@gmail.com> Closes apache#5596 from yinxusen/SPARK-6529 and squashes the following commits: ee2b37a [Xusen Yin] merge with former HEAD 4945462 [Xusen Yin] merge with apache#5626 3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas 5dd4ee7 [Xusen Yin] fix scala style 743e0d5 [Xusen Yin] fix comments and code style 04c48e9 [Xusen Yin] ensure the functionality a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec 02848fa [Xusen Yin] refine comments 34a55c0 [Xusen Yin] fix errors 109d124 [Xusen Yin] add test suite and pass it 04dde06 [Xusen Yin] add shared params c594095 [Xusen Yin] add word2vec transformer 23d77fa [Xusen Yin] merge with apache#5626 e8cfaf7 [Xusen Yin] fix conflict with master 66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas 566ec20 [Xusen Yin] fix scala style b54399f [Xusen Yin] fix comments and code style 1211e86 [Xusen Yin] ensure the functionality 6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec 7cde18f [Xusen Yin] rm sharedParams 618abd0 [Xusen Yin] refine comments e29680a [Xusen Yin] fix errors fe3afe9 [Xusen Yin] add test suite and pass it 02767fb [Xusen Yin] add shared params 6a514f1 [Xusen Yin] add word2vec transformer

yinxusen added 5 commits April 19, 2015 17:22

add word2vec transformer

6a514f1

add shared params

02767fb

add test suite and pass it

fe3afe9

fix errors

e29680a

refine comments

618abd0

oefirouz reviewed Apr 20, 2015
View reviewed changes

mengxr reviewed Apr 22, 2015
View reviewed changes

yinxusen added 4 commits April 22, 2015 16:26

rm sharedParams

7cde18f

fix code style and refine the transform function of word2vec

6b97ec8

ensure the functionality

1211e86

fix comments and code style

b54399f

fix scala style

566ec20

mengxr reviewed Apr 24, 2015
View reviewed changes

brkyvz and others added 8 commits April 29, 2015 00:09

[SQL][Minor] fix java doc for DataFrame.agg

81ea42b

Author: Wenchen Fan <cloud0fan@outlook.com> Closes apache#5712 from cloud-fan/minor and squashes the following commits: be23064 [Wenchen Fan] fix java doc for DataFrame.agg

Fix a typo of "threshold"

c0c0ba6

mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes apache#5769 from yinxusen/patch-1 and squashes the following commits: 43235f4 [Xusen Yin] Update PearsonCorrelation.scala f7287ee [Xusen Yin] Fix a typo of "threshold"

yinxusen added 12 commits April 30, 2015 01:53

add word2vec transformer

c594095

add shared params

04dde06

add test suite and pass it

109d124

fix errors

34a55c0

refine comments

02848fa

fix code style and refine the transform function of word2vec

a190f2c

ensure the functionality

04c48e9

fix comments and code style

743e0d5

fix scala style

5dd4ee7

change foldLeft to for loop and use blas

3bc2cbd

merge with apache#5626

4945462

merge with former HEAD

ee2b37a

asfgit closed this in c9d530e Apr 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML][SPARK-6529] Add Word2Vec transformer #5596

[ML][SPARK-6529] Add Word2Vec transformer #5596

yinxusen commented Apr 20, 2015

SparkQA commented Apr 20, 2015

oefirouz Apr 20, 2015

yinxusen Apr 21, 2015

jkbradley commented Apr 21, 2015

yinxusen commented Apr 21, 2015

jkbradley commented Apr 21, 2015

mengxr Apr 22, 2015

yinxusen commented Apr 22, 2015

yinxusen commented Apr 22, 2015

SparkQA commented Apr 22, 2015

yinxusen commented Apr 22, 2015

jkbradley commented Apr 22, 2015

yinxusen commented Apr 23, 2015

SparkQA commented Apr 23, 2015

yinxusen commented Apr 24, 2015

mengxr Apr 24, 2015

yinxusen Apr 24, 2015

jkbradley commented Apr 29, 2015

yinxusen commented Apr 29, 2015

SparkQA commented Apr 29, 2015

jkbradley commented Apr 29, 2015

mengxr commented Apr 29, 2015

mengxr commented Apr 29, 2015

[ML][SPARK-6529] Add Word2Vec transformer #5596

[ML][SPARK-6529] Add Word2Vec transformer #5596

Conversation

yinxusen commented Apr 20, 2015

SparkQA commented Apr 20, 2015

oefirouz Apr 20, 2015

Choose a reason for hiding this comment

yinxusen Apr 21, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 21, 2015

yinxusen commented Apr 21, 2015

jkbradley commented Apr 21, 2015

mengxr Apr 22, 2015

Choose a reason for hiding this comment

yinxusen commented Apr 22, 2015

yinxusen commented Apr 22, 2015

SparkQA commented Apr 22, 2015

yinxusen commented Apr 22, 2015

jkbradley commented Apr 22, 2015

yinxusen commented Apr 23, 2015

SparkQA commented Apr 23, 2015

yinxusen commented Apr 24, 2015

mengxr Apr 24, 2015

Choose a reason for hiding this comment

yinxusen Apr 24, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 29, 2015

yinxusen commented Apr 29, 2015

SparkQA commented Apr 29, 2015

jkbradley commented Apr 29, 2015

mengxr commented Apr 29, 2015

mengxr commented Apr 29, 2015