Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numI…
…terations are large ### What changes were proposed in this pull request? This patch adds normalization to word vectors when fitting dataset in Word2Vec. ### Why are the changes needed? Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors. ### Does this PR introduce any user-facing change? Yes. After this patch, Word2Vec won't produce infinity word vectors. ### How was this patch tested? Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload. ```scala case class Sentences(name: String, words: Array[String]) val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(30) val model = word2Vec.fit(dataset) model.getVectors.show() ``` Before: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-Infinity,-Infin...| | Talent|[-Infinity,Infini...| | Hourglass|[2.02805806500023...| |Nickelodeon's|[-4.2918617120906...| | Priests|[-1.3570403355926...| | Religion:|[-6.7049072282803...| | Bu|[5.05591774315586...| | Totoro:|[-1.0539840178632...| | Trouble,|[-3.5363592836003...| | Hatter|[4.90413981352826...| | '79|[7.50436471285412...| | Vile|[-2.9147142985312...| | 9/11|[-Infinity,Infini...| | Santino|[1.30005911270850...| | Motives|[-1.2538958306253...| | '13|[-4.5040152427657...| | Fierce|[Infinity,Infinit...| | Stover|[-2.6326895394029...| | 'It|[1.66574533864436...| | Butts|[Infinity,Infinit...| +-------------+--------------------+ only showing top 20 rows ``` After: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-0.0454501919448...| | Talent|[-0.2657704949378...| | Hourglass|[-0.1399687677621...| |Nickelodeon's|[-0.1767119318246...| | Priests|[-0.0047509293071...| | Religion:|[-0.0411605164408...| | Bu|[0.11837736517190...| | Totoro:|[0.05258282646536...| | Trouble,|[0.09482011198997...| | Hatter|[0.06040831282734...| | '79|[0.04783720895648...| | Vile|[-0.0017210749210...| | 9/11|[-0.0713915303349...| | Santino|[-0.0412711687386...| | Motives|[-0.0492418706417...| | '13|[-0.0073119504377...| | Fierce|[-0.0565455369651...| | Stover|[0.06938160210847...| | 'It|[0.01117012929171...| | Butts|[0.05374567210674...| +-------------+--------------------+ only showing top 20 rows ``` Closes #26722 from viirya/SPARK-24666-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
- Loading branch information