Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-182][SPARK][WIP] Add an optimizer rule to filter out columns with low variances #139

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maropu
Copy link
Member

@maropu maropu commented Mar 29, 2018

What changes were proposed in this pull request?

This pr added a new optimizer rule VarianceThreshold in Spark;

scala> spark.read.option("inferSchema", "true").csv("test.csv").write.saveAsTable("t")
scala> sql("SELECT * FROM t").show
+---+--------+---+----+
|_c0|     _c1|_c2| _c3|
+---+--------+---+----+
|  1|   "one"|1.0| 1.0|
|  1|   "two"|1.1| 2.3|
|  1| "three"|0.9| 3.5|
|  1|   "one"|0.9|10.3|
+---+--------+---+----+

// Enables the optimizer rule and prints again
scala> sql("spark.sql.cbo.enabled=true")
scala> sql("spark.sql.statistics.histogram.enabled=true")
scala> sql("spark.sql.optimizer.featureSelection.enabled=true")
scala> sql("spark.sql.optimizer.featureSelection.varianceThreshold=0.10")
scala> sql("SELECT * FROM t").show
+--------+----+
|     _c1| _c3|
+--------+----+
|   "one"| 1.0|
|   "two"| 2.3|
| "three"| 3.5|
|   "one"|10.3|
+--------+----+

TODO

  • Add docs in gitbook
  • Add more tests
  • Brush up VarianceThreshold code

What type of PR is it?

Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-182

How was this patch tested?

Added tests in FeatureSelectionRuleSuite.

@myui
Copy link
Member

myui commented Apr 1, 2018

@maropu CI failing.

@maropu
Copy link
Member Author

maropu commented Apr 1, 2018

I'll fix later.

@myui
Copy link
Member

myui commented Aug 13, 2018

@maropu is this PR still WIP?

@maropu
Copy link
Member Author

maropu commented Aug 15, 2018

Sorry for my slow work. I'm checking the feasibility on my separate repo (because there are some issues to solve): https://github.com/maropu/spark-catalyst-rule-rewiter/tree/master
So, please give me more time and thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants