Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13057][SQL] Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation #10965

Closed
wants to merge 2 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jan 28, 2016

This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.

@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50254 has finished for PR 10965 at commit b3bf70c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50546 has finished for PR 10965 at commit cc58f20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 5, 2016

The size of in-memory columnar cache is much bigger than parquet data on disk because Spark uses simpler compression algorithms than parquet does in compressionSchemes (See a discussion on spark-user mailing list - https://www.mail-archive.com/user@spark.apache.org/msg45241.html).

Since spark-sql already has parquet-column dependency, we can use the same efficient compression algorithms with Parquet to reduce GC pressure.
Thought?

@maropu
Copy link
Member Author

maropu commented Feb 5, 2016

I tried to use DeltaBinaryPackingValuesReader and DeltaBinaryPackingValuesWriter in parquet-column package.

Benchmark
Running benchmark: INT Decode(Lower Skew)
  Running case: PassThrough(1.000)
  Running case: RunLengthEncoding(1.005)
  Running case: DictionaryEncoding(0.500)
  Running case: IntDelta(0.250)
  Running case: ParquetIntDelta(0.072)

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
INT Decode(Lower Skew):            Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
PassThrough(1.000)                       278.04           241.36         1.00 X
RunLengthEncoding(1.005)                 741.29            90.53         0.38 X
DictionaryEncoding(0.500)                954.02            70.34         0.29 X
IntDelta(0.250)                          836.38            80.24         0.33 X
ParquetIntDelta(0.072)                   839.36            79.95         0.33 X

Running benchmark: INT Decode(Higher Skew)
  Running case: PassThrough(1.000)
  Running case: RunLengthEncoding(1.339)
  Running case: DictionaryEncoding(0.501)
  Running case: IntDelta(0.250)
  Running case: ParquetIntDelta(0.161)

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
INT Decode(Higher Skew):           Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
PassThrough(1.000)                       899.44            74.61         1.00 X
RunLengthEncoding(1.339)                1192.31            56.28         0.75 X
DictionaryEncoding(0.501)                981.44            68.38         0.92 X
IntDelta(0.250)                          874.43            76.75         1.03 X
ParquetIntDelta(0.161)                   937.08            71.61         0.96 X

@maropu
Copy link
Member Author

maropu commented Feb 5, 2016

@rxin Could you give me any comment on this?

@rxin
Copy link
Contributor

rxin commented Feb 5, 2016

cc @nongli is this useful?

@maropu
Copy link
Member Author

maropu commented Feb 7, 2016

@nongli ping

@cloud-fan
Copy link
Contributor

the benchmark infra is updated, I think we need to rerun it and update the results.

@nongli
Copy link
Contributor

nongli commented Feb 8, 2016

The benchmark LGTM and I think this is useful.

@maropu Before you make significant changes to this, can you write up what you plan to do?

@maropu
Copy link
Member Author

maropu commented Feb 9, 2016

@nongli Okay, I'll let you know the plan first. plz give me some time to look around similar codes in Parquet.

@SparkQA
Copy link

SparkQA commented Feb 9, 2016

Test build #50979 has finished for PR 10965 at commit fab3fb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 10, 2016

Thanks - I've merged this in master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants