Hash page row #49

keith-turner · 2016-02-03T22:49:30Z

I ran some test on EC2 using roughly the same config as the recent long test. The following plot shows loading 80 common crawl files. Before this 120 common crawl files were loaded using spark. The dips in performance are caused by files finishing and new ones starting. With the changes in this PR, each file ran as a task. There were 20 spark executors for loading data. The load across the cluster was much more even than in the past and the throughput was better.

Below are some the tablets in the Fluo table containing page data. The file sizes are very even. This was not the case before adding the hash.

5;p:0aw3 file:hdfs://leader1:10000/accumulo/tables/5/t-0000279/A0000dud.rf []    41977575,126316
5;p:0ls6 file:hdfs://leader1:10000/accumulo/tables/5/t-000027r/A0000due.rf []    42176089,126148
5;p:0wo9 file:hdfs://leader1:10000/accumulo/tables/5/t-00001ne/A0000duf.rf []    42035837,125046
5;p:17kc file:hdfs://leader1:10000/accumulo/tables/5/t-0000278/A0000dug.rf []    41237240,125422
5;p:1igf file:hdfs://leader1:10000/accumulo/tables/5/t-000027q/A0000duh.rf []    41937943,125324
5;p:1tci file:hdfs://leader1:10000/accumulo/tables/5/t-00001jq/A0000dui.rf []    41119372,124920
5;p:248l file:hdfs://leader1:10000/accumulo/tables/5/t-0000274/A0000dcd.rf []    41955329,125664
5;p:2f4o file:hdfs://leader1:10000/accumulo/tables/5/t-000027m/A0000dce.rf []    41337171,124974
5;p:2q0r file:hdfs://leader1:10000/accumulo/tables/5/t-00001nc/A0000dcf.rf []    41315244,125138

keith-turner · 2016-02-03T22:51:11Z

This PR depends on apache/fluo-recipes#52

mikewalch · 2016-02-05T15:35:57Z

bin/webindex

@@ -149,16 +149,6 @@ load-s3)
    ${COMMAND}
  fi
  ;;
-reindex)


While I think it's OK that the reindex command is removed for this PR, I think it is something we should tackle. I think it's a problem that all production Fluo applications will need to solve and we should work through it to determine if new features or recipes are needed and show that is possible. What do you think about adding this as an issue?

mikewalch · 2016-02-10T17:52:14Z

+1

Hash page row

keith-turner added 3 commits February 3, 2016 16:28

added hash to page row prefix to evenly spread work

db31921

fixes astralway#39 create a load task per file

ffd094a

fixes astralway#18 reworked spark code to use POJOs

9d4e43e

keith-turner mentioned this pull request Feb 3, 2016

Fluo code incorrectly computing domain counts #50

Closed

mikewalch reviewed Feb 5, 2016
View reviewed changes

keith-turner mentioned this pull request Feb 10, 2016

Reintroduce reindex command #51

Open

keith-turner added a commit that referenced this pull request Feb 10, 2016

Merge pull request #49 from keith-turner/hash_page_row

0325b47

Hash page row

keith-turner merged commit 0325b47 into astralway:master Feb 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash page row #49

Hash page row #49

keith-turner commented Feb 3, 2016

keith-turner commented Feb 3, 2016

mikewalch Feb 5, 2016

mikewalch commented Feb 10, 2016

Hash page row #49

Hash page row #49

Conversation

keith-turner commented Feb 3, 2016

keith-turner commented Feb 3, 2016

mikewalch Feb 5, 2016

Choose a reason for hiding this comment

mikewalch commented Feb 10, 2016