Skip to content
This repository has been archived by the owner on Apr 17, 2018. It is now read-only.

Hash page row #49

Merged
merged 3 commits into from
Feb 10, 2016
Merged

Hash page row #49

merged 3 commits into from
Feb 10, 2016

Conversation

keith-turner
Copy link
Member

I ran some test on EC2 using roughly the same config as the recent long test. The following plot shows loading 80 common crawl files. Before this 120 common crawl files were loaded using spark. The dips in performance are caused by files finishing and new ones starting. With the changes in this PR, each file ran as a task. There were 20 spark executors for loading data. The load across the cluster was much more even than in the past and the throughput was better.

hash-run

Below are some the tablets in the Fluo table containing page data. The file sizes are very even. This was not the case before adding the hash.

5;p:0aw3 file:hdfs://leader1:10000/accumulo/tables/5/t-0000279/A0000dud.rf []    41977575,126316
5;p:0ls6 file:hdfs://leader1:10000/accumulo/tables/5/t-000027r/A0000due.rf []    42176089,126148
5;p:0wo9 file:hdfs://leader1:10000/accumulo/tables/5/t-00001ne/A0000duf.rf []    42035837,125046
5;p:17kc file:hdfs://leader1:10000/accumulo/tables/5/t-0000278/A0000dug.rf []    41237240,125422
5;p:1igf file:hdfs://leader1:10000/accumulo/tables/5/t-000027q/A0000duh.rf []    41937943,125324
5;p:1tci file:hdfs://leader1:10000/accumulo/tables/5/t-00001jq/A0000dui.rf []    41119372,124920
5;p:248l file:hdfs://leader1:10000/accumulo/tables/5/t-0000274/A0000dcd.rf []    41955329,125664
5;p:2f4o file:hdfs://leader1:10000/accumulo/tables/5/t-000027m/A0000dce.rf []    41337171,124974
5;p:2q0r file:hdfs://leader1:10000/accumulo/tables/5/t-00001nc/A0000dcf.rf []    41315244,125138

@keith-turner
Copy link
Member Author

This PR depends on apache/fluo-recipes#52

@@ -149,16 +149,6 @@ load-s3)
${COMMAND}
fi
;;
reindex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think it's OK that the reindex command is removed for this PR, I think it is something we should tackle. I think it's a problem that all production Fluo applications will need to solve and we should work through it to determine if new features or recipes are needed and show that is possible. What do you think about adding this as an issue?

@mikewalch
Copy link
Contributor

+1

keith-turner added a commit that referenced this pull request Feb 10, 2016
@keith-turner keith-turner merged commit 0325b47 into astralway:master Feb 10, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants