Permalink
Browse files

Bago

  • Loading branch information...
Elad Meidar
Elad Meidar committed Nov 3, 2010
1 parent 3a64b3b commit d489f77c66280d6735ee038e0dc2fcf9fdf0df9c
@@ -1,20 +1,19 @@
---
-title: Scaling a 500 million rows table
+title: Scaling a 500 million rows table - planning
layout: post
---
438 million, 218 thousand and 363 rows.
-Current count of indexes on the table: 0.
+Current count of indexes on the table on the other hand, is 0.
I imagine you all ask how long does it take to perform a `select (*)` on it, well, I stopped waiting after about 4 minutes.
-
-This peculiar situation happens in one of our client's projects, the table itself operates as storage for a daemon that listens to some kind of a stream with the current daily amount that goes somewhere around 4 million rows per one single day.
+This peculiar situation happens in one of our client's projects, the table itself fills up from a daemon that listens to some kind of a stream with the current daily amount that goes somewhere around 4 million rows per one single day. all we are storing is a simple integer and a foreign key ( "sample" ).
Crazy, i know.
-This table (hereby "samples table") should allow the app to access any subset of query, but mostly based on a `WHERE user_id = xxx` clause, so i can't offload "old" rows away into oblivion (or an archive).
+This table ("samples table") should allow the app to access any subset of query, but mostly based on a `WHERE user_id = xxx` clause, so i can't offload "old" rows away into oblivion (or an archive).
After a little research, i decided on the following options:
@@ -30,4 +29,21 @@ What i am planning on doing is to create some kind of sampling and to keep to mo
h4. Use internal MySQL partitioning
+Partitioning seems like a reasonable RDBMS level solution, but on mysql it's limited to 1000 partitions only and they are also not very dynamic (i can't create an automatic partitioning engine that will.
+
+
+h4. Current direction
+
+We decided on trying the following flow:
+
+# Having the HA data in a NoSQL implementation, in our case it means we keep about 6 to 10 million rows in a NoSQL instance.
+# The most important data (insertions in the last 48 hours) needs to stay at the top resolution, but older data can lose resolution so we came up with this idea:
+
+We will create a cron task that will run every hour processing all the samples from the last hour and will avg it up, later storing it in a statistics table with only the hourly avg as the sample value.
+another task will do the same scoping out from hours to days, and from days to weeks which will be our lowest resolution.
+
+This method drops our row counts in places we can afford data resolution decrease in 10s of millions of rows.
+This process is still under development so if anyone has a better idea and care to enlighten us, please do so.
+
+
@@ -2,7 +2,7 @@
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
- <title>Emphasized Insanity - Scaling a 500 million rows table</title>
+ <title>Emphasized Insanity - Scaling a 500 million rows table - planning</title>
<script type="text/javascript" src="/javascripts/application.js"></script>
<link rel="stylesheet" type="text/css" href="/stylesheets/application.css">
<link rel="alternate" type="application/rss+xml" title="Emphasized Insanity - Elad Meidar" href="http://feeds.feedburner.com/EladOnRails" />
@@ -56,14 +56,14 @@ <h3>BACKTRACE</h3>
<div id="get">
<a id="homepage_link" href="/">Back to Posts List</a>
<h3 id="get-info">GET</h3>
- <h3 class="post_title"><a href="/2010/10/scaling-500-million-rows-26-10-2010">Scaling a 500 million rows table</a></h3>
+ <h3 class="post_title"><a href="/2010/10/scaling-500-million-rows-26-10-2010">Scaling a 500 million rows table - planning</a></h3>
<div class="single_post">
<p>438 million, 218 thousand and 363 rows.</p>
-<p>Current count of indexes on the table: 0.</p>
+<p>Current count of indexes on the table on the other hand, is 0.</p>
<p>I imagine you all ask how long does it take to perform a `select (*)` on it, well, I stopped waiting after about 4 minutes.</p>
-<p>This peculiar situation happens in one of our client&#8217;s projects, the table itself operates as storage for a daemon that listens to some kind of a stream with the current daily amount that goes somewhere around 4 million rows per one single day.</p>
+<p>This peculiar situation happens in one of our client&#8217;s projects, the table itself fills up from a daemon that listens to some kind of a stream with the current daily amount that goes somewhere around 4 million rows per one single day. all we are storing is a simple integer and a foreign key ( &#8220;sample&#8221; ).</p>
<p>Crazy, i know.</p>
-<p>This table (hereby &#8220;samples table&#8221;) should allow the app to access any subset of query, but mostly based on a `<span class="caps">WHERE</span> user_id = xxx` clause, so i can&#8217;t offload &#8220;old&#8221; rows away into oblivion (or an archive).</p>
+<p>This table (&#8220;samples table&#8221;) should allow the app to access any subset of query, but mostly based on a `<span class="caps">WHERE</span> user_id = xxx` clause, so i can&#8217;t offload &#8220;old&#8221; rows away into oblivion (or an archive).</p>
<p>After a little research, i decided on the following options:</p>
<h4>NoSQL indexed storage (Redis, Mongo or CouchDb)</h4>
<p>The amount of data is huge, so i was initially looking for some information regarding data size limitations on those NoSQLs:</p>
@@ -74,6 +74,17 @@ <h4>NoSQL indexed storage (Redis, Mongo or CouchDb)</h4>
</ul>
<p>What i am planning on doing is to create some kind of sampling and to keep to most recent data in a NoSQL storage engine.</p>
<h4>Use internal MySQL partitioning</h4>
+<p>Partitioning seems like a reasonable <span class="caps">RDBMS</span> level solution, but on mysql it&#8217;s limited to 1000 partitions only and they are also not very dynamic (i can&#8217;t create an automatic partitioning engine that will.</p>
+<h4>Current direction</h4>
+<p>We decided on trying the following flow:</p>
+<ol>
+ <li>Having the HA data in a NoSQL implementation, in our case it means we keep about 6 to 10 million rows in a NoSQL instance.</li>
+ <li>The most important data (insertions in the last 48 hours) needs to stay at the top resolution, but older data can lose resolution so we came up with this idea:</li>
+</ol>
+<p>We will create a cron task that will run every hour processing all the samples from the last hour and will avg it up, later storing it in a statistics table with only the hourly avg as the sample value.<br />
+another task will do the same scoping out from hours to days, and from days to weeks which will be our lowest resolution.</p>
+<p>This method drops our row counts in places we can afford data resolution decrease in 10s of millions of rows.<br />
+This process is still under development so if anyone has a better idea and care to enlighten us, please do so.</p>
</div>
<div class="clear"></div>
</div> <!-- /GET -->
View
@@ -70,7 +70,7 @@ <h4><a href="/2010/11/bag-o-links-3-11-2010">Bag O' Links - 3/11/2010</a></h4>
</div>
<div class="post">
- <h4><a href="/2010/10/scaling-500-million-rows-26-10-2010">Scaling a 500 million rows table</a></h4>
+ <h4><a href="/2010/10/scaling-500-million-rows-26-10-2010">Scaling a 500 million rows table - planning</a></h4>
<em>26/10/2010</em>
</div>
View
@@ -4,7 +4,7 @@
<title>Emphasized Insanity</title>
<link href="http://blog.eizesus.com/feed/atom.xml" rel="self"/>
<link href="http://blog.eizesus.com/"/>
- <updated>2010-11-03T11:58:39+02:00</updated>
+ <updated>2010-11-03T12:10:02+02:00</updated>
<id>http://blog.eizesus.com/</id>
<author>
<name>Elad Meidar</name>
@@ -46,16 +46,16 @@
</entry>
<entry>
- <title>Scaling a 500 million rows table</title>
+ <title>Scaling a 500 million rows table - planning</title>
<link href="http://blog.eizesus.com/2010/10/scaling-500-million-rows-26-10-2010"/>
<updated>2010-10-26T00:00:00+02:00</updated>
<id>http://gitready.com/2010/10/scaling-500-million-rows-26-10-2010</id>
<content type="html">&lt;p&gt;438 million, 218 thousand and 363 rows.&lt;/p&gt;
-&lt;p&gt;Current count of indexes on the table: 0.&lt;/p&gt;
+&lt;p&gt;Current count of indexes on the table on the other hand, is 0.&lt;/p&gt;
&lt;p&gt;I imagine you all ask how long does it take to perform a `select (*)` on it, well, I stopped waiting after about 4 minutes.&lt;/p&gt;
-&lt;p&gt;This peculiar situation happens in one of our client&amp;#8217;s projects, the table itself operates as storage for a daemon that listens to some kind of a stream with the current daily amount that goes somewhere around 4 million rows per one single day.&lt;/p&gt;
+&lt;p&gt;This peculiar situation happens in one of our client&amp;#8217;s projects, the table itself fills up from a daemon that listens to some kind of a stream with the current daily amount that goes somewhere around 4 million rows per one single day. all we are storing is a simple integer and a foreign key ( &amp;#8220;sample&amp;#8221; ).&lt;/p&gt;
&lt;p&gt;Crazy, i know.&lt;/p&gt;
-&lt;p&gt;This table (hereby &amp;#8220;samples table&amp;#8221;) should allow the app to access any subset of query, but mostly based on a `&lt;span class=&quot;caps&quot;&gt;WHERE&lt;/span&gt; user_id = xxx` clause, so i can&amp;#8217;t offload &amp;#8220;old&amp;#8221; rows away into oblivion (or an archive).&lt;/p&gt;
+&lt;p&gt;This table (&amp;#8220;samples table&amp;#8221;) should allow the app to access any subset of query, but mostly based on a `&lt;span class=&quot;caps&quot;&gt;WHERE&lt;/span&gt; user_id = xxx` clause, so i can&amp;#8217;t offload &amp;#8220;old&amp;#8221; rows away into oblivion (or an archive).&lt;/p&gt;
&lt;p&gt;After a little research, i decided on the following options:&lt;/p&gt;
&lt;h4&gt;NoSQL indexed storage (Redis, Mongo or CouchDb)&lt;/h4&gt;
&lt;p&gt;The amount of data is huge, so i was initially looking for some information regarding data size limitations on those NoSQLs:&lt;/p&gt;
@@ -65,7 +65,18 @@
&lt;li&gt;With CouchDB it&amp;#8217;s a litter different, it depends basically on your `_id` column size (number of bits you define for usage).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What i am planning on doing is to create some kind of sampling and to keep to most recent data in a NoSQL storage engine.&lt;/p&gt;
-&lt;h4&gt;Use internal MySQL partitioning&lt;/h4&gt;</content>
+&lt;h4&gt;Use internal MySQL partitioning&lt;/h4&gt;
+&lt;p&gt;Partitioning seems like a reasonable &lt;span class=&quot;caps&quot;&gt;RDBMS&lt;/span&gt; level solution, but on mysql it&amp;#8217;s limited to 1000 partitions only and they are also not very dynamic (i can&amp;#8217;t create an automatic partitioning engine that will.&lt;/p&gt;
+&lt;h4&gt;Current direction&lt;/h4&gt;
+&lt;p&gt;We decided on trying the following flow:&lt;/p&gt;
+&lt;ol&gt;
+ &lt;li&gt;Having the HA data in a NoSQL implementation, in our case it means we keep about 6 to 10 million rows in a NoSQL instance.&lt;/li&gt;
+ &lt;li&gt;The most important data (insertions in the last 48 hours) needs to stay at the top resolution, but older data can lose resolution so we came up with this idea:&lt;/li&gt;
+&lt;/ol&gt;
+&lt;p&gt;We will create a cron task that will run every hour processing all the samples from the last hour and will avg it up, later storing it in a statistics table with only the hourly avg as the sample value.&lt;br /&gt;
+another task will do the same scoping out from hours to days, and from days to weeks which will be our lowest resolution.&lt;/p&gt;
+&lt;p&gt;This method drops our row counts in places we can afford data resolution decrease in 10s of millions of rows.&lt;br /&gt;
+This process is still under development so if anyone has a better idea and care to enlighten us, please do so.&lt;/p&gt;</content>
</entry>
<entry>
@@ -160,7 +160,7 @@ <h2>Pledgie Donations</h2>
<tr class="alt">
<td class="icon"> <img alt="file" src="images/txt.png"> </td>
<td class="content">
- <a href="/2010/10/scaling-500-million-rows-26-10-2010" id="d8f8d46921aa81abc4c0d27703a8908333ae38c3">Scaling a 500 million rows table</a>
+ <a href="/2010/10/scaling-500-million-rows-26-10-2010" id="d8f8d46921aa81abc4c0d27703a8908333ae38c3">Scaling a 500 million rows table - planning</a>
</td>
<td class="age">
<span class="relatize relatized">26/10/2010</span>
View
@@ -76,7 +76,7 @@ <h4><a href="/2010/11/bag-o-links-3-11-2010">Bag O' Links - 3/11/2010</a></h4>
</div>
<div class="post_headline">
- <h4><a href="/2010/10/scaling-500-million-rows-26-10-2010">Scaling a 500 million rows table</a></h4>
+ <h4><a href="/2010/10/scaling-500-million-rows-26-10-2010">Scaling a 500 million rows table - planning</a></h4>
<em>26/10/2010</em>
</div>

0 comments on commit d489f77

Please sign in to comment.