Skip to content

Commit

Permalink
finish background flash ssd part of report
Browse files Browse the repository at this point in the history
  • Loading branch information
Ming committed Dec 15, 2012
1 parent 8d70b27 commit 354e507
Show file tree
Hide file tree
Showing 11 changed files with 326 additions and 272 deletions.
12 changes: 12 additions & 0 deletions doc/references.bib
Expand Up @@ -255,12 +255,24 @@ @inproceedings{kvworkload_sigmetrics
workload modeling},
}
@MISC{ssdanatomy,
TITLE = "{Anatomy of SSDs}",
NOTE = "\url{http://www.linux-mag.com/id/7590/}",
KEY = "SSD"
}

@MISC{flashcache,
TITLE = "{A Write Back Block Cache for Linux}",
NOTE = "\url{https://github.com/facebook/flashcache/}",
KEY = "Flashcache"
}

@MISC{flashwiki,
TITLE = "{Flash Memory}",
NOTE = "\url{http://en.wikipedia.org/wiki/Flash_memory}",
KEY = "Flash"
}

@MISC{bcache,
TITLE = "{A Linux kernel block layer cache}",
NOTE = "\url{http://bcache.evilpiepirate.org/}",
Expand Down
208 changes: 150 additions & 58 deletions doc/report/bg.tex
@@ -1,78 +1,170 @@
\section{Background}
\label{bg}
\label{sec:bg}

MRIS is a key/value store designed for multi-resolution image
workloads using a multi-tier architecture consisting of Flash SSD and
magnetic HDD.

\subsection{Flash SSD}
Flash is a type of non-volatile memory. There are two main types of
flash memory, which are named after the NAND and NOR logic
gates~\cite{flashwiki}. NAND is the more popular one used in
SSDs~\cite{ssdanatomy}. NAND flash chip is able to trap electrons
between its gates. The absence of electrons corresponds to a logical
0. Otherwise, it corresponds to a logical 1. NAND can be furthered
divided into SLC and MLC by the number of bits can be represented in a
cell.

NAND Flash has asymmetric read and write performance. Read is fast and
takes roughly 50$mu$s for a MLC~\cite{ssdanatomy}. Write is 10-20
times slower than read. However, write is complicated in the sense
that bits cannot be simply overwritten. Before writes, a block has to
undergo an erase procedure which is 2-3 order of magnitude slower than
read. Moveroever, NAND Flash cell can endure only limit cycles of
erasing. Therefore, Flash chips are often used for storage in the form
of SSD, which also contains internal controller, processor and RAM.
Algorithms including log-structured writing, wear-leveling, and
garbage collection are implemented inside SSD to make Flash writes
faster and endures longer.

\subsection{Key/Value Store}
We implemented MRIS using LevelDB~\cite{leveldb-web}, an open source
key/value database engine developed by Google. LevelDB is
log-structured and organizes data into Sorted String Table (SSTable).
SSTable, introduced in Bigtable~\cite{chang06osdi}, is an immutable
data structure containing a sequence of key/value pairs sorted by the
key as shown in Figure~\ref{fig:sstable}. Besides key and value, there
might be optional fields such as CRC, compression type etc. SSTable
are mostly saved as files and each of them can contains data
structures, such as bloomfilter, to facilitate key lookup. SSTable
have counterpart in the memory called Memtable. The key/value pairs in
Memtable are often kept in data structures easy for insert and lookup
such as red/black tree and skiplist.

LevelDB, as well as most other key/value engines, use Log-Structured
Merge Trees (LSM)~\cite{lsm} for internal storage. When key/value
pairs are first added, they are inserted into Memtable. Once the size
of the Memtable growes beyond a certain threshold, the whole Memtable
is flushed out into a SSTable, and a new Memtable is created for
insertion. When key/value pairs get changed, the new pairs are
inserted without modifying the old pairs. When a key/value pair is
deleted, a marker of the deletion is inserted by setting a flag inside
the key called \texttt{KeyType}. This way key/value can provide large
insertion throughput because data is written out using sequential
I/Os, which have good performance on Hard Disk Drives (HDD).

\begin{figure}[t]
\begin{centering}
\epsfig{file=figures/sstable.eps,width=1.00\linewidth}
\caption{SSTable}
\label{fig:sstable}
\end{centering}
\end{figure}

\begin{figure}[t]
\begin{centering}
\epsfig{file=figures/leveldb-compact.eps,width=1.00\linewidth}
\caption{LevelDB Compaction}
\label{fig:compact}
\end{centering}
\end{figure}

To serve a key lookup, Memtable is queried firstly. If not found in
Memtable, the SSTables are queried in reverse chronological order. A
naive implementation of such a lookup can be very slow because the
whole database need be read and checked in the worst case. To make
lookup fast, SSTable are organized into several layers with the size
of each table increasing from the top layer to the bottom. Background
jobs are launched periodically to sort and merge small SSTables into
larger ones in the next layer. This is called compaction. Deleted
pairs are also removed during compaction. Then a lookup iterates the
SSTables layer by layer and returns once the key is found. Because
SSTables are sorted by key, it enables fast lookup algorithm like
binary search. There is also index for SSTables tells the key range
covered by a particular SSTable so that it suffice just checking the
SSTables whose key ranges cover the interested key. Inside each
SSTable, we can have a bloomfilter to filter negative key lookup and a
secondary index for faster search.

In LevelDB, there are two Memtables, once one is filled, the other one
is used for further insertion. The filled one is flushed into a
Memtable in background. Its compaction procedure is illustrated in
Figure~\ref{fig:compact}. One SSTable (a) at layer $n$ is merged with
the SSTables at layer $n+1$ that have overlapping keys with (a) into
new SSTables at layer $n+1$.

%Background...

%Typical length: 0 pages to 1.0.

%Background and Related Work can be similar.
%Most citations will be in this section.

%1. Describe past work and criticize it, fairly. Use citations
%to JUSTIFY your criticism! Problem: hard to compare to YOUR
%work, b/c you've not yet described your work in enough
%detail. Solution: move this text to Related Work at end of
%paper.

%2. Describe in some detail, background material necessary to
%understand the rest of the paper. Doesn't happen often,
%esp. if you've covered it in Intro.

%Example, submit a paper to a storage conference: reviewers are
%experts in storage. Don't need to tell them about basic disk
%operation. But if your paper, say, is an improvement over an
%already-advanced data structure (eg., COLA), then it'd make
%sense to describe basic COLA algorithms in some detail.

%Important: open the bg section with some "intro" text to tell
%reader what to expect (so experienced readers can skip it).

%If your bg material is too short, can fold it into opening of
%'design' section.


%\textbf{notes about picking a project}

%Put every possible related citation you can! (esp. if conf.
%doesn't count citations towards page size).

%Literature survey:
%- CiteSeer

Background...
%- Google Scholar

Typical length: 0 pages to 1.0.
%- libraries

Background and Related Work can be similar.
Most citations will be in this section.
%1. find a few relates paper

1. Describe past work and criticize it, fairly. Use citations
to JUSTIFY your criticism! Problem: hard to compare to YOUR
work, b/c you've not yet described your work in enough
detail. Solution: move this text to Related Work at end of
paper.
%2. skim papers to find relevance

2. Describe in some detail, background material necessary to
understand the rest of the paper. Doesn't happen often,
esp. if you've covered it in Intro.
%3. search for add'l related papers in Biblio.

Example, submit a paper to a storage conference: reviewers are
experts in storage. Don't need to tell them about basic disk
operation. But if your paper, say, is an improvement over an
already-advanced data structure (eg., COLA), then it'd make
sense to describe basic COLA algorithms in some detail.
%4. reverse citation: use srch engines, to find
%newer papers that cite the paper you like.

Important: open the bg section with some "intro" text to tell
reader what to expect (so experienced readers can skip it).
%5. "stop" when reach transitive closure

If your bg material is too short, can fold it into opening of
'design' section.
%- then go off and read it.

%- think about "how can I improve" and "what was so
%good about that paper".

\textbf{notes about picking a project}
%- check future work for project ideas.

Put every possible related citation you can! (esp. if conf.
doesn't count citations towards page size).
%- go to talks \& conferences

Literature survey:
- CiteSeer
%Pick an idea:

- Google Scholar
%- novelty vs. incremental (how big of an increment?)

- libraries
%- idea vs. practical implications
%(implemented? released? in use as OSS or commercial?)

1. find a few relates paper
%- where to submit? good fit and match for quality.

2. skim papers to find relevance

3. search for add'l related papers in Biblio.

4. reverse citation: use srch engines, to find
newer papers that cite the paper you like.

5. "stop" when reach transitive closure

- then go off and read it.

- think about "how can I improve" and "what was so
good about that paper".

- check future work for project ideas.

- go to talks \& conferences

Pick an idea:

- novelty vs. incremental (how big of an increment?)

- idea vs. practical implications
(implemented? released? in use as OSS or commercial?)

- where to submit? good fit and match for quality.

- look at schedule of conferences: due dates \& result dates.
%- look at schedule of conferences: due dates \& result dates.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% For Emacs:
Expand Down
10 changes: 5 additions & 5 deletions doc/report/conclusion.tex
Expand Up @@ -9,11 +9,11 @@ \section{Conclusions}
of an emulated Facebook image workload by 3.73$\times$.

\paragraph{Future Work}
We plan to study the model of multi-tier key/value considering cost,
characteristics of different drives, as well as more workload specific
features besides the ratio of read small and large objects. More
threads will be used in benchmark to exploit better parallelism in
Flash SSD.
We plan to further study the model of multi-tier key/value considering
cost, characteristics of different drives, as well as more workload
specific features besides the ratio of read small and large objects.
More threads will be used in benchmark to exploit better parallelism
in Flash SSD.

\paragraph{Acknowledgement}
The author would like to thank Vasily Tarasov and Mike Ferdman for
Expand Down
32 changes: 17 additions & 15 deletions doc/report/eval.tex
@@ -1,13 +1,14 @@
\section{Evaluation} \label{sec:eval}

We have evaluated our system on a 64-bit Dell server with 1GB memory
and a one-core XXX CPU. The OS, a Ubuntu server 8.04 with kernel
version 3.2.9, is installed on a Maxtor 7L250S0 3.5-inch SATA HDD. We
used the same but another SATA HDD and one Flash based SSD for MRIS
store. The HDD is also a Maxtor 7L250S0 with a capacity of 250GB and a
rotational speed of 7200 rpm. The SSD is an Intel SSDSA2CW300G3
2.5-inch with 300G capacity. The code and benchmark results are
publicly available at https://github.com/brianchenming/mris.
and a one-core Intel(R) Xeon(TM) CPU 2.80GH CPU. The OS, a Ubuntu
server 8.04 with kernel version 3.2.9, is installed on a Maxtor
7L250S0 3.5-inch SATA HDD. We used the same but another SATA HDD and
one Flash based SSD for MRIS store. The HDD is also a Maxtor 7L250S0
with a capacity of 250GB and a rotational speed of 7200 rpm. The SSD
is an Intel SSDSA2CW300G3 2.5-inch with 300G capacity. The code and
benchmark results are publicly available at
https://github.com/brianchenming/mris.

\subsection{Measure drives}
\label{sec:drives}
Expand Down Expand Up @@ -102,12 +103,13 @@ \subsection{Wikipedia Image Workload}
$(4, 64]$ KB are most popular. They sum to 81.58\% of the total number
of image requests. Moreover, 94.57\% of the requests are for images
smaller than or equal to 128KB. Despite the fact that small images
(<=128KB) are the absolute majority in term of request numbers, the
traffic (size$\times$frequency) introduced by them is just 2.96\% of
the total. Although not all the requests make their way to the storage
layer because of memory cache (such as Memcached~\cite{memcached}).
This salient size-tiered property of requests still makes size-tiered
storage a close match for multi-resolution image workloads.
(smaller than 128KB) are the absolute majority in term of request
numbers, the traffic (size$\times$frequency) introduced by them is
just 2.96\% of the total. Although not all the requests make their way
to the storage layer because of memory cache (such as
Memcached~\cite{memcached}). This salient size-tiered property of
requests still makes size-tiered storage a close match for
multi-resolution image workloads.

\subsection{MRIS Write}

Expand Down Expand Up @@ -272,8 +274,8 @@ \subsection{MRIS Read}
1000000 \frac{ratio + 1}{t_{SH} * ratio + t_{LH}}
\end{equation}

By linear regression, we obtained the estimation of the variables
(shown in Table~\ref{tbl:variable}) from our benchmark data.
Using linear regression, we estimated the values of the variables
(also shown in Table~\ref{tbl:variable}) from our benchmark data.
Approxmiately, the ops/sec of Hybrid can be expressed in
(\ref{eqn:hybridops}).

Expand Down

0 comments on commit 354e507

Please sign in to comment.