layout | title | categories | navigation | |||||
---|---|---|---|---|---|---|---|---|
global |
Tachyon - Reliable File Sharing at Memory Speed Across Cluster Frameworks |
|
|
{:toc}
In-memory data processing has gained tremendous attention recently. People use in-memory computation frameworks to perform fast and interactive queries. However, there are still several problems that remain unsolved.
-
Slow data sharing in a workflow: Companies build complicated workflows to process data, and use distributed file systems, such as HDFS or S3, to share one job's output as other jobs' input. Even though in-memory computation is fast, writing output data is slow, which is bounded by either disk or network bandwidth.
-
Duplicated input data for different jobs: Without a data sharing service that operates at memory speed, applications, written in frameworks such as Spark or MapReduce, are forced to store things in-memory local to their own, even when they share the same input. This multiplies the amount of memory needed.
-
Lost cache when JVM crashes: In-memory storage engine and execution engine co-exist in the same JVM in computation frameworks, such as Spark. In this case, if the JVM crashes, all the in-memory data is lost, and the next program has to load the data into memory again, which could be time consuming.
The Tachyon project to address the above issues. Tachyon is a fault tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage (alpha) information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.
In this chapter we first go over basic operations of Tachyon, and then run a Spark program on top of it. For more information, please visit Tachyon's website or Github repository.
All system's configuration is under tachyon/conf
folder. Please find them, and see how much
memory is configured on each worker node?
You can also read the through the file and try to understand those parameters. For more information on configuration, you can visit Tachyon's Configuration Settings webpage.
Before starting Tachyon for the first time, we need to format the system. It can be done by using
tachyon
script in the tachyon/bin
folder. Please type the following command first, and then
learn how to format the Tachyon file system for the first time.
$ ./bin/tachyon
After formatting the storeage, we can finally try to start the system. This can be done by using
tachyon/bin/tachyon-start.sh
script.
$ ./bin/tachyon-start.sh all Mount
$ Killed 0 processes
$ Killed 0 processes
$ localhost: Killed 0 processes
$ Formatting RamFS: /mnt/ramdisk (1.1gb)
$ Starting master @ localhost
$ Starting worker @ hy-ubuntu
In this section, we will go over three approaches to interact with Tachyon:
- Command Line Interface
- Application Programming Interface
- Web User Interface
You can interact with Tachyon using the following command:
$ ./bin/tachyon tfs
Then, it will return a list of options:
$ Usage: java TFsShell
$ [cat <path>]
$ [count <path>]
$ [ls <path>]
$ [lsr <path>]
$ [mkdir <path>]
$ [rm <path>]
$ [tail <path>]
$ [touch <path>]
$ [mv <src> <dst>]
$ [copyFromLocal <src> <remoteDst>]
$ [copyToLocal <src> <localDst>]
$ [fileinfo <path>]
$ [location <path>]
$ [report <path>]
$ [request <tachyonaddress> <dependencyId>]
Please try to put the local file tachyon/LICENSE
into Tachyon file system as /LICENSE.txt using
command line.
You can also use command line interface to verify this:
Now, you want to check out the conent of the file:
After using command line to interact with Tachyon, you can also use its API. We have several sample applications. For example, BasicOperations.java shows how to user file create, write, and read operations.
You have put these into our script, you can simply use the following command to run this sample program. The following command runs [BasicOperations.java], and also verifies Tachyon's installation.
After using commands and API to interact with Tachyon, let's take a look at its web user interface.
The URI is http://ec2masterhost:19999
.
The first page is the cluster's summary. If you click on the Browse File System
, it shows you
all the files you just created and copied.
You can also click a particular file or folder. e.g. /LICENSE.txt
file, and then you will see the
detailed information about it.
In this section, we run a Spark program to interact with Tachyon. The first one is to do a word
count on /LICENSE.txt
file. In /root/spark
folder, execute the following command to start
Spark shell.
$ ./bin/spark-shell
The results are stored in /result
folder. You can verfy the results through Web UI or commands.
Because \LICENSE.txt
is in memory, when a new Spark program comes up, it can load in memory data
directly from Tachyon. In the meantime, we are also working on other features to make Tachyon
further enhance Spark's performance.
This brings us to the end of the Tachyon chapter of the tutorial. We encourage you to continue playing with the code and to check out the project website or Github repository for further information.
Bug reports and feature requests are welcomed.