Skip to content

Hadoop Compatible File System

Chris Lu edited this page Mar 31, 2019 · 20 revisions

HDFS is optimized for large files. The scalability of the single HDFS namenode is limited by the number of files. It is hard for HDFS to store lots of small files.

SeaweedFS excels on small files and has no issue to store large files. Now it is possible to enable Hadoop jobs to read from and write to SeaweedFS.

Build SeaweedFS Hadoop Client Jar

$cd $GOPATH/src/github.com/chrislusf/seaweedfs/other/java/client
$ mvn install
$cd $GOPATH/src/github.com/chrislusf/seaweedfs/other/java/hdfs
$ mvn package
$ ls -al target/seaweedfs-hadoop-client-*.jar

Or you can download the latest version from MavenCentral

Test SeaweedFS on Hadoop

Suppose you are getting a new Hadoop installation. Here are the minimum steps to get SeaweedFS to run.

You would need to start a weed filer first, build the seaweedfs-hadoop-client-xxx.jar, and do the following:

$ cd ${HADOOP_HOME}
# create etc/hadoop/mapred-site.xml, just to satisfy hdfs dfs. skip this if the file already exists.
$ echo "<configuration></configuration>" > etc/hadoop/mapred-site.xml
$ bin/hdfs dfs -Dfs.defaultFS=seaweedfs://localhost:8888 \
               -Dfs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \
               -libjars ./seaweedfs-hadoop-client-x.x.x.jar \
               -ls /

Both reads and writes are working fine.

Installation for Hadoop

  • Configure Hadoop to use SeaweedFS in etc/hadoop/conf/core-site.xml. core-site.xml resides on each node in the Hadoop cluster. You must add the same properties to each instance of core-site.xml. There are several properties to modify:
    1. fs.seaweedfs.impl: This property defines the Seaweed HCFS implementation classes that are contained in the SeaweedFS HDFS client JAR. It is required.
    2. fs.defaultFS: This property defines the default file system URI to use. It is optional if a path always has prefix seaweedfs://localhost:8888.
<configuration>
    <property>
        <name>fs.seaweedfs.impl</name>
        <value>seaweed.hdfs.SeaweedFileSystem</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>seaweedfs://localhost:8888</value>
    </property>
</configuration>
  • Deploy the SeaweedFS HDFS client jar
# Run the classpath command to get the list of directories in the classpath
$ bin/hadoop classpath

# Copy SeaweedFS HDFS client jar to one of the folders
$ cd ${HADOOP_HOME}
$ cp ./seaweedfs-hadoop-client-x.x.x.jar share/hadoop/common/lib/

Now you can do this:

$ cd ${HADOOP_HOME}
$ bin/hdfs dfs -ls /

# if you did not set fs.defaultFS in etc/hadoop/core-site.xml
# or you want to access a different SeaweedFS filer
$ bin/hdfs dfs -ls seaweedfs://localhost:8888/

Installation for Spark

Follow instructions on spark doc:

installation inheriting from Hadoop cluster configuration

Inheriting from Hadoop cluster configuration should be the easiest way.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh to a location containing the configuration file core-site.xml, usually /etc/hadoop/conf

installation not inheriting from Hadoop cluster configuration

Copy the seaweedfs-hadoop-client-x.x.x.jar to all executor machines.

Add the following to spark/conf/spark-defaults.conf on every node running Spark

spark.driver.extraClassPath   /path/to/seaweedfs-hadoop-client-x.x.x.jar
spark.executor.extraClassPath /path/to/seaweedfs-hadoop-client-x.x.x.jar

And modify the configuration at runntime:

./bin/spark-submit \ 
  --name "My app" \ 
  --master local[4] \  
  --conf spark.eventLog.enabled=false \ 
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ 
  --conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \ 
  --conf spark.hadoop.fs.defaultFS=seaweedfs://localhost:8888 \ 
  myApp.jar

Installation for HBase

If HBase is used, create a folder and configure the HBase root directory in etc/hbase/conf/hbase-site.xml:

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>seaweedfs://localhost:8888/hbase</value>
    </property>
</configuration>

Supported Operations

    bin/hdfs dfs -appendToFile README.txt /weedfs/weedfs.txt
    bin/hdfs dfs -cat /weedfs/weedfs.txt
    bin/hdfs dfs -rm -r /uber
    bin/hdfs dfs -chown -R chris:chris /weedfs
    bin/hdfs dfs -chmod -R 755 /weedfs
    bin/hdfs dfs -copyFromLocal README.txt /weedfs/README.txt.2
    bin/hdfs dfs -copyToLocal /weedfs/README.txt.2 .
    bin/hdfs dfs -count /weedfs/README.txt.2
    bin/hdfs dfs -cp /weedfs/README.txt.2 /weedfs/README.txt.3
    bin/hdfs dfs -du -h /weedfs
    bin/hdfs dfs -find /weedfs -name "*.txt" -print
    bin/hdfs dfs -get /weedfs/weedfs.txt
    bin/hdfs dfs -getfacl /weedfs
    bin/hdfs dfs -getmerge -nl /weedfs w.txt
    bin/hdfs dfs -ls /
    bin/hdfs dfs -mkdir /tmp
    bin/hdfs dfs -mkdir -p /tmp/x/y
    bin/hdfs dfs -moveFromLocal README.txt.2 /tmp/x/
    bin/hdfs dfs -mv /tmp/x/y/README.txt.2 /tmp/x/y/README.txt.3
    bin/hdfs dfs -mv /tmp/x /tmp/z
    bin/hdfs dfs -put README.txt /tmp/z/y/
    bin/hdfs dfs -rm /tmp/z/y/*
    bin/hdfs dfs -rmdir /tmp/z/y
    bin/hdfs dfs -stat /weedfs
    bin/hdfs dfs -tail /weedfs/weedfs.txt
    bin/hdfs dfs -test -f /weedfs/weedfs.txt
    bin/hdfs dfs -text /weedfs/weedfs.txt
    bin/hdfs dfs -touchz /weedfs/weedfs.txtx

Operations Plan to Support

  getfattr
  setfacl
  setfattr
  truncate
  createSnapshot
  deleteSnapshot
  renameSnapshot
  setrep

Notes

Atomicity

SeaweedFS satisfies the HCFS requirements that the following operations to be atomic, when using MySql/Postgres database transactions.

  1. Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.
  2. Deleting a file.
  3. Renaming a file.
  4. Renaming a directory.
  5. Creating a single directory with mkdir().

Among these, except file or directory renaming, the following operations are all atomic for any filer store.

  1. Creating a file
  2. Deleting a file
  3. Creating a single directory with mkdir().

No native shared libraries

The SeaweedFS hadoop client is a pure java library. There are no native libraries to install if you already have Hadoop running.

This is different from many other HCFS options. If native shared libraries are needed, these libraries need to be install on all hadoop nodes. This is quite some task.

Shaded Fat Jar

One of the headache with complicated Java systems is the jar runtime dependency problem, which is resolved by Go's build time dependency resolution. For this SeaweedFS hadoop client, the required jars are mostly shaded and packaged as one fat jar, so there are no extra jar files needed.

You can’t perform that action at this time.