Skip to content

ReGrid File Storage

bchavez edited this page Jan 31, 2016 · 64 revisions

What is ReGrid?

ReGrid is a distributed file storage on top of RethinkDB. ReGrid is similarly inspired by GridFS from MongoDB. With ReGrid, a large 4GB file can be broken up into chunks and stored on RethinkDB cluster. Later, the file can be retrieved by streaming the file's chunks back to the client. The figure below shows ReGrid storing a large video file in chunks across a three node cluster.

Figure 1: Physical Layout

Figure 1: Physical Layout (Note: Please ask permission before using figures in presentations, videos, or other works. Thanks.)

Important Terms

  • Physical refers to the physical topology, location, and layout of data.
  • Logical refers to a logical location of data. A high-level user's view of the organization of files regardless of the physical layout of data.

Getting Started

Buckets

A Bucket is a logical set of files organized together. File read/download and write/upload operations are performed using a Bucket.

  • A Bucket requires a RethinkDB database.
  • A RethinkDB database can be partitioned into several Buckets.
  • Multiple Buckets in the same RethinkDB database are differentiated by a Bucket's name.
  • The default name for a Bucket is fs.

The figure below illustrates the logical separation of buckets within a single MyFiles database:

Figure 2: Logical Buckets in MyFiles DB

Figure 2: Physical Layout

In Figure 2 above, there are three logical file Bucket stores in the MyFiles RethinkDB database. It is important to note video.mp4 from the fs bucket is not the same file as video.mp4 from the dev bucket. Buckets can be used to organize files in anyway developers see fit.

To create a Bucket named dev in MyFiles simply:

var bucket = new Bucket(conn, "MyFiles", bucketName: "dev" );
bucket.Mount(); // required before use...

Mounting the dev Bucket before use is required. Mount is necessary to ensure the existence of tables and indexes.

Files

When a File is uploaded to a Bucket a path is specified in the destination Bucket. Multiple uploads to the same path cause the file to be revisioned. Figure 3 below shows the

Figure 1: Physical Layout

Revision Numbers

Positive Negative
0: The original stored file.
1: The first revision.
2: The second revision.
etc...
- 1: The most recent revision.
- 2: The second most recent revision.
- 3: The third most recent revision.
etc...

Upload

The following code uploads a file to a Bucket:

// Upload a file using byte[]
var fileId = bucket.Upload("/video.mp4", videoBytes);

// Upload a file using an IO stream
Guid uploadId;
using( var fileStream = File.Open("C:\\video.mp4", FileMode.Open) )
using( var uploadStream = bucket.OpenUploadStream("/video.mp4") )
{
    uploadId = uploadStream.FileInfo.Id;
    fileStream.CopyTo(uploadStream);
}

fileId will be the file reference for that specific revision. There are many methods on bucket that allow the use of IO streams and async methods.

UploadOptions

UploadOptions can be specified to control the ChunkSizeBytes. This value controls the size of the document chunks stored in the RethinkDB. Optionally, additional variable Metadata can be stored along with the uploaded file.

var opts = new UploadOptions();

opts.SetMetadata(new
    {
        UserId = "123",
        LastAccess = r.now(),
        Roles = r.array("admin", "office"),
        ContentType = "application/pdf"
    });

var id = bucket.Upload(testFile, TestBytes.HalfChunk, opts);

var fileInfo = bucket.GetFileInfo(id);

fileInfo.Metadata["UserId"].Value<string>().Should().Be("123");
fileInfo.Dump();

Download

// Downloads to a byte[]
var bytes = bucket.DownloadAsBytesByName("/video.mp4");

// Download revision:0 to a file stream on the client
var localFileStream = File.Open("C:\\video_original.mp4", FileMode.Create);
bucket.DownloadToStreamByName("/video.mp4", localFileStream, revision: 0);
localFileStream.Close();

Caution using DownloadAsBytes as it returns a byte[] with int.MaxValue as a maximum size. For relatively large files use DownloadToStream. DownloadToStream does not have any maximum size limit beyond the host's OS file limitations on the client side.

Seekable Download Streams

ReGrid supports starting downloads at an offset by seeking into part of a large file.

var opts = new DownloadOptions {Seekable = true};

using( var stream = bucket.OpenDownloadStream("/video.mp4", options: opts) )
{
    stream.Seek( 1024 * 1024 * 20, SeekOrigin.Begin);

    //start reading 20MB into the file...
}

rg Command Line Utility

### List files in /foo folder, using localhost as default

rg ls /foo


### List root files in cluster with a well-known IP 192.168.0.4

rg 192.168.0.4 ls /
rg 192.168.0.4:2802 ls /
rg 192.168.0.4 ls /folder

### Get file metadata on /path/video.mp4 and any past revisions
 
rg 192.168.0.4 info /path/video.mp4


### Copy video.mp4 from the local computer and send it
### to the cluster at 192.168.0.4 but wait until there
### is a pooled connection of at least 5 servers to
### increase fan-out write-performance.

rg 192.168.0.4 -pool 5 put ./video.mp4 /video_uploads/video.mp4


### Create a file video.mp4 on the local computer and
### receive it from the cluster at 192.168.0.4 but wait until there
### is a pooled connection of at least 5 servers to
### increase fan-in read-performance.

rg 192.168.0.4 -pool 5 get /video_uploads/video.mp4 ./video.mp4


### Perform a sha256 integrety check on video.mp4

rg 192.168.0.4 fsck /path/video.mp4


### Reclaim diskspace by cleaning up orphaned file chunks or soft-deleted files.

rg 192.168.0.4 cleanup

Clone this wiki locally