Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load database from .gz archive #57

Closed
SergeC opened this issue Jul 7, 2014 · 9 comments
Closed

Load database from .gz archive #57

SergeC opened this issue Jul 7, 2014 · 9 comments
Labels

Comments

@SergeC
Copy link

SergeC commented Jul 7, 2014

I'd like to propose to add support for compressed databases. Use case: I load FreeBase data dump from https://developers.google.com/freebase/data file is about 20 Gb and want be able to use database without unpacking it.

@kortschak
Copy link
Contributor

This is trivially implemented, but the design is important. How do you propose to detect compressed input? Easier and more brittle is to go by extensions, more robust is to use magic numbers for bzip2 and gzip

@barakmich, your view?

@SergeC
Copy link
Author

SergeC commented Jul 8, 2014

Easiest ways is detect compressed input by file extension or add "--format" parameter "./cayley http --dbpath=30kmovies.nt.zip --format=zip".

The reason of doing it is to save space on HDD. Sometimes it's hard to find 300Gb of space especially on rented hardware (VPS, Cloud services).

@kortschak
Copy link
Contributor

I'm not keen on adding a flag for this; it allows a greater opportunity for discordance between data and processing. Per extension is easy:

func extension(file *os.File) (io.Reader, error) {
    switch filepath.Ext(file.Name()) {
    case ".gz", ".gzip":
        return gzip.NewReader(f)
    case ".bz2":
        return bzip2.NewReader(f), nil
    default:
        return f, nil
    }
}

All magic number determination needs is a read of the head of the file to get the file type (gzip[:] == "\x1f\x8b" and b2zip[:3] == "BZh"). This is not difficult:

const (
    gzipMagic  = "\x1f\x8b"
    b2zipMagic = "BZh"
)

type readAtReader interface {
    io.Reader
    io.ReaderAt
}

func magic(r readAtReader) (io.Reader, error) {
    var buf [3]byte
    _, err := r.ReadAt(buf[:], 0)
    if err != nil {
        return nil, err
    }
    switch {
    case bytes.Compare(buf[:2], []byte(gzipMagic)) == 0:
        return gzip.NewReader(r)
    case bytes.Compare(buf[:3], []byte(b2zipMagic)) == 0:
        return bzip2.NewReader(r), nil
    default:
        return r, nil
    }
}

@SergeC
Copy link
Author

SergeC commented Jul 8, 2014

Sounds very good. Can you please make pull request? I really need to be able to search data in FreeBase compressed database as soon as possible. Thanks.

@kortschak
Copy link
Contributor

I'm waiting on Barak's input. I'd prefer the magic number approach, but I'll defer to his view on this.

@SergeC
Copy link
Author

SergeC commented Jul 14, 2014

Currently I seeing process of using this database:

  1. Load data into Level DB, MongoDB etc.
  2. Use loaded data in step 1 in Cayley.

My questions are:

  1. Is it possible to use gziped FreeBase data dump in Cayley as backend without loading dump into database? As I can see from examples of using Cayley it's possible to use only unpacked FreeBase data dump but it require a lot of RAM.
  2. If Q1 is not possible. To speed up process and reduce RAM usage. Is it possible to load only into Level DB, MongoDB etc. only need data from gziped FreeBase data dump? For example I need only music domain.

To make Cayley usable I cut needed data with few greps then using Cayley to query it.

@kortschak
Copy link
Contributor

This was fixed by 984ab6f.

@SergeC
Copy link
Author

SergeC commented Jul 24, 2014

Thanks. I'll test it as soon as new version of binaries will be released for Mac OS X.

@barakmich
Copy link
Member

0.3.1 is available (point release, yay)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants