Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dense Meta Data using MMap-Based arrays #80

Merged
merged 1 commit into from Aug 12, 2018

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jul 31, 2018

There is still much to do and to refactor, just wanted a first chance for reviewing.

@@ -5,10 +5,13 @@ add_library(index
VocabularyGenerator.h VocabularyGenerator.cpp
ConstantsIndexCreation.h
ExternalVocabulary.h ExternalVocabulary.cpp
IndexMetaData.h IndexMetaData.cpp
IndexMetaData.h IndexMetaDataImpl.h IndexMetaDataImpl.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation looks wrong

StxxlSortFunctors.h
TextMetaData.cpp TextMetaData.h
DocsDB.cpp DocsDB.h
MmapBasedArray.h MmapBasedArrayImpl.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation looks wrong, you might want to set your editor to show spaces and tabs differently

@@ -216,6 +220,11 @@ size_t Index::passTsvFileForVocabulary(const string& tsvFile) {
}
LOG(INFO) << "Pass done.\n";
_vocab.createFromSet(items);
// TODO: completely merge TSV and NTrIPLEs parsing. This is just for making
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(<username>)

@@ -885,24 +896,19 @@ void Index::createFromOnDiskIndex(const string& onDiskBase,
setOnDiskBase(onDiskBase);
_vocab.readFromFile(_onDiskBase + ".vocabulary",
_onDiskLiterals ? _onDiskBase + ".literals-index" : "");
// this is the read only version, TODO: give better naming or type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(<username>)

@@ -885,24 +896,19 @@ void Index::createFromOnDiskIndex(const string& onDiskBase,
setOnDiskBase(onDiskBase);
_vocab.readFromFile(_onDiskBase + ".vocabulary",
_onDiskLiterals ? _onDiskBase + ".literals-index" : "");
// this is the read only version, TODO: give better naming or type
_opsMeta.setup(onDiskBase + MMAP_FILE_SUFFIX + ".ops");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't setup() basically open() so could be named open()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, open() would suggest that we open a file, but actually what we are doing is completely depending on our implementation (nothing, reloading a mmapBased array, copying something to a stxxl vector (deprecated) etc.
My purpose was to create a generic function where the different template instatiations can have different initializations/setups passed through by varargs. I chose "setup" because you anyway have to look at the concrete type and which type of setup is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable

@@ -0,0 +1,190 @@
// Copyright 2018, University of Freiburg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this belongs in util because it is quite generic. Also it should be namespaced and "Based" is redundant so just name it MmapArray or better yet since it can grow MmapVector

AD_CHECK(idx < _size);
}

// we will never hold less than this capacity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"never have less capacity" as I guess we can hold 0 items

// create Array of given size fill with default value
// file contents will be overriden if existing!
// allows read and write access
void setupNewFile(size_t size, const T& defaultValue, string filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupNewFile() sounds like open() to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I agree, I would refactor it using an argument of type std::ios_base::openmode and mimic the read/write semantics used with ifstreams.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to use two "tag"-classes. You will see if you like it better that way.

_bytesize = realSize._bytesize;
// writeMetaData will also adapt the filesize according to
// _capacity and _bytesize
unmap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Linux we should probably use mremap()

// Writes file "externalTextFile" which can be used to directly write external
// Literals
void mergeVocabulary(const std::string& basename, size_t numFiles);
// Writes file "externalTextFile" which can be used to directly write ext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to have deleted the rest of the line by accident

size_t _bytesize;
};

// 2 tags two differentiate different versions of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/two/to/

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple more small comments. Also I think we should have unit tests for the new stuff. That said I'll now give this version a shot with the Clueweb Freebase instance and see how it fares.

EDIT: I forgot this needs an index rebuild. How hard would it be to do automatic conversion? This would also be a nice use of the magic number + version data

T& at(size_t idx) {
throwIfUninitialized();
if (idx >= _size) {
throw std::out_of_range();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing indent

const T& at(size_t idx) const {
throwIfUninitialized();
if (idx >= _size) {
throw std::out_of_range();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing indent

MmapVector& operator=(const MmapVector<T>&) = delete;

// move construction and assignment
// TODO<joka921>: is this noexcept?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes

MmapVector(MmapVector<T>&& other);
MmapVector& operator=(MmapVector<T>&& other);

// Construct from anything that has a valid open() method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doen't really construct from something else. It just constructs and opens in one go

// truncate the underlying file to _bytesize and write meta data (_size,
// _capacity, etc) for this
// array to the end. new size of file will be _bytesize + sizeof(meta data)
// TODO: maybe extend meta data by "magic number" constant to verify if this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a magic number + vesion

}

// __________________________________________________________
const FullRelationMetaData& getAsserted(Id id) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "Asserted" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that there is a AD_CHECK whether there is a valid entry at this Id. That is how the original HashMap based code used it and so I chose to keep this for the other implementations as well.
The closest name from the stl would be "at", would you prefer that?


// _____________________________________________________________
// hidden in implementation namespace
namespace ExtVecWrapperImpl {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really important that ExtVec is external, right? So why not just call it Vec?

uint64_t _typeMultAndNofElements;
};

inline ad_utility::File& operator<<(ad_utility::File& f,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

return f;
}

class BlockBasedRelationMetaData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if you want to give it a shot I think you can only improve upon the current state but I guess it's not strictly necessary for the PR


// 2 tags two differentiate different versions of the
// setup / construction variants of Mmap Vector which only take a filename
class createTag{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are still classes so should probably start with a capital letter.

@@ -35,19 +41,35 @@ void MmapVector<T>::writeMetaDataToEnd() {
ofs.write((char*)&_size, sizeof(_size));
ofs.write((char*)&_capacity, sizeof(_capacity));
ofs.write((char*)&_bytesize, sizeof(_bytesize));
ofs.write((char*)&MagicNumber, sizeof(MagicNumber));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since magic number is larger than an unit8_t this means the written value will depend on the endianess. But I think in this case this might not be bad since we can't read other-endianess data anyway since all of it is written in the non--endianess-portable write(&data, sizeof(data)); way.

static constexpr float ResizeFactor = 1.5;
static constexpr int MagicNumber = 7601577;
static constexpr int Version = 0;
private:

private:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the private: is redundant

@niklas88
Copy link
Member

niklas88 commented Aug 3, 2018

Do you already have some experience with how much this reduces RAM usage and startup time on one of our larger datasets?


if (verify) {
for (auto it = res._data.cbegin(); it != res._data.cend(); ++it) {
if (hmap._data.getAsserted((*it).first) != (*it).second) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you use (*it).first instead of it->first? I feel like the second is more readable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My iterator manually "plunges" the index to its pointee, so we get a map-like interface where
*it is a pair of <key, value>. There are some restrictions to the implementation of operator-> which must yield something that is or behaves as a pointer. I can check if I can make this work, but there were some issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now implemented a variant of operator-> for you. I am not sure if this is the most elegant or efficient way with the std::reference_wrappers, but during our uses we normally don't use it for queries.

size_t nofBytesDone = sizeof(size_t);
_name.assign(reinterpret_cast<char*>(buf + nofBytesDone), nameLength);
nofBytesDone += nameLength;
// skip nOfRelations, since this information is already stored in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we sure there were previous runs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we first have to run the IndexBuilder to create MetaData which is then written to disk which we then read again when opening the server using createFromByteBuffer.
The local nofRelations variable is needed in the non-persistent specializations of IndexMetaData where we read the FullRelationalMetaData from our byteBuffer. (You can check the difference to the "standard" implementation.
I also added a (hopefully) better comment to this.

_spoFile.read(buf, static_cast<size_t>(metaTo - metaFrom), metaFrom);
_spoMeta.createFromByteBuffer(buf);
delete[] buf;
_spoMeta.readFromFile(&_spoFile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also check for a denseMetaFile for extremely RAM saving configs right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that is a relatively deep design decision:
Currently:
P** and S** are always "sparse" (hashMapBased on disk), while O** is dense.
(hardcoded using templates).

If we want to have this at runtime (decide during index creation which meta data we want for which format) I think we would have to let the different IndexMetaData inherit from a (virtual) class IndexMetaDataBase. Then we could use a unique_ptr to dispatch at runtime. The last time I mentioned it I thing you were not really in favor of that, but it would make things more flexible at the cost of the pointer indirections.

But I think I would first finish this current PR and then think about this additional flexibility which would change quite some lines in the Index class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yeah I remember, ok I think for now we should keep the current approach.

exit(1);
}
std::string base = in.substr(0, in.size() - 4);
out = base + DENSE_META_SUFFIX + suffix;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suffix seems to be only used when using the converter. I think a converted index should have the exact same data as the same index built freshly in the files named the same (unless both versions have a version field and we can check that in the code).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read from your Email you would prefer to always write the dense meta data to an extra file. I will change this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(together with my next PR for prefix compression I am preparing a index.configuration file (json) where we can also save information about the index in a more explicit way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joka921 it doesn't really matter if it's one file. It's just that an index upgrade tool like the new MetaDataConvertor should output the same format as would be used when building from scratch with that version.

@niklas88
Copy link
Member

niklas88 commented Aug 6, 2018

Somehow I'm getting a compile error when building this branch locally using

git checkout -b joka921-f.mmapBasedArray master

git pull git://github.com/joka921/QLever.git f.mmapBasedArray

Looks like the linker can't find the specialization's code

CMakeFiles/IndexMetaDataTest.dir/IndexMetaDataTest.cpp.o: In function `IndexMetaDataTest_writeReadTest2Mmap_Test::TestBody()':
IndexMetaDataTest.cpp:(.text+0x4cd0): undefined reference to `ad_utility::File& operator<<     <MetaDataWrapperDense<ad_utility::MmapVector<FullRelationMetaData> > >(ad_utility::File&,         IndexMetaData<MetaDataWrapperDense<ad_utility::MmapVector<FullRelationMetaData> > > const&)'
collect2: error: ld returned 1 exit status
make[2]: *** [test/CMakeFiles/IndexMetaDataTest.dir/build.make:89: test/IndexMetaDataTest] Error 1
make[1]: *** [CMakeFiles/Makefile2:2852: test/CMakeFiles/IndexMetaDataTest.dir/all] Error 2
make: *** [Makefile:163: all] Error 2

Can't find this error on Travis though but it happens on both my local Arch Linux workstation as well as on a Ubuntu 16.04 server.

@joka921
Copy link
Member Author

joka921 commented Aug 6, 2018

@niklas88 Have you tried running make clean before? I have the impression that somewhere cmake does not handle all dependencies right.

@niklas88
Copy link
Member

niklas88 commented Aug 6, 2018

@joka921 yes and I even built with a completely fresh ./build before that on both machines

@niklas88
Copy link
Member

niklas88 commented Aug 6, 2018

@joka921 ok found the problem. Removing inline in front of that function in src/IndexMetaData.cpp fixes the build. I guess the difference was that I was building with optimizations and thus this was actually inlined, removing the copy to link to. On Travis on the other hand things are built without optimization and thus a copy of the method ends up in the object.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm still unsure about these file format changes. I think we should make sure we get this right. I actually liked having the meta data in the same file. However in the previous version if I got this right, the converter would do something different than building the index from scratch, which I think is quite fragile. Namely it would create external meta data files for what was just written at the end in the freshly built case.

This difference is what I didn't like. Always creating external meta data of course fixes this but also seems a bit clumsy. Also I think that just as the MmapVector adds a magic number and version field I think it would be nice to have that in the meta data for "*." files as well. Maybe with a magic number we can then also keep them single files with the meta data at the end?


metaData.appendToFile(&out);
// metaData.appendToFile(&out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

_psoFile.open(string(_onDiskBase + ".index.pso").c_str(), "r");
_posFile.open(string(_onDiskBase + ".index.pos").c_str(), "r");
auto psoName = string(_onDiskBase + ".index.pso");
_psoFile.open(psoName.c_str(), "r");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time we add an open(const std::string&) method to ad_utility::File

@@ -25,7 +25,7 @@ int main (int argc, char** argv) {
exit(1);
}
std::string base = in.substr(0, in.size() - 4);
out = base + DENSE_META_SUFFIX + suffix;
out = in + DENSE_META_SUFFIX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this gives us the dense file with ".ops.meta-dense" while below we get ".meta-mmap.ops".
I think since all of this changes the Index format we really should make sure to get this right. I actually liked having the meta data in the same file, my main concern in the last review was that the Converter should create the same format as a newly built index. Maybe since this is the first time we add a magic number and version we should in fact keep the meta data at the end of the files and only change the converter so it works for sparse meta data as well making sure all the files get a version and magic number.


// already Existing means we do not need to at rmd to _data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/at/add/


// already Existing means we do not need to at rmd to _data
// if bool alreadyExisting is set, we will not add argument rmd
// but assume that is already is contained in _data (for persistent metaData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that it is already contained"

@niklas88
Copy link
Member

niklas88 commented Aug 7, 2018

As discussed on the phone, we should also add a SPARQL query that uses one of the OSP/OPS indices to the e2e tests.

The following example query for all predicates connecting two scientists does a full scan of OSP

SELECT DISTINCT ?p WHERE {
    ?x <is-a> <Scientist> .
    ?y <is-a> <Scientist> .
    ?x ?p ?y .
}

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, any idea if the advise(AccessPattern::Sequential) calls make building the vector faster?

@@ -922,9 +922,9 @@ void Index::createFromOnDiskIndex(const string& onDiskBase,
<< std::endl;
if (allPermutations) {
_opsMeta.setup(onDiskBase + ".index.ops" + MMAP_FILE_SUFFIX,
ad_utility::ReuseTag());
ad_utility::ReuseTag(), ad_utility::AccessPattern::Random);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I seeing this right, ReuseTag needs to be a tag to select at compile time but AccessPattern can be an enum because it selects at runtime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to I can also make ReuseTag an enum. It is basically the same as in an fstream where we also can decide by a runtime argument wheter we want to overwrite an existing file or use/add to its contents. The efficiency difference would not matter and maybe it would be more consistent. Just tell me, what you think.

@@ -224,11 +241,15 @@ class MmapVector {
// invalidates iterators
void unmap();

// advise the kernel to use a certain access pattern
void advise(AccessPattern pattern);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between advice() and setAccessPattern() seems a bit arbitrary

for (size_t i = 0; i < _size; ++i) {
_ptr[i] = defaultValue;
}
advise(_pattern);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so you readvise when changing the access pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this answer your question? advise is the private implementation detail (call to madvise) and setAccessPattern the global interface, that let's us for example first advise to use sequential for insertion and then switch to random for access etc...
Concerning your question about the effect: According to the manpages the "sequential" mode does "heavy prefetching" and also evicts pages soon after they are written. I did not yet measure it and think this heavily depends on the overall system load and the kernel implementation (there is a reason why it's called "advise")

std::cerr << "error, infile must have suffix of"
".ops, .osp\n";
exit(1);
std::vector<std::string> sparseNames{".pso", ".pos", ".spo", ".sop"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::array

permutName + ".converted");
}

std::vector<std::string> denseNames{".osp", ".ops"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::array


// _________________________________________________________
int main (int argc, char** argv) {
if (argc != 2) {
std::cerr << "Usage: ./MetaDataConverterMain <originalFilename>\n";
std::cerr << "Usage: ./MetaDataConverterMain <pathToIndex>\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we call this an "index prefix" elsewhere

@@ -51,6 +51,8 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30;
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19;

static const std::string MMAP_FILE_SUFFIX = ".meta-mmap";
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented

@niklas88
Copy link
Member

I think the commit you squashed this to now needs a better name

@joka921
Copy link
Member Author

joka921 commented Aug 10, 2018

@niklas88 This is now a feature-fixed (I hope) version. I hope we can merge it early next week so other projects can go on.

@niklas88
Copy link
Member

Can you add just another commit to update the README.md with the conversion tool and also the last section about memory use?

@joka921
Copy link
Member Author

joka921 commented Aug 11, 2018

@niklas88 I have added the requested documentation. After you've had a look at it I would rebase it into the actual implementation commit to get a cleaner history.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, really happy how this turned out! Some final comments on a few typos/formatting and then feel free to merge!

@@ -380,6 +380,25 @@ Group by is supported, but aggregate aliases may currently only be used within t
Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`.
Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable.

# 6. Converting Old Indices For Current QLever Versions

We have recently updated the way the index meta data (offsets of relations
Copy link
Member

@niklas88 niklas88 Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format these lines to roughly be of equal length (up to 80 columns). When using vim this can be done with first selecting them <esc><V><j..> and then doing <g><q>. Other than that this looks great.

StxxlSortFunctors.h
TextMetaData.cpp TextMetaData.h
DocsDB.cpp DocsDB.h
FTSAlgorithms.cpp FTSAlgorithms.h)

target_link_libraries(index parser ${STXXL_LIBRARIES})

add_library(metaConverter
MetaDataConverter.cpp MetaDataConverter.h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrongly indented

ad_utility::ReuseTag(), ad_utility::AccessPattern::Random);
_ospMeta.setup(onDiskBase + ".index.osp" + MMAP_FILE_SUFFIX,
ad_utility::ReuseTag(), ad_utility::AccessPattern::Random);
// TODO<joka921>: Refactor (upcoming PR): there is so so much code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to it!


bool hasBlocks() const { return _rmdPairs.hasBlocks(); }
// persistentRMD == true means we do not need to add rmd to _data
// but assume that it is already is contained in _data (for persistent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one "is" too many


uint8_t getCol2LogMultiplicity() const {
return _rmdPairs.getCol2LogMultiplicity();
// Persistent (called MmapBased here) have to be separated from RAM-based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word missing

const auto el = *it;
// when we are MmapBased, the _data member is already persistent on disk, so
// we do not write it here
if constexpr (!imd._isMmapBased) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

…tations

- Almost every token in the vocabulary occurs as an object in a triple,
  so the OPS and OSP MetaData are Dense (all possible keys/vocabIds
  occur). This makes using array-like data structures better.

- Implemented a class MmapVector that is able to handle a persistent
  dynamic array that is stored on hard disk and made accessible from RAM
  via mmap

- Implemented a class MmapVectorView that is a readonly version of
  MmapVector

- Templated the IndexMetaDataClass to handle either the "traditional"
  HashMap based IndexMetaData or the new Mmap based ones.

- Implemented a converter to transform previously built indices to the
  new format

- Currently, the OSP and OPS permutations are always mmap based whereas
  the rest is hashMap based. One could also change SOP and SPO to mmap
  based to save even more Ram or implement a dynamic dispatching at
  runtime.

- Updated README.md with documentation on the MetaDataConverter and
  information about the reduced RAM usage for OSP and OPS
@joka921 joka921 merged commit 7b6e89b into ad-freiburg:master Aug 12, 2018
@joka921 joka921 deleted the f.mmapBasedArray branch August 23, 2022 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants