New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dense Meta Data using MMap-Based arrays #80
Conversation
src/index/CMakeLists.txt
Outdated
@@ -5,10 +5,13 @@ add_library(index | |||
VocabularyGenerator.h VocabularyGenerator.cpp | |||
ConstantsIndexCreation.h | |||
ExternalVocabulary.h ExternalVocabulary.cpp | |||
IndexMetaData.h IndexMetaData.cpp | |||
IndexMetaData.h IndexMetaDataImpl.h IndexMetaDataImpl.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation looks wrong
src/index/CMakeLists.txt
Outdated
StxxlSortFunctors.h | ||
TextMetaData.cpp TextMetaData.h | ||
DocsDB.cpp DocsDB.h | ||
MmapBasedArray.h MmapBasedArrayImpl.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation looks wrong, you might want to set your editor to show spaces and tabs differently
src/index/Index.cpp
Outdated
@@ -216,6 +220,11 @@ size_t Index::passTsvFileForVocabulary(const string& tsvFile) { | |||
} | |||
LOG(INFO) << "Pass done.\n"; | |||
_vocab.createFromSet(items); | |||
// TODO: completely merge TSV and NTrIPLEs parsing. This is just for making |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(<username>)
src/index/Index.cpp
Outdated
@@ -885,24 +896,19 @@ void Index::createFromOnDiskIndex(const string& onDiskBase, | |||
setOnDiskBase(onDiskBase); | |||
_vocab.readFromFile(_onDiskBase + ".vocabulary", | |||
_onDiskLiterals ? _onDiskBase + ".literals-index" : ""); | |||
// this is the read only version, TODO: give better naming or type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(<username>)
src/index/Index.cpp
Outdated
@@ -885,24 +896,19 @@ void Index::createFromOnDiskIndex(const string& onDiskBase, | |||
setOnDiskBase(onDiskBase); | |||
_vocab.readFromFile(_onDiskBase + ".vocabulary", | |||
_onDiskLiterals ? _onDiskBase + ".literals-index" : ""); | |||
// this is the read only version, TODO: give better naming or type | |||
_opsMeta.setup(onDiskBase + MMAP_FILE_SUFFIX + ".ops"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't setup()
basically open()
so could be named open()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, open() would suggest that we open a file, but actually what we are doing is completely depending on our implementation (nothing, reloading a mmapBased array, copying something to a stxxl vector (deprecated) etc.
My purpose was to create a generic function where the different template instatiations can have different initializations/setups passed through by varargs. I chose "setup" because you anyway have to look at the concrete type and which type of setup is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable
src/index/MmapBasedArray.h
Outdated
@@ -0,0 +1,190 @@ | |||
// Copyright 2018, University of Freiburg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this belongs in util
because it is quite generic. Also it should be namespaced and "Based" is redundant so just name it MmapArray
or better yet since it can grow MmapVector
src/index/MmapBasedArray.h
Outdated
AD_CHECK(idx < _size); | ||
} | ||
|
||
// we will never hold less than this capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"never have less capacity" as I guess we can hold 0 items
src/index/MmapBasedArray.h
Outdated
// create Array of given size fill with default value | ||
// file contents will be overriden if existing! | ||
// allows read and write access | ||
void setupNewFile(size_t size, const T& defaultValue, string filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setupNewFile()
sounds like open()
to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I agree, I would refactor it using an argument of type std::ios_base::openmode and mimic the read/write semantics used with ifstreams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to use two "tag"-classes. You will see if you like it better that way.
src/index/MmapBasedArrayImpl.h
Outdated
_bytesize = realSize._bytesize; | ||
// writeMetaData will also adapt the filesize according to | ||
// _capacity and _bytesize | ||
unmap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Linux we should probably use mremap()
src/index/VocabularyGenerator.h
Outdated
// Writes file "externalTextFile" which can be used to directly write external | ||
// Literals | ||
void mergeVocabulary(const std::string& basename, size_t numFiles); | ||
// Writes file "externalTextFile" which can be used to directly write ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have deleted the rest of the line by accident
src/util/MmapVector.h
Outdated
size_t _bytesize; | ||
}; | ||
|
||
// 2 tags two differentiate different versions of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/two/to/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more small comments. Also I think we should have unit tests for the new stuff. That said I'll now give this version a shot with the Clueweb Freebase instance and see how it fares.
EDIT: I forgot this needs an index rebuild. How hard would it be to do automatic conversion? This would also be a nice use of the magic number + version data
src/util/MmapVector.h
Outdated
T& at(size_t idx) { | ||
throwIfUninitialized(); | ||
if (idx >= _size) { | ||
throw std::out_of_range(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing indent
src/util/MmapVector.h
Outdated
const T& at(size_t idx) const { | ||
throwIfUninitialized(); | ||
if (idx >= _size) { | ||
throw std::out_of_range(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing indent
src/util/MmapVector.h
Outdated
MmapVector& operator=(const MmapVector<T>&) = delete; | ||
|
||
// move construction and assignment | ||
// TODO<joka921>: is this noexcept? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yes
src/util/MmapVector.h
Outdated
MmapVector(MmapVector<T>&& other); | ||
MmapVector& operator=(MmapVector<T>&& other); | ||
|
||
// Construct from anything that has a valid open() method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doen't really construct from something else. It just constructs and opens in one go
src/util/MmapVector.h
Outdated
// truncate the underlying file to _bytesize and write meta data (_size, | ||
// _capacity, etc) for this | ||
// array to the end. new size of file will be _bytesize + sizeof(meta data) | ||
// TODO: maybe extend meta data by "magic number" constant to verify if this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it a magic number + vesion
src/index/MetaDataHandler.h
Outdated
} | ||
|
||
// __________________________________________________________ | ||
const FullRelationMetaData& getAsserted(Id id) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "Asserted" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that there is a AD_CHECK whether there is a valid entry at this Id. That is how the original HashMap based code used it and so I chose to keep this for the other implementations as well.
The closest name from the stl would be "at", would you prefer that?
src/index/MetaDataHandler.h
Outdated
|
||
// _____________________________________________________________ | ||
// hidden in implementation namespace | ||
namespace ExtVecWrapperImpl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really important that ExtVec
is external, right? So why not just call it Vec
?
uint64_t _typeMultAndNofElements; | ||
}; | ||
|
||
inline ad_utility::File& operator<<(ad_utility::File& f, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
return f; | ||
} | ||
|
||
class BlockBasedRelationMetaData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if you want to give it a shot I think you can only improve upon the current state but I guess it's not strictly necessary for the PR
src/util/MmapVector.h
Outdated
|
||
// 2 tags two differentiate different versions of the | ||
// setup / construction variants of Mmap Vector which only take a filename | ||
class createTag{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are still classes so should probably start with a capital letter.
src/util/MmapVectorImpl.h
Outdated
@@ -35,19 +41,35 @@ void MmapVector<T>::writeMetaDataToEnd() { | |||
ofs.write((char*)&_size, sizeof(_size)); | |||
ofs.write((char*)&_capacity, sizeof(_capacity)); | |||
ofs.write((char*)&_bytesize, sizeof(_bytesize)); | |||
ofs.write((char*)&MagicNumber, sizeof(MagicNumber)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since magic number is larger than an unit8_t
this means the written value will depend on the endianess. But I think in this case this might not be bad since we can't read other-endianess data anyway since all of it is written in the non--endianess-portable write(&data, sizeof(data));
way.
src/util/MmapVector.h
Outdated
static constexpr float ResizeFactor = 1.5; | ||
static constexpr int MagicNumber = 7601577; | ||
static constexpr int Version = 0; | ||
private: | ||
|
||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the private:
is redundant
Do you already have some experience with how much this reduces RAM usage and startup time on one of our larger datasets? |
src/index/MetaDataConverter.cpp
Outdated
|
||
if (verify) { | ||
for (auto it = res._data.cbegin(); it != res._data.cend(); ++it) { | ||
if (hmap._data.getAsserted((*it).first) != (*it).second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you use (*it).first
instead of it->first
? I feel like the second is more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My iterator manually "plunges" the index to its pointee, so we get a map-like interface where
*it is a pair of <key, value>. There are some restrictions to the implementation of operator-> which must yield something that is or behaves as a pointer. I can check if I can make this work, but there were some issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented a variant of operator-> for you. I am not sure if this is the most elegant or efficient way with the std::reference_wrappers, but during our uses we normally don't use it for queries.
src/index/IndexMetaData.cpp
Outdated
size_t nofBytesDone = sizeof(size_t); | ||
_name.assign(reinterpret_cast<char*>(buf + nofBytesDone), nameLength); | ||
nofBytesDone += nameLength; | ||
// skip nOfRelations, since this information is already stored in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we sure there were previous runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we first have to run the IndexBuilder to create MetaData which is then written to disk which we then read again when opening the server using createFromByteBuffer.
The local nofRelations variable is needed in the non-persistent specializations of IndexMetaData where we read the FullRelationalMetaData from our byteBuffer. (You can check the difference to the "standard" implementation.
I also added a (hopefully) better comment to this.
_spoFile.read(buf, static_cast<size_t>(metaTo - metaFrom), metaFrom); | ||
_spoMeta.createFromByteBuffer(buf); | ||
delete[] buf; | ||
_spoMeta.readFromFile(&_spoFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also check for a denseMetaFile
for extremely RAM saving configs right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that is a relatively deep design decision:
Currently:
P** and S** are always "sparse" (hashMapBased on disk), while O** is dense.
(hardcoded using templates).
If we want to have this at runtime (decide during index creation which meta data we want for which format) I think we would have to let the different IndexMetaData inherit from a (virtual) class IndexMetaDataBase. Then we could use a unique_ptr to dispatch at runtime. The last time I mentioned it I thing you were not really in favor of that, but it would make things more flexible at the cost of the pointer indirections.
But I think I would first finish this current PR and then think about this additional flexibility which would change quite some lines in the Index class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yeah I remember, ok I think for now we should keep the current approach.
src/MetaDataConverterMain.cpp
Outdated
exit(1); | ||
} | ||
std::string base = in.substr(0, in.size() - 4); | ||
out = base + DENSE_META_SUFFIX + suffix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suffix seems to be only used when using the converter. I think a converted index should have the exact same data as the same index built freshly in the files named the same (unless both versions have a version field and we can check that in the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I read from your Email you would prefer to always write the dense meta data to an extra file. I will change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(together with my next PR for prefix compression I am preparing a index.configuration file (json) where we can also save information about the index in a more explicit way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joka921 it doesn't really matter if it's one file. It's just that an index upgrade tool like the new MetaDataConvertor should output the same format as would be used when building from scratch with that version.
Somehow I'm getting a compile error when building this branch locally using
git pull git://github.com/joka921/QLever.git f.mmapBasedArray Looks like the linker can't find the specialization's code
Can't find this error on Travis though but it happens on both my local Arch Linux workstation as well as on a Ubuntu 16.04 server. |
@niklas88 Have you tried running make clean before? I have the impression that somewhere cmake does not handle all dependencies right. |
@joka921 yes and I even built with a completely fresh |
@joka921 ok found the problem. Removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm still unsure about these file format changes. I think we should make sure we get this right. I actually liked having the meta data in the same file. However in the previous version if I got this right, the converter would do something different than building the index from scratch, which I think is quite fragile. Namely it would create external meta data files for what was just written at the end in the freshly built case.
This difference is what I didn't like. Always creating external meta data of course fixes this but also seems a bit clumsy. Also I think that just as the MmapVector adds a magic number and version field I think it would be nice to have that in the meta data for "*." files as well. Maybe with a magic number we can then also keep them single files with the meta data at the end?
src/index/Index.cpp
Outdated
|
||
metaData.appendToFile(&out); | ||
// metaData.appendToFile(&out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
src/index/Index.cpp
Outdated
_psoFile.open(string(_onDiskBase + ".index.pso").c_str(), "r"); | ||
_posFile.open(string(_onDiskBase + ".index.pos").c_str(), "r"); | ||
auto psoName = string(_onDiskBase + ".index.pso"); | ||
_psoFile.open(psoName.c_str(), "r"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's time we add an open(const std::string&)
method to ad_utility::File
src/MetaDataConverterMain.cpp
Outdated
@@ -25,7 +25,7 @@ int main (int argc, char** argv) { | |||
exit(1); | |||
} | |||
std::string base = in.substr(0, in.size() - 4); | |||
out = base + DENSE_META_SUFFIX + suffix; | |||
out = in + DENSE_META_SUFFIX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this gives us the dense file with ".ops.meta-dense" while below we get ".meta-mmap.ops".
I think since all of this changes the Index format we really should make sure to get this right. I actually liked having the meta data in the same file, my main concern in the last review was that the Converter should create the same format as a newly built index. Maybe since this is the first time we add a magic number and version we should in fact keep the meta data at the end of the files and only change the converter so it works for sparse meta data as well making sure all the files get a version and magic number.
src/index/IndexMetaData.h
Outdated
|
||
// already Existing means we do not need to at rmd to _data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/at/add/
src/index/IndexMetaData.h
Outdated
|
||
// already Existing means we do not need to at rmd to _data | ||
// if bool alreadyExisting is set, we will not add argument rmd | ||
// but assume that is already is contained in _data (for persistent metaData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that it is already contained"
As discussed on the phone, we should also add a SPARQL query that uses one of the OSP/OPS indices to the e2e tests. The following example query for all predicates connecting two scientists does a full scan of OSP
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, any idea if the advise(AccessPattern::Sequential)
calls make building the vector faster?
src/index/Index.cpp
Outdated
@@ -922,9 +922,9 @@ void Index::createFromOnDiskIndex(const string& onDiskBase, | |||
<< std::endl; | |||
if (allPermutations) { | |||
_opsMeta.setup(onDiskBase + ".index.ops" + MMAP_FILE_SUFFIX, | |||
ad_utility::ReuseTag()); | |||
ad_utility::ReuseTag(), ad_utility::AccessPattern::Random); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I seeing this right, ReuseTag
needs to be a tag to select at compile time but AccessPattern
can be an enum because it selects at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to I can also make ReuseTag an enum. It is basically the same as in an fstream where we also can decide by a runtime argument wheter we want to overwrite an existing file or use/add to its contents. The efficiency difference would not matter and maybe it would be more consistent. Just tell me, what you think.
src/util/MmapVector.h
Outdated
@@ -224,11 +241,15 @@ class MmapVector { | |||
// invalidates iterators | |||
void unmap(); | |||
|
|||
// advise the kernel to use a certain access pattern | |||
void advise(AccessPattern pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between advice()
and setAccessPattern()
seems a bit arbitrary
src/util/MmapVectorImpl.h
Outdated
for (size_t i = 0; i < _size; ++i) { | ||
_ptr[i] = defaultValue; | ||
} | ||
advise(_pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, so you readvise when changing the access pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this answer your question? advise is the private implementation detail (call to madvise) and setAccessPattern the global interface, that let's us for example first advise to use sequential for insertion and then switch to random for access etc...
Concerning your question about the effect: According to the manpages the "sequential" mode does "heavy prefetching" and also evicts pages soon after they are written. I did not yet measure it and think this heavily depends on the overall system load and the kernel implementation (there is a reason why it's called "advise")
src/MetaDataConverterMain.cpp
Outdated
std::cerr << "error, infile must have suffix of" | ||
".ops, .osp\n"; | ||
exit(1); | ||
std::vector<std::string> sparseNames{".pso", ".pos", ".spo", ".sop"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::array
src/MetaDataConverterMain.cpp
Outdated
permutName + ".converted"); | ||
} | ||
|
||
std::vector<std::string> denseNames{".osp", ".ops"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::array
src/MetaDataConverterMain.cpp
Outdated
|
||
// _________________________________________________________ | ||
int main (int argc, char** argv) { | ||
if (argc != 2) { | ||
std::cerr << "Usage: ./MetaDataConverterMain <originalFilename>\n"; | ||
std::cerr << "Usage: ./MetaDataConverterMain <pathToIndex>\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we call this an "index prefix" elsewhere
src/global/Constants.h
Outdated
@@ -51,6 +51,8 @@ static const int DEFAULT_NOF_VALUE_MANTISSA_DIGITS = 30; | |||
static const int DEFAULT_NOF_DATE_YEAR_DIGITS = 19; | |||
|
|||
static const std::string MMAP_FILE_SUFFIX = ".meta-mmap"; | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented
e2b9597
to
7a7ac0a
Compare
I think the commit you squashed this to now needs a better name |
678c9e3
to
1a2398e
Compare
@niklas88 This is now a feature-fixed (I hope) version. I hope we can merge it early next week so other projects can go on. |
Can you add just another commit to update the README.md with the conversion tool and also the last section about memory use? |
c30628b
to
2da241d
Compare
@niklas88 I have added the requested documentation. After you've had a look at it I would rebase it into the actual implementation commit to get a cleaner history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, really happy how this turned out! Some final comments on a few typos/formatting and then feel free to merge!
@@ -380,6 +380,25 @@ Group by is supported, but aggregate aliases may currently only be used within t | |||
Supported aggregates are `MIN, MAX, AVG, GROUP_CONCAT, SAMPLE, COUNT, SUM`. All of the aggreagates support `DISTINCT`, e.g. `(GROUP_CONCAT(DISTINCT ?a) as ?b)`. | |||
Group concat also supports a custom separator: `(GROUP_CONCAT(?a ; separator=" ; ") as ?concat)`. Xsd types float, decimal and integer are recognized as numbers, other types or unbound variables (e.g. no entries for an optional part) in one of the aggregates that need to interpret the variable (e.g. AVG) lead to either no result or nan. MAX with an unbound variable will always return the unbound variable. | |||
|
|||
# 6. Converting Old Indices For Current QLever Versions | |||
|
|||
We have recently updated the way the index meta data (offsets of relations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format these lines to roughly be of equal length (up to 80 columns). When using vim this can be done with first selecting them <esc><V><j..>
and then doing <g><q>
. Other than that this looks great.
src/index/CMakeLists.txt
Outdated
StxxlSortFunctors.h | ||
TextMetaData.cpp TextMetaData.h | ||
DocsDB.cpp DocsDB.h | ||
FTSAlgorithms.cpp FTSAlgorithms.h) | ||
|
||
target_link_libraries(index parser ${STXXL_LIBRARIES}) | ||
|
||
add_library(metaConverter | ||
MetaDataConverter.cpp MetaDataConverter.h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrongly indented
ad_utility::ReuseTag(), ad_utility::AccessPattern::Random); | ||
_ospMeta.setup(onDiskBase + ".index.osp" + MMAP_FILE_SUFFIX, | ||
ad_utility::ReuseTag(), ad_utility::AccessPattern::Random); | ||
// TODO<joka921>: Refactor (upcoming PR): there is so so much code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to it!
src/index/IndexMetaData.h
Outdated
|
||
bool hasBlocks() const { return _rmdPairs.hasBlocks(); } | ||
// persistentRMD == true means we do not need to add rmd to _data | ||
// but assume that it is already is contained in _data (for persistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one "is" too many
src/index/IndexMetaData.h
Outdated
|
||
uint8_t getCol2LogMultiplicity() const { | ||
return _rmdPairs.getCol2LogMultiplicity(); | ||
// Persistent (called MmapBased here) have to be separated from RAM-based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word missing
const auto el = *it; | ||
// when we are MmapBased, the _data member is already persistent on disk, so | ||
// we do not write it here | ||
if constexpr (!imd._isMmapBased) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
…tations - Almost every token in the vocabulary occurs as an object in a triple, so the OPS and OSP MetaData are Dense (all possible keys/vocabIds occur). This makes using array-like data structures better. - Implemented a class MmapVector that is able to handle a persistent dynamic array that is stored on hard disk and made accessible from RAM via mmap - Implemented a class MmapVectorView that is a readonly version of MmapVector - Templated the IndexMetaDataClass to handle either the "traditional" HashMap based IndexMetaData or the new Mmap based ones. - Implemented a converter to transform previously built indices to the new format - Currently, the OSP and OPS permutations are always mmap based whereas the rest is hashMap based. One could also change SOP and SPO to mmap based to save even more Ram or implement a dynamic dispatching at runtime. - Updated README.md with documentation on the MetaDataConverter and information about the reduced RAM usage for OSP and OPS
2da241d
to
5194941
Compare
There is still much to do and to refactor, just wanted a first chance for reviewing.