This repository contains the source code for the paper:
Our implementation of the naïve and the deduplication-aware index is based on the Destor open-source storage system: https://github.com/fomy/destor
System requirements: Ubuntu version 16.04, and the list of dependencies installed by the script provided.
MAKE SURE TO USE A CLEAN UBUNTU SERVER 16.04 LTS IMAGE, IF POSSIBLE. (FROM HERE)
An existing Ubuntu 16.04 machine is also OK. The install_dependencies.sh
script is meant to run on a clean Ubuntu Server installation.
Simply install git and clone the repo. Dependencies are installed later in this README in ./install_dependencies.sh :)
# takes a couple of human minutes :)
sudo apt update
sudo apt install git -y
git clone https://github.com/asaflevi0812/IDEA.git
cd IDEA
Installs the libraries required for IDEA:
- BerkeleyDB
- Lucene++
- BOOST
- SSL
- GLIB
# takes about a minute until a single [PRESS ENTER] is required. -- sometimes it does not.
# after that, about ten minutes of installation, probably less.
chmod +x install_dependencies.sh
./install_dependencies.sh
RUN this from the IDEA repository to compile the code.
chmod +x compile.sh
./compile.sh
In the destor.config file (discussed later), the paths to the directories containing the destor backup and the IDEA HDD index part and SSD index part are defined.
By default, they are defined as follows:
- backup directory: working_directories/backup_directory
- HDD part directory: working_directories/index_directory
- SSD part directory: working_directories/chunk_to_file_directory
The following command creates the directories:
mkdir -p working_directories/backup_directory working_directories/index_directory working_directories/chunk_to_file_directory
and in our original setup the proper devices are mounted to these directories.
The following command creates a deduplicated backup from a given dataset path:
destor <dataset_path>
The entire directory will be recursively backed up in the backup directory defined in destor.config (discussed below).
to create a naive index from the currently stored deduplicated backup - run the following command:
destor -an
The index will be stored in the HDD part index directory defined in destor.config (discussed below).
to create an IDEA index from the currently stored deduplicated backup - run the following command:
destor -aq
The index parts will be stored accordingly the HDD part index directory and SSD part directory, both are defined in destor.config (discussed below).
to create an inline index, which should backup the data and create the index concurrently, use the following commands:
destor -n <path to data>
for naive inline index and
destor -q <path to data>
for IDEA-indirect inline index.
use the command:
destor -m|l <keyword1> <keyword2> <keyword3>
where m is used for the naive index and l for the IDEA index. For example
destor -l hello world
would lookup the keywords "hello" and "world" in the IDEA index.
use the command:
destor -m|l -f<path_to_dictionary>
to search a list of keywords in an OR manner, i.e. if any of the terms are contained in a file, it would appear in the results.
For example use:
destor -m -f"keywords/linux/128/file-med.txt"
to search the file-med dictionary in the naive index.
When destor is initialized, it reads a configuration file named in a fixed name, destor.config
, which contains lines in the following format:
<configuration_variable> <value>
For example:
working-directory "/home/asaf.levi/destor_working_directory"
chunking-type whitespace
reverse-mapping db
The following values are used to configure parameters regarding IDEA and NAIVE:
- working-directory: path to a directory which will contain the deduplicated backups.
- index-directory: path to the HDD-part of an index. It will store different parts of the index according to the index type:
- NAIVE: the entire index
- IDEA: the term-to-file map
- IDEA-indirect: the term-to-chunk map
- reverse-directory: path to the SSD-part of an index. It will store different parts of the index according to the index type:
- IDEA: the file-to-path map
- IDEA-indirect: the chunk-to-file and file-to-path maps
- chunking-type: can hold the values
whitespace
orwhitespace-reversed
. The former is our default whitespace chunking and the latter is a chunking method used for fixed-sized blocks, where the next whitespace is searched before the chunk boundary and not after. More about these two methods in the paper. - reverse-mapping: the type of chunk-to-file mapping. May be
in-doc
for IDEA anddb
for IDEA-indirect. - indirect-path-strings: whether to enable file-to-path mapping. Always set to yes in our paper.
- offsets-mode: whether to store offsets. Supported values are
none
andterm-vectors
, whereterm-vectors
is the only offsets option in the paper. - tf-idf: whether to store tf-idf values.
yes
\no
value.
Several configurations where pre-created in the "configs" directory. These are the important values in these configurations:
chunking-type whitespace
reverse-mapping in-doc
offsets-mode none
tf-idf no
chunking-type whitespace
reverse-mapping db
offsets-mode none
tf-idf no
chunking-type whitespace
reverse-mapping db
offsets-mode term-vectors
tf-idf no
chunking-type whitespace
reverse-mapping db
offsets-mode none
tf-idf yes
The results are best reproduced on a suitable clean Ubuntu 16.04 machine with only the required installations. If the use of SSDs and HDDs as described in the paper is possible, then the results should be more accurate. NOTE: the figure scripts do not generate a graphical figure, they generate a CSV.
[estimated runtime for LNX-198: 2 hours]
First, make sure the system is configured to create offset-less and rank-less indexes, and that the index working directories are empty. You can do that with the following command:
scripts/clean_index.sh
cp configs/default.config destor.config
The script scripts/create_indexes.sh
creates the basic IDEA and Naive index versions, cleaning the OS cache before each creation, and measures the creation time.
The output should indicate the start and end of the naive and deduplicated index building process. During the process, an updating progress bar appears with the number of chunks and files processed. After each index is built, several parameters’ values are printed.
The script creates indexes as in Figures 7+8 from the paper.
The indexes will be created in the working directories mentioned above.
The indexing time and index size for each index is logged by the script in the csv file index.log. Indexing time is in the total_time column and index size is in the complete_index_size column.
[estimated runtime for LNX-198: 2.5-3 hours]
Use the "base indexing" guide with offset.config
or ranking.config
instead of default.config
.
[estimated runtime for LNX-198: <2 minutes]
The script scripts/figure_10.sh
performs lookup of keywords from the file-med dictionary in each index type.
It searches increasing numbers of keywords, like in Figure 10 from the paper, clearing the cache and starting up the index before each lookup.
The lookup time for each index and each number of keywords is logged by the script in the following csv file: lookup.log.
The script creates lookup.log in a CSV format, where the interesting values are index_type, keywords_num and total_lookup_time. Index_type is naive
for Naive, and dedup
for IDEA, keywords_num represents the number of keywords, and the total_lookup_time is in seconds. All the results are for the file-med dictionary.
[estimated runtime for LNX-198: <5 minutes]
The script scripts/figure_11.sh
performs a lookup of 128 keywords from each dictionary in each index type, like in Figure 11 from the paper, clearing the cache and starting up the index before each lookup.
The lookup time for each index and each number of keywords is logged by the script in the following csv file: lookup.log.
The script runs lookups for the dictionaries one after the other: file-low, file-med, file-high, chunk-low, chunk-med, chunk-high - for each dictionary, first runs the Naive version and then the IDEA version. Therefore there are 12 result rows in the CSV file.
When the system is configured with offsets enabled, the figure_11.sh
script can be used to reproduce Figure 15
[estimated runtime for LNX-198: <1 minutes]
The script scripts/figure_16.sh
performs a lookup of a single keyword from the file-med and then the chunk-med dictionaries in each index type, like in Figures 9+16 from the paper.
The lookup time for each index and each dictionary is logged by the script in the following csv file: lookup.log
.
There are 4 single keywords averaged - 1a
, 1b
, 1c
and 1d
. For each dictionary they are looked up in that order, first for the Naive version and then for the IDEA version. Therefore the CSV file contains 16 result rows.
Dataset resources: LINUX KERNEL MIRROR, ENWIKI DUMPS and the old Wikipedia mirrors on the INTERNET ARCHIVE.
For more information on how the datasets were built and how to recreate them, check the dataset_details
directory.
This error may be introduced on computers with a locale which Lucene cannot recognize.
It can be solved by running the following command:
export LC_ALL="en_US.UTF-8"