Skip to content

Commit

Permalink
Erased text in README
Browse files Browse the repository at this point in the history
  • Loading branch information
fulmicoton committed Aug 26, 2016
1 parent 59150ad commit b2afe85
Show file tree
Hide file tree
Showing 5 changed files with 29 additions and 93 deletions.
62 changes: 2 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,64 +11,6 @@ Check out the [doc](http://fulmicoton.com/tantivy/tantivy/index.html)
in minutes.


# How it works
# Contribute

This document explains how tantivy works, and specifically
what kind of datastructures are used to index and store the data.

# An inverted index

As you may know, an idea central to search engines is to assign a document id
to each document, and build an inverted index, which is simply
a datastructure associating each term (word) to a sorted list of doc ids.

Such an index then makes it possible to compute the union or
the intersection of the documents containing two terms
in `O(1)` memory and `O(n)` time.

## Term dictionary

Tantivy term dictionary (`.term` files) are stored in
a finite state transducer (courtesy of the excellent
[`fst`](https://github.com/BurntSushi/fst) crate).

For each term, the dictionary associates
a [TermInfo](http://fulmicoton.com/tantivy/tantivy/postings/struct.TermInfo.html).
which contains all of the information required to access the list of doc ids of the doc containing
the term.

In fact `fst` can only associated terms to a long. [`FstMap`](https://github.com/fulmicoton/tantivy/blob/master/src/datastruct/fstmap.rs) are
in charge to build a KV map on top of it.


## Postings

The posting lists (sorted list of doc ids) are encoded in the `.idx` file.
Optionally, you specify in your schema that you want tf-idf to be encoded
in the index file (if you do not, the index will behave as if all documents
have a term frequency of 1).
Tf-idf scoring requires the term frequency (number of time the term appeared in the field of the document)
for each document.


# Segments

Tantivy's index are divided into segments.
All segments are as many independent structure.

This has many benefits. For instance, assuming you are
trying to one billion documents, you could split
your corpus into N pieces, index them on Hadoop, copy all
of the resulting segments in the same directory
and edit the index meta.json file to list all of the segments.

This strong division also simplify a lot multithreaded indexing.
Each thread is actually build its own segment.


##

# Store

The store
When a document
Send me an email (paul.masurel at gmail.com) if you want to contribute to tantivy.
16 changes: 11 additions & 5 deletions src/core/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ use std::convert::From;
use num_cpus;
use std::collections::HashSet;
use super::segment::Segment;

use core::SegmentReader;

#[derive(Clone,Debug,RustcDecodable,RustcEncodable)]
pub struct IndexMeta {
Expand Down Expand Up @@ -120,10 +120,6 @@ impl Index {
self.writer_with_num_threads(num_cpus::get())
}

pub fn searcher(&self,) -> Result<Searcher> {
Searcher::for_index(self.clone())
}

pub fn from_directory(directory: Box<Directory>, schema: Schema) -> Index {
Index {
metas: Arc::new(RwLock::new(IndexMeta::with_schema(schema.clone()))),
Expand Down Expand Up @@ -210,6 +206,16 @@ impl Index {
.atomic_write(&META_FILEPATH, &w[..])
.map_err(From::from)
}

pub fn searcher(&self,) -> Result<Searcher> {
let segment_readers: Vec<SegmentReader> = try!(
self.segments()
.into_iter()
.map(SegmentReader::open)
.collect()
);
Ok(Searcher::from_readers(segment_readers))
}
}


Expand Down
36 changes: 12 additions & 24 deletions src/core/searcher.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
use Result;
use core::SegmentReader;
use core::Index;
use core::segment::Segment;
use schema::Document;
use collector::Collector;
use common::TimerTree;
Expand All @@ -12,54 +10,44 @@ use schema::Term;

#[derive(Debug)]
pub struct Searcher {
segments: Vec<SegmentReader>,
segment_readers: Vec<SegmentReader>,
}

impl Searcher {

pub fn doc(&self, doc_address: &DocAddress) -> Result<Document> {
// TODO err
let DocAddress(segment_local_id, doc_id) = *doc_address;
let segment_reader = &self.segments[segment_local_id as usize];
let segment_reader = &self.segment_readers[segment_local_id as usize];
segment_reader.doc(doc_id)
}

pub fn num_docs(&self,) -> DocId {
self.segments
self.segment_readers
.iter()
.map(|segment_reader| segment_reader.num_docs())
.fold(0u32, |acc, val| acc + val)
}

pub fn doc_freq(&self, term: &Term) -> u32 {
self.segments
self.segment_readers
.iter()
.map(|segment_reader| segment_reader.doc_freq(term))
.fold(0u32, |acc, val| acc + val)
}

fn add_segment(&mut self, segment: Segment) -> Result<()> {
let segment_reader = try!(SegmentReader::open(segment.clone()));
self.segments.push(segment_reader);
Ok(())

pub fn segment_readers(&self,) -> &Vec<SegmentReader> {
&self.segment_readers
}

fn new() -> Searcher {
Searcher {
segments: Vec::new(),
}
}

pub fn segments(&self,) -> &Vec<SegmentReader> {
&self.segments
pub fn segment_reader(&self, segment_ord: usize) -> &SegmentReader {
&self.segment_readers[segment_ord]
}

pub fn for_index(index: Index) -> Result<Searcher> {
let mut searcher = Searcher::new();
for segment in index.segments() {
try!(searcher.add_segment(segment));
pub fn from_readers(segment_readers: Vec<SegmentReader>) -> Searcher {
Searcher {
segment_readers: segment_readers,
}
Ok(searcher)
}

pub fn search<Q: Query, C: Collector>(&self, query: &Q, collector: &mut C) -> Result<TimerTree> {
Expand Down
4 changes: 2 additions & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ mod tests {
{

let searcher = index.searcher().unwrap();
let segment_reader: &SegmentReader = searcher.segments().iter().next().unwrap();
let segment_reader: &SegmentReader = searcher.segment_readers().iter().next().unwrap();
let fieldnorms_reader = segment_reader.get_fieldnorms_reader(text_field).unwrap();
assert_eq!(fieldnorms_reader.get(0), 3);
assert_eq!(fieldnorms_reader.get(1), 0);
Expand All @@ -264,7 +264,7 @@ mod tests {
}
{
let searcher = index.searcher().unwrap();
let reader = &searcher.segments()[0];
let reader = searcher.segment_reader(0);
let mut postings = reader.read_postings_all_info(&Term::from_field_text(text_field, "af")).unwrap();
assert!(postings.advance());
assert_eq!(postings.doc(), 0);
Expand Down
4 changes: 2 additions & 2 deletions src/query/multi_term_query.rs
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ impl Query for MultiTermQuery {
&self,
searcher: &Searcher,
doc_address: &DocAddress) -> Result<Explanation> {
let segment_reader = &searcher.segments()[doc_address.segment_ord() as usize];
let segment_reader = searcher.segment_reader(doc_address.segment_ord() as usize);
let similitude = SimilarityExplainer::from(self.similitude(searcher));
let mut timer_tree = TimerTree::new();
let mut postings = try!(
Expand Down Expand Up @@ -147,7 +147,7 @@ impl Query for MultiTermQuery {
let mut timer_tree = TimerTree::new();
{
let mut search_timer = timer_tree.open("search");
for (segment_ord, segment_reader) in searcher.segments().iter().enumerate() {
for (segment_ord, segment_reader) in searcher.segment_readers().iter().enumerate() {
let mut segment_search_timer = search_timer.open("segment_search");
{
let _ = segment_search_timer.open("set_segment");
Expand Down

0 comments on commit b2afe85

Please sign in to comment.