Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

624 lines (432 sloc) 22.904 kb

SearchIndex API

The SearchIndex class allows the application developer a way to provide data to the backend in a structured format. Developers familiar with Django's Form or Model classes should find the syntax for indexes familiar.

This class is arguably the most important part of integrating Haystack into your application, as it has a large impact on the quality of the search results and how easy it is for users to find what they're looking for. Care and effort should be put into making your indexes the best they can be.

Quick Start

For the impatient:

import datetime
from haystack import indexes
from myapp.models import Note


class NoteIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    author = indexes.CharField(model_attr='user')
    pub_date = indexes.DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Note

    def index_queryset(self):
        "Used when the entire index for model is updated."
        return self.get_model().objects.filter(pub_date__lte=datetime.datetime.now())

Background

Unlike relational databases, most search engines supported by Haystack are primarily document-based. They focus on a single text blob which they tokenize, analyze and index. When searching, this field is usually the primary one that is searched.

Further, the schema used by most engines is the same for all types of data added, unlike a relational database that has a table schema for each chunk of data.

It may be helpful to think of your search index as something closer to a key-value store instead of imagining it in terms of a RDBMS.

Why Create Fields?

Despite being primarily document-driven, most search engines also support the ability to associate other relevant data with the indexed document. These attributes can be mapped through the use of fields within Haystack.

Common uses include storing pertinent data information, categorizations of the document, author information and related data. By adding fields for these pieces of data, you provide a means to further narrow/filter search terms. This can be useful from either a UI perspective (a better advanced search form) or from a developer standpoint (section-dependent search, off-loading certain tasks to search, et cetera).

Warning

Haystack reserves the following field names for internal use: id, django_ct, django_id & content. The name & type names used to be reserved but no longer are.

You can override these field names using the HAYSTACK_ID_FIELD, HAYSTACK_DJANGO_CT_FIELD & HAYSTACK_DJANGO_ID_FIELD if needed.

Significance Of document=True

Most search engines that were candidates for inclusion in Haystack all had a central concept of a document that they indexed. These documents form a corpus within which to primarily search. Because this ideal is so central and most of Haystack is designed to have pluggable backends, it is important to ensure that all engines have at least a bare minimum of the data they need to function.

As a result, when creating a SearchIndex, at least one field must be marked with document=True. This signifies to Haystack that whatever is placed in this field while indexing is to be the primary text the search engine indexes. The name of this field can be almost anything, but text is one of the more common names used.

Stored/Indexed Fields

One shortcoming of the use of search is that you rarely have all or the most up-to-date information about an object in the index. As a result, when retrieving search results, you will likely have to access the object in the database to provide better information.

However, this can also hit the database quite heavily (think .get(pk=result.id) per object). If your search is popular, this can lead to a big performance hit. There are two ways to prevent this. The first way is SearchQuerySet.load_all, which tries to group all similar objects and pull them though one query instead of many. This still hits the DB and incurs a performance penalty.

The other option is to leverage stored fields. By default, all fields in Haystack are both indexed (searchable by the engine) and stored (retained by the engine and presented in the results). By using a stored field, you can store commonly used data in such a way that you don't need to hit the database when processing the search result to get more information.

For example, one great way to leverage this is to pre-rendering an object's search result template DURING indexing. You define an additional field, render a template with it and it follows the main indexed record into the index. Then, when that record is pulled when it matches a query, you can simply display the contents of that field, which avoids the database hit.:

Within myapp/search_indexes.py:

class NoteIndex(SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='user')
    pub_date = DateTimeField(model_attr='pub_date')
    # Define the additional field.
    rendered = CharField(use_template=True, indexed=False)

Then, inside a template named search/indexes/myapp/note_rendered.txt:

<h2>{{ object.title }}</h2>

<p>{{ object.content }}</p>

And finally, in search/search.html:

...

{% for result in page.object_list %}
    <div class="search_result">
        {{ result.rendered|safe }}
    </div>
{% endfor %}

Keeping The Index Fresh

There are several approaches to keeping the search index in sync with your database. None are more correct than the others and depending the traffic you see, the churn rate of your data and what concerns are important to you (CPU load, how recent, et cetera).

The conventional method is to use SearchIndex in combination with cron jobs. Running a ./manage.py update_index every couple hours will keep your data in sync within that timeframe and will handle the updates in a very efficient batch. Additionally, Whoosh (and to a lesser extent Xapian) behave better when using this approach.

Another option is to use RealTimeSearchIndex, which uses Django's signals to immediately update the index any time a model is saved/deleted. This yields a much more current search index at the expense of being fairly inefficient. Solr is the only backend that handles this well under load, and even then, you should make sure you have the server capacity to spare.

A third option is to develop a custom QueueSearchIndex that, much like RealTimeSearchIndex, uses Django's signals to enqueue messages for updates/deletes. Then writing a management command to consume these messages in batches, yielding a nice compromise between the previous two options.

Note

Haystack doesn't ship with a QueueSearchIndex largely because there is such a diversity of lightweight queuing options and that they tend to polarize developers. Queuing is outside of Haystack's goals (provide good, powerful search) and, as such, is left to the developer.

Additionally, the implementation is relatively trivial in that you simply extend the same four methods as RealTimeSearchIndex and simply add messages to the queue of choice.

Advanced Data Preparation

In most cases, using the model_attr parameter on your fields allows you to easily get data from a Django model to the document in your index, as it handles both direct attribute access as well as callable functions within your model.

Note

The model_attr keyword argument also can look through relations in models. So you can do something like model_attr='author__first_name' to pull just the first name of the author, similar to some lookups used by Django's ORM.

However, sometimes, even more control over what gets placed in your index is needed. To facilitate this, SearchIndex objects have a 'preparation' stage that populates data just before it is indexed. You can hook into this phase in several ways.

This should be very familiar to developers who have used Django's forms before as it loosely follows similar concepts, though the emphasis here is less on cleansing data from user input and more on making the data friendly to the search backend.

1. prepare_FOO(self, object)

The most common way to affect a single field's data is to create a prepare_FOO method (where FOO is the name of the field). As a parameter to this method, you will receive the instance that is attempting to be indexed.

Note

This method is analogous to Django's Form.clean_FOO methods.

To keep with our existing example, one use case might be altering the name inside the author field to be "firstname lastname <email>". In this case, you might write the following code:

class NoteIndex(SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='user')
    pub_date = DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Note

    def prepare_author(self, obj):
        return "%s <%s>" % (obj.user.get_full_name(), obj.user.email)

This method should return a single value (or list/tuple/dict) to populate that fields data upon indexing. Note that this method takes priority over whatever data may come from the field itself.

Just like Form.clean_FOO, the field's prepare runs before the prepare_FOO, allowing you to access self.prepared_data. For example:

class NoteIndex(SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='user')
    pub_date = DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Note

    def prepare_author(self, obj):
        # Say we want last name first, the hard way.
        author = u''

        if 'author' in self.prepared_data:
            name_bits = self.prepared_data['author'].split()
            author = "%s, %s" % (name_bits[-1], ' '.join(name_bits[:-1]))

        return author

This method is fully function with model_attr, so if there's no convenient way to access the data you want, this is an excellent way to prepare it:

class NoteIndex(SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='user')
    categories = MultiValueField()
    pub_date = DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Note

    def prepare_categories(self, obj):
        # Since we're using a M2M relationship with a complex lookup,
        # we can prepare the list here.
        return [category.id for category in obj.category_set.active().order_by('-created')]

2. prepare(self, object)

Each SearchIndex gets a prepare method, which handles collecting all the data. This method should return a dictionary that will be the final data used by the search backend.

Overriding this method is useful if you need to collect more than one piece of data or need to incorporate additional data that is not well represented by a single SearchField. An example might look like:

class NoteIndex(SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='user')
    pub_date = DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Note

    def prepare(self, object):
        self.prepared_data = super(NoteIndex, self).prepare(object)

        # Add in tags (assuming there's a M2M relationship to Tag on the model).
        # Note that this would NOT get picked up by the automatic
        # schema tools provided by Haystack.
        self.prepared_data['tags'] = [tag.name for tag in object.tags.all()]

        return self.prepared_data

If you choose to use this method, you should make a point to be careful to call the super() method before altering the data. Without doing so, you may have an incomplete set of data populating your indexes.

This method has the final say in all data, overriding both what the fields provide as well as any prepare_FOO methods on the class.

Note

This method is roughly analogous to Django's Form.full_clean and Form.clean methods. However, unlike these methods, it is not fired as the result of trying to access self.prepared_data. It requires an explicit call.

3. Overriding prepare(self, object) On Individual SearchField Objects

The final way to manipulate your data is to implement a custom SearchField object and write its prepare method to populate/alter the data any way you choose. For instance, a (naive) user-created GeoPointField might look something like:

from haystack import indexes

class GeoPointField(indexes.CharField):
    def __init__(self, **kwargs):
        kwargs['default'] = '0.00-0.00'
        super(GeoPointField, self).__init__(**kwargs)

    def prepare(self, obj):
        return unicode("%s-%s" % (obj.latitude, obj.longitude))

The prepare method simply returns the value to be used for that field. It's entirely possible to include data that's not directly referenced to the object here, depending on your needs.

Note that this is NOT a recommended approach to storing geographic data in a search engine (there is no formal suggestion on this as support is usually non-existent), merely an example of how to extend existing fields.

Note

This method is analagous to Django's Field.clean methods.

Adding New Fields

If you have an existing SearchIndex and you add a new field to it, Haystack will add this new data on any updates it sees after that point. However, this will not populate the existing data you already have.

In order for the data to be picked up, you will need to run ./manage.py rebuild_index. This will cause all backends to rebuild the existing data already present in the quickest and most efficient way.

Note

With the Solr backend, you'll also have to add to the appropriate schema.xml for your configuration before running the rebuild_index.

Search Index

get_model

Should return the Model class (not an instance) that the rest of the SearchIndex should use.

This method is required & you must override it to return the correct class.

index_queryset

Get the default QuerySet to index when doing a full update.

Subclasses can override this method to avoid indexing certain objects.

read_queryset

Get the default QuerySet for read actions.

Subclasses can override this method to work with other managers. Useful when working with default managers that filter some objects.

prepare

Fetches and adds/alters data before indexing.

get_content_field

Returns the field that supplies the primary document to be indexed.

update

Updates the entire index.

If using is provided, it specifies which connection should be used. Default relies on the routers to decide which backend should be used.

update_object

Update the index for a single object. Attached to the class's post-save hook.

If using is provided, it specifies which connection should be used. Default relies on the routers to decide which backend should be used.

remove_object

Remove an object from the index. Attached to the class's post-delete hook.

If using is provided, it specifies which connection should be used. Default relies on the routers to decide which backend should be used.

clear

Clears the entire index.

If using is provided, it specifies which connection should be used. Default relies on the routers to decide which backend should be used.

reindex

Completely clears the index for this model and rebuilds it.

If using is provided, it specifies which connection should be used. Default relies on the routers to decide which backend should be used.

get_updated_field

Get the field name that represents the updated date for the model.

If specified, this is used by the reindex command to filter out results from the QuerySet, enabling you to reindex only recent records. This method should either return None (reindex everything always) or a string of the Model's DateField/DateTimeField name.

should_update

Determine if an object should be updated in the index.

It's useful to override this when an object may save frequently and cause excessive reindexing. You should check conditions on the instance and return False if it is not to be indexed.

The kwargs passed along to this method can be the same as the ones passed by Django when a Model is saved/delete, so it's possible to check if the object has been created or not. See django.db.models.signals.post_save for details on what is passed.

By default, returns True (always reindex).

load_all_queryset

Provides the ability to override how objects get loaded in conjunction with RelatedSearchQuerySet.load_all. This is useful for post-processing the results from the query, enabling things like adding select_related or filtering certain data.

By default, returns all() on the model's default manager.

Example:

class NoteIndex(SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True)
    author = CharField(model_attr='user')
    pub_date = DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Note

    def load_all_queryset(self):
        # Pull all objects related to the Note in search results.
        return Note.objects.all().select_related()

When searching, the RelatedSearchQuerySet appends on a call to in_bulk, so be sure that the QuerySet you provide can accommodate this and that the ids passed to in_bulk will map to the model in question.

If you need a specific QuerySet in one place, you can specify this at the RelatedSearchQuerySet level using the load_all_queryset method. See :doc:`searchqueryset_api` for usage.

RealTimeSearchIndex

The RealTimeSearchIndex provides all the same functionality as the standard SearchIndex. However, in addition, it connects to the post_save/post_delete signals of the model it's registered with.

This means that anytime a model is saved or deleted, it's automatically and immediately updated in the search index, yielding real-time search.

Warning

Not all backends deal well with the kind of document churn that can result from using the RealTimeSearchIndex. Solr is the only one that handles it gracefully.

Additionally, this will add more overhead in terms of CPU usage, so you should be sure to accommodate for this and should have appropriate monitoring in place.

ModelSearchIndex

The ModelSearchIndex class allows for automatic generation of a SearchIndex based on the fields of the model assigned to it.

With the exception of the automated introspection, it is a SearchIndex class, so all notes above pertaining to SearchIndexes apply. As with the ModelForm class in Django, it employs an inner class called Meta, which should either contain a pass to include all fields, a fields list to specify a whitelisted set of fields or excludes to prevent certain fields from appearing in the class. Unlike ModelForm, you should NOT specify a model attribute, as that is already handled when registering the class.

In addition, it adds a text field that is the document=True field and has use_template=True option set, just like the BasicSearchIndex.

Warning

Usage of this class might result in inferior SearchIndex objects, which can directly affect your search results. Use this to establish basic functionality and move to custom SearchIndex objects for better control.

At this time, it does not handle related fields.

Quick Start

For the impatient:

import datetime
from haystack import indexes
from myapp.models import Note

# All Fields
class AllNoteIndex(indexes.ModelSearchIndex, indexes.Indexable):
    class Meta:
        model = Note

# Blacklisted Fields
class LimitedNoteIndex(indexes.ModelSearchIndex, indexes.Indexable):
    class Meta:
        model = Note
        excludes = ['user']

# Whitelisted Fields
class NoteIndex(indexes.ModelSearchIndex, indexes.Indexable):
    class Meta:
        model = Note
        fields = ['user', 'pub_date']

    # Note that regular ``SearchIndex`` methods apply.
    def index_queryset(self):
        "Used when the entire index for model is updated."
        return Note.objects.filter(pub_date__lte=datetime.datetime.now())
Jump to Line
Something went wrong with that request. Please try again.