Skip to content

Terminology

cpliakas edited this page Feb 15, 2013 · 4 revisions

Components

The following components are the parts that the Search Framework is aware of. In some instances the components are defined within the framework, and in other cases they are things that the Search Framework is aware of but doesn't have direct interaction with.

Collection

A collection is the source data being searched. Examples are RSS / Atom feeds, content in a CMS, third party websites, etc. In the Search Framework, a collection is responsible for connecting to the data source to fetch the items that are scheduled for indexing and modeling the structure of the data so that it can be processed and searched based on the type of content it contains.

Index

An index is a copy of the source data stored in some format that is optimized for search. Similar to an index card in a library's card catalog, the index usually contains only a subset of the source data that is useful for searching. The Search Framework library has only indirect interaction with the index via third party client libraries, and it has only a basic awareness of how an index is structured and what data is contained in it.

Schema

The schema is a model of the source data as it is stored in the index. For example, the schema might flag the "pubDate" element in an RSS feed as a "date" and map it to the appropriate field in the index. The schema is also used to determine how the data is processed for indexing.

Search Engine

A search engine is the technology that builds and maintains the index and performs the searching operations. The Search Framework library doesn't have much visibility into the inner workings of the search engine, nor does it care to have any. The third party client libraries are responsible for doing the heavy lifting regarding the communication with the search engine, although the Search Framework will usually initiate the action.

Endpoint

Search Engines often expose on or more endpoints that the client libraries use to connect to it. An endpoint could be a URL or a path on a filesystem. Distributed search engines such as Elasticsearch might have multiple endpoints that correspond with servers, but a Solr cluster that connects through a load balancer might only define one endpoint that points to the balancer.

Job Scheduler

The job scheduler is the generic term for the mechanism that tells the Search Framework which items are scheduled for indexing. If could be as simple as the Search Framework consuming an RSS feed and getting the latest items or a more complex system that has intimate knowledge of the content it is managing as well as which processes are performing indexing operations. Regardless, the Search Framework sees it as the thing it can query to get the items that are scheduled for indexing.

Queue

After fetching the items from the job scheduler that are due for indexing, the items are put into a queue for indexing. A queue could be something as simply as a PHP Iterator or a true queue such as RabbitMQ for parallel indexing operations.