# A quick tour of Lucene and Solr

[Lucene](http://lucene.apache.org/) is a widely used information retrieval library in Java.

* implements the TF-IDF model
* simple model for indexing and search
* easy to learn, fast, and scalable
* written by Doug Cutting, who later co-wrote Hadoop
* I've been using it since 2001 ([DSpace](http://dspace.org/), [Canary](http://canarydatabase.org/), [WDL](http://wdl.org/), [Chronam](http://chroniclingamerica.loc.gov), [GW Libs](http://library.gwu.edu/))
* free, open source software

## Lucene basics

* everything is a Document (source format independent)
* Documents have Fields
* Fields can be tokenized, indexed, stored, multivalue (or not)
* text is Tokenized, Analyzed on index and query
* queries are parsed, then Tokenized, Analyzed, and results ranked

See [a simple example of code using the Lucene API](http://lucene.apache.org/core/5_0_0/core/overview-summary.html#overview_description) and also [more extensive Lucene documentation](http://lucene.apache.org/core/5_0_0/index.html).

## Relevance in Lucene

* Lucene uses a modified TF-IDF (cosine similarity) strategy by default
* adds term (field value) level boosting at index time, and query term boosting
* adds coordination factor for multi-term queries
* adds query normalization to make scores comparable across queries (comparison not meaningful otherwise)
* adds field length normalization so shorter fields contribute more
* optional [Term Vector Component](https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component) supports further exploration of query result and corpus similarity
* explained [in detail here](http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html)
* can swap this out for other similarity functions


## Solr basics

[Solr](http://lucene.apache.org/solr/) is a search application (server) that makes Lucene even easier to implement and scale.

* Lucene under the hood
* simple "index anything" or schema-based configuration
* simple web API to index and search
* scales to many servers
* very widely used
* free, open source software

# Working with Solr

* index and search using web API - easier code
* define a schema first, or don't
* csv, json, xml results easy to parse in any language
* handles many technical details (caching, faceting, spelling corrections, clustering, hit hightlighting)
* the bee's knees, basically

## Searching solr

It's as easy as sending an HTTP request.  Let's use [python-requests](http://docs.python-requests.org/en/latest/) to handle HTTP.

In [50]:
import requests
req = requests.get("http://localhost:8983/solr/gettingstarted/select?wt=json&q=foundation&indent=true")
print req.content

{
  "responseHeader":{
    "status":0,
    "QTime":18,
    "params":{
      "q":"foundation",
      "indent":"true",
      "wt":"json"}},
  "response":{"numFound":3672,"start":0,"maxScore":0.39567953,"docs":[
      {
        "id":"0553293354",
        "cat":["book"],
        "name":["Foundation"],
        "price":[7.99],
        "inStock":[true],
        "author":["Isaac Asimov"],
        "series_t":["Foundation Novels"],
        "sequence_i":1,
        "genre_s":"scifi",
        "_version_":1494510828527812608},
      {
        "id":"UTF8TEST",
        "name":["Test with some UTF-8 encoded characters"],
        "manu":["Apache Software Foundation"],
        "cat":["software",
          "search"],
        "features":["No accents here",
          "This is an e acute: é",
          "eaiou with circumflexes: êâîôû",
          "eaiou with umlauts: ëäïöü",
          "tag with escaped chars: <nicetag/>",
          "escaped ampersand: Bonnie & Clyde",
          "Outside the BMP:𐌈 codepoint=10

If you haven't seen it before, this is [JSON](http://json.org/), a very easy to produce and parse format that is widely used on the web.  Many APIs return JSON content, and every language you might use supports JSON.  It's built in to python (`import json`), and it's so common that modules like `requests` automatically parse it for you, like this:

In [51]:
resp = req.json()
resp.keys()

[u'responseHeader', u'response']

As you can see, the JSON response is just a big dictionary, and the `json()` function parses it (as text, like as above) and returns an in-memory python dictionary.

**Don't be fooled** - if you print a python dictionary out it will look a lot like this.  This is mostly a happy convenience.  It is very important to use well-tested JSON libraries to produce and parse JSON, like what we have here.  Mistakes in JSON encoding may be made accidentally or purposefully, for malicious reasons.  Always use a JSON library instead of something like python's `eval()`.

Anyway, back to the data.  The response header is pretty straightforward:

In [52]:
resp['responseHeader']

{u'QTime': 18,
 u'params': {u'indent': u'true', u'q': u'foundation', u'wt': u'json'},
 u'status': 0}

Pretty straightforward.  The more interesting bits are in the query response.

In [53]:
resp['response'].keys()

[u'start', u'maxScore', u'numFound', u'docs']

In [54]:
resp['response']['start']

0

In [55]:
resp['response']['maxScore']

0.39567953

In [56]:
resp['response']['numFound']

3672

That's the basic header.  We're starting with the first page of results, and it's a zero-based index, which explains why "start" is 0.  We'll look more at scores in a moment.

Let's take a look at the results first.

In [57]:
len(resp['response']['docs'])

10

In [58]:
docs = resp['response']['docs']
docs[0]

{u'_version_': 1494510828527812608,
 u'author': [u'Isaac Asimov'],
 u'cat': [u'book'],
 u'genre_s': u'scifi',
 u'id': u'0553293354',
 u'inStock': [True],
 u'name': [u'Foundation'],
 u'price': [7.99],
 u'sequence_i': 1,
 u'series_t': [u'Foundation Novels']}

What we see in this response is a combination of metadata about the result as a "hit" and the field values that were stored at index time.

In [59]:
docs[1]

{u'_version_': 1494510737274437632,
 u'cat': [u'software', u'search'],
 u'features': [u'No accents here',
  u'This is an e acute: \xe9',
  u'eaiou with circumflexes: \xea\xe2\xee\xf4\xfb',
  u'eaiou with umlauts: \xeb\xe4\xef\xf6\xfc',
  u'tag with escaped chars: <nicetag/>',
  u'escaped ampersand: Bonnie & Clyde',
  u'Outside the BMP:\U00010308 codepoint=10308, a circle with an x inside. UTF8=f0908c88 UTF16=d800 df08'],
 u'id': u'UTF8TEST',
 u'inStock': [True],
 u'manu': [u'Apache Software Foundation'],
 u'name': [u'Test with some UTF-8 encoded characters'],
 u'price': [0.0]}

In a web application context, we can iterate over results and display them like this:

In [60]:
for doc in docs:
    print "%s: %s, $%s" % (doc['id'], doc.get('name', '(no title)'), doc.get('price', '-'))

0553293354: [u'Foundation'], $[7.99]
UTF8TEST: [u'Test with some UTF-8 encoded characters'], $[0.0]
SOLR1000: [u'Solr, the Enterprise Search Server'], $[0.0]
/Volumes/PGCD0803/etext03/vbgle11h/index.html: (no title), $-
/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/spelling/suggest/tst/package-use.html: (no title), $-
/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-dataimporthandler-extras/deprecated-list.html: (no title), $-
/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-analytics/deprecated-list.html: (no title), $-
/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/package-use.html: (no title), $-
/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/logging/log4j/package-use.html: (no title), $-
/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-solrj/org/apache/solr/client/solrj/util/package-use.html: (no title), $-


In a web application you'd do a lot more than just this... probably formatting a link to the item, making it look good, and then adding buttons to page through results, display information about how many results there were overall (`numFound`), etc.

Let's revisit this query and add relevance rank scores to the results.

In [89]:
# add "fl=*,score" to get all stored fields along with the relevance score
req = requests.get("http://localhost:8983/solr/gettingstarted/select?wt=json&q=foundation&fl=*,score")
resp = req.json()
docs = resp['response']['docs']
docs[0]

{u'_version_': 1494510828527812608,
 u'author': [u'Isaac Asimov'],
 u'cat': [u'book'],
 u'genre_s': u'scifi',
 u'id': u'0553293354',
 u'inStock': [True],
 u'name': [u'Foundation'],
 u'price': [7.99],
 u'score': 0.39567953,
 u'sequence_i': 1,
 u'series_t': [u'Foundation Novels']}

In [62]:
for doc in docs:
    print "%s - %s" % (doc['score'], doc.get('name', '(no title)'))

0.39567953 - [u'Foundation']
0.13989384 - [u'Test with some UTF-8 encoded characters']
0.122407116 - [u'Solr, the Enterprise Search Server']
0.06483684 - (no title)
0.061824925 - (no title)
0.061824925 - (no title)
0.061824925 - (no title)
0.061824925 - (no title)
0.061824925 - (no title)
0.061824925 - (no title)


# Under the hood

Now let's take a closer look at how those relevance scores are generated. Solr offers easy tools for examining a live index and debugging information about results.

In [63]:
# Add parameter debugQuery=true, repeat
req = requests.get("http://localhost:8983/solr/gettingstarted/select?wt=json&q=foundation&fl=*,score&debugQuery=true")
resp = req.json()
resp.keys()

[u'debug', u'responseHeader', u'response']

Let's look at that `debug` element:

In [64]:
debug = resp['debug']
debug.keys()

[u'parsedquery',
 u'track',
 u'explain',
 u'querystring',
 u'rawquerystring',
 u'parsedquery_toString',
 u'QParser',
 u'timing']

First, we can see that the query is parsed to determine how it should be processed.  In the absence of named fields, it assumes a default, here `_text`.

In [65]:
debug['QParser']

u'LuceneQParser'

In [66]:
debug['rawquerystring']

u'foundation'

In [67]:
debug['parsedquery']

u'_text:foundation'

The `timing` element gives the processing time in milliseconds.  Note that a lot of tasks were not performed.

In [68]:
debug['timing']

{u'prepare': {u'debug': {u'time': 0.0},
  u'expand': {u'time': 0.0},
  u'facet': {u'time': 0.0},
  u'highlight': {u'time': 0.0},
  u'mlt': {u'time': 0.0},
  u'query': {u'time': 4.0},
  u'stats': {u'time': 0.0},
  u'time': 4.0},
 u'process': {u'debug': {u'time': 7.0},
  u'expand': {u'time': 0.0},
  u'facet': {u'time': 0.0},
  u'highlight': {u'time': 0.0},
  u'mlt': {u'time': 0.0},
  u'query': {u'time': 4.0},
  u'stats': {u'time': 0.0},
  u'time': 12.0},
 u'time': 16.0}

## The good stuff: relevance scoring

Now we get to the good bits.  The `explain` element contains the precise scoring details for each aspect of a parsed query. 

In [69]:
explain = debug['explain']
explain.keys()

[u'0553293354',
 u'/Volumes/PGCD0803/etext03/vbgle11h/index.html',
 u'UTF8TEST',
 u'/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/spelling/suggest/tst/package-use.html',
 u'/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-solrj/org/apache/solr/client/solrj/util/package-use.html',
 u'/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/logging/log4j/package-use.html',
 u'/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-dataimporthandler-extras/deprecated-list.html',
 u'SOLR1000',
 u'/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/package-use.html',
 u'/Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-analytics/deprecated-list.html']

In [70]:
print explain['UTF8TEST']


0.13989384 = (MATCH) weight(_text:foundation in 1578) [DefaultSimilarity], result of:
  0.13989384 = fieldWeight in 1578, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    1.1191508 = idf(docFreq=1785, maxDocs=2012)
    0.125 = fieldNorm(doc=1578)



In [71]:
print explain['0553293354']


0.39567953 = (MATCH) weight(_text:foundation in 1586) [DefaultSimilarity], result of:
  0.39567953 = fieldWeight in 1586, product of:
    1.4142135 = tf(freq=2.0), with freq of:
      2.0 = termFreq=2.0
    1.1191508 = idf(docFreq=1785, maxDocs=2012)
    0.25 = fieldNorm(doc=1586)



In [72]:
req = requests.get("http://localhost:8983/solr/gettingstarted/select?wt=json&q=chinese+name:tokenizer&fl=*,score&debugQuery=true")
resp = req.json()
resp['debug']['parsedquery']

u'_text:chinese name:tokenizer'

In [73]:
for doc in resp['response']['docs']:
    print "%s - %s: %s" % (doc['score'], doc['id'], doc.get('name', '(no name)'))

0.032653715 - /Volumes/PGCD0803/etext00/rbddh10.txt: (no name)
0.03125867 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/LuceneCarrot2TokenizerFactory.html: (no name)
0.030319117 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/package-summary.html: (no name)
0.026523702 - /Volumes/PGCD0803/etext00/rlchn10.txt: (no name)
0.025249962 - /Volumes/PGCD0803/etext94/sunzu10.txt: (no name)
0.023208376 - /Volumes/PGCD0803/etext04/hmjnc11h/chap26.html: (no name)
0.019769719 - /Volumes/PGCD0803/etext04/hmjnc11h/chap28.html: (no name)
0.019723328 - /Volumes/PGCD0803/etext05/8igjp10.txt: (no name)
0.019395495 - /Volumes/PGCD0803/etext04/samur10.txt: (no name)
0.017682573 - /Volumes/PGCD0803/etext04/conra10.txt: (no name)


Because this is a slightly more complex query, we can see a lot more going on in the explanations.

In [74]:
explain = resp['debug']['explain']
for id, result in explain.items():
    print "%s%s" % (id, result)

/Volumes/PGCD0803/etext94/sunzu10.txt
0.025249964 = (MATCH) product of:
  0.050499927 = (MATCH) sum of:
    0.050499927 = (MATCH) weight(_text:chinese in 15) [DefaultSimilarity], result of:
      0.050499927 = score(doc=15,freq=116.0), product of:
        0.41048786 = queryWeight, product of:
          3.8988824 = idf(docFreq=116, maxDocs=2124)
          0.10528347 = queryNorm
        0.123024166 = fieldWeight in 15, product of:
          10.770329 = tf(freq=116.0), with freq of:
            116.0 = termFreq=116.0
          3.8988824 = idf(docFreq=116, maxDocs=2124)
          0.0029296875 = fieldNorm(doc=15)
  0.5 = coord(1/2)

/Volumes/PGCD0803/etext05/8igjp10.txt
0.01972333 = (MATCH) product of:
  0.03944666 = (MATCH) sum of:
    0.03944666 = (MATCH) weight(_text:chinese in 64) [DefaultSimilarity], result of:
      0.03944666 = score(doc=64,freq=52.0), product of:
        0.41048786 = queryWeight, product of:
          3.8988824 = idf(docFreq=116, maxDocs=2124)
          0.10528347 =

## A better example

These examples are rather inscrutable; let's look at something more obvious.  In addition to the example "getting started" API documentation indexed during the [Solr Tutorial](http://lucene.apache.org/solr/quickstart.html) I've downloaded and indexed several hundred books from [Project Gutenberg](http://www.gutenberg.org/ebooks/11220).  Let's try the "foundation" search again:

In [75]:
req = requests.get("http://localhost:8983/solr/gettingstarted/select?q=foundation&wt=json&fl=*,score&debugQuery=true")
resp = req.json()
for doc in resp['response']['docs']:
    print "%s - %s: %s" % (doc['score'], doc['id'], doc.get('name', '(no name)'))

0.39567953 - 0553293354: [u'Foundation']
0.13989384 - UTF8TEST: [u'Test with some UTF-8 encoded characters']
0.122407116 - SOLR1000: [u'Solr, the Enterprise Search Server']
0.06483684 - /Volumes/PGCD0803/etext03/vbgle11h/index.html: (no name)
0.061824925 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/spelling/suggest/tst/package-use.html: (no name)
0.061824925 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-dataimporthandler-extras/deprecated-list.html: (no name)
0.061824925 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-solrj/org/apache/solr/client/solrj/util/package-use.html: (no name)
0.061824925 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-analytics/deprecated-list.html: (no name)
0.061824925 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/package-use.html: (no name)
0.061824925 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/logging/log4j/package-use.html: (no nam

That first hit is new!  Let's take a look.

In [76]:
explain = resp['debug']['explain']
print explain['0553293354']


0.39567953 = (MATCH) weight(_text:foundation in 1586) [DefaultSimilarity], result of:
  0.39567953 = fieldWeight in 1586, product of:
    1.4142135 = tf(freq=2.0), with freq of:
      2.0 = termFreq=2.0
    1.1191508 = idf(docFreq=1785, maxDocs=2012)
    0.25 = fieldNorm(doc=1586)



In [77]:
print explain['SOLR1000']


0.122407116 = (MATCH) weight(_text:foundation in 1576) [DefaultSimilarity], result of:
  0.122407116 = fieldWeight in 1576, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    1.1191508 = idf(docFreq=1785, maxDocs=2012)
    0.109375 = fieldNorm(doc=1576)



Can you guess what it is?

In [78]:
resp['response']['docs'][0]

{u'_version_': 1494510828527812608,
 u'author': [u'Isaac Asimov'],
 u'cat': [u'book'],
 u'genre_s': u'scifi',
 u'id': u'0553293354',
 u'inStock': [True],
 u'name': [u'Foundation'],
 u'price': [7.99],
 u'score': 0.39567953,
 u'sequence_i': 1,
 u'series_t': [u'Foundation Novels']}

This gives us a great opportunity to see how relevance calculations vary with specific queries. First, we add the bare term "asimov", so our search query is now "foundation asimov".  Watch what happens to the relevance score of the top hit, and the scores of the other hits.

In [79]:
req = requests.get("http://localhost:8983/solr/gettingstarted/select?q=foundation+asimov&wt=json&fl=*,score&debugQuery=true")
resp = req.json()
for doc in resp['response']['docs']:
    print "%s - %s: %s" % (doc['score'], doc['id'], doc.get('name', '(no name)'))

2.0143478 - 0553293354: [u'Foundation']
0.009794351 - UTF8TEST: [u'Test with some UTF-8 encoded characters']
0.008570056 - SOLR1000: [u'Solr, the Enterprise Search Server']
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/spelling/suggest/tst/package-use.html: (no name)
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-dataimporthandler-extras/deprecated-list.html: (no name)
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-solrj/org/apache/solr/client/solrj/util/package-use.html: (no name)
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-analytics/deprecated-list.html: (no name)
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/package-use.html: (no name)
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/logging/log4j/package-use.html: (no name)
0.004328532 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-test

In [80]:
explain = resp['debug']['explain']
print explain['0553293354']


2.0143478 = (MATCH) sum of:
  0.05540521 = (MATCH) weight(_text:foundation in 1586) [DefaultSimilarity], result of:
    0.05540521 = score(doc=1586,freq=2.0), product of:
      0.14002547 = queryWeight, product of:
        1.1191508 = idf(docFreq=1785, maxDocs=2012)
        0.12511761 = queryNorm
      0.39567953 = fieldWeight in 1586, product of:
        1.4142135 = tf(freq=2.0), with freq of:
          2.0 = termFreq=2.0
        1.1191508 = idf(docFreq=1785, maxDocs=2012)
        0.25 = fieldNorm(doc=1586)
  1.9589427 = (MATCH) weight(_text:asimov in 1586) [DefaultSimilarity], result of:
    1.9589427 = score(doc=1586,freq=1.0), product of:
      0.99014795 = queryWeight, product of:
        7.9137373 = idf(docFreq=1, maxDocs=2012)
        0.12511761 = queryNorm
      1.9784343 = fieldWeight in 1586, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        7.9137373 = idf(docFreq=1, maxDocs=2012)
        0.25 = fieldNorm(doc=1586)



In [81]:
print explain['SOLR1000']


0.008570056 = (MATCH) product of:
  0.017140113 = (MATCH) sum of:
    0.017140113 = (MATCH) weight(_text:foundation in 1576) [DefaultSimilarity], result of:
      0.017140113 = score(doc=1576,freq=1.0), product of:
        0.14002547 = queryWeight, product of:
          1.1191508 = idf(docFreq=1785, maxDocs=2012)
          0.12511761 = queryNorm
        0.122407116 = fieldWeight in 1576, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.1191508 = idf(docFreq=1785, maxDocs=2012)
          0.109375 = fieldNorm(doc=1576)
  0.5 = coord(1/2)



Let's refine it a notch by using a specific field query for the author name.

In [82]:
req = requests.get("http://localhost:8983/solr/gettingstarted/select?q=foundation+author:asimov&wt=json&fl=*,score&debugQuery=true")
resp = req.json()
for doc in resp['response']['docs']:
    print "%s - %s: %s" % (doc['score'], doc['id'], doc.get('name', '(no name)'))

0.02551029 - 0553293354: [u'Foundation']
0.00901925 - UTF8TEST: [u'Test with some UTF-8 encoded characters']
0.007891844 - SOLR1000: [u'Solr, the Enterprise Search Server']
0.004149459 - /Volumes/PGCD0803/etext03/vbgle11h/index.html: (no name)
0.003985983 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/spelling/suggest/tst/package-use.html: (no name)
0.003985983 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-dataimporthandler-extras/deprecated-list.html: (no name)
0.003985983 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-solrj/org/apache/solr/client/solrj/util/package-use.html: (no name)
0.003985983 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-analytics/deprecated-list.html: (no name)
0.003985983 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-clustering/org/apache/solr/handler/clustering/carrot2/package-use.html: (no name)
0.003985983 - /Users/dchud/Downloads/apps/solr-5.0.0/docs/solr-core/org/apache/solr/logging/log4j/package-use.html: (no na

What happened to the scores?

In [83]:
explain = resp['debug']['explain']
print explain['0553293354']


0.02551029 = (MATCH) product of:
  0.05102058 = (MATCH) sum of:
    0.05102058 = (MATCH) weight(_text:foundation in 1586) [DefaultSimilarity], result of:
      0.05102058 = score(doc=1586,freq=2.0), product of:
        0.1289442 = queryWeight, product of:
          1.1191508 = idf(docFreq=1785, maxDocs=2012)
          0.11521612 = queryNorm
        0.39567953 = fieldWeight in 1586, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.1191508 = idf(docFreq=1785, maxDocs=2012)
          0.25 = fieldNorm(doc=1586)
  0.5 = coord(1/2)



In [84]:
print explain['SOLR1000']


0.007891844 = (MATCH) product of:
  0.015783688 = (MATCH) sum of:
    0.015783688 = (MATCH) weight(_text:foundation in 1576) [DefaultSimilarity], result of:
      0.015783688 = score(doc=1576,freq=1.0), product of:
        0.1289442 = queryWeight, product of:
          1.1191508 = idf(docFreq=1785, maxDocs=2012)
          0.11521612 = queryNorm
        0.122407116 = fieldWeight in 1576, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.1191508 = idf(docFreq=1785, maxDocs=2012)
          0.109375 = fieldNorm(doc=1576)
  0.5 = coord(1/2)



## Faceting

This is only waving at the surface of what Solr can do, not even scratching it.  Just as an example, here's what an result set faceted by author names looks like.

In [85]:
req = requests.get("http://localhost:8983/solr/gettingstarted/select?q=*:*&facet=true&facet.field=author&wt=json&fl=*,score&debugQuery=true")
resp = req.json()
resp['responseHeader']

{u'QTime': 96,
 u'params': {u'debugQuery': u'true',
  u'facet': u'true',
  u'facet.field': u'author',
  u'fl': u'*,score',
  u'q': u'*:*',
  u'wt': u'json'},
 u'status': 0}

In [86]:
resp['facet_counts']

{u'facet_dates': {},
 u'facet_fields': {u'author': [u'George R.R. Martin',
   3,
   u'Lloyd Alexander',
   2,
   u'Rick Riordan',
   2,
   u'A NOVEL IN THREE PARTS BY FYODOR DOSTOEVSKY',
   1,
   u'Forster, E. M.',
   1,
   u'Glen Cook',
   1,
   u'Henry David Thoreau',
   1,
   u'Isaac Asimov',
   1,
   u'Jostein Gaarder',
   1,
   u'Michael McCandless',
   1,
   u'Orson Scott Card',
   1,
   u'Roger Zelazny',
   1,
   u'Rudyard Kipling',
   1,
   u'Sir James George Frazer',
   1,
   u'Steven Brust',
   1,
   u'gilevi',
   1]},
 u'facet_intervals': {},
 u'facet_queries': {},
 u'facet_ranges': {}}

## High degree of usefulness warning

This is exactly how we do the facets/refinement categories on the right side of the screen [after this catalog query](http://findit.library.gwu.edu/search?q=asimov+foundation).  Or if you don't use the library, think of product catalogs at amazon.  Faceting is what makes it easy to refine searches to limit to TVs of a certain size, or within a particular price range, or from by a particular manufacturer.

In [87]:
resp['debug']['timing']

{u'prepare': {u'debug': {u'time': 0.0},
  u'expand': {u'time': 0.0},
  u'facet': {u'time': 0.0},
  u'highlight': {u'time': 0.0},
  u'mlt': {u'time': 0.0},
  u'query': {u'time': 0.0},
  u'stats': {u'time': 0.0},
  u'time': 0.0},
 u'process': {u'debug': {u'time': 1.0},
  u'expand': {u'time': 0.0},
  u'facet': {u'time': 69.0},
  u'highlight': {u'time': 0.0},
  u'mlt': {u'time': 0.0},
  u'query': {u'time': 17.0},
  u'stats': {u'time': 0.0},
  u'time': 88.0},
 u'time': 88.0}

86 milliseconds is pretty fast.  Why is this so fast?

That's all for this quick tour.  There is a ton of stuff in Lucene/Solr that makes it a powerful tool for building search apps, even including spatial search.  Basically, if you can represent anything as a document-with-fields-with-values, you can index it with Lucene and serve it up with Solr.