# Building and Loading Text Search in Python Whoosh


## OUTLINE
 1. [Whoosh](#Whoosh_text)
 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Executing Queries, Google-lite...very very lite](#search_me) 



--- 
<a id='Whoosh_text' ></a>

## Whoosh

Whoosh was started as a quick and dirty search server for the online documentation of the Houdini 3D animation software package. 
Side Effects Software generously allowed the code to be open source, in case it might be useful to anyone else who needs a very flexible or pure-Python search engine (or both!).

  * Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
  * By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
  * Whoosh creates fairly small indexes compared to many other search libraries.
  * All indexed text in Whoosh must be unicode.
  * Whoosh lets you store arbitrary Python objects with indexed documents.

### What is Whoosh?

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. 
You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

Practically no important behavior of Whoosh is hard-coded. 
Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously read about the _`book`_ data and you have seen the data used for a corpus in a PostgreSQL full text search.
Now, we are going to walk through the similar process to build the search engine in pure Python.
The process will take very little time and the useability of the full text search is multiplied by degree of heterogeneous data that can be integrated with the full text search.

Throughout these steps, try to recognize the similarities to the PostgreSQL process.

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content, tags)`

In [1]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this lab, we have created a small folder of a few books under the `lab/` folder:

```Bash
[scottgs@metal labs]$ ls book_lite/
acts.txt  numbers.txt  romans.txt
```
We will create the _whoosh_ index files in the folder, then ingest the files.

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.

In [2]:
import os, os.path
from whoosh import index

# Note, this clears the existing index in the directory
ix = index.create_in("book_lite", schema)

# Get a writer form the created index in 
writer = ix.writer()


In [None]:
#import os,os.path
#from whoosh import index
#this clears the existing index in the directory
#ix = index.create_in("book_lite",schema)
# Get a writer from the created index
#writer = ix.writer()

#### This should look familiar!

Note the subtle changes.
You should now be able to adapt code such as provided (found) in the PostgreSQL to recursive parsing into the a new API for indexing or other file / textual processing.

In [3]:
def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        content=infile.read()
        writer.add_document(filename=fname, content=content)
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")
        # Recurse into subfolders
        for d in dirs:
            print("recursing into ",d)
            processFolder(writer,d)

# Functions defined,  get the party started:
processFolder(writer,"book_lite")
writer.commit() # save changes

Processing folder:  book_lite
root =  book_lite
Processing File: book_lite/acts.txt
Indexed:  book_lite/acts.txt
Processing File: book_lite/numbers.txt
Indexed:  book_lite/numbers.txt
Processing File: book_lite/romans.txt
Indexed:  book_lite/romans.txt
Unhandled File
Unhandled File
Unhandled File
Unhandled File
recursing into  MAIN.tmp
Processing folder:  MAIN.tmp
root =  book_lite/MAIN.tmp
Unhandled File


--- 
<a id='search_me' ></a>

## Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html

In [4]:
from whoosh.qparser import QueryParser

qp = QueryParser("content", schema=ix.schema)
q = qp.parse(u"abode")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

<Hit {'filename': 'book_lite/acts.txt'}>
<Hit {'filename': 'book_lite/numbers.txt'}>


In [5]:
q = qp.parse(u"judged OR power")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

<Hit {'filename': 'book_lite/romans.txt'}>
<Hit {'filename': 'book_lite/acts.txt'}>
<Hit {'filename': 'book_lite/numbers.txt'}>


In [6]:
q = qp.parse(u"wealth")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

<Hit {'filename': 'book_lite/acts.txt'}>


In [7]:
q = qp.parse(u"mightest")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

<Hit {'filename': 'book_lite/romans.txt'}>
<Hit {'filename': 'book_lite/acts.txt'}>


In [11]:
q = qp.parse(u"weak AND powerless")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

In [9]:
q = qp.parse(u"strong AND powerful")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

<Hit {'filename': 'book_lite/romans.txt'}>
<Hit {'filename': 'book_lite/numbers.txt'}>
<Hit {'filename': 'book_lite/acts.txt'}>


# SAVE YOUR NOTEBOOK