# Exercise: Building and Loading Text Search in Python Whoosh


## OUTLINE
 1. [Whoosh](#Whoosh_text)
 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Executing Queries, Google-lite...very very lite](#search_me) 



--- 
<a id='Whoosh_text' ></a>

## Whoosh

Whoosh was started as a quick and dirty search server for the online documentation of the Houdini 3D animation software package. 
Side Effects Software generously allowed the code to be open source, in case it might be useful to anyone else who needs a very flexible or pure-Python search engine (or both!).

  * Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
  * By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
  * Whoosh creates fairly small indexes compared to many other search libraries.
  * All indexed text in Whoosh must be unicode.
  * Whoosh lets you store arbitrary Python objects with indexed documents.

### What is Whoosh?

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. 
You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

Practically no important behavior of Whoosh is hard-coded. 
Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously read about the _`book`_ data and you have seen the data used for a corpus in a PostgreSQL full text search.
Now, we are going to walk through the similar process to build the search engine in pure Python.
The process will take very little time and the useability of the full text search is multiplied by degree of heterogeneous data that can be integrated with the full text search.

Throughout these steps, try to recognize the similarities to the PostgreSQL process.

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```

You need to extend the schema definition to collec the table data when available.

In [1]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                # Extend the schema definition to capture relevant table data
                
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder:

```Bash
[scottgs@metal exercises]$ ls en.wikipedia.org/wiki
Acris.html           Ecnomiohyla.html      Myersiohyla.html    Scinax.html
Anotheca.html        Exerodonta.html       Nyctimantis.html    Smilisca.html
Aparasphenodon.html  Hyla.html             Osteocephalus.html  Sphaenorhynchus.html
Aplastodiscus.html   Hylidae.html          Osteopilus.html     Tepuihyla.html
Argenteohyla.html    Hylinae.html          Phyllodytes.html    Tlalocohyla.html
Bokermannohyla.html  Hyloscirtus.html      Phytotriades.html   Trachycephalus.html
Bromeliohyla.html    Hypsiboas.html        Plectrohyla.html    Triprion.html
Charadrahyla.html    Isthmohyla.html       Pseudacris.html     Xenohyla.html
Corythomantis.html   Itapotihyla.html      Pseudis.html
Dendropsophus.html   Lysapsus.html         Ptychohyla.html
Duellmanohyla.html   Megastomatohyla.html  Scarthyla.html

```
You will create the _whoosh_ index files in the folder then ingest the files.

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### Step 2) Create / Initialize the whoosh index and get the `writer` object.

In [2]:
import os, os.path
from whoosh import index

# Step 1 below this comment"



### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.

In [None]:
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True


def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        content=infile.read()
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.findAll(text=True)
        
        # Process all the visible text
        visible_texts = filter(visible, texts)
        # TODO: Assemble all visible_texts into a content string
        

        # TODO: Process the "<table class="infobox biota" ... </table> data
        
        
        # Write to the index
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")
        # Recurse into subfolders
        for d in dirs:
            print("recursing into ",d)
            processFolder(writer,d)



### 4) Parse with our defined functions in place.

In [None]:
# Start processing the folder and commit the work
# ---------------------------------------------------






--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strinigs into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search.

In [None]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------








### 6) Write example queries to ensure you can search the 

```HTML
<table class="infobox biota"
```

In [None]:
# Write your code below this comment:
# --------------------------------------








In [None]:
# Write your code below this comment:
# --------------------------------------








# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS