# Exercise: Building and Loading Text Search in Python Whoosh


## OUTLINE
 1. [Whoosh](#Whoosh_text)
 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Executing Queries, Google-lite...very very lite](#search_me) 



--- 
<a id='Whoosh_text' ></a>

## Whoosh

Whoosh was started as a quick and dirty search server for the online documentation of the Houdini 3D animation software package. 
Side Effects Software generously allowed the code to be open source, in case it might be useful to anyone else who needs a very flexible or pure-Python search engine (or both!).

  * Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
  * By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
  * Whoosh creates fairly small indexes compared to many other search libraries.
  * All indexed text in Whoosh must be unicode.
  * Whoosh lets you store arbitrary Python objects with indexed documents.

### What is Whoosh?

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. 
You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

Practically no important behavior of Whoosh is hard-coded. 
Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

--- 
<a id='task' ></a>

## Task at hand

For this exercise, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously read about the _`book`_ data and you have seen the data used for a corpus in a PostgreSQL full text search, as well as using Whoosh in Python.

Now, we are going go through the similar process to build a search engine in pure Python for a different corpus.

The process will take very little time and the useability of the full text search is multiplied by degree of heterogeneous data that can be integrated with the full text search.

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

For this exercise, we will be using a few Wikipedia pages for our data source.

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```

You need to extend the schema definition to collect the table data when available.

In [13]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer()),
                # Extend the schema definition to capture relevant table data
                kingdom=TEXT(stored=True),
                phylum=TEXT(stored=True),
                class_=TEXT(stored=True),
                order=TEXT(stored=True),
                family=TEXT(stored=True),
                subfamily=TEXT(stored=True),
                genus=TEXT(stored=True)
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder:

```Bash
[scottgs@metal exercises]$ ls en.wikipedia.org/wiki
Acris.html           Ecnomiohyla.html      Myersiohyla.html    Scinax.html
Anotheca.html        Exerodonta.html       Nyctimantis.html    Smilisca.html
Aparasphenodon.html  Hyla.html             Osteocephalus.html  Sphaenorhynchus.html
Aplastodiscus.html   Hylidae.html          Osteopilus.html     Tepuihyla.html
Argenteohyla.html    Hylinae.html          Phyllodytes.html    Tlalocohyla.html
Bokermannohyla.html  Hyloscirtus.html      Phytotriades.html   Trachycephalus.html
Bromeliohyla.html    Hypsiboas.html        Plectrohyla.html    Triprion.html
Charadrahyla.html    Isthmohyla.html       Pseudacris.html     Xenohyla.html
Corythomantis.html   Itapotihyla.html      Pseudis.html
Dendropsophus.html   Lysapsus.html         Ptychohyla.html
Duellmanohyla.html   Megastomatohyla.html  Scarthyla.html

```
You will create the _whoosh_ index files in the folder then ingest the files.

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### Step 2) Create / Initialize the whoosh index and get the `writer` object.

In [14]:
import os, os.path
from whoosh import index

# Step 2 below this comment"
ix = index.create_in("en.wikipedia.org/wiki", schema)
writer = ix.writer()


In [15]:
import bs4
import re
from sklearn.feature_extraction.text import CountVectorizer
import pprint

pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

def pullBiota(content):
    '''
    Content is the HTML content
    '''
     # Start up a dictionary
    data = {}

    soup = bs4.BeautifulSoup(content, 'html.parser')

    # TODO: Process the "<table class="infobox biota" ... </table> data
    biota_table = soup.findAll('table', attrs={'class': re.compile(r'\binfobox\b')} )
    # Now that we have pulled the table, lets process the rows
    table_as_bs = bs4.BeautifulSoup(str(biota_table), 'html.parser')

    for row in table_as_bs.findAll('tr'):
        # Each row has td or th elements, we know we need the td 
        cells = row.findAll('td')
        # Only the two-column rows matter
        if (len(cells) == 2):
            # print(cells[0].string,'-',cells[1].string)
            data[cells[0].string.strip(':')] = cells[1].string
    return data


In [16]:
with open("../exercises/en.wikipedia.org/wiki/Tepuihyla.html", 'r', encoding="utf-8") as infile:
    content=infile.read()
    tableOut = pullBiota(content)
    print(tableOut)

{'Genus': None, 'Class': 'Amphibia', 'Phylum': 'Chordata', 'Order': 'Anura', 'Family': 'Hylidae', 'Kingdom': 'Animalia', 'Subfamily': 'Hylinae'}


### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.

In [17]:
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r',encoding="utf-8") as infile:
        content=infile.read()
        tableOut=pullBiota(content)
        soup = bs4.BeautifulSoup(content, 'html.parser')
        texts = soup.findAll(text=True)
        
        # Process all the visible text
        visible_texts = filter(visible, texts)
        # TODO: Assemble all visible_texts into a content string
        visible_content = ""
        for i in visible_texts:
            visible_content += " " + i.strip('\n')
        
        # TODO: Process the "<table class="infobox biota" ... </table> data
        
        writer.add_document(filename=fname,
                           content=visible_content,
                           kingdom=tableOut.get('Kingdom'),
                           phylum=tableOut.get('Phylum'),
                           class_=tableOut.get('Class'),
                           order=tableOut.get('Order'),
                           family=tableOut.get('Family'),
                           subfamily=tableOut.get('Subfamily'),
                           genus=tableOut.get('Genus')
                          ) 
        
        # Write to the index
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")
        # Recurse into subfolders
        for d in dirs:
            print("recursing into ",d)
            processFolder(writer,d)

### 4) Parse with our defined functions in place.

In [18]:
# Start processing the folder and commit the work
# ---------------------------------------------------
processFolder(writer,"en.wikipedia.org/wiki")

Processing folder:  en.wikipedia.org/wiki
root =  en.wikipedia.org/wiki
Processing File: en.wikipedia.org/wiki/Acris.html
Indexed:  en.wikipedia.org/wiki/Acris.html
Processing File: en.wikipedia.org/wiki/Anotheca.html
Indexed:  en.wikipedia.org/wiki/Anotheca.html
Processing File: en.wikipedia.org/wiki/Aparasphenodon.html
Indexed:  en.wikipedia.org/wiki/Aparasphenodon.html
Processing File: en.wikipedia.org/wiki/Aplastodiscus.html
Indexed:  en.wikipedia.org/wiki/Aplastodiscus.html
Processing File: en.wikipedia.org/wiki/Argenteohyla.html
Indexed:  en.wikipedia.org/wiki/Argenteohyla.html
Processing File: en.wikipedia.org/wiki/Bokermannohyla.html
Indexed:  en.wikipedia.org/wiki/Bokermannohyla.html
Processing File: en.wikipedia.org/wiki/Bromeliohyla.html
Indexed:  en.wikipedia.org/wiki/Bromeliohyla.html
Processing File: en.wikipedia.org/wiki/Charadrahyla.html
Indexed:  en.wikipedia.org/wiki/Charadrahyla.html
Processing File: en.wikipedia.org/wiki/Corythomantis.html
Indexed:  en.wikipedia.org

In [19]:
writer.commit()

--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strings into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search.

In [20]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------

qp = QueryParser("content", schema=ix.schema)
q = qp.parse(u"Tepuihyla")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)


<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Tepuihyla.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Hylinae.html'}>
<Hit {'order': 'Anura', 'filename': 'en.wikipedia.org/wiki/Hylidae.html', 'phylum': 'Chordata', 'kingdom': 'Animalia', 'class_': 'Amphibia'}>


In [21]:
user_search_string = input("What would you like to search for..?")
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(user_search_string)

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

What would you like to search for..?Chordata
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Scarthyla.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aplastodiscus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Corythomantis.html'}>
<Hit {'order': 'Anura', 'filename': 'en.wikipedia.org/wiki/Hylidae.html', 'phylum': 'Chordata', 'kingdom': 'Animalia', 'class_': 'Amphibia'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Lysapsus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', '

### 6) Write example queries to ensure you can search the index 

```HTML
<table class="infobox biota"
```

In [22]:
# Write your code below this comment:
# --------------------------------------
qp = QueryParser("order", schema=ix.schema)
q = qp.parse(u"Anura")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)


<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Acris.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Anotheca.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aparasphenodon.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aplastodiscus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Argenteohyla.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wi

In [23]:
user_search_string = input("What would you like to search for..?")
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(user_search_string)

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

What would you like to search for..?Animalia
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Scarthyla.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aplastodiscus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Corythomantis.html'}>
<Hit {'order': 'Anura', 'filename': 'en.wikipedia.org/wiki/Hylidae.html', 'phylum': 'Chordata', 'kingdom': 'Animalia', 'class_': 'Amphibia'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Lysapsus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', '

In [24]:
# Write your code below this comment:
# --------------------------------------
qp = QueryParser("phylum", schema=ix.schema)
q = qp.parse(u"Chordata")

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Acris.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Anotheca.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aparasphenodon.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aplastodiscus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Argenteohyla.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wi

In [25]:
user_search_string = input("What would you like to search for..?")
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(user_search_string)

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)

What would you like to search for..?phylum
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Scarthyla.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'filename': 'en.wikipedia.org/wiki/Aplastodiscus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Corythomantis.html'}>
<Hit {'order': 'Anura', 'filename': 'en.wikipedia.org/wiki/Hylidae.html', 'phylum': 'Chordata', 'kingdom': 'Animalia', 'class_': 'Amphibia'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'order': 'Anura', 'family': 'Hylidae', 'subfamily': 'Hylinae', 'filename': 'en.wikipedia.org/wiki/Lysapsus.html'}>
<Hit {'kingdom': 'Animalia', 'phylum': 'Chordata', 'class_': 'Amphibia', 'or

# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS