# Exercise: Building and Loading Text Search in Python Whoosh


## OUTLINE
 1. [Whoosh](#Whoosh_text)
 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Executing Queries, Google-lite...very very lite](#search_me) 



--- 
<a id='Whoosh_text' ></a>

## Whoosh

Whoosh was started as a quick and dirty search server for the online documentation of the Houdini 3D animation software package. 
Side Effects Software generously allowed the code to be open source, in case it might be useful to anyone else who needs a very flexible or pure-Python search engine (or both!).

  * Whoosh is fast, but uses only pure Python, so it will run anywhere Python runs, without requiring a compiler.
  * By default, Whoosh uses the Okapi BM25F ranking function, but like most things the ranking function can be easily customized.
  * Whoosh creates fairly small indexes compared to many other search libraries.
  * All indexed text in Whoosh must be unicode.
  * Whoosh lets you store arbitrary Python objects with indexed documents.

### What is Whoosh?

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. 
You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one of its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine.

Practically no important behavior of Whoosh is hard-coded. 
Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously read about the _`book`_ data and you have seen the data used for a corpus in a PostgreSQL full text search.
Now, we are going to walk through the similar process to build the search engine in pure Python.
The process will take very little time and the useability of the full text search is multiplied by degree of heterogeneous data that can be integrated with the full text search.

Throughout these steps, try to recognize the similarities to the PostgreSQL process.

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```

You need to extend the schema definition to collec the table data when available.

In [1]:
from whoosh.fields import Schema,TEXT,KEYWORD,ID,STORED
from whoosh.analysis import StemmingAnalyzer
schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer(),stored=True),
                kingdom=TEXT(analyzer=StemmingAnalyzer(),stored=True),
                phylum=TEXT(analyzer=StemmingAnalyzer(),stored=True),
                class_=TEXT(analyzer=StemmingAnalyzer(),stored=True),
                order=TEXT(analyzer=StemmingAnalyzer(),stored=True),
                family=TEXT(analyzer=StemmingAnalyzer(),stored=True),
                genus=TEXT(analyzer=StemmingAnalyzer(),stored=True)
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder:

```Bash
[scottgs@metal exercises]$ ls en.wikipedia.org/wiki
Acris.html           Ecnomiohyla.html      Myersiohyla.html    Scinax.html
Anotheca.html        Exerodonta.html       Nyctimantis.html    Smilisca.html
Aparasphenodon.html  Hyla.html             Osteocephalus.html  Sphaenorhynchus.html
Aplastodiscus.html   Hylidae.html          Osteopilus.html     Tepuihyla.html
Argenteohyla.html    Hylinae.html          Phyllodytes.html    Tlalocohyla.html
Bokermannohyla.html  Hyloscirtus.html      Phytotriades.html   Trachycephalus.html
Bromeliohyla.html    Hypsiboas.html        Plectrohyla.html    Triprion.html
Charadrahyla.html    Isthmohyla.html       Pseudacris.html     Xenohyla.html
Corythomantis.html   Itapotihyla.html      Pseudis.html
Dendropsophus.html   Lysapsus.html         Ptychohyla.html
Duellmanohyla.html   Megastomatohyla.html  Scarthyla.html

```
You will create the _whoosh_ index files in the folder then ingest the files.

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### Step 2) Create / Initialize the whoosh index and get the `writer` object.

In [2]:
import os, os.path
from whoosh import index

# Step 1 below this comment"
ix = index.create_in("en.wikipedia.org/wiki",schema)
#Get a writer from the created index
writer = ix.writer()


### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.

In [7]:
import bs4
import re
from sklearn.feature_extraction.text import CountVectorizer
import pprint

In [8]:
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True


def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        content=infile.read()
        soup = bs4.BeautifulSoup(content, 'html.parser')
        texts = soup.findAll(text=True)
        
        # Process all the visible text
        visible_texts = filter(visible, texts)
        # TODO: Assemble all visible_texts into a content string
        visible_content=""
        for line in visible_texts:
            visible_content += "" + line.strip ("\n")
            print(line)

        # TODO: Process the "<table class="infobox biota" ... </table> data
        table = soup.find('table', class_='infobox biota')
        
        # Write to the index
        writer.add_document(filename=fname, content=visible_content, 
                        kingdom=table.get('kingdom'), 
                        phylum=table.get('phylum'), 
                        class_=table.get('class'),
                        family=table.get('family'),
                        genus=table.get('genus'), 
                        )  
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")
        # Recurse into subfolders
        for d in dirs:
            print("recursing into ",d)
            processFolder(writer,d)

#processFolder(writer,"en.wikipedia.org/wiki")
#writer.commit() # save changes

### 4) Parse with our defined functions in place.

In [9]:
# Start processing the folder and commit the work
# -----------------------------------------------
import pandas as pd
from bs4 import BeautifulSoup
from html.parser import HTMLParser

processFolder(writer,"en.wikipedia.org/wiki")
writer.commit() # save changes


Processing folder:  en.wikipedia.org/wiki
root =  en.wikipedia.org/wiki
Processing File: en.wikipedia.org/wiki/Acris.html




 








 CentralNotice 






Cricket frog




From Wikipedia, the free encyclopedia


  (Redirected from 
Acris
)



					Jump to:					
navigation
, 					
search




For other uses, see 
Cricket frog (disambiguation)
.


"Acris" redirects here. For the Romanian village, see 
Acriş
.








Cricket frogs












Acris gryllus






Scientific classification






Kingdom:


Animalia






Phylum:


Chordata






Class:


Amphibia






Order:


Anura






Family:


Hylidae






Genus:


Acris


Duméril
 & 
Bibron
, 1841






Species










Acris blanchardi


Acris crepitans


Acris gryllus












Cricket frogs
, genus 
Acris
, are small, 
North American
 
frogs
 of the family 
Hylidae
. They are more aquatic than other members of the family, and are generally associated with permanent bodies of water with surface vegetation. The 
common
 and 


--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strinigs into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search.

In [12]:
epass = input()

"America"


In [14]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(epass)
with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)





<Hit {'filename': 'en.wikipedia.org/wiki/Smilisca.html', 'class_': ['infobox', 'biota'], 'content': '  CentralNotice Mexican burrowing tree frogFrom Wikipedia, the free encyclopedia\xa0\xa0(Redirected from Smilisca)\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearchSmiliscaSmilisca phaeotaScientific classificationKingdom:AnimaliaPhylum:ChordataClass:AmphibiaOrder:AnuraFamily:HylidaeSubfamily:HylinaeGenus:SmiliscaCope, 1865SpeciesSee text.The Mexican burrowing tree frog (Smilisca, also known as the cross-banded tree frog) is a genus of frogs in the Hylidae family found in Mexico, southern Texas and Arizona, Central America, and northwestern South America. In a recent revision of the Hylidae family, the two species of the previous genus Pternohyla were included in this genus.[1] Its name is from the Ancient Greek smiliskos (‘little knife’), referring to the pointed frontoparietal processes.[2]Species[edit]Binomial name and authorCommon nameS. baudinii (Duméril and Bibron, 1841)commo

### 6) Write example queries to ensure you can search the 

```HTML
<table class="infobox biota"
```

In [15]:
name=input()

"Nyctimantis rugiceps"


In [16]:
# Write your code below this comment:
# --------------------------------------


qp = QueryParser("content", schema=ix.schema)
q = qp.parse(name)
with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)







<Hit {'filename': 'en.wikipedia.org/wiki/Nyctimantis.html', 'class_': ['infobox', 'biota'], 'content': '  CentralNotice Nyctimantis rugicepsFrom Wikipedia, the free encyclopedia\xa0\xa0(Redirected from Nyctimantis)\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearchNyctimantis rugicepsConservation statusLeast Concern\xa0(IUCN 3.1)[1]Scientific classification Kingdom:AnimaliaPhylum:ChordataClass:AmphibiaOrder:AnuraFamily:HylidaeSubfamily:HylinaeGenus:NyctimantisBoulenger, 1882Species:N.\xa0rugicepsBinomial nameNyctimantis rugicepsBoulenger, 1882Nyctimantis rugiceps (common name: brown-eyed treefrog) is a species of frog in the Hylidae family.[2] It is monotypic within the genus Nyctimantis.[3] It is known from the Amazonian Ecuador, Peru, and Colombia, and it is likely to occur also in adjacent Brazil.[2] Its natural habitats are primary and secondary lowland tropical rainforest.[1]Nyctimantis rugiceps breeds in bamboo and tree holes. Females raise the tadpoles on unfertilized eggs.

In [17]:
genus=input()

Osteocephalus


In [18]:
# Write your code below this comment:
# --------------------------------------
qp = QueryParser("content", schema=ix.schema)
q = qp.parse(genus)
with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit)









<Hit {'filename': 'en.wikipedia.org/wiki/Osteocephalus.html', 'class_': ['infobox', 'biota'], 'content': '  CentralNotice Slender-legged tree frogsFrom Wikipedia, the free encyclopedia\xa0\xa0(Redirected from Osteocephalus)\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearchSlender-legged tree frogsOsteocephalus taurinusScientific classificationKingdom:AnimaliaPhylum:ChordataClass:AmphibiaOrder:AnuraFamily:HylidaeGenus:OsteocephalusSteindachner, 1862SpeciesSee text.Osteocephalus is a genus of frogs (slender-legged tree frogs)in the Hylidae family found in the Guianas, the Amazon Basin, Venezuela, Colombia, southeastern Brazil, and northeastern Argentina. Males are warty, while females are smooth.Species[edit]Binomial name and authorCommon nameO. buckleyi (Boulenger, 1882)Buckley\'s slender-legged tree frogO. cabrerai (Cochran and Goin, 1970)O. carri (Cochran and Goin, 1970)O. deridens Jungfer, Ron, Seipp, and Almendáriz, 2000O. elkejungingerae (Henle, 1981)Henle\'s slender-legged t

# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS