# Exercise: Building and Loading Text Search in Python Whoosh

--- 
<a id='task' ></a>

## Task at hand


For this exercise, we are going to walk through the process of creating full text search capability within Python for integration into other analytical processes.

You previously worked with the _`book`_ data. In this exercise, we will work with some wiki data. 

--- 
<a id='build_it' ></a>

## Buiding our Whoosh Schema

Recall, the `book/` folder is composed of a collection of text files, each its own book chapter.

In whoosh, we structure the retrieval system by defining a storage schema.

From the lab with the text files:
```
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
                )
```

This tells us we are defining records to have a `(filename, content)`

For this exercise, we will be using a few Wikipedia pages for our data source.

### 1) For this exercise, you should look at a few of these web pages:

  * https://en.wikipedia.org/wiki/Nyctimantis
  * https://en.wikipedia.org/wiki/Osteocephalus
  * https://en.wikipedia.org/wiki/Osteopilus
  
Specifically, inspect the HTML source and the 
```HTML
<table class="infobox biota" ... </table>
```



<img src="../images/table_inspect.png" height=400 width=600 />



**Task: You need to extend the above schema definition to collect this frog table data when available.**

* Content will be the all visible text on the html page
* Table information such as kingdom, phylum, class, order, family, subfamily, genus should be searchable 

In [None]:
# change this to a code cell and run if you have trouble with a locked writer
from whoosh import writing
writer.commit(mergetype=writing.CLEAR)

In [1]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(filename=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer()),
                # Extend the schema definition to capture relevant table data
                kingdom=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True),
                phylum=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True),
                an_class=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True),
                an_order=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True),
                family=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True),
                subfamily=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True),
                genus=KEYWORD(stored=True, analyzer=StemmingAnalyzer(), scorable=True)
               )

--- 
<a id='load_it' ></a>

## Loading Data

For this exercise, we have created a small folder of a few Wikipedia pages under the `en.wikipedia.org/wiki` folder in the common datasets folder:


In [2]:
! ls /dsa/data/all_datasets/en.wikipedia.org/wiki

Acris.html	     Hylidae.html	   Plectrohyla.html
Anotheca.html	     Hylinae.html	   Pseudacris.html
Aparasphenodon.html  Hyloscirtus.html	   Pseudis.html
Aplastodiscus.html   Hypsiboas.html	   Ptychohyla.html
Argenteohyla.html    Isthmohyla.html	   Scarthyla.html
Bokermannohyla.html  Itapotihyla.html	   Scinax.html
Bromeliohyla.html    Lysapsus.html	   Smilisca.html
Charadrahyla.html    Megastomatohyla.html  Sphaenorhynchus.html
Corythomantis.html   Myersiohyla.html	   Tepuihyla.html
Dendropsophus.html   Nyctimantis.html	   Tlalocohyla.html
Duellmanohyla.html   Osteocephalus.html    Trachycephalus.html
Ecnomiohyla.html     Osteopilus.html	   Triprion.html
Exerodonta.html      Phyllodytes.html	   Xenohyla.html
Hyla.html	     Phytotriades.html




You will create the _whoosh_ index files in the `modules/module6/exercises/wiki_index` folder then ingest the files.

To load the data, write a python script that follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into indexer.
 
## Follow the lab for Python IR with whoosh to complete this exercise.

### 2) Create / Initialize the whoosh index and get the `writer` object.

In [3]:
import os, os.path
from whoosh import index
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'MyAPP/1.0'}

attrs = {'class': 'search-result'}

# Step 2 below this comment"

os.makedirs("wiki_index", exist_ok=True)
ix = index.create_in("wiki_index", schema)
writer = ix.writer()

### 3) Adapt the helper functions

Note the subtle changes.
Please adapt the code below such as provided recursive parsing of the HTML (.html) files, indexing with the Whoosh API.
Trust no code, verify all code segments.


In [233]:
def visible(element):  # return those html elements that are visible as text 
    if element.parent.name in ['style', 'script', 'document', 'head', 'title']: #html tags
        return False
    elif re.match('<!--.*-->', str(element)): # html comments
        return False
    return True

def pullBiota(soup):
    
    data = {}
    cells=[]
    
    table = soup.find('table', class_='infobox biota') # extract infobox biota table

    rows = table.find_all('tr') # extract rows
    
    for row in rows:
        cells.append(row.find_all('td')) # creates list of lists of cells per row

    cells=[x for x in cells if len(x)==2] # only retains rows with 2 cells each
    
    for i in cells:
        data[i[0].get_text()] = i[1].get_text() # sets key to first cell and value to second cell
    
    return data


def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    with open(fname, 'r') as infile:
        content=infile.read()   # read html content
        
        soup = BeautifulSoup(content, 'html.parser')
        texts = soup.find_all(text=True)
        
        # Process all the visible text
        visible_texts = filter(visible, texts)
        
        # TODO: Assemble all visible_texts into a content string
        # Hint: Iterate over visible_texts line by line; remove newlines; create a concatenated string
        processed_text = ""
        
        for line in visible_texts:
            line = line.rstrip('\n')
            processed_text+=line
        
        
        # TODO: Process the "<table class="infobox biota" ... </table> data
        infotable = pullBiota(soup)
        
        # Write to the index
        
        writer.add_document(filename=fname,content=processed_text,kingdom=infotable.get('Kingdom:'),
                            phylum=infotable.get('Phylum:'),an_class=infotable.get('Class:'),an_order=infotable.get('Order:'),
                            family=infotable.get('Family:'),subfamily=infotable.get('Subfamily:'),
                            genus=infotable.get('Genus:'))
        
        print("Indexed: ", fname)

def processFolder(writer,folder):
    '''
    Process a folder for files and subfolders
    '''
    print('Processing folder: ',folder)
    for root, dirs, files in os.walk(folder):
        print("root = ", root)
        # Process Files
        for file in files:
            if file.endswith(".html"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                loadFile(writer,filename)
            else:
                print("Unhandled File")



### 4) Parse with our defined functions in place.

In [234]:
# Start processing the folder and commit the work
# ---------------------------------------------------

processFolder(writer,"/dsa/data/all_datasets/en.wikipedia.org/wiki")
writer.commit()


Processing folder:  /dsa/data/all_datasets/en.wikipedia.org/wiki
root =  /dsa/data/all_datasets/en.wikipedia.org/wiki
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html
Processing File: /dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html
Indexed:  /dsa/data/all_datasets/en.wikipedia.org/w

--- 
<a id='search_me' ></a>

### 5) Executing Queries

Read: 
  http://whoosh.readthedocs.io/en/latest/searching.html
  
Previously, we hard-coded query strings into the code cells.

Now, use the `input()` function collect a query string from the user. 
Then execute the search. For this task, focus only on the `content` field. 

In [236]:
from whoosh.qparser import QueryParser

# Write your code below this comment:
# --------------------------------------

qp = QueryParser("content", schema=ix.schema)

query = input("Enter a string: ")

q = qp.parse(query)

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit['filename'], hit.score, hit.rank)



Enter a string: frog
/dsa/data/all_datasets/en.wikipedia.org/wiki/Smilisca.html 1.9813945223168978 0
/dsa/data/all_datasets/en.wikipedia.org/wiki/Sphaenorhynchus.html 1.970012642899165 1
/dsa/data/all_datasets/en.wikipedia.org/wiki/Hylidae.html 1.9345465722677746 2
/dsa/data/all_datasets/en.wikipedia.org/wiki/Pseudis.html 1.8908179954008726 3
/dsa/data/all_datasets/en.wikipedia.org/wiki/Osteopilus.html 1.890020489001432 4
/dsa/data/all_datasets/en.wikipedia.org/wiki/Ptychohyla.html 1.890020489001432 5
/dsa/data/all_datasets/en.wikipedia.org/wiki/Pseudacris.html 1.8432853499981452 6
/dsa/data/all_datasets/en.wikipedia.org/wiki/Ecnomiohyla.html 1.7583032681359634 7
/dsa/data/all_datasets/en.wikipedia.org/wiki/Ecnomiohyla.html 1.7583032681359634 8
/dsa/data/all_datasets/en.wikipedia.org/wiki/Hyla.html 1.747022624013118 9


### 6) Write two example queries to ensure you can search the index 

That is, make sure you can search on the fields you added to the index from the infobox biota table.

```HTML
<table class="infobox biota" ... </table>
```
For this search, we will ignore `content` field and search over the other fields. We can use `MultifieldParser` to specify the fields of our interest. 


In [238]:
# Write your code below this comment:
# --------------------------------------
import whoosh
from whoosh.qparser import MultifieldParser
from whoosh import *

# OMIT CONTENT
qp = MultifieldParser(["kingdom","phylum","an_class","an_order","family","subfamily","genus"], 
                      schema=ix.schema, group=qparser.OrGroup)  

q = qp.parse("Animalia") # searches the hierarchies for "Animalia"

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit['filename'], hit.score, hit.rank)



/dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html 0.9819814944973216 0
/dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html 0.9819814944973216 1
/dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html 0.9819814944973216 2
/dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html 0.9819814944973216 3
/dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html 0.9819814944973216 4
/dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html 0.9819814944973216 5
/dsa/data/all_datasets/en.wikipedia.org/wiki/Bromeliohyla.html 0.9819814944973216 6
/dsa/data/all_datasets/en.wikipedia.org/wiki/Charadrahyla.html 0.9819814944973216 7
/dsa/data/all_datasets/en.wikipedia.org/wiki/Corythomantis.html 0.9819814944973216 8
/dsa/data/all_datasets/en.wikipedia.org/wiki/Dendropsophus.html 0.9819814944973216 9


In [240]:
# Write your code below this comment:
# --------------------------------------

# OMIT CONTENT
qp = MultifieldParser(["kingdom","phylum","an_class","an_order","family","subfamily","genus"], 
                      schema=ix.schema, group=qparser.OrGroup)

q = qp.parse("Amphibia") # searches the hierarchies for "Amphibia"

with ix.searcher() as s:
    results = s.search(q)
    for hit in results:
        print(hit['filename'], hit.score, hit.rank)



/dsa/data/all_datasets/en.wikipedia.org/wiki/Acris.html 0.9819814944973216 0
/dsa/data/all_datasets/en.wikipedia.org/wiki/Anotheca.html 0.9819814944973216 1
/dsa/data/all_datasets/en.wikipedia.org/wiki/Aparasphenodon.html 0.9819814944973216 2
/dsa/data/all_datasets/en.wikipedia.org/wiki/Aplastodiscus.html 0.9819814944973216 3
/dsa/data/all_datasets/en.wikipedia.org/wiki/Argenteohyla.html 0.9819814944973216 4
/dsa/data/all_datasets/en.wikipedia.org/wiki/Bokermannohyla.html 0.9819814944973216 5
/dsa/data/all_datasets/en.wikipedia.org/wiki/Bromeliohyla.html 0.9819814944973216 6
/dsa/data/all_datasets/en.wikipedia.org/wiki/Charadrahyla.html 0.9819814944973216 7
/dsa/data/all_datasets/en.wikipedia.org/wiki/Corythomantis.html 0.9819814944973216 8
/dsa/data/all_datasets/en.wikipedia.org/wiki/Dendropsophus.html 0.9819814944973216 9


# SAVE YOUR NOTEBOOK WITH ALL EXECUTED CELLS
# Then, `File > Close and Halt`