# This exercise is to use whoosh to build a search engine 
# over your own DSA Notebooks

Have you ever had the thought:
"I know we did this before! Where did I see that in the course materials?"

This exercise is to build a technical solution to aid in answering that question.

**NOTE:** This is a little more like a practice, but it is the exercise for the week.

## Here are the steps 
### I) Build (conceptual):
  1. Crawl through your home directory and find all notebooks (`.ipynb`)
  2. Extract the visible text from the notebooks
  3. Use Whoosh to Index
  
### II) Query:
  1. Open the index
  1. Query

--- 
## Preliminaries
### Parsing Visible Text from a notebook

In [1]:
import sys
import json

# Test on my self, like all evil scientists
filename = './DSA_Notebook_Search_Engine.ipynb'

# Use the JSON library to get it in a DOM-like structure
# This is similar to using BeautifulSoup on HTML/KML/XML
file_data = json.load(open(filename))

# File data is now a map, recall a JSON format is a combo of dictionaries and lists
cells = file_data.get('cells')

print("Dumping {} Non-output Cells from {}".format(len(cells), filename))

# Count the Cells
cno = 1

# for each cell in the notebook
for c in cells:
    
    #extract and test the cell type
    cell_type = c['cell_type']
    if ('code'==cell_type or 'markdown'==cell_type or 'raw'==cell_type ):
        print("# -------------{}----------------".format(cno))
        
        # run the source into lines, it is actually a list of strings/lines
        source = c['source']
        for l in source:
            print(l.strip('\n'))
        cno += 1


Dumping 30 Non-output Cells from ./DSA_Notebook_Search_Engine.ipynb
# -------------1----------------
# This exercise is to use whoosh to build a search engine 
# over your own DSA Notebooks

Have you ever had the thought:
"I know we did this before! Where did I see that in the course materials?"

This exercise is to build a technical solution to aid in answering that question.

**NOTE:** This is a little more like a practice, but it is the exercise for the week.

## Here are the steps 
### I) Build (conceptual):
  1. Crawl through your home directory and find all notebooks (`.ipynb`)
  2. Extract the visible text from the notebooks
  3. Use Whoosh to Index
  
### II) Query:
  1. Open the index
  1. Query
# -------------2----------------
--- 
## Preliminaries
### Parsing Visible Text from a notebook
# -------------3----------------
import sys
import json

# Test on my self, like all evil scientists
filename = './DSA_Notebook_Search_Engine.ipynb'

# Use the JSON library to get it in a DO

If you have followed the pattern of the way Dr. Scott builds up code... 

## a) Time to create function!

### Your function should return a set of cells, each cell having all 
### lines concatentated into one long `string` of data.

The comment lines, `# TODO ... `, will define steps for you to complete.

In [2]:
def visibleTextFromNB(filename):
    '''
    # TODO Describe this function's purpose and return result
    
    This function pulls all the non-output visible cells from
    a JupyterNotebook and concatenates it all into a block of
    text.
    
    Returns : a list of the cells
    '''
    #####################################
    # TODO: Parse file, pull cells
    #####################################
    file_data = json.load(open(filename))

    cells = file_data.get('cells')
   
   
    #####################################
    # TODO: Append cells into a list of cells
    # HINT: Do not strip the newline, \n
    #####################################

    cell_list = []
    
    if cells == None:
        return cell_list
    
    for c in cells:
        cell_type = c['cell_type']
        if ('code'==cell_type or 'markdown'==cell_type or 'raw'==cell_type ):
            cell_text = ""
            source = c['source']
            for l in source:
                cell_text += l
            cell_list.append(cell_text)     
        
   
    #####################################
    # TODO: Append cells into a list of cells
    # HINT: Do not strip the newline, \n
    #####################################

    # return the list
    return cell_list


#End of function: visibleTextFromNB 

## Test your function:

In [3]:
#################################
#        NO EDIT CELL
#################################

# Use the lab notebook
filename = '../labs/Text_Search_TFIDF.ipynb'
cells = visibleTextFromNB(filename)

# Print the begin and end
print(cells[0])
print(cells[1])
print('...')
print(cells[len(cells)-1])

# Building and Loading Text Search in Python Whoosh using TFIDF


## OUTLINE

 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Scoring](#Scoring)
 1. [Executing Queries, Google-lite...very very lite](#TFIFD) 
--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to revist the IR_with_Python_Whoosh lab in module 5 which walks us through the process of creating full text search capability within Python. In addition to that we are going to incorporate a scoring technique called TFIDF for ranking documents based on TFIDF scores of the terms occuring in the documents. 

We will walk through the process to build the search engine in Python using whoosh. We will compare the serach results with and without TFIDF method.
...
The line_num above is the actual line number in the text file. docnum should be the index number in the whole indexes we have created.


#### <span style="background:yellow">Expected Output</span>

```
# Building and Loading Text Search in Python Whoosh using TFIDF


## OUTLINE

 1. [Task at hand](#task)
 1. [Buiding our Whoosh Schema](#build_it)
 1. [Loading Data](#load_it)
 1. [Scoring](#Scoring)
 1. [Executing Queries, Google-lite...very very lite](#TFIFD) 
--- 
<a id='task' ></a>

## Task at hand

For this lab, we are going to revist the IR_with_Python_Whoosh lab in module 5 which walks us through the process of creating full text search capability within Python. In addition to that we are going to incorporate a scoring technique called TFIDF for ranking documents based on TFIDF scores of the terms occuring in the documents. 

We will walk through the process to build the search engine in Python using whoosh. We will compare the serach results with and without TFIDF method.
...
The line_num above is the actual line number in the text file. docnum should be the index number in the whole indexes we have created.
```

# b) Create a _draft_ function to walk through directory and find notebooks (.ipynb)

Collect up the directory walking code from module 5 examples.

#### Note, the starting folder will be `'~'`, the alias for your home directory

#### Note, do not process files yet, just construct the function and print the notebook path names

In [4]:
import os, os.path

def walkFolder(folder):
    '''
    Process a folder for files and subfolders
    Prints the files and folders that are processed. 
    '''
    #print('Processing folder: ',folder)
    
    #####################################
    # TODO: walk through the filesystem starting at folder
    # HINT: os.walk
    #####################################
    for root,dirs,files in os.walk(folder):
        print("root=",root)
    
    
        
        #####################################
        # TODO: Process Files
        # HINT: skips file.endswith("-checkpoint.ipynb")
        #####################################
    
        for file in files:
            filename = os.path.join(root, file)
            if file.endswith("-checkpoint.ipynb"):
                pass
            elif file.endswith(".ipynb"):
                print('Found Notebook:',filename)
            else:
                print("unhandled file")
   
                
                
            
        #####################################
        # TODO: Recurse into subfolders
        # HINT: Skip these
        #           .config
        #           .cache
        #           .ssh
        #           ... etc.         
        #####################################
            
            
        for d in dirs:
            if d.startswith("."):
                 pass
            else:
                 walkFolder(d)
 
    
    
    

############# END for walkFolder

## Test your function:

In [5]:
#################################
#        NO EDIT CELL
#################################

# Use your top-level home directory
initial_root = os.path.expanduser('~') 
walkFolder(initial_root)


root= /dsa/home/souleymanesaleya
unhandled file
unhandled file
unhandled file
unhandled file
unhandled file
unhandled file
root= /dsa/home/souleymanesaleya/.config
root= /dsa/home/souleymanesaleya/.config/matplotlib
root= /dsa/home/souleymanesaleya/.config/htop
unhandled file
root= /dsa/home/souleymanesaleya/.ipython
root= /dsa/home/souleymanesaleya/.ipython/extensions
root= /dsa/home/souleymanesaleya/.ipython/nbextensions
root= /dsa/home/souleymanesaleya/.ipython/profile_default
unhandled file
unhandled file
root= /dsa/home/souleymanesaleya/.ipython/profile_default/db
root= /dsa/home/souleymanesaleya/.ipython/profile_default/log
root= /dsa/home/souleymanesaleya/.ipython/profile_default/pid
root= /dsa/home/souleymanesaleya/.ipython/profile_default/security
root= /dsa/home/souleymanesaleya/.ipython/profile_default/startup
unhandled file
root= /dsa/home/souleymanesaleya/.local
root= /dsa/home/souleymanesaleya/.local/share
root= /dsa/home/souleymanesaleya/.local/share/jupyter
unhandled fi

#### <span style="background:yellow">Expected Output</span>

   * Expected output similar to

```
Found Notebook: /dsa/home/scottgs/Testing Python.ipynb
Found Notebook: /dsa/home/scottgs/Testing R.ipynb
...
Found Notebook: /dsa/home/scottgs/sp17DMIR/modules/module5/practices/Text_Preprocessing.ipynb
Found Notebook: /dsa/home/scottgs/sp17DMIR/modules/module5/exercises/Exercises_Python_text_search.ipynb
Found Notebook: /dsa/home/scottgs/sp17DMIR/modules/module5/answers/Text_Preprocessing_Answers.ipynb
```

# c) Create the schema for whoosh and initialize 
# the index in `notebooks` folder

Recall, in the work above you pulled each visible cell from the notebook into a list of cells.

So, we can store the data such as our results are:
  * Filename
  * Cell No.   
  
Then, we will also index the cell content.

In [6]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer
from whoosh import index


#####################################
# TODO: Create the schema
#####################################
schema = Schema(filename=ID(stored=True),
                line_num=ID(stored=True),
                content=TEXT(analyzer=StemmingAnalyzer())
               )




#####################################
# TODO: Create the index and initialize a `writer`
#####################################
import os, os.path
from whoosh import index

# Note, this clears the existing index in the directory
ix = index.create_in("notebooks", schema)

# Get a writer form the created index in 
writer = ix.writer()






In [7]:
import json

def is_json(myjson):
    try:
        json_object = json.loads(myjson)
    except ValueError:
        return False
    return True

In [8]:
#################################
#        NO EDIT CELL
#################################

#  REPEATED for better review

def visibleTextFromNB(filename):
    '''
    # TODO Describe this function's purpose and return result
    This function pulls all the non-output visible cells from
    a JupyterNotebook and concatenates it all into a block of
    text.
    Returns : a list of the cells
    '''
    #####################################
    # TODO: Parse file, pull cells
    #####################################

    #file_data = json.load(open(filename))

    # File data is now a map, recall a JSON format is a combo of dictionaries and lists
    #cells = file_data.get('cells')
    
    cell_list = []

    if is_json(filename):
        file_data = json.loads(open(filename))
    else: 
        return cell_list
    # File data is now a map,recall a Json format is a combo of dictionaries and lists
   
    cells = file_data.get('cells')

    #####################################
    # TODO: Append cells into a list of cells
    # HINT: Do not strip the newline, \n
    #####################################

#     cell_list = []
    if cells == None:
        return cell_list
    
    # for each cell in the notebook
    for c in cells:

        #extract and test the cell type
        cell_type = c['cell_type']
        if ('code'==cell_type or 'markdown'==cell_type or 'raw'==cell_type ):
            cell_text = ""
            # run the source into lines, it is actually a list of strings/lines
            source = c['source']
            for l in source:
                cell_text += l
            cell_list.append(cell_text)

            
    #####################################
    # TODO: Append cells into a list of cells
    # HINT: Do not strip the newline, \n
    #####################################

    # return the list
    return cell_list

#End of function: visibleTextFromNB 

## d) Write function to load file into the index
  * See [Module 5 Lab](../../module5/labs/IR_with_Python_Whoosh.ipynb#This-should-look-familiar!)

In [9]:
def loadFile(writer, fname):
    '''
    Read file contents, load into database.
    '''
    #####################################
    # TODO: Get cell text from function
    #####################################
    cells=visibleTextFromNB(fname)

    
    #####################################
    # TODO: Iterate through cells, index
    #####################################
    counter = 1;
    
    for c in cells:
        writer.add_document(filename=fname,line_num=str(counter),content=c)
        counter+= 1
    
    
    
    print("Indexed: ", fname)

# END of function

# e) Adapt the folder walking function to invoke file load

### HINT: add the `writer` as a parameter

In [10]:
import os, os.path

#####################################
# TODO: Adapt the parameters
#####################################
def walkFolder(writer, folder):
    '''
    Process a folder for files and subfolders
    Prints the files and folders that are processed.
    '''
        
    #print('Processing folder: ',folder)
    
    #####################################
    # TODO: walk through the filesystem starting at folder
    # HINT: os.walk
    #####################################
    for root,dirs,files in os.walk(folder):
        print("root=",root)
    

    
    
    
    
        #####################################
        # TODO: Process Files
        # HINT: skips file.endswith("-checkpoint.ipynb")
        #####################################
        for file in files:
            filename = os.path.join(root, file)
            if file.endswith("-checkpoint.ipynb"):
                pass
#             elif file.endswith("DSA-8620_souleymanesaleya/modules/module4/labs/regression.ipynb"):
#                 pass
            elif file.endswith(".ipynb"): 
                loadFile(writer,filename)
            else:
                print("unhandled file")
        






        #####################################
        # TODO: Recurse into subfolders
        # HINT: Skip these
        #           .config
        #           .cache
        #           .ssh
        #           ... etc.         
        #####################################
    for d in dirs:
        if (d.startswith(".")):
            pass
        else:
            walkFolder(writer,d)
        
        
        
        
        

############# END for walkFolder

# f) Run the index build 

In [11]:
#################################
#        NO EDIT CELL
#################################

# Use your top-level home directory
initial_root = os.path.expanduser('~') 
walkFolder(writer,initial_root)


# Commit changes
writer.commit() # save changes

root= /dsa/home/souleymanesaleya
unhandled file
unhandled file
unhandled file
unhandled file
unhandled file
unhandled file
root= /dsa/home/souleymanesaleya/.config
root= /dsa/home/souleymanesaleya/.config/matplotlib
root= /dsa/home/souleymanesaleya/.config/htop
unhandled file
root= /dsa/home/souleymanesaleya/.ipython
root= /dsa/home/souleymanesaleya/.ipython/extensions
root= /dsa/home/souleymanesaleya/.ipython/nbextensions
root= /dsa/home/souleymanesaleya/.ipython/profile_default
unhandled file
unhandled file
root= /dsa/home/souleymanesaleya/.ipython/profile_default/db
root= /dsa/home/souleymanesaleya/.ipython/profile_default/log
root= /dsa/home/souleymanesaleya/.ipython/profile_default/pid
root= /dsa/home/souleymanesaleya/.ipython/profile_default/security
root= /dsa/home/souleymanesaleya/.ipython/profile_default/startup
unhandled file
root= /dsa/home/souleymanesaleya/.local
root= /dsa/home/souleymanesaleya/.local/share
root= /dsa/home/souleymanesaleya/.local/share/jupyter
unhandled fi

In [12]:
# DO NOT RUN UNLESS YOU NEED A RESET
#from whoosh import writing
#writer.commit(mergetype=writing.CLEAR)

# g) Execute three queries

## 1

In [13]:
from whoosh.qparser import QueryParser
from whoosh import scoring

# Get input, convert to unicode
qstr = input("Input a query: ")

print("searching for ",qstr)

####################################
# TODO: Build query parser and parse query
####################################

qp = QueryParser("content", schema=ix.schema)
search_str = qp.parse(qstr)




####################################
# TODO: Search the content field
####################################
#with ix.searcher() as s:
    #results = s.search(qstr)
    #for hit in results:
        #print(hit)  

with ix.searcher() as s:
    results = s.search(search_str)
    for hit in results:
        print(hit)




Input a query: linear
searching for  linear


#### <span style="background:yellow">Expected Output</span>

  *  Example output from searching for QueryParser

```
Input a qeury: QueryParser
searching for  QueryParser
content:queryparser
Cell 11 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module5/labs/IR_with_Python_Whoosh.ipynb'
Cell 27 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module5/answers/ParsingWikipediaLifeformPage.ipynb'
Cell 25 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/exercises/DSA_Notebook_Search_Engine.ipynb'
Cell 9 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Text_Search_TFIDF.ipynb'
Cell 14 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Text_Search_TFIDF.ipynb'
Cell 17 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Text_Search_TFIDF.ipynb'
Cell 45 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/labs/Topic_modelling.ipynb'
Cell 7 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/answers/TFIDF_Scoring_Practice.ipynb'
Cell 8 of Notebook '/dsa/home/scottgs/DataMiningAndInfoRetrieval/modules/module6/answers/TFIDF_Scoring_Practice.ipynb'
Cell 11 of Notebook '/dsa/home/scottgs/sp17DMIR/modules/module5/labs/IR_with_Python_Whoosh.ipynb'
```

---
## 2

In [14]:
# Get input, conver to unicode
qstr = input("Input a query: ")

print("searching for ",qstr)

####################################
# TODO: Build query parser and parse query
####################################
qp = QueryParser("content", schema=ix.schema)
qstr = qp.parse("regression")




####################################
# TODO: Search the content field
####################################
with ix.searcher() as s:
    results = s.search(qstr)
    for hit in results:
        print('Cell ',hit["line_num"],'of Notebook:', hit["filename"])











Input a query: regression
searching for  regression


## 3

In [15]:
# Get input, conver to unicode
qstr = input("Input a query: ")

print("searching for ",qstr)

####################################
# TODO: Build query parser and parse query
####################################
qp = QueryParser("content", schema=ix.schema)
qstr = qp.parse("classification")






####################################
# TODO: Search the content field
####################################

with ix.searcher() as s:
    results = s.search(qstr)
    for hit in results:
        print('Cell ',hit["line_num"],'of Notebook:', hit["filename"])





Input a query: classification
searching for  classification


# SAVE YOUR NOTEBOOK