# <center>Build a search engine from scratch</center>
---

In [None]:
# Increase notebook cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

The link below is to a search engine tutorial that uses elasticsearch.
- https://www.analyticbridge.datasciencecentral.com/profiles/blogs/how-to-build-a-search-engine-part-1

Unfortunately, pythonanywhere does not support elasticsearch, so we will use the Python search engine library **Whoosh** instead.

### Objective

Attempt to build a search engine that incorporates the following features:
- Fuzzy search
- Subset pattern matching
- Auto-complete suggestions
- Scoring algorithm
- Sorting and filtering the results

### Libraries

In [1]:
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser

### Hello World search engine

In [2]:
! rm -rf indexdir # remove the old indexdir that contains the documents

In [3]:
! mkdir indexdir # create a new indexdir folder to store the documents

In [4]:
# Setup schema and writer object for creating documents
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
ix = create_in("indexdir", schema) # Need to mkdir to create folder indexdir
writer = ix.writer()

In [5]:
# Create and commit (save to disk) some documents
writer.add_document(title=u"First document", path=u"/a",
                   content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", path=u"/b",
                   content=u"This is the second document and it is even more interesting!")
writer.commit() # save documents to disk

In [10]:
# Create query engine, execute query and return results.
# Here we are searching the document content, then returning the document title.
with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("document")
    results = searcher.search(query)
    for r in results:
        print(r)

<Hit {'title': 'First document', 'path': '/a'}>
<Hit {'title': 'Second document', 'path': '/b'}>


NB: At some point stopwords are being removed, so searching for the word "this" returns 0 results.

In [14]:
# Now, we'll search the document titles.
# This also returns the document title.
with ix.searcher() as searcher:
    query = QueryParser("title", ix.schema).parse("second")
    results = searcher.search(query)
    for r in results:
        print(r)

<Hit {'title': 'Second document', 'path': '/b'}>
