Alright folks, this will be a tutorial that walks you through how to build your own Full Text Search engine from scratch.

By completing this tutorial, you will understand Tokenizers, Analyzers, Indexing, and Search. 

## Analyze

Installing all the prerequisites

In [38]:
! pip install pystemmer

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8 -m pip install --upgrade pip' command.[0m


Importing some of the libraries we will be using.

**Stemmer:** which is basically just removing the suffix from a word and reduce it to its root word. For example: “Flying” is a word and its suffix is “ing”, if we remove “ing” from “Flying” then we will get base word or root word which is “Fly”.

**re:** Python's default regex library

**string:** Python's default string manipulation library

In [39]:
import Stemmer
import re
import string

In [40]:
# simply breaking a string up by whitespace into an array of strings.
def tokenize(text):
    return text.split()

In [41]:
# converting every string into lowercarse
def lowercase_filter(tokens):
    return [token.lower() for token in tokens]

In [42]:
# applying the stemmer library to get every word to its' root
def stem_filter(tokens):
    STEMMER = Stemmer.Stemmer('english')
    return STEMMER.stemWords(tokens)

In [43]:
# remove all punctuation
def punctuation_filter(tokens):
    PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
    return [PUNCTUATION.sub('', token) for token in tokens]

These are the top 25 most common words in English according to wikipedia:
https://en.wikipedia.org/wiki/Most_common_words_in_English

In [44]:
# remove all stop words
def stopword_filter(tokens):
    STOPWORDS = set(['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
                     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
                     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'wikipedia'])
    return [token for token in tokens if token not in STOPWORDS]

Now an "analyze" function which puts all of the functions above into one single function:

In [45]:
def analyze(text):
    tokens = tokenize(text)
    tokens = lowercase_filter(tokens)
    tokens = punctuation_filter(tokens)
    tokens = stopword_filter(tokens)
    tokens = stem_filter(tokens)

    return [token for token in tokens if token]

So let's test it using a common sentence.

In [46]:
analyze("The quick brown fox jumps over the lazy dog")

['quick', 'brown', 'fox', 'jump', 'over', 'lazi', 'dog']

## Index

## Search

## Relevance