# Spark Search Engine
Done by Artur Samigullin

This Notebook shows how to make indexing with a Spark Search Engine Library on a small use case

## Part I. Indexing Dataset

### Initialize Contexts
First of all, to work with Spark Search Engine you need to import pyspark library and initialize SparkContext and SQLContext

In [1]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

from pyspark.sql import SQLContext
sqlc = SQLContext(sc)

In [2]:
sc.pythonVer

'3.5'

In [3]:
sc.defaultParallelism

4

### Import SearchEngine class
At this step you need to import SearchEngine class from SparkSearchEngineLib.SearchEngine package

In [4]:
from SparkSearchEngineLib.SearchEngine import SearchEngine

### Initialize instance of SearchEngine class
You need to pass two parameters to SearchEngine constructor - SparkContext and SQLContext

In [5]:
se = SearchEngine(sc,sqlc)

### Index your dataset
We assume that you made all preprocessing for your files, and we expect a folder that consists of textual files in format 'Token0 Token1 ... TokenN'

In [6]:
se.construct_index('./Dataset/')

## Part II. Use Search
To use search you need to have an SearchEngine instance with constructed index. You can make it with *search( )* method.  
*search( )* method has one parameter - preprocessed query string with format 'Token0 Token1 ... TokenN'  
Method returns a list of links(filenames) with number of hits.

In [11]:
import nltk
import string
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
from nltk import wordpunct_tokenize

import re
punctStr = re.compile(r'|'.join([re.escape(x) for x in string.punctuation]))

In [17]:
def preprocess_query(query_string):
    query = re.sub(punctStr, '', ' '.join( [ snowball_stemmer.stem(x) for x in wordpunct_tokenize(query_string)]))
    return query

In [19]:
q_str = preprocess_query('Alice')

In [23]:
find = se.search(q_str)
find.take(10)

['file:/Users/deusesx/Projects/P&MP/Upload/Dataset/carroll-alice.txt has number of hits: 398',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/edgeworth-parents.txt has number of hits: 2',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/chesterton-thursday.txt has number of hits: 1']

In [21]:
q_str2 = preprocess_query('Cheshire Cat')

In [24]:
find_2 = se.search(q_str2)
find_2.take(10)

['file:/Users/deusesx/Projects/P&MP/Upload/Dataset/carroll-alice.txt has number of hits: 57',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/chesterton-ball.txt has number of hits: 8',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/edgeworth-parents.txt has number of hits: 7',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/chesterton-brown.txt has number of hits: 6',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/whitman-leaves.txt has number of hits: 3',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/melville-moby_dick.txt has number of hits: 3',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/bryant-stories.txt has number of hits: 2',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/shakespeare-macbeth.txt has number of hits: 2',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/chesterton-thursday.txt has number of hits: 2',
 'file:/Users/deusesx/Projects/P&MP/Upload/Dataset/austen-persuasion.txt has number of hits: 1']

## Part III. Files Manipulation
You can store your index in a *parquet* format. To make this just use *save_index( )* method.  
*save_index( )* method has one parameter - string with filename.  
Note that if filename is already exists, it will be overwritten by *save_index( )* method.

In [25]:
se.save_index('index.parquet')

In [26]:
se2 = SearchEngine(sc, sqlc)

You can load index from *parquet* format with *load_index( )* method.  
*load_index( )* method takes one parameter - filename.

In [27]:
se2.load_index('index.parquet')