# Serial Version and Spark Tuning

In [None]:
# sparktuning.ipynb

######################
#
# Submission by Kendrick Lo (Harvard ID: 70984997) for
# CS 205 - Computing Foundations for Computational Science (Prof. R. Jones)
# 
# This is part of a joint project with Gioia Dominedo that includes a separate
# component for context checking. This notebook outlines algorithms for
# word-level correction, and includes a serial Python algorithm based on a third
# party algorithm (namely SymSpell, see below), as well as a Spark/Python
# algorithm. A number of optimizations/compromises were attempted with varying
# levels of success -- these attempts are also document in this notebook.
#
######################

### 1. Serial Code Performance (Word-Level Correction)

In [None]:
'''
v 1.0 last revised 22 Nov 2015

This program is a Python version of a spellchecker based on SymSpell, 
a Symmetric Delete spelling correction algorithm developed by Wolf Garbe 
and originally written in C#.

From the original SymSpell documentation:

"The Symmetric Delete spelling correction algorithm reduces the complexity 
 of edit candidate generation and dictionary lookup for a given Damerau-
 Levenshtein distance. It is six orders of magnitude faster and language 
 independent. Opposite to other algorithms only deletes are required, 
 no transposes + replaces + inserts. Transposes + replaces + inserts of the 
 input term are transformed into deletes of the dictionary term.
 Replaces and inserts are expensive and language dependent: 
 e.g. Chinese has 70,000 Unicode Han characters!"

For further information on SymSpell, please consult the original documentation:
  URL: blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
  Description: blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

The current version of this program will output all possible suggestions for
corrections up to an edit distance (configurable) of max_edit_distance = 3. 

Future improvements may entail allowing for less verbose options, 
including the output of a single recommended correction. Note also that we
have generally kept to the form of the original program, and have not
introduced any major optimizations or structural changes in this Python port.

To execute program:
1. Ensure "big.txt" is in the current working directory. This is the corpus
   from which the dictionary for the spellchecker will be built.
2a. Check recommended single word corrections by executing get_suggestions("word") 
    in the corresponding marked box below; or
2b. Check single word corrections for a document by executing
    correct_document("<file>") in the corresponding marked box below.

################

Example input/output:

################

get_suggestions("there")

number of possible corrections: 604
  edit distance for deletions: 3
  
[('there', (2972, 0)),
 ('these', (1231, 1)),
 ('where', (977, 1)),
 ('here', (691, 1)),
 ('three', (584, 1)),
 ('thee', (26, 1)),
 ('chere', (9, 1)),
 ('theme', (8, 1)),
 ('the', (80030, 2)), ...

####

correct_document("OCRsample.txt")

Finding misspelled words in your document...
In line 3, taiths: suggested correction is < faith >
In line 11, the word < oonipiittee > was not found (no suggested correction)
In line 13, tj: suggested correction is < to >
In line 13, mnnff: suggested correction is < snuff >
[...]

total words checked: 700
total unknown words: 3
total potential errors found: 94

'''

import re

max_edit_distance = 3 

dictionary = {}
longest_word_length = 0

def get_deletes_list(w):
    '''given a word, derive strings with up to max_edit_distance characters deleted'''
    deletes = []
    queue = [w]
    for d in range(max_edit_distance):
        temp_queue = []
        for word in queue:
            if len(word)>1:
                for c in range(len(word)):  # character index
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in deletes:
                        deletes.append(word_minus_c)
                    if word_minus_c not in temp_queue:
                        temp_queue.append(word_minus_c)
        queue = temp_queue
        
    return deletes

def create_dictionary_entry(w):
    '''add word and its derived deletions to dictionary'''
    # check if word is already in dictionary
    # dictionary entries are in the form: (list of suggested corrections, frequency of word in corpus)
    global longest_word_length
    new_real_word_added = False
    if w in dictionary:
        dictionary[w] = (dictionary[w][0], dictionary[w][1] + 1)  # increment count of word in corpus
    else:
        dictionary[w] = ([], 1)  
        longest_word_length = max(longest_word_length, len(w))
        
    if dictionary[w][1]==1:
        # first appearance of word in corpus
        # n.b. word may already be in dictionary as a derived word (deleting character from a real word)
        # but counter of frequency of word in corpus is not incremented in those cases)
        new_real_word_added = True
        deletes = get_deletes_list(w)
        for item in deletes:
            if item in dictionary:
                # add (correct) word to delete's suggested correction list if not already there
                if item not in dictionary[item][0]:
                    dictionary[item][0].append(w)
            else:
                dictionary[item] = ([w], 0)  # note frequency of word in corpus is not incremented
        
    return new_real_word_added

def create_dictionary(fname):

    total_word_count = 0
    unique_word_count = 0
    print "Creating dictionary..." 
    
    with open(fname) as file:    
        for line in file:
            words = re.findall('[a-z]+', line.lower())  # separate by words by non-alphabetical characters      
            for word in words:
                total_word_count += 1
                if create_dictionary_entry(word):
                    unique_word_count += 1
    
    print "total words processed: %i" % total_word_count
    print "total unique words in corpus: %i" % unique_word_count
    print "total items in dictionary (corpus words and deletions): %i" % len(dictionary)
    print "  edit distance for deletions: %i" % max_edit_distance
    print "  length of longest word in corpus: %i" % longest_word_length
        
    return dictionary

In [None]:
## %%time
## test dictionary creation
## d = create_dictionary("/Users/K-Lo/Desktop/big.txt")

In [None]:
def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This method has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
    
    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

def get_suggestions(string, silent=False):
    '''return list of suggested corrections for potentially incorrectly spelled word'''
    if (len(string) - longest_word_length) > max_edit_distance:
        if not silent:
            print "no items in dictionary within maximum edit distance"
        return []
    
    suggest_dict = {}
    
    queue = [string]
    q_dictionary = {}  # items other than string that we've checked
    
    while len(queue)>0:
        q_item = queue[0]  # pop
        queue = queue[1:]
        
        # process queue item
        if (q_item in dictionary) and (q_item not in suggest_dict):
            if (dictionary[q_item][1]>0):
            # word is in dictionary, and is a word from the corpus, and not already in suggestion list
            # so add to suggestion dictionary, indexed by the word with value (frequency in corpus, edit distance)
            # note q_items that are not the input string are shorter than input string 
            # since only deletes are added (unless manual dictionary corrections are added)
                assert len(string)>=len(q_item)
                suggest_dict[q_item] = (dictionary[q_item][1], len(string) - len(q_item))
            
            ## the suggested corrections for q_item as stored in dictionary (whether or not
            ## q_item itself is a valid word or merely a delete) can be valid corrections
            for sc_item in dictionary[q_item][0]:
                if (sc_item not in suggest_dict):
                    
                    # compute edit distance
                    # suggested items should always be longer (unless manual corrections are added)
                    assert len(sc_item)>len(q_item)
                    # q_items that are not input should be shorter than original string 
                    # (unless manual corrections added)
                    assert len(q_item)<=len(string)
                    if len(q_item)==len(string):
                        assert q_item==string
                        item_dist = len(sc_item) - len(q_item)

                    # item in suggestions list should not be the same as the string itself
                    assert sc_item!=string           
                    # calculate edit distance using, for example, Damerau-Levenshtein distance
                    item_dist = dameraulevenshtein(sc_item, string)
                    
                    if item_dist<=max_edit_distance:
                        assert sc_item in dictionary  # should already be in dictionary if in suggestion list
                        suggest_dict[sc_item] = (dictionary[sc_item][1], item_dist)
        
        # now generate deletes (e.g. a substring of string or of a delete) from the queue item
        # as additional items to check -- add to end of queue
        assert len(string)>=len(q_item)
        if (len(string)-len(q_item))<max_edit_distance and len(q_item)>1:
            for c in range(len(q_item)): # character index        
                word_minus_c = q_item[:c] + q_item[c+1:]
                if word_minus_c not in q_dictionary:
                    queue.append(word_minus_c)
                    q_dictionary[word_minus_c] = None  # arbitrary value, just to identify we checked this
             
    # queue is now empty: convert suggestions in dictionary to list for output
    if not silent:
        print "number of possible corrections: %i" %len(suggest_dict)
        print "  edit distance for deletions: %i" % max_edit_distance
    
    # output option 1
    # sort results by ascending order of edit distance and descending order of frequency
    #     and return list of suggested word corrections only:
    # return sorted(suggest_dict, key = lambda x: (suggest_dict[x][1], -suggest_dict[x][0]))

    # output option 2
    # return list of suggestions with (correction, (frequency in corpus, edit distance)):
    as_list = suggest_dict.items()
    return sorted(as_list, key = lambda (term, (freq, dist)): (dist, -freq))

    '''
    Option 1:
    get_suggestions("file")
    ['file', 'five', 'fire', 'fine', ...]
    
    Option 2:
    get_suggestions("file")
    [('file', (5, 0)),
     ('five', (67, 1)),
     ('fire', (54, 1)),
     ('fine', (17, 1))...]  
    '''

def best_word(s, silent=False):
    try:
        return get_suggestions(s, silent)[0]
    except:
        return None
    
def correct_document(fname):
    # correct an entire document
    with open(fname) as file:
        doc_word_count = 0
        corrected_word_count = 0
        unknown_word_count = 0
        print "Finding misspelled words in your document..." 
        
        for i, line in enumerate(file):
            doc_words = re.findall('[a-z]+', line.lower())  # separate by words by non-alphabetical characters      
            for doc_word in doc_words:
                doc_word_count += 1
                suggestion = best_word(doc_word, silent=True)
                if suggestion is None:
                    print "In line %i, the word < %s > was not found (no suggested correction)" % (i, doc_word)
                    unknown_word_count += 1
                elif suggestion[0]!=doc_word:
                    print "In line %i, %s: suggested correction is < %s >" % (i, doc_word, suggestion[0])
                    corrected_word_count += 1
        
    print "-----"
    print "total words checked: %i" % doc_word_count
    print "total unknown words: %i" % unknown_word_count
    print "total potential errors found: %i" % corrected_word_count

    return

<div class="alert alert-danger">
  <strong>Run the cell below only once to build the dictionary.</strong>
</div>

In [None]:
d = create_dictionary("/Users/K-Lo/Desktop/big.txt")

<div class="alert alert-success">
  <strong>Enter word to correct below.</strong>
</div>

In [None]:
%%time
get_suggestions("there")

In [None]:
%%time
best_word("there")

<div class="alert alert-success">
  <strong>Enter file name of document to correct below.</strong>
</div>

In [None]:
%%time
correct_document("/Users/K-Lo/Desktop/OCRsample.txt")

------

### 2. Original SPARK version Performance (SLOW)

<div class="alert alert-danger">
  <strong>To run this program, restart notebook, and start executing the cells of this section starting here.</strong>
</div>

In [None]:
'''
v 1.0 last revised 22 Nov 2015

This program is a Spark (PySpark) version of a spellchecker based on SymSpell, 
a Symmetric Delete spelling correction algorithm developed by Wolf Garbe 
and originally written in C#.

From the original SymSpell documentation:

"The Symmetric Delete spelling correction algorithm reduces the complexity 
 of edit candidate generation and dictionary lookup for a given Damerau-
 Levenshtein distance. It is six orders of magnitude faster and language 
 independent. Opposite to other algorithms only deletes are required, 
 no transposes + replaces + inserts. Transposes + replaces + inserts of the 
 input term are transformed into deletes of the dictionary term.
 Replaces and inserts are expensive and language dependent: 
 e.g. Chinese has 70,000 Unicode Han characters!"

For further information on SymSpell, please consult the original documentation:
  URL: blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
  Description: blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

The current version of this program will output all possible suggestions for
corrections up to an edit distance (configurable) of max_edit_distance = 3. 

Future improvements may entail allowing for less verbose options, 
including the output of a single recommended correction. Note also that we
have generally kept to the form of the original program, and have not
introduced any major optimizations or structural changes in this PySpark port.

To execute program:
1. Ensure "big.txt" is in the current working directory. This is the corpus
   from which the dictionary for the spellchecker will be built.
2. Check recommended single word corrections by executing get_suggestions("word") 
    in the corresponding marked box below.

Note: we did not implement entire document checking given speed of program,
      since we are not stopping early after having found a best word with
      minimum edit distance (however, see context-based version).

################

Example input/output:

################

get_suggestions("there")

number of possible corrections: 604
  edit distance for deletions: 3
  
[('there', (2972, 0)),
 ('these', (1231, 1)),
 ('where', (977, 1)),
 ('here', (691, 1)),
 ('three', (584, 1)),
 ('thee', (26, 1)),
 ('chere', (9, 1)),
 ('theme', (8, 1)),
 ('the', (80030, 2)), ...


'''

import findspark
import os
findspark.init('/Users/K-Lo/spark-1.5.0')

import pyspark
conf = (pyspark.SparkConf()
    .setMaster('local')
    .setAppName('pyspark')
    .set("spark.executor.memory", "2g"))
sc = pyspark.SparkContext(conf=conf)

import re

n_partitions = 6  # number of partitions to be used
max_edit_distance = 3


def get_deletes_list(word):
    '''given a word, derive strings with one character deleted'''
    # takes a string as input and returns all 1-deletes in a list
    # allows for duplicates to be created, will deal with duplicates later to minimize shuffling
    if len(word)>1:
        return ([word[:c] + word[c+1:] for c in range(len(word))])
    else:
        return []
    
def copartitioned(RDD1, RDD2):
    '''check if two RDDs are copartitioned'''
    return RDD1.partitioner == RDD2.partitioner

def combine_joined_lists(tup):
    '''takes as input a tuple in the form (a, b) where each of a, b may be None (but not both) or a list
       and returns a concatenated list of unique elements'''
    concat_list = []
    if tup[1] is None:
        concat_list = tup[0]
    elif tup[0] is None:
        concat_list = tup[1]
    else:
        concat_list = tup[0] + tup[1]
        
    return list(set(concat_list))

def parallel_create_dictionary(fname):
    '''Create dictionary using Spark RDDs.'''
    # we generate and count all words for the corpus,
    # then add deletes to the dictionary
    # this is a slightly different approach from the SymSpell algorithm
    # that may be more appropriate for Spark processing
    
    print "Creating dictionary..." 
    
    ############
    #
    # process corpus
    #
    ############
    
    # http://stackoverflow.com/questions/22520932/python-remove-all-non-alphabet-chars-from-string
    regex = re.compile('[^a-z ]')

    # convert file into one long sequence of words
    make_all_lower = sc.textFile(fname).map(lambda line: line.lower())
    replace_nonalphs = make_all_lower.map(lambda line: regex.sub(' ', line))
    all_words = replace_nonalphs.flatMap(lambda line: line.split())

    # create core corpus dictionary (i.e. only words appearing in file, no "deletes") and cache it
    # output RDD of unique_words_with_count: [(word1, count1), (word2, count2), (word3, count3)...]
    count_once = all_words.map(lambda word: (word, 1))
    unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()
    
    # output stats on core corpus
    print "total words processed: %i" % unique_words_with_count.map(lambda (k, v): v).reduce(lambda a, b: a + b)
    print "total unique words in corpus: %i" % unique_words_with_count.count()
    
    ############
    #
    # generate deletes list
    #
    ############
    
    # generate list of n-deletes from words in a corpus of the form: [(word1, count1), (word2, count2), ...]
    # we will handle possible duplicates after map/reduce:
    #     our thinking is the resulting suggestions lists for each delete will be much smaller than the
    #     list of potential deletes, and it is more efficient to reduce first, then remove duplicates 
    #     from these smaller lists (at each worker node), rather than calling `distinct()` on  
    #     flattened `expand_deletes` which would require a large shuffle

    ##
    ## generate 1-deletes
    ##
     
    assert max_edit_distance>0  
    
    generate_deletes = unique_words_with_count.map(lambda (parent, count): (parent, get_deletes_list(parent)), 
                                                      preservesPartitioning=True)
    expand_deletes = generate_deletes.flatMapValues(lambda x: x)
    
    # swap and combine, resulting RDD after processing 1-deletes has elements:
    # [(delete1, [correct1, correct2...]), (delete2, [correct1, correct2...])...]
    swap = expand_deletes.map(lambda (orig, delete): (delete, [orig]))
    combine = swap.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions)

    # cache "master" deletes RDD, list of (deletes, [unique suggestions]), for use in loop
    deletes = combine.mapValues(lambda sl: list(set(sl))).cache()
    
    ##
    ## generate 2+ deletes
    ##
    
    d_remaining = max_edit_distance - 1  # decreasing counter
    queue = deletes

    while d_remaining>0:

        # generate further deletes -- we parallelize processing of all deletes in this version
        #'expand_new_deletes' will be of the form [(parent "delete", [new child "deletes"]), ...]
        # n.b. this will filter out elements with no new child deletes
        gen_new_deletes = queue.map(lambda (x, y): (x, get_deletes_list(x)), preservesPartitioning=True)
        expand_new_deletes = gen_new_deletes.flatMapValues(lambda x: x)  

        # associate each new child delete with same corpus word suggestions that applied for parent delete
        # update queue with [(new child delete, [corpus suggestions]) ...] and cache for next iteration
        
        assert copartitioned(queue, expand_new_deletes)   # check partitioning for efficient join
        get_sugglist_from_parent = expand_new_deletes.join(queue)
        new_deletes = get_sugglist_from_parent.map(lambda (p, (c, sl)): (c, sl))
        combine_new = new_deletes.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions)
        queue = combine_new.mapValues(lambda sl: list(set(sl))).cache()

        # update "master" deletes list with new deletes, and cache for next iteration
        
        assert copartitioned(deletes, queue)    # check partitioning for efficient join
        join_delete_lists = deletes.fullOuterJoin(queue)
        deletes = join_delete_lists.mapValues(lambda y: combine_joined_lists(y)).cache()

        d_remaining -= 1
        
    ############
    #
    # merge deletes with unique corpus words to construct main dictionary
    #
    ############

    # dictionary entries are in the form: (list of suggested corrections, frequency of word in corpus)
    # note frequency of word in corpus is not incremented for deletes
    deletes_for_dict = deletes.mapValues(lambda sl: (sl, 0)) 
    unique_words_for_dict = unique_words_with_count.mapValues(lambda count: ([], count))

    assert copartitioned(unique_words_for_dict, deletes_for_dict)  # check partitioning for efficient join
    join_deletes = unique_words_for_dict.fullOuterJoin(deletes_for_dict)
    '''
    entries now in form of (word, ( ([], count), ([suggestions], 0) )) for words in both corpus/deletes
                           (word, ( ([], count), None               )) for (real) words in corpus only
                           (word, ( None       , ([suggestions], 0) )) for (fake) words in deletes only
    '''

    # if entry has deletes and is a real word, take suggestion list from deletes and count from corpus
    dictionary_RDD = join_deletes.mapValues(lambda (xtup, ytup): 
                                                xtup if ytup is None
                                                else ytup if xtup is None
                                                else (ytup[0], xtup[1])).cache()

    print "total items in dictionary (corpus words and deletions): %i" % dictionary_RDD.count()
    print "  edit distance for deletions: %i" % max_edit_distance
    longest_word_length = unique_words_with_count.map(lambda (k, v): len(k)).reduce(max)
    print "  length of longest word in corpus: %i" % longest_word_length
        
    return dictionary_RDD, longest_word_length

In [None]:
## %%time
## test dictionary creation
## dict, lwl = parallel_create_dictionary("/Users/K-Lo/Desktop/big.txt")

In [None]:
def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance (an integer) between sequences.

    This code has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
    
    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

def get_n_deletes_list(w, n):
    '''given a word, derive strings with up to n characters deleted'''
    deletes = []
    queue = [w]
    for d in range(n):
        temp_queue = []
        for word in queue:
            if len(word)>1:
                for c in range(len(word)):  # character index
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in deletes:
                        deletes.append(word_minus_c)
                    if word_minus_c not in temp_queue:
                        temp_queue.append(word_minus_c)
        queue = temp_queue
        
    return deletes

def get_suggestions(s, dictRDD, longest_word_length=float('inf'), silent=False):
    '''return list of suggested corrections for potentially incorrectly spelled word.
    
    s: input string
    dictRDD: the main dictionary, which includes deletes
             entries are in the form of: [(word, ([suggested corrections], frequency of word in corpus)), ...]
    longest_word_length: optional identifier of longest real word in dictRDD
    silent: verbose output
    '''

    if (len(s) - longest_word_length) > max_edit_distance:
        if not silent:
            print "no items in dictionary within maximum edit distance"
        return []

    ##########
    #
    # initialize suggestions RDD
    # suggestRDD entries: (word, (frequency of word in corpus, edit distance))
    #
    ##########
    
    if not silent:
        print "looking up suggestions based on input word..."
    
    # ensure input RDDs are partitioned
    dictRDD = dictRDD.partitionBy(n_partitions).cache()
    
    # check if input word is in dictionary, and is a word from the corpus (edit distance = 0)
    # if so, add input word itself to suggestRDD
    exact_match = dictRDD.filter(lambda (w, (sl, freq)): w==s).cache()
    suggestRDD = exact_match.mapValues(lambda (sl, freq): (freq, 0)).cache()

    ##########
    #
    # add suggestions for input word
    #
    ##########
    
    # the suggested corrections for the item in dictionary (whether or not
    # the input string s itself is a valid word or merely a delete) can be valid corrections
    sc_items = exact_match.flatMap(lambda (w, (sl, freq)): sl)
    calc_dist = sc_items.map(lambda sc: (sc, len(sc)-len(s))).partitionBy(n_partitions).cache()
    
    assert copartitioned(dictRDD, calc_dist)  # check partitioning for efficient join
    get_freq = dictRDD.join(calc_dist)
    parent_sugg = get_freq.mapValues(lambda ((sl, freq), dist): (freq, dist))
    suggestRDD = suggestRDD.union(parent_sugg).cache()
    assert copartitioned(parent_sugg, suggestRDD)  # check partitioning

    ##########
    #
    # process deletes on the input string
    #
    ##########
     
    assert max_edit_distance>0
    
    list_deletes_of_s = sc.parallelize(get_n_deletes_list(s, max_edit_distance))
    deletes_of_s = list_deletes_of_s.map(lambda k: (k, 0)).partitionBy(n_partitions).cache()
    
    assert copartitioned(dictRDD, deletes_of_s) # check partitioning for efficient join
    check_matches = dictRDD.join(deletes_of_s).cache()
    
    # if delete is a real word in corpus, add it to suggestion list
    del_exact_match = check_matches.filter(lambda (w, ((sl, freq), _)): freq>0)
    del_sugg = del_exact_match.map(lambda (w, ((s1, freq), _)): (w, (freq, len(s)-len(w))),
                                   preservesPartitioning=True)
    suggestRDD = suggestRDD.union(del_sugg).cache()
    
    # the suggested corrections for the item in dictionary (whether or not
    # the delete itself is a valid word or merely a delete) can be valid corrections    
    list_sl = check_matches.mapValues(lambda ((sl, freq), _): sl).flatMapValues(lambda x: x)
    swap_del = list_sl.map(lambda (w, sc): (sc, 0))
    combine_del = swap_del.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()

    # need to recalculate actual Deverau-Levenshtein distance to be within max_edit_distance for all deletes
    calc_dist = combine_del.map(lambda (w, _): (w, dameraulevenshtein(s, w)),
                                       preservesPartitioning=True)
    filter_by_dist = calc_dist.filter(lambda (w, dist): dist<=max_edit_distance)
    
    # get frequencies from main dictionary and add to suggestions list
    assert copartitioned(dictRDD, filter_by_dist)  # check partitioning for efficient join
    get_freq = dictRDD.join(filter_by_dist)
    del_parent_sugg = get_freq.mapValues(lambda ((sl, freq), dist): (freq, dist))
    
    suggestRDD = suggestRDD.union(del_parent_sugg).distinct().cache()    
    
    if not silent:
        print "number of possible corrections: %i" %suggestRDD.count()
        print "  edit distance for deletions: %i" % max_edit_distance

    ##########
    #
    # sort RDD for output
    #
    ##########
    
    # suggest_RDD is in the form: [(word, (freq, editdist)), (word, (freq, editdist)), ...]
    # there does not seem to be a straightforward way to sort by both primary and secondary keys in Spark
    # this is a documented issue: one option is to simply work with a list since there are likely not
    # going to be an extremely large number of recommended suggestions
    
    output = suggestRDD.collect()
    
    # output option 1
    # sort results by ascending order of edit distance and descending order of frequency
    #     and return list of suggested corrections only:
    # return sorted(output, key = lambda x: (suggest_dict[x][1], -suggest_dict[x][0]))

    # output option 2
    # return list of suggestions with (correction, (frequency in corpus, edit distance)):
    # return sorted(output, key = lambda (term, (freq, dist)): (dist, -freq))

    return sorted(output, key = lambda (term, (freq, dist)): (dist, -freq))

def best_word(s, d, l, silent=False):
    a = get_suggestions(s, d, l, silent)
    if len(a)==0:
        return (None, (None, None))
    else: 
        return a[0]

<div class="alert alert-danger">
  <strong>Run the cell below only once to build the dictionary.</strong>
</div>

In [None]:
%%time
d, lwl = parallel_create_dictionary("/Users/K-Lo/Desktop/big.txt")

<div class="alert alert-success">
  <strong>Enter word to correct below.</strong>
</div>

In [None]:
%%time
get_suggestions("there", d, lwl)

In [None]:
%%time
best_word("there", d, lwl)

------

### 3. Optimized SPARK version Performance (FASTER)

<div class="alert alert-danger">
  <strong>To run this program, restart notebook, and start executing the cells of this section starting here.</strong>
</div>

In [1]:
'''
v 2.0 last revised 26 Nov 2015

This program is a Spark (PySpark) version of a spellchecker based on SymSpell, 
a Symmetric Delete spelling correction algorithm developed by Wolf Garbe 
and originally written in C#.

From the original SymSpell documentation:

"The Symmetric Delete spelling correction algorithm reduces the complexity 
 of edit candidate generation and dictionary lookup for a given Damerau-
 Levenshtein distance. It is six orders of magnitude faster and language 
 independent. Opposite to other algorithms only deletes are required, 
 no transposes + replaces + inserts. Transposes + replaces + inserts of the 
 input term are transformed into deletes of the dictionary term.
 Replaces and inserts are expensive and language dependent: 
 e.g. Chinese has 70,000 Unicode Han characters!"

For further information on SymSpell, please consult the original documentation:
  URL: blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
  Description: blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

The current version of this program will output all possible suggestions for
corrections up to an edit distance (configurable) of max_edit_distance = 3. 

Future improvements may entail allowing for less verbose options, 
including the output of a single recommended correction. Note also that we
have generally kept to the form of the original program, and have not
introduced any major optimizations or structural changes in this PySpark port.

To execute program:
1. Ensure "big.txt" is in the current working directory. This is the corpus
   from which the dictionary for the spellchecker will be built.
2. Check recommended single word corrections by executing get_suggestions("word") 
    in the corresponding marked box below.

Note: we did not implement entire document checking given speed of program,
      since we are not stopping early after having found a best word with
      minimum edit distance (however, see context-based version).

################

Example input/output:

################

get_suggestions("there")

number of possible corrections: 604
  edit distance for deletions: 3
  
[('there', (2972, 0)),
 ('these', (1231, 1)),
 ('where', (977, 1)),
 ('here', (691, 1)),
 ('three', (584, 1)),
 ('thee', (26, 1)),
 ('chere', (9, 1)),
 ('theme', (8, 1)),
 ('the', (80030, 2)), ...


'''

import findspark
import os
findspark.init('/Users/K-Lo/spark-1.5.0')

from pyspark import SparkContext
sc = SparkContext()

import re

n_partitions = 6  # number of partitions to be used
max_edit_distance = 3

# helper functions
def get_n_deletes_list(w, n):
    '''given a word, derive list of strings with up to n characters deleted'''
    # since this list is generally of the same magnitude as the number of 
    # characters in a word, it may not make sense to parallelize this
    # so we use python to create the list
    deletes = []
    queue = [w]
    for d in range(n):
        temp_queue = []
        for word in queue:
            if len(word)>1:
                for c in range(len(word)):  # character index
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in deletes:
                        deletes.append(word_minus_c)
                    if word_minus_c not in temp_queue:
                        temp_queue.append(word_minus_c)
        queue = temp_queue
        
    return deletes
    
def copartitioned(RDD1, RDD2):
    '''check if two RDDs are copartitioned'''
    return RDD1.partitioner == RDD2.partitioner

def combine_joined_lists(tup):
    '''takes as input a tuple in the form (a, b) where each of a, b may be None (but not both) or a list
       and returns a concatenated list of unique elements'''
    concat_list = []
    if tup[1] is None:
        concat_list = tup[0]
    elif tup[0] is None:
        concat_list = tup[1]
    else:
        concat_list = tup[0] + tup[1]
        
    return list(set(concat_list))

def parallel_create_dictionary(fname):
    '''Create dictionary using Spark RDDs.'''
    # we generate and count all words for the corpus,
    # then add deletes to the dictionary
    # this is a slightly different approach from the SymSpell algorithm
    # that may be more appropriate for Spark processing
    
    print "Creating dictionary..." 
    
    ############
    #
    # process corpus
    #
    ############
    
    # http://stackoverflow.com/questions/22520932/python-remove-all-non-alphabet-chars-from-string
    regex = re.compile('[^a-z ]')

    # convert file into one long sequence of words
    make_all_lower = sc.textFile(fname).map(lambda line: line.lower())
    replace_nonalphs = make_all_lower.map(lambda line: regex.sub(' ', line))
    all_words = replace_nonalphs.flatMap(lambda line: line.split())

    # create core corpus dictionary (i.e. only words appearing in file, no "deletes") and cache it
    # output RDD of unique_words_with_count: [(word1, count1), (word2, count2), (word3, count3)...]
    count_once = all_words.map(lambda word: (word, 1))
    unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()
    
    # output stats on core corpus
    print "total words processed: %i" % unique_words_with_count.map(lambda (k, v): v).reduce(lambda a, b: a + b)
    print "total unique words in corpus: %i" % unique_words_with_count.count()
    
    ############
    #
    # generate deletes list
    #
    ############
    
    # generate list of n-deletes from words in a corpus of the form: [(word1, count1), (word2, count2), ...]
     
    assert max_edit_distance>0  
    
    generate_deletes = unique_words_with_count.map(lambda (parent, count): 
                                                   (parent, get_n_deletes_list(parent, max_edit_distance)))
    expand_deletes = generate_deletes.flatMapValues(lambda x: x)
    swap = expand_deletes.map(lambda (orig, delete): (delete, ([orig], 0)))
   
    ############
    #
    # combine delete elements with main dictionary
    #
    ############
    
    corpus = unique_words_with_count.mapValues(lambda count: ([], count))
    combine = swap.union(corpus)  # combine deletes with main dictionary, eliminate duplicates
    new_dict = combine.reduceByKey(lambda a, b: (a[0]+b[0], a[1]+b[1])).cache()
    
    print "total items in dictionary (corpus words and deletions): %i" % new_dict.count()
    print "  edit distance for deletions: %i" % max_edit_distance
    longest_word_length = unique_words_with_count.map(lambda (k, v): len(k)).reduce(max)
    print "  length of longest word in corpus: %i" % longest_word_length

    return new_dict, longest_word_length    

In [None]:
## %%time
## test dictionary creation
## d, lwl = parallel_create_dictionary("/Users/K-Lo/Desktop/big.txt")

In [3]:
def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance (an integer) between sequences.

    This code has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
    
    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

def get_suggestions(s, dictRDD, longest_word_length=float('inf'), silent=False):
    '''return list of suggested corrections for potentially incorrectly spelled word.
    
    s: input string
    dictRDD: the main dictionary, which includes deletes
             entries are in the form of: [(word, ([suggested corrections], frequency of word in corpus)), ...]
    longest_word_length: optional identifier of longest real word in dictRDD
    silent: verbose output
    '''

    if (len(s) - longest_word_length) > max_edit_distance:
        if not silent:
            print "no items in dictionary within maximum edit distance"
        return []

    ##########
    #
    # initialize suggestions RDD
    # suggestRDD entries: (word, (frequency of word in corpus, edit distance))
    #
    ##########
    
    if not silent:
        print "looking up suggestions based on input word..."
    
    # ensure input RDDs are partitioned
    dictRDD = dictRDD.repartitionAndSortWithinPartitions(n_partitions).cache()
    
    # check if input word is in dictionary, and is a word from the corpus (edit distance = 0)
    # if so, add input word itself to suggestRDD
    exact_match = dictRDD.filter(lambda (w, (sl, freq)): w==s).cache()
    suggestRDD = exact_match.mapValues(lambda (sl, freq): (freq, 0))
    
    ##########
    #
    # add suggestions for input word
    #
    ##########

    # the suggested corrections for the item in dictionary (whether or not
    # the input string s itself is a valid word or merely a delete) can be valid corrections
    # the suggestions list will likely be short: it is only for one word
    # so we choose to collect here and process as a list, rather than parallelizing 
    # a very short list
    sc_items = exact_match.flatMap(lambda (w, (sl, freq)): sl).collect()  
    get_freq = dictRDD.filter(lambda (w, (sl, freq)): w in sc_items)
    parent_sugg = get_freq.map(lambda (w, (sl, freq)): (w, (freq, len(w)-len(s))), 
                                   preservesPartitioning=True)
    suggestRDD = suggestRDD.union(parent_sugg).cache()
    assert copartitioned(parent_sugg, suggestRDD)  # check partitioning

    ##########
    #
    # process deletes of the input string
    #
    ##########
     
    assert max_edit_distance>0
    
    list_deletes_of_s = get_n_deletes_list(s, max_edit_distance)  # this list is also short
    check_matches = dictRDD.filter(lambda (w, (sl, freq)): w in list_deletes_of_s).cache()

    # identify deletes that match a dictionary entry, and add matches to suggestions
    del_exact_match = check_matches.filter(lambda (w, (sl, freq)): freq>0)
    del_sugg = del_exact_match.map(lambda (w, (s1, freq)): (w, (freq, len(s)-len(w))),
                                   preservesPartitioning=True)
    suggestRDD = suggestRDD.union(del_sugg).cache()
    
    ##########
    #
    # now process suggestions lists of deletes
    #
    ##########

    # the suggested corrections for the item in dictionary (whether or not
    # the delete itself is a valid word or merely a delete) can be valid corrections 
    list_sl = check_matches.mapValues(lambda (sl, freq): sl).flatMapValues(lambda x: x)
    swap_del = list_sl.map(lambda (w, sc): (sc, 0))
    combine_del = swap_del.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()

    # need to recalculate actual Deverau Levenshtein distance to be within max_edit_distance 
    # for all deletes and check against the threshold value
    calc_dist = combine_del.map(lambda (w, _): (w, dameraulevenshtein(s, w)),
                                       preservesPartitioning=True)
    filter_by_dist = calc_dist.filter(lambda (w, dist): dist<=max_edit_distance)
    
    # MERGE: get frequencies from main dictionary and add to suggestions list
    assert copartitioned(dictRDD, filter_by_dist)  # check partitioning for efficient join
    get_freq = dictRDD.join(filter_by_dist)
    del_parent_sugg = get_freq.mapValues(lambda ((sl, freq), dist): (freq, dist)).cache()
    
    suggestRDD = suggestRDD.union(del_parent_sugg).distinct().cache()    

    ##########
    #
    # output suggestions
    #
    ##########
    
    if not silent:
        print "number of possible corrections: %i" % suggestRDD.count()
        print "  edit distance for deletions: %i" % max_edit_distance

    output = suggestRDD.collect()
    
    # suggest_RDD is in the form: [(word, (freq, editdist)), (word, (freq, editdist)), ...]
    # there does not seem to be a straightforward way to sort by both primary and secondary keys in Spark
    # this is a documented issue: one option is to simply work with a list since there are likely not
    # going to be an extremely large number of recommended suggestions
    
    # output option 1
    # sort results by ascending order of edit distance and descending order of frequency
    #     and return list of suggested corrections only:
    # return sorted(output, key = lambda x: (suggest_dict[x][1], -suggest_dict[x][0]))

    # output option 2
    # return list of suggestions with (correction, (frequency in corpus, edit distance)):
    # return sorted(output, key = lambda (term, (freq, dist)): (dist, -freq))

    return sorted(output, key = lambda (term, (freq, dist)): (dist, -freq))


def best_word(s, d, l, silent=False):
    a = get_suggestions(s, d, l, silent)
    if len(a)==0:
        return (None, (None, None))
    else: 
        return a[0]

<div class="alert alert-danger">
  <strong>Run the cell below only once to build the dictionary.</strong>
</div>

In [2]:
%%time
d, lwl = parallel_create_dictionary("/Users/K-Lo/Desktop/big.txt")

Creating dictionary...
total words processed: 1105285
total unique words in corpus: 29157
total items in dictionary (corpus words and deletions): 2151998
  edit distance for deletions: 3
  length of longest word in corpus: 18
CPU times: user 57.2 ms, sys: 14.2 ms, total: 71.4 ms
Wall time: 1min 1s


<div class="alert alert-success">
  <strong>Enter word to correct below.</strong>
</div>

In [4]:
%%time
get_suggestions("there", d, lwl)

looking up suggestions based on input word...
number of possible corrections: 604
  edit distance for deletions: 3
CPU times: user 59.4 ms, sys: 15.1 ms, total: 74.4 ms
Wall time: 1min 9s


[(u'there', (2972, 0)),
 (u'these', (1231, 1)),
 (u'where', (977, 1)),
 (u'here', (691, 1)),
 (u'three', (584, 1)),
 (u'thee', (26, 1)),
 (u'chere', (9, 1)),
 (u'theme', (8, 1)),
 (u'the', (80030, 2)),
 (u'her', (5284, 2)),
 (u'were', (4289, 2)),
 (u'they', (3938, 2)),
 (u'their', (2955, 2)),
 (u'them', (2241, 2)),
 (u'then', (1558, 2)),
 (u'other', (1502, 2)),
 (u'those', (1201, 2)),
 (u'others', (410, 2)),
 (u'third', (239, 2)),
 (u'term', (133, 2)),
 (u'threw', (96, 2)),
 (u'mere', (79, 2)),
 (u'theory', (79, 2)),
 (u'share', (69, 2)),
 (u'hero', (55, 2)),
 (u'tree', (42, 2)),
 (u'hare', (36, 2)),
 (u'thereby', (32, 2)),
 (u'sphere', (31, 2)),
 (u'hers', (30, 2)),
 (u'thereof', (26, 2)),
 (u'cher', (25, 2)),
 (u'tore', (18, 2)),
 (u'herd', (15, 2)),
 (u'theirs', (14, 2)),
 (u'thiers', (13, 2)),
 (u'shore', (11, 2)),
 (u'thence', (10, 2)),
 (u'tete', (9, 2)),
 (u'sheer', (8, 2)),
 (u'adhere', (8, 2)),
 (u'ether', (8, 2)),
 (u'tver', (7, 2)),
 (u'therein', (6, 2)),
 (u'tier', (5, 2)),

In [None]:
%%time
best_word("there", d, lwl)