<div class="alert alert-danger">
  <strong>To run this program, restart notebook, and start executing the cells of this section starting here.</strong> <br><p>
  This version parallelizes the word check for all the words in a document, using word-level correction. Since SPARK does not permit RDD manipulation from within an RDD transformation (i.e. no parallelism within a parallel task), we converted the `get_suggestions` function that acts on an individual word to a serial method. This allows us to then parallelize across multiple words in a document. <i>This is a reasonable trade off when the number of words in a document is much larger compared to the number of suggestions that will likely be found for any given word)</i>. <br><p>
  Also note the (modified) `no_RDD_get_suggestions` function still returns an entire list of all possible suggestions to the calling function (e.g. for context checking), even if only the top match is used or required. Future improvements may be made to `no_RDD_get_suggestions` to terminate early once a "top" match (e.g. minimum edit distance) is found; a speedup in that function will in turn lead to a performance improvement of the document checking function as well.
</div>

In [2]:
'''
v 4.0 last revised 27 Nov 2015

This program is a Spark (PySpark) version of a spellchecker based on SymSpell, 
a Symmetric Delete spelling correction algorithm developed by Wolf Garbe 
and originally written in C#.

'''
import re

n_partitions = 6  # number of partitions to be used
max_edit_distance = 3

# helper functions

    
def copartitioned(RDD1, RDD2):
    '''check if two RDDs are copartitioned'''
    return RDD1.partitioner == RDD2.partitioner

def combine_joined_lists(tup):
    '''takes as input a tuple in the form (a, b) where each of a, b may be None (but not both) or a list
       and returns a concatenated list of unique elements'''
    concat_list = []
    if tup[1] is None:
        concat_list = tup[0]
    elif tup[0] is None:
        concat_list = tup[1]
    else:
        concat_list = tup[0] + tup[1]
        
    return list(set(concat_list))

def parallel_create_dictionary(fname):
    '''Create dictionary using Spark RDDs.'''
    # we generate and count all words for the corpus,
    # then add deletes to the dictionary
    # this is a slightly different approach from the SymSpell algorithm
    # that may be more appropriate for Spark processing
    
    print "Creating dictionary..." 
    
    ############
    #
    # process corpus
    #
    ############
    
    # http://stackoverflow.com/questions/22520932/python-remove-all-non-alphabet-chars-from-string
    regex = re.compile('[^a-z ]')

    # convert file into one long sequence of words
    make_all_lower = sc.textFile(fname).map(lambda line: line.lower())
    replace_nonalphs = make_all_lower.map(lambda line: regex.sub(' ', line))
    all_words = replace_nonalphs.flatMap(lambda line: line.split())

    # create core corpus dictionary (i.e. only words appearing in file, no "deletes") and cache it
    # output RDD of unique_words_with_count: [(word1, count1), (word2, count2), (word3, count3)...]
    count_once = all_words.map(lambda word: (word, 1))
    unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()
    
    # output stats on core corpus
    print "total words processed: %i" % unique_words_with_count.map(lambda (k, v): v).reduce(lambda a, b: a + b)
    print "total unique words in corpus: %i" % unique_words_with_count.count()
    
    ############
    #
    # generate deletes list
    #
    ############
    
    # generate list of n-deletes from words in a corpus of the form: [(word1, count1), (word2, count2), ...]
     
    assert max_edit_distance>0  
    
    generate_deletes = unique_words_with_count.map(lambda (parent, count): 
                                                   (parent, get_n_deletes_list(parent, max_edit_distance)))
    expand_deletes = generate_deletes.flatMapValues(lambda x: x)
    swap = expand_deletes.map(lambda (orig, delete): (delete, ([orig], 0)))
   
    ############
    #
    # combine delete elements with main dictionary
    #
    ############
    
    corpus = unique_words_with_count.mapValues(lambda count: ([], count))
    combine = swap.union(corpus)  # combine deletes with main dictionary, eliminate duplicates
    
    ## since the dictionary will only be a lookup table once created, we can
    ## pass on as a Python dictionary rather than RDD by reducing locally and
    ## avoiding an extra shuffle from reduceByKey
    new_dict = combine.reduceByKeyLocally(lambda a, b: (a[0]+b[0], a[1]+b[1]))
    
    print "total items in dictionary (corpus words and deletions): %i" % len(new_dict)
    print "  edit distance for deletions: %i" % max_edit_distance
    longest_word_length = unique_words_with_count.map(lambda (k, v): len(k)).reduce(max)
    print "  length of longest word in corpus: %i" % longest_word_length

    return new_dict, longest_word_length    

def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance (an integer) between sequences.

    This code has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
    
    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]

def no_RDD_get_suggestions(s, masterdict, longest_word_length=float('inf'), silent=False):
    '''return list of suggested corrections for potentially incorrectly spelled word.
    
    Note: serialized version for Spark document correction.
    
    s: input string
    masterdict: the main dictionary (python dict), which includes deletes
             entries, is in the form of: {word: ([suggested corrections], 
                                                 frequency of word in corpus), ...}
    longest_word_length: optional identifier of longest real word in masterdict
    silent: verbose output (when False)
    '''

    if (len(s) - longest_word_length) > max_edit_distance:
        if not silent:
            print "no items in dictionary within maximum edit distance"
        return []

    ##########
    #
    # initialize suggestions list
    # suggestList entries: (word, (frequency of word in corpus, edit distance))
    #
    ##########
    
    if not silent:
        print "looking up suggestions based on input word..."
    
    suggestList = []
    
    # check if input word is in dictionary, and is a word from the corpus (edit distance = 0)
    # if so, add input word itself and suggestions to suggestRDD
    
    if s in masterdict:
        init_sugg = []
        # dictionary values are in the form of ([suggestions], freq)
        if masterdict[s][1]>0:  # frequency>0 -> real corpus word
            init_sugg = [(s, (masterdict[s][1], 0))]

        # the suggested corrections for the item in dictionary (whether or not
        # the input string s itself is a valid word or merely a delete) can be 
        # valid corrections  -- essentially we serialize this portion since
        # the list of corrections tends to be very short
        
        add_sugg = [(sugg, (masterdict[sugg][1], len(sugg)-len(s))) 
                        for sugg in masterdict[s][0]]
        
        suggestList = init_sugg + add_sugg
        
    ##########
    #
    # process deletes on the input string 
    #
    ##########
     
    assert max_edit_distance>0
    
    list_deletes_of_s = get_n_deletes_list(s, max_edit_distance)  # this list is short
    
    # check suggestions is in dictionary and is a real word
    add_sugg_2 = [(sugg, (masterdict[sugg][1], len(s)-len(sugg))) 
                      for sugg in list_deletes_of_s if ((sugg in masterdict) and
                                                        (masterdict[sugg][1]>0))]
    
    suggestList += add_sugg_2
        
    # check each item of suggestion list of all new-found suggestions 
    # the suggested corrections for any item in dictionary (whether or not
    # the delete itself is a valid word or merely a delete) can be valid corrections   
    # expand lists of list
    
    sugg_lists = [masterdict[sugg][0] for sugg in list_deletes_of_s if sugg in masterdict]
    list_sl = [(val, 0) for sublist in sugg_lists for val in sublist]
    combine_del = list(set((list_sl))) 

    # need to recalculate actual Deverau Levenshtein distance to be within 
    # max_edit_distance for all deletes; also check that suggestion is a real word
    filter_by_dist = []
    for item in combine_del:
        calc_dist = dameraulevenshtein(s, item[0])
        if (calc_dist<=max_edit_distance) and (item[0] in masterdict):
            filter_by_dist += [(item[0], calc_dist)]
        
    # get frequencies from main dictionary and add new suggestions to suggestions list
    suggestList += [(str(item[0]), (masterdict[item[0]][1], item[1]))
                            for item in filter_by_dist]
    
    output = list(set(suggestList))
    
    if not silent:
        print "number of possible corrections: %i" % len(output)
        print "  edit distance for deletions: %i" % max_edit_distance

    ##########
    #
    # optionally, sort RDD for output
    #
    ##########
    
    # output option 1
    # sort results by ascending order of edit distance and descending order of frequency
    #     and return list of suggested corrections only:
    # return sorted(output, key = lambda x: (suggest_dict[x][1], -suggest_dict[x][0]))

    # output option 2
    # return list of suggestions with (correction, (frequency in corpus, edit distance)):
    # return sorted(output, key = lambda (term, (freq, dist)): (dist, -freq))

    if len(output)>0:
        return sorted(output, key = lambda (term, (freq, dist)): (dist, -freq))
    else:
        return []
    
def correct_document(fname, d, lwl=float('inf'), printlist=True):
    '''Correct an entire document using word-level correction.
    
    Note: Uses a serialized version of an individual word checker. 
    
    fname: filename
    d: the main dictionary (python dict), which includes deletes
             entries, is in the form of: {word: ([suggested corrections], 
                                                 frequency of word in corpus), ...}
    lwl: optional identifier of longest real word in masterdict
    printlist: identify unknown words and words with error (default is True)
    '''
    
    # broadcast lookup dictionary to workers
    bd = sc.broadcast(d)
    
    print "Finding misspelled words in your document..." 
    
    # http://stackoverflow.com/questions/22520932/python-remove-all-non-alphabet-chars-from-string
    regex = re.compile('[^a-z ]')

    # convert file into one long sequence of words with the line index for reference
    make_all_lower = sc.textFile(fname).map(lambda line: line.lower()).zipWithIndex()
    replace_nonalphs = make_all_lower.map(lambda (line, index): (regex.sub(' ', line), index))
    flattened = replace_nonalphs.map(lambda (line, index): 
                                 [(i, index) for i in line.split()]).flatMap(list)
    
    # create RDD with (each word in document, corresponding line index) 
    # key value pairs and cache it
    all_words = flattened.partitionBy(n_partitions).cache()
    
    # check all words in parallel --  stores whole list of suggestions for each word
    get_corrections = all_words.map(lambda (w, index): 
                                    (w, (no_RDD_get_suggestions(w, bd.value, lwl, True), index)),
                                     preservesPartitioning=True).cache()
    
    # UNKNOWN words are words where the suggestion list is empty
    unknown_words = get_corrections.filter(lambda (w, (sl, index)): len(sl)==0)
    if printlist:
        print "    Unknown words (line number, word in text):"
        print unknown_words.map(lambda (w, (sl, index)): (index, str(w))).sortByKey().collect()
    
    # ERROR words are words where the word does not match the first tuple's word (top match)
    error_words = get_corrections.filter(lambda (w, (sl, index)): len(sl)>0 and w!=sl[0][0]) 
    if printlist:
        print "    Words with suggested corrections (line number, word in text, top match):"
        print error_words.map(lambda (w, (sl, index)): 
                                 (index, str(w) + " --> " +
                                         str(sl[0][0]))).sortByKey().collect()
    
    print "-----"
    print "total words checked: %i" % get_corrections.count()
    print "total unknown words: %i" % unknown_words.count()
    print "total potential errors found: %i" % error_words.count()

    return

<div class="alert alert-danger">
  <strong>Run the cell below only once to build the dictionary.</strong>
</div>

In [3]:
%%time
d, lwl = parallel_create_dictionary("testdata/big.txt")

Creating dictionary...
total words processed: 1105285
total unique words in corpus: 29157
total items in dictionary (corpus words and deletions): 2151998
  edit distance for deletions: 3
  length of longest word in corpus: 18
CPU times: user 11.6 s, sys: 1.19 s, total: 12.8 s
Wall time: 47.3 s


<div class="alert alert-success">
  <strong>Enter word to correct below.</strong>
</div>

In [4]:
%%time
no_RDD_get_suggestions("there", d, lwl)

looking up suggestions based on input word...
number of possible corrections: 604
  edit distance for deletions: 3
CPU times: user 60.2 ms, sys: 3.25 ms, total: 63.5 ms
Wall time: 61.7 ms


[('there', (2972, 0)),
 ('these', (1231, 1)),
 ('where', (977, 1)),
 ('here', (691, 1)),
 ('three', (584, 1)),
 ('thee', (26, 1)),
 ('chere', (9, 1)),
 ('theme', (8, 1)),
 ('the', (80030, 2)),
 ('her', (5284, 2)),
 ('were', (4289, 2)),
 ('they', (3938, 2)),
 ('their', (2955, 2)),
 ('them', (2241, 2)),
 ('then', (1558, 2)),
 ('other', (1502, 2)),
 ('those', (1201, 2)),
 ('others', (410, 2)),
 ('third', (239, 2)),
 ('term', (133, 2)),
 ('threw', (96, 2)),
 ('mere', (79, 2)),
 ('theory', (79, 2)),
 ('share', (69, 2)),
 ('hero', (55, 2)),
 ('tree', (42, 2)),
 ('hare', (36, 2)),
 (u'thereby', (32, 2)),
 ('sphere', (31, 2)),
 ('hers', (30, 2)),
 (u'thereof', (26, 2)),
 ('cher', (25, 2)),
 ('tore', (18, 2)),
 ('herd', (15, 2)),
 ('theirs', (14, 2)),
 ('thiers', (13, 2)),
 ('shore', (11, 2)),
 ('thence', (10, 2)),
 ('tete', (9, 2)),
 ('ether', (8, 2)),
 ('adhere', (8, 2)),
 ('sheer', (8, 2)),
 ('tver', (7, 2)),
 (u'therein', (6, 2)),
 ('herb', (5, 2)),
 ('cheer', (5, 2)),
 ('hire', (5, 2)),
 (

In [5]:
%%time
no_RDD_get_suggestions("zzffttt", d, lwl)

looking up suggestions based on input word...
number of possible corrections: 0
  edit distance for deletions: 3
CPU times: user 272 µs, sys: 97 µs, total: 369 µs
Wall time: 290 µs


[]

<div class="alert alert-success">
  <strong>Enter file name of document to correct below.</strong>
</div>

In [6]:
%%time
correct_document("testdata/OCRsample.txt", d, lwl)

Finding misspelled words in your document...
    Unknown words (line number, word in text):
[(11, 'oonipiittee'), (42, 'senbrnrgs'), (82, 'ghmhvestigat')]
    Words with suggested corrections (line number, word in text, top match):
[(3, 'taiths --> faith'), (13, 'gjpt --> get'), (13, 'tj --> to'), (13, 'mnnff --> snuff'), (15, 'bh --> by'), (15, 'uth --> th'), (15, 'unuer --> under'), (15, 'snc --> sac'), (20, 'mthiitt --> thirty'), (21, 'cas --> was'), (22, 'pythian --> scythian'), (26, 'brainin --> brain'), (27, 'jfl --> of'), (28, 'eug --> dug'), (28, 'stice --> stick'), (28, 'blaci --> black'), (28, 'ji --> i'), (28, 'debbs --> debts'), (29, 'nericans --> americans'), (30, 'ergs --> eggs'), (30, 'ainin --> again'), (31, 'trumped --> trumpet'), (32, 'erican --> american'), (33, 'thg --> the'), (33, 'nenance --> penance'), (33, 'unorthodox --> orthodox'), (34, 'rgs --> rags'), (34, 'sln --> son'), (38, 'eu --> e'), (38, 'williaij --> william'), (40, 'fcsf --> ff'), (40, 'ber --> be')

***

<div class="alert alert-info">
  <strong>START RUNNING CODE HERE</strong>
</div>

In [1]:
import math
import re

In [2]:
import findspark
import os
findspark.init()
import pyspark
sc = pyspark.SparkContext()
sc.setLogLevel('ERROR')

***
# Pre-processing

In [3]:
n_partitions = 6  # number of partitions to be used
max_edit_distance = 3

***

In [4]:
def get_n_deletes_list(w, n):
    '''given a word, derive list of strings with up to n characters deleted'''
    # since this list is generally of the same magnitude as the number of 
    # characters in a word, it may not make sense to parallelize this
    # so we use python to create the list
    deletes = []
    queue = [w]
    for d in range(n):
        temp_queue = []
        for word in queue:
            if len(word)>1:
                for c in range(len(word)):  # character index
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in deletes:
                        deletes.append(word_minus_c)
                    if word_minus_c not in temp_queue:
                        temp_queue.append(word_minus_c)
        queue = temp_queue
        
    return deletes

***

In [5]:
############
#
# load file & initial processing
#
############

In [6]:
fname = "testdata/big.txt"

In [7]:
regex = re.compile('[^a-z ]')

In [8]:
make_all_lower = sc.textFile(fname).map(lambda line: line.lower())

In [9]:
print make_all_lower
print make_all_lower.getNumPartitions()
print make_all_lower.count()
print make_all_lower.take(5)

PythonRDD[2] at RDD at PythonRDD.scala:43
2
128457
[u'the project gutenberg ebook of the adventures of sherlock holmes', u'by sir arthur conan doyle', u'(#15 in our series by sir arthur conan doyle)', u'', u'copyright laws are changing all over the world. be sure to check the']


In [10]:
split_sentence = make_all_lower.flatMap(lambda line: line.split('.')).map(lambda sentence: regex.sub(' ', sentence)) \
            .map(lambda sentence: sentence.split())

In [11]:
print split_sentence
print split_sentence.getNumPartitions()
print split_sentence.count()
print split_sentence.take(5)

PythonRDD[5] at RDD at PythonRDD.scala:43
2
187129
[[u'the', u'project', u'gutenberg', u'ebook', u'of', u'the', u'adventures', u'of', u'sherlock', u'holmes'], [u'by', u'sir', u'arthur', u'conan', u'doyle'], [u'in', u'our', u'series', u'by', u'sir', u'arthur', u'conan', u'doyle'], [], [u'copyright', u'laws', u'are', u'changing', u'all', u'over', u'the', u'world']]


In [12]:
############
#
# generate start probabilities
#
############

In [13]:
start_words = split_sentence.map(lambda sentence: sentence[0] if len(sentence)>0 else None) \
    .filter(lambda word: word!=None)

In [14]:
print start_words
print start_words.getNumPartitions()
print start_words.count()
print start_words.take(5)

PythonRDD[8] at RDD at PythonRDD.scala:43
2
137073
[u'the', u'by', u'in', u'copyright', u'be']


In [15]:
accum_total_start_words = sc.accumulator(0)
count_start_words_once = start_words.map(lambda word: (word, 1))
count_total_start_words = count_start_words_once.foreach(lambda x: accum_total_start_words.add(1))
TOTAL_START_WORDS = float(accum_total_start_words.value)

In [16]:
print count_start_words_once
print count_start_words_once.getNumPartitions()
print count_start_words_once.count()
print count_start_words_once.take(5)

print 'Total start words:', TOTAL_START_WORDS

PythonRDD[12] at RDD at PythonRDD.scala:43
2
137073
[(u'the', 1), (u'by', 1), (u'in', 1), (u'copyright', 1), (u'be', 1)]
Total start words: 137073.0


In [17]:
unique_start_words = count_start_words_once.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions)

In [18]:
print unique_start_words
print unique_start_words.getNumPartitions()
print unique_start_words.count()
print unique_start_words.take(5)

PythonRDD[19] at RDD at PythonRDD.scala:43
6
15297
[(u'aided', 3), (u'suicidal', 1), (u'desirable', 4), (u'all', 562), (u'yellow', 4)]


In [19]:
start_prob_calc = unique_start_words.map(lambda (k,v): (k, math.log(v/TOTAL_START_WORDS)))
DEFAULT_START_PROB = math.log(1/TOTAL_START_WORDS)

In [20]:
print start_prob_calc
print start_prob_calc.getNumPartitions()
print start_prob_calc.count()
print start_prob_calc.take(5)

print 'Default start probability:', DEFAULT_START_PROB

PythonRDD[22] at RDD at PythonRDD.scala:43
6
15297
[(u'aided', -10.729656620945079), (u'suicidal', -11.82826890961319), (u'desirable', -10.441974548493299), (u'all', -5.496767059719498), (u'yellow', -10.441974548493299)]
Default start probability: -11.8282689096


In [21]:
start_prob = start_prob_calc.collectAsMap()

In [22]:
############
#
# generate transition probabilities
#
############

In [23]:
def get_transitions(sentence):
    result = []
    if len(sentence)<2:
        return None
    else:
        for i in range(len(sentence)-1):
            result.append(((sentence[i], sentence[i+1]), 1))
        return result

In [24]:
accum_total_other_words = sc.accumulator(0)
other_words = split_sentence.map(lambda sentence: get_transitions(sentence)).filter(lambda x: x!=None). \
                flatMap(lambda x: x)
count_total_other_words = other_words.foreach(lambda x: accum_total_other_words.add(1))
TOTAL_OTHER_WORDS = float(accum_total_other_words.value)

In [25]:
print other_words
print other_words.getNumPartitions()
print other_words.count()
print other_words.take(5)

print 'Total other words', TOTAL_OTHER_WORDS

PythonRDD[26] at RDD at PythonRDD.scala:43
2
968212
[((u'the', u'project'), 1), ((u'project', u'gutenberg'), 1), ((u'gutenberg', u'ebook'), 1), ((u'ebook', u'of'), 1), ((u'of', u'the'), 1)]
Total other words 968212.0


In [26]:
unique_other_words = other_words.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions)

In [27]:
print unique_other_words
print unique_other_words.getNumPartitions()
print unique_other_words.count()
print unique_other_words.take(5)

PythonRDD[33] at RDD at PythonRDD.scala:43
6
319665
[((u'so', u'was'), 5), ((u'mischievous', u'pang'), 1), ((u'gave', u'confused'), 1), ((u'getting', u'stouter'), 1), ((u'long', u'frock'), 1)]


In [28]:
other_words_collapsed = unique_other_words.map(lambda x: (x[0][0], (x[0][1], x[1]))).groupByKey().mapValues(dict)

In [29]:
print other_words_collapsed
print other_words_collapsed.getNumPartitions()
print other_words_collapsed.count()
print other_words_collapsed.take(5)

PythonRDD[40] at RDD at PythonRDD.scala:43
6
27224
[(u'bennigsens', {u'and': 1}), (u'aided', {u'the': 3, u'by': 12, u'augustus': 1}), (u'suicidal', {u'and': 2, u'cut': 1, u'or': 1, u'commented': 1}), (u'linsey', {u'woolseys': 1}), (u'unheeded', {u'to': 1, u'upon': 1})]


In [30]:
def map_transition_prob(x):
    vals = x[1]
    total = float(sum(vals.values()))
    probs = {k: math.log(v/total) for k, v in vals.items()}
    return (x[0], probs)

In [31]:
transition_prob_calc = other_words_collapsed.map(lambda x: map_transition_prob(x))
DEFAULT_TRANSITION_PROB = math.log(1/TOTAL_OTHER_WORDS)

In [32]:
print transition_prob_calc
print transition_prob_calc.getNumPartitions()
print transition_prob_calc.count()
print transition_prob_calc.take(5)

print 'Default transition probability:', DEFAULT_TRANSITION_PROB

PythonRDD[43] at RDD at PythonRDD.scala:43
6
27224
[(u'bennigsens', {u'and': 0.0}), (u'aided', {u'the': -1.6739764335716716, u'by': -0.2876820724517809, u'augustus': -2.772588722239781}), (u'suicidal', {u'and': -0.916290731874155, u'cut': -1.6094379124341003, u'or': -1.6094379124341003, u'commented': -1.6094379124341003}), (u'linsey', {u'woolseys': 0.0}), (u'unheeded', {u'to': -0.6931471805599453, u'upon': -0.6931471805599453})]
Default transition probability: -13.7832063505


In [33]:
transition_prob = transition_prob_calc.collectAsMap()

In [34]:
############
#
# generate dictionary
#
############

In [35]:
all_words = make_all_lower.map(lambda line: regex.sub(' ', line)).flatMap(lambda line: line.split())

In [36]:
print all_words
print all_words.getNumPartitions()
print all_words.count()
print all_words.take(5)

PythonRDD[46] at RDD at PythonRDD.scala:43
2
1105285
[u'the', u'project', u'gutenberg', u'ebook', u'of']


In [37]:
count_once = all_words.map(lambda word: (word, 1))

In [38]:
print count_once
print count_once.getNumPartitions()
print count_once.count()
print count_once.take(5)

PythonRDD[49] at RDD at PythonRDD.scala:43
2
1105285
[(u'the', 1), (u'project', 1), (u'gutenberg', 1), (u'ebook', 1), (u'of', 1)]


In [39]:
unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b, numPartitions = n_partitions).cache()

In [40]:
print unique_words_with_count
print unique_words_with_count.getNumPartitions()
print unique_words_with_count.count()
print unique_words_with_count.take(5)

PythonRDD[56] at RDD at PythonRDD.scala:43
6
29157
[(u'aided', 17), (u'bennigsens', 1), (u'suicidal', 5), (u'linsey', 1), (u'worshiped', 1)]


In [41]:
assert max_edit_distance>0 

In [42]:
generate_deletes = unique_words_with_count.map(lambda (parent, count): 
                                                   (parent, get_n_deletes_list(parent, max_edit_distance)))

In [43]:
print generate_deletes
print generate_deletes.getNumPartitions()
print generate_deletes.count()
print generate_deletes.take(5)

PythonRDD[59] at RDD at PythonRDD.scala:43
6
29157
[(u'aided', [u'ided', u'aded', u'aied', u'aidd', u'aide', u'ded', u'ied', u'idd', u'ide', u'aed', u'add', u'ade', u'aid', u'aie', u'ed', u'dd', u'de', u'id', u'ie', u'ad', u'ae', u'ai']), (u'bennigsens', [u'ennigsens', u'bnnigsens', u'benigsens', u'benngsens', u'bennisens', u'bennigens', u'bennigsns', u'bennigses', u'bennigsen', u'nnigsens', u'enigsens', u'enngsens', u'ennisens', u'ennigens', u'ennigsns', u'ennigses', u'ennigsen', u'bnigsens', u'bnngsens', u'bnnisens', u'bnnigens', u'bnnigsns', u'bnnigses', u'bnnigsen', u'beigsens', u'bengsens', u'benisens', u'benigens', u'benigsns', u'benigses', u'benigsen', u'bennsens', u'benngens', u'benngsns', u'benngses', u'benngsen', u'benniens', u'bennisns', u'bennises', u'bennisen', u'bennigns', u'benniges', u'bennigen', u'bennigss', u'bennigsn', u'bennigse', u'nigsens', u'nngsens', u'nnisens', u'nnigens', u'nnigsns', u'nnigses', u'nnigsen', u'eigsens', u'engsens', u'enisens', u'enigens', u'eni

In [44]:
expand_deletes = generate_deletes.flatMapValues(lambda x: x)

In [45]:
print expand_deletes
print expand_deletes.getNumPartitions()
print expand_deletes.count()
print expand_deletes.take(5)

PythonRDD[62] at RDD at PythonRDD.scala:43
6
2863776
[(u'aided', u'ided'), (u'aided', u'aded'), (u'aided', u'aied'), (u'aided', u'aidd'), (u'aided', u'aide')]


In [46]:
swap = expand_deletes.map(lambda (orig, delete): (delete, ([orig], 0)))

In [47]:
print swap
print swap.getNumPartitions()
print swap.count()
print swap.take(5)

PythonRDD[65] at RDD at PythonRDD.scala:43
6
2863776
[(u'ided', ([u'aided'], 0)), (u'aded', ([u'aided'], 0)), (u'aied', ([u'aided'], 0)), (u'aidd', ([u'aided'], 0)), (u'aide', ([u'aided'], 0))]


In [48]:
corpus = unique_words_with_count.mapValues(lambda count: ([], count))

In [49]:
print corpus
print corpus.getNumPartitions()
print corpus.count()
print corpus.take(5)

PythonRDD[68] at RDD at PythonRDD.scala:43
6
29157
[(u'aided', ([], 17)), (u'bennigsens', ([], 1)), (u'suicidal', ([], 5)), (u'linsey', ([], 1)), (u'worshiped', ([], 1))]


In [50]:
combine = swap.union(corpus)  # combine deletes with main dictionary, eliminate duplicates

In [51]:
print combine
print combine.getNumPartitions()
print combine.count()
print combine.take(5)

UnionRDD[71] at union at NativeMethodAccessorImpl.java:-2
12
2892933
[(u'ided', ([u'aided'], 0)), (u'aded', ([u'aided'], 0)), (u'aied', ([u'aided'], 0)), (u'aidd', ([u'aided'], 0)), (u'aide', ([u'aided'], 0))]


In [52]:
new_dict = combine.reduceByKeyLocally(lambda a, b: (a[0]+b[0], a[1]+b[1]))