# Context-Level Correction -  Serial Version and SPARK Implementations

In [1]:
# context_correction.ipynb

######################
#
# Submission by Gioia Dominedo (Harvard ID: 40966234) for
# CS 205 - Computing Foundations for Computational Science
# 
# This is part of a joint project with Kendrick Lo that includes a
# separate component for word-level checking. This notebook outlines
# algorithms for context-level correction, and includes a serial
# Python algorithm adapted from third party algorithms (Symspell and
# Viterbi algorithms), as well as a Spark/Python algorithm. 
#
# The following were also used as references:
# Peter Norvig, How to Write a Spelling Corrector
#	(http://norvig.com/spell-correct.html)
# Peter Norvig, Natural Language Corpus Data: Beautiful Data
#	(http://norvig.com/ngrams/ch14.pdf)
#
# Two main approaches to parallelization were attempted: sentence-
# level and word-level. Both attempts are documented in this notebook.
#
######################

In [2]:
######################
#
# CONTEXT-LEVEL CORRECTION LOGIC - VITERBI ALGORITHM
#
# Each sentence is modeled as a hidden Markov model. Prior
# probabilities (for first words in the sentences) and transition
# probabilities (for all subsequent words) are calculated when
# generating the main dictionary, using the same corpus. Emission
# probabilities are generated on the fly by parameterizing a Poisson 
# distribution with the edit distance. The state space of possible
# corrections is based on the suggested words from the word-level
# correction. Words must (a) be 'real' word (i.e. appear at least
# once in the corpus) used to generate the dictionary in order to be
# considered valid suggestions; this ensures that the state space
# remains manageable by ignoring words that have not been seen
# previously.
#
# All probabilities are stored in log-space to avoid underflow. Pre-
# defined minimum values are used for words that are not present in
# the dictionary and/or probability tables.
#
# More detail is included at each step below.
#
######################

<div class="alert alert-danger">
<strong>To run the serial version, restart notebook, and start executing the cells of this section starting here.</strong>
</div>

### 1. Serial Code Performance

In [8]:
import re
import math
from scipy.stats import poisson

<div class="alert alert-info">
<strong>PRE-PROCESSING CODE</strong>
</div>

In [9]:
######################
#
# v 1.0 last revised 27 Nov 2015
#
# The pre-processing steps have been adapted from the dictionary
# creation of the word-level spellchecker, which in turn was based on
# SymSpell, a Symmetric Delete spelling correction algorithm
# developed by Wolf Garbe and originally written in C#. More detail
# on SymSpell is included in the word-level spell-check documentation.
#
# The main modifications to the word-level spellchecker pre-
# processing stages are to create the additional outputs that are
# required for the context-level checking, and to eliminate redundant
# outputs that are not necessary.
#
# The outputs of the pre-processing stage are:
#
# - dictionary: A dictionary that combines both words present in the
# corpus and other words that are within a given 'delete distance'. 
# The format of the dictionary is:
# {word: ([list of words within the given 'delete distance'], 
# word count in corpus)}
#
# - start_prob: A dictionary with key, value pairs that correspond to
# (word, probability of the word being the first word in a sentence)
#
# - transition_prob: A dictionary of dictionaries that stores the
# probability of a given word following another. The format of the
# dictionary is:
# {previous word: {word1 : P(word1|prevous word), word2 : 
# P(word2|prevous word), ...}}
#
# - default_start_prob: A benchmark probability of a word being at
# the start of a sentence, set to 1 / # of words at the beginning of
# sentences. This ensures that all previously unseen words at the
# beginning of sentences are not corrected unnecessarily.
#
# - default_transition_prob: A benchmark probability of a word being
# seen, given the previous word in the sentence, also set to 1 / # of
# transitions in corpus. This ensures that all previously unseen
# transitions are not corrected unnecessarily.
#
######################

MAX_EDIT_DISTANCE = 3

def get_deletes_list(w):
    '''
    Given a word, derive strings with up to max_edit_distance
    characters deleted.
    '''
    deletes = []
    queue = [w]
    for d in range(MAX_EDIT_DISTANCE):
        temp_queue = []
        for word in queue:
            if len(word)>1:
                for c in range(len(word)):  # character index
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in deletes:
                        deletes.append(word_minus_c)
                    if word_minus_c not in temp_queue:
                        temp_queue.append(word_minus_c)
        queue = temp_queue
        
    return deletes

def create_dictionary_entry(w, dictionary):
    '''
    Adds a word and its derived deletions to the dictionary.
    Dictionary entries are of the form:
    (list of suggested corrections, frequency of word in corpus)
    '''

    new_real_word_added = False
    
    # check if word is already in dictionary
    if w in dictionary:
        # increment count of word in corpus
        dictionary[w] = (dictionary[w][0], dictionary[w][1] + 1)
    else:
        # create new entry in dictionary
        dictionary[w] = ([], 1)  
        
    if dictionary[w][1]==1:
        
        # first appearance of word in corpus
        # n.b. word may already be in dictionary as a derived word
        # (deleting character from a real word)
        # but counter of frequency of word in corpus is not
        # incremented in those cases)
        
        new_real_word_added = True
        deletes = get_deletes_list(w)
        
        for item in deletes:
            if item in dictionary:
                # add (correct) word to delete's suggested correction
                # list if not already there
                if item not in dictionary[item][0]:
                    dictionary[item][0].append(w)
            else:
                # note: frequency of word in corpus is not incremented
                dictionary[item] = ([w], 0)  
        
    return new_real_word_added

def create_dictionary(fname):
    '''
    Loads a text file and uses it to create a dictionary and
    to calculate start probabilities and transition probabilities. 
    Please refer to the text above for a full description.
    '''
    
    print 'Creating dictionary...'

    dictionary = dict()
    start_prob = dict()
    transition_prob = dict()
    word_count = 0
    transitions = 0
    
    with open(fname) as file:    
        
        for line in file:
            
            # process each sentence separately
            for sentence in line.split('.'):
                
                # separate by words by non-alphabetical characters
                words = re.findall('[a-z]+', sentence.lower())      
                
                for w, word in enumerate(words):
                    
                    if create_dictionary_entry(word, dictionary):
                        word_count += 1
                        
                    # update probabilities for Hidden Markov Model
                    if w == 0:

                        # probability of a word being at the
                        # beginning of a sentence
                        if word in start_prob:
                            start_prob[word] += 1
                        else:
                            start_prob[word] = 1
                    else:
                        
                        # probability of transitionining from one
                        # word to another
                        # dictionary format:
                        # {previous word: {word1 : P(word1|prevous
                        # word), word2 : P(word2|prevous word)}}
                        
                        # check that prior word is present
                        # - create if not
                        if words[w - 1] not in transition_prob:
                            transition_prob[words[w - 1]] = dict()
                            
                        # check that current word is present
                        # - create if not
                        if word not in transition_prob[words[w - 1]]:
                            transition_prob[words[w - 1]][word] = 0
                            
                        # update value
                        transition_prob[words[w - 1]][word] += 1
                        transitions += 1
                    
    # convert counts to log-probabilities, to avoid underflow in
    # later calculations (note: natural logarithm, not base-10)
    
    total_start_words = float(sum(start_prob.values()))
    default_start_prob = math.log(1/total_start_words)
    start_prob.update( 
        {k: math.log(v/total_start_words)
         for k, v in start_prob.items()})
    
    default_transition_prob = math.log(1./transitions)
    transition_prob.update(
        {k: {k1: math.log(float(v1)/sum(v.values()))
             for k1, v1 in v.items()} 
         for k, v in transition_prob.items()})

    print 'Total unique words in corpus: %i' % word_count
    print 'Total items in dictionary: %i' \
        % len(dictionary)
    print '  Edit distance for deletions: %i' % MAX_EDIT_DISTANCE
    print 'Total unique words at the start of a sentence: %i' \
        % len(start_prob)
    print 'Total unique word transitions: %i' % len(transition_prob)
        
    return dictionary, start_prob, default_start_prob, \
        transition_prob, default_transition_prob

<div class="alert alert-info">
<strong>SPELL-CHECKING CODE</strong>
</div>

In [13]:
######################
#
# v 1.0 last revised 27 Nov 2015
#
# Reads in a text file, breaks it down into individual sentences (by
# splitting on periods), and then carries out context-based spell-
# checking on each sentence in turn. In cases where the 'suggested'
# word does not match the actual word in the text, both the original
# and the suggested sentences are printed.
#
# Probabilistic model:
#
# Each sentence is modeled as a hidden Markov model, where the
# hidden states are the words that the user intended to type, and
# the emissions are the words
# that were actually typed.
#
# For each word in a sentence, we can define:
# - emission probabilities: P(observed word|intended word)
# - prior probabilities (for first words in sentences only):
# P(being the first word in a sentence)
# - transition probabilities (for all subsequent words):
# P(intended word|previous intended word)
#
# Prior and transition probabilities were calculated in the pre-
# processing step, using the same corpus as the dictionary.
# 
# Emission probabilities are calculated on the fly using a Poisson
# distribution, where P(observed word|intended word) = PMF of 
# Poisson(k, l), where k = edit distance between word type and word
# intended, and l=0.01. This approach was taken from the 2015
# lecture notes of AM207 Stochastic Optimization, as was the
# parameter l=0.01. Various parameters between 0 and 1 were tested,
# confirming that 0.01 yields the most accurate word suggestions.
# The shape of the PMF is shown in the cell below.
#
# All probabilities are stored in log-space to avoid underflow. Pre-
# defined minimum values (also defined at the pre-processing stage)
# are used for words that are not present in the dictionary and/or
# probability tables.
#
# The spell-checking itself is carried out using a modified version
# of the Viterbi algorithm, which yields the most likely sequence of
# hidden states, i.e. the most likely sequence of words that the
# user intended to type. The main difference to the 'standard'
# Viterbi algorithm is that the state space (i.e. list of possible
# corrections) is generated (and therefore varies) for each word,
# instead of considering the state space of all possible words in
# the dictionary for every word that is checked. This ensures that
# the problem remains tractable.
#
# The algorithm is best illustrated by way of an example.
# e.g. suppose that we are checking the sentence 'This is ax test.'
# The emissions are 'This is ax test.' and the hidden states are
# 'This is a test.'
#
# As a pre-processing step, we convert everything to lowercase,
# eliminate punctuation, and break the sentence up into a list of
# words: ['this', 'is', 'ax', 'text']
# This list is passed as a parameter to the viterbi function.
#
# The algorithm tackles each word in turn, starting with 'this'.
#
# We first use get_suggestions to obtain a list of all words that
# may have been intended, i.e. all possible hidden states (intended
# words) for the emission (word typed).
# get_suggestions returns 1,004 possible corrections, including:
# - 1 word with an edit distance of 0 ['this']
# - 5 words with an edit distance of 1 ['his', 'thus', 'thin',
# 'tis', 'thins']
# - 109 words with an edit distance of 2 ['the', 'that', 'is',
# 'him', 'they', ...]
# - 889 words with an edit distance of 3 ['to', 'in', 'he', 'was',
# 'it', 'as', ...]
# Note: get_suggestions is capped at an edit distance of 3.
# 
# These 1,004 words represent our state space, i.e. all possible
# words that may have been intended. They each have an emission
# probability = PMF of Poisson(edit distance, 0.01)
# We refer to this below as the list of possible corrections.
#
# For each word in the list of possible corrections, we calculate:
# P(word starting a sentence) * P(observed 'this'|intended word)
# This is a simple application of Bayes' rule: by normalizing the
# probabilities we obtain P(intended word|oberved 'this') for
# each of the 1,004 words.
# These probabilities are referred to as the belief state, and they
# are stored for
# future use. We also store the possible paths, which at this stage
# are only one word long.
# 
# We now move on to the next word. After the first word, all
# subsequent words are
# treated as follows.
#
# The second word in our test sentence is 'is'. Once again, we use
# get_suggestions to obtain a list of all words that may have been
# intended. get_suggestions returns 1,124 possible corrections,
# including:
# - 1 word with an edit distance of 0 ['is']
# - 31 words with an edit distance of 1 ['in', 'it', 'his', 'as',
# 'i', ...]
# - 213 words with an edit distance of 2 ['was', 'him', 'this',
# 'so', 'did', ...]
# - 879 words with an edit distance of 3 ['with', 'she', 'said',
# 'into', ...]
# These 1,124 words represent our state space for the second word.
#
# For each word in the list of possible corrections, we loop through
# all the words in the previous list of possible corrections, and
# calculate:
# belief state(previous word) * P(current word|previous word)
#    * P(typing 'is'|meaning to type current word)
# We store the previous word that maximizes this calculation, as
# well as the probability that it results in.
#
# For example, suppose that we are considering the possibility that
# 'is' was indeed intended to be 'is'. We then calculate: 
# belief state(previous word) * P('is'|previous word) * P('is'|'is')
# for all possible previous words, and discover that the previous
# word 'this' maximizes the above calculation. We therefore store
# 'this is' as the optimal path for the suggested correction 'is'
# (more specifically, path['is'] = 'this is'), and the above
# (normalized) probability as the belief state for 'is' (more
# specifically, belief['is'] = prob('this is')).
#
# If the sentence had been only 2 words long, then at this point we
# would return the path that maximizes the most recent belief state.
# As it is not, we repeat the previous steps for 'ax' and 'test',
# and then return the path that is associated with the most likely
# belief state at the last step.
#
######################


def dameraulevenshtein(seq1, seq2):
    '''
    Calculate the Damerau-Levenshtein distance between sequences.
    Same code as word-level checking.
    '''
    
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1
    # matrix. However, only the current and two previous rows are
    # needed at once, so we only store those.
    
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    
    for x in xrange(len(seq1)):
        
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = \
            oneago, thisrow, [0] * len(seq2) + [x + 1]
        
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
                
    return thisrow[len(seq2) - 1]

def get_suggestions(string, dictionary, longest_word_length=20, 
                    min_count=100, max_sug=10):
    '''
    Return list of suggested corrections for potentially incorrectly
    spelled word.
    Code based on get_suggestions function from word-level checking,
    with the addition of the min_count parameter, which only
    considers words that have occur more than min_count times in the
    (dictionary) corpus.
    '''
    
    if (len(string) - longest_word_length) > MAX_EDIT_DISTANCE:
        # to ensure Viterbi can keep running -- use the word itself
        return [(string, 0)]
    
    suggest_dict = {}
    
    queue = [string]
    q_dictionary = {}  # items other than string that we've checked
    
    while len(queue)>0:
        q_item = queue[0]  # pop
        queue = queue[1:]
        
        # process queue item
        if (q_item in dictionary) and (q_item not in suggest_dict):
            if (dictionary[q_item][1]>0):
            # word is in dictionary, and is a word from the corpus,
            # and not already in suggestion list so add to suggestion
            # dictionary, indexed by the word with value (frequency
            # in corpus, edit distance)
            # note: q_items that are not the input string are shorter
            # than input string since only deletes are added (unless
            # manual dictionary corrections are added)
                assert len(string)>=len(q_item)
                suggest_dict[q_item] = \
                    (dictionary[q_item][1], len(string) - len(q_item))
            
            # the suggested corrections for q_item as stored in
            # dictionary (whether or not q_item itself is a valid
            # word or merely a delete) can be valid corrections
            for sc_item in dictionary[q_item][0]:
                if (sc_item not in suggest_dict):
                    
                    # compute edit distance
                    # suggested items should always be longer (unless
                    # manual corrections are added)
                    assert len(sc_item)>len(q_item)
                    # q_items that are not input should be shorter
                    # than original string 
                    # (unless manual corrections added)
                    assert len(q_item)<=len(string)
                    if len(q_item)==len(string):
                        assert q_item==string
                        item_dist = len(sc_item) - len(q_item)

                    # item in suggestions list should not be the same
                    # as the string itself
                    assert sc_item!=string           
                    # calculate edit distance using Damerau-
                    # Levenshtein distance
                    item_dist = dameraulevenshtein(sc_item, string)
                    
                    if item_dist<=MAX_EDIT_DISTANCE:
                        # should already be in dictionary if in
                        # suggestion list
                        assert sc_item in dictionary  
                        # trim list to contain state space
                        if (dictionary[q_item][1]>0): 
                            suggest_dict[sc_item] = \
                                (dictionary[sc_item][1], item_dist)
        
        # now generate deletes (e.g. a substring of string or of a
        # delete) from the queue item as additional items to check
        # -- add to end of queue
        assert len(string)>=len(q_item)
        if (len(string)-len(q_item))<MAX_EDIT_DISTANCE \
            and len(q_item)>1:
            for c in range(len(q_item)): # character index        
                word_minus_c = q_item[:c] + q_item[c+1:]
                if word_minus_c not in q_dictionary:
                    queue.append(word_minus_c)
                    # arbitrary value to identify we checked this
                    q_dictionary[word_minus_c] = None

    # return list of suggestions: (correction, edit distance)
    
    # only include words that have appeared a minimum number of times
    # make sure that we do not lose the original word
    as_list = [i for i in suggest_dict.items() 
               if (i[1][0]>min_count or i[0]==string)]
    
    # only include the most likely suggestions (based on frequency
    # and edit distance from original word)
    trunc_as_list = sorted(as_list, 
            key = lambda (term, (freq, dist)): (dist, -freq))[:max_sug]
    
    if len(trunc_as_list)==0:
        # to ensure Viterbi can keep running
        # -- use the word itself if no corrections are found
        return [(string, 0)]
        
    else:
        # drop the word frequency - not needed beyond this point
        return [(i[0], i[1][1]) for i in trunc_as_list]

    '''
    Output format:
    get_suggestions('file', dictionary)
    [('file', 0), ('five', 1), ('fire', 1), ('fine', 1), ('will', 2),
    ('time', 2), ('face', 2), ('like', 2), ('life', 2), ('while', 2)]
    '''
    
def get_emission_prob(edit_dist, poisson_lambda=0.01):
    '''
    The emission probability, i.e. P(observed word|intended word)
    is approximated by a Poisson(k, l) distribution, where 
    k=edit distance and l=0.01.
    
    The lambda parameter matches the one used in the AM207
    lecture notes. Various parameters between 0 and 1 were tested
    to confirm that 0.01 yields the most accurate results.
    '''
    
    return math.log(poisson.pmf(edit_dist, poisson_lambda))

######################
# Multiple helper functions are used to avoid KeyErrors when
# attempting to access values that are not present in dictionaries,
# in which case the previously specified default value is returned.
######################

def get_start_prob(word, start_prob, default_start_prob):
    try:
        return start_prob[word]
    except KeyError:
        return default_start_prob
    
def get_transition_prob(cur_word, prev_word, 
                        transition_prob, default_transition_prob):
    try:
        return transition_prob[prev_word][cur_word]
    except KeyError:
        return default_transition_prob

def get_belief(prev_word, prev_belief):
    try:
        return prev_belief[prev_word]
    except KeyError:
        return math.log(math.exp(min(prev_belief.values()))/2.)  
    
def viterbi(words, dictionary, start_prob, default_start_prob, 
            transition_prob, default_transition_prob):
    
    V = [{}]
    path = {}
    path_context = []
    
    # character level correction - used to determine state space
    corrections = get_suggestions(words[0], dictionary)
        
    # Initialize base cases (t == 0)
    for sug_word in corrections:
        
        # compute the value for all possible starting states
        V[0][sug_word[0]] = math.exp(
            get_start_prob(sug_word[0], start_prob, 
                           default_start_prob)
            + get_emission_prob(sug_word[1]))
        
        # remember all the different paths (only one state so far)
        path[sug_word[0]] = [sug_word[0]]
 
    # normalize for numerical stability
    path_temp_sum = sum(V[0].values())
    V[0].update({k: math.log(v/path_temp_sum) 
                 for k, v in V[0].items()})
    
    # keep track of previous state space
    prev_corrections = [i[0] for i in corrections]
    
    if len(words) == 1:
        path_context = [max(V[0], key=lambda i: V[0][i])]
        return path_context

    # run Viterbi for t > 0
    for t in range(1, len(words)):

        V.append({})
        new_path = {}
        
        # character level correction
        corrections = get_suggestions(words[t], dictionary)
 
        for sug_word in corrections:
        
            sug_word_emission_prob = get_emission_prob(sug_word[1])
            
            # compute the values coming from all possible previous
            # states, only keep the maximum
            (prob, word) = max(
                (get_belief(prev_word, V[t-1]) 
                + get_transition_prob(sug_word[0], prev_word, 
                    transition_prob, default_transition_prob)
                + sug_word_emission_prob, prev_word) 
                               for prev_word in prev_corrections)

            # save the maximum value for each state
            V[t][sug_word[0]] = math.exp(prob)
            
            # remember the path we came from to get this maximum value
            new_path[sug_word[0]] = path[word] + [sug_word[0]]
        
        # normalize for numerical stability
        path_temp_sum = sum(V[t].values())
        V[t].update({k: math.log(v/path_temp_sum) 
                     for k, v in V[t].items()})
        
        # keep track of previous state space
        prev_corrections = [i[0] for i in corrections]
 
        # don't need to remember the old paths
        path = new_path
     
    (prob, word) = max((V[t][sug_word[0]], sug_word[0]) 
                       for sug_word in corrections)
    path_context = path[word]
    
    return path_context

def correct_document_context(fname, dictionary, start_prob, default_start_prob,
                             transition_prob, default_transition_prob):
    
    doc_word_count = 0
    corrected_word_count = 0
    
    with open(fname) as file:
        
        for i, line in enumerate(file):
            
            for sentence in line.split('.'):
                
                # separate by words by non-alphabetical characters
                words = re.findall('[a-z]+', sentence.lower())  
                doc_word_count += len(words)
                
                if len(words) > 0:
                
                    suggestion = viterbi(words, dictionary,
                                start_prob, default_start_prob, 
                                transition_prob, default_transition_prob)

                    # display sentences with corrections
                    if words != suggestion:
                        
                        print 'Line %i: %s --> %s' % \
                        (i, ' '.join(words), ' '.join(suggestion))
                        
                        # update count of corrected words
                        corrected_word_count += \
                        sum([words[j]!=suggestion[j] 
                             for j in range(len(words))])
  
    print '-----'
    print 'Total words checked: %i' % doc_word_count
    print 'Total potential errors found: %i' % corrected_word_count

<div class="alert alert-info">
  <strong>SAMPLE OUTPUTS</strong>
</div>

In [14]:
%%time
dictionary, start_prob, default_start_prob, \
transition_prob, default_transition_prob \
= create_dictionary('testdata/big.txt')

Creating dictionary...
Total unique words in corpus: 29157
Total items in dictionary: 2151998
  Edit distance for deletions: 3
Total unique words at the start of a sentence: 15297
Total unique word transitions: 27224
CPU times: user 40.6 s, sys: 738 ms, total: 41.3 s
Wall time: 42.6 s


In [15]:
%%time
correct_document_context('testdata/test.txt', dictionary,
                         start_prob, default_start_prob, 
                         transition_prob, default_transition_prob)

Line 3: this is ax test --> this is a test
Line 4: this is za test --> this is a test
Line 5: thee is a test --> there is a test
Line 6: her tee set --> her to set
-----
Total words checked: 27
Total potential errors found: 4
CPU times: user 936 ms, sys: 11.6 ms, total: 948 ms
Wall time: 951 ms


<div class="alert alert-danger">
<strong>To run the SPARK version, restart notebook, and start executing the cells of this section starting here.</strong>
</div>

In [1]:
import re
import math
from scipy.stats import poisson
import itertools

In [2]:
import findspark
import os
findspark.init()
import pyspark
sc = pyspark.SparkContext()
sc.setLogLevel('ERROR')

***
### 2. Pre-Processing SPARK Code Performance

In [3]:
######################
#
# DOCUMENTATION HERE
#
######################

# number of partitions to be used
n_partitions = 6
MAX_EDIT_DISTANCE = 3

def get_n_deletes_list(w, n):
    '''given a word, derive list of strings with up to n characters deleted'''
    # since this list is generally of the same magnitude as the number of 
    # characters in a word, it may not make sense to parallelize this
    # so we use python to create the list
    deletes = []
    queue = [w]
    for d in range(n):
        temp_queue = []
        for word in queue:
            if len(word)>1:
                for c in range(len(word)):  # character index
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in deletes:
                        deletes.append(word_minus_c)
                    if word_minus_c not in temp_queue:
                        temp_queue.append(word_minus_c)
        queue = temp_queue
        
    return deletes

def get_transitions(sentence):
    if len(sentence)<2:
        return None
    else:
        return [((sentence[i], sentence[i+1]), 1) for i in range(len(sentence)-1)]
    
def map_transition_prob(x):
    vals = x[1]
    total = float(sum(vals.values()))
    probs = {k: math.log(v/total) for k, v in vals.items()}
    return (x[0], probs)

def parallel_create_dictionary(fname):
    '''
    Create dictionary, start probabilities and transition
    probabilities using Spark RDDs.
    '''
    # we generate and count all words for the corpus,
    # then add deletes to the dictionary
    # this is a slightly different approach from the SymSpell algorithm
    # that may be more appropriate for Spark processing
    
    print 'Creating dictionary...'
    
    ############
    #
    # load file & initial processing
    #
    ############
    
    # http://stackoverflow.com/questions/22520932/python-remove-all-non-alphabet-chars-from-string
    regex = re.compile('[^a-z ]')

    # convert file into one long sequence of words
    make_all_lower = sc.textFile(fname) \
            .map(lambda line: line.lower()) \
            .filter(lambda x: x!='').cache()
    
    # split into individual sentences and remove other punctuation
    split_sentence = make_all_lower.flatMap(lambda line: line.split('.')) \
            .map(lambda sentence: regex.sub(' ', sentence)) \
            .map(lambda sentence: sentence.split()).cache()
    
    ############
    #
    # generate start probabilities
    #
    ############
    
    # only focus on words at the start of sentences
    start_words = split_sentence.map(lambda sentence: sentence[0] if len(sentence)>0 else None) \
        .filter(lambda word: word!=None)
    
    # add a count to each word
    count_start_words_once = start_words.map(lambda word: (word, 1)).cache()

    # use accumulator to count the number of words at the start of sentences
    accum_total_start_words = sc.accumulator(0)
    count_total_start_words = count_start_words_once.foreach(lambda x: accum_total_start_words.add(1))
    total_start_words = float(accum_total_start_words.value)
    
    # reduce into count of unique words at the start of sentences
    unique_start_words = count_start_words_once.reduceByKey(lambda a, b: a + b)
    
    # convert counts to probabilities
    start_prob_calc = unique_start_words.mapValues(lambda v: math.log(v/total_start_words))
    
    # get default start probabilities (for words not in corpus)
    default_start_prob = math.log(1/total_start_words)
    
    # store start probabilities as a dictionary (will be used as a lookup table)
    start_prob = start_prob_calc.collectAsMap()
    
    ############
    #
    # generate transition probabilities
    #
    ############
    
    # focus on continuous word pairs within the sentence
    # e.g. "this is a test" -> "this is", "is a", "a test"
    # note: as the relevant probability is P(word|previous word)
    # the tuples are ordered as (previous word, word)
    other_words = split_sentence.map(lambda sentence: get_transitions(sentence)).filter(lambda x: x!=None). \
                flatMap(lambda x: x).cache()

    # use accumulator to count the number of transitions
    accum_total_other_words = sc.accumulator(0)
    count_total_other_words = other_words.foreach(lambda x: accum_total_other_words.add(1))
    total_other_words = float(accum_total_other_words.value)
    
    # reduce into count of unique word pairs
    unique_other_words = other_words.reduceByKey(lambda a, b: a + b)
    
    # aggregate by previous word
    # i.e. (previous word, [(word1, word1-previous word count), (word2, word2-previous word count), ...])
    other_words_collapsed = unique_other_words.map(lambda x: (x[0][0], (x[0][1], x[1]))).groupByKey().mapValues(dict)

    # POTENTIAL OPTIMIZATION: FIND AN ALTERNATIVE TO GROUPBYKEY (CREATES ~9.3MB SHUFFLE)
    
    # convert counts to probabilities
    transition_prob_calc = other_words_collapsed.map(lambda x: map_transition_prob(x))
    
    # get default transition probabilities (for word pairs not in corpus)
    default_transition_prob = math.log(1/total_other_words)
    
    # store transition probabilities as dictionary (will be used as lookup table)
    transition_prob = transition_prob_calc.collectAsMap()
    
    ############
    #
    # process corpus for dictionary
    #
    ############
    
    replace_nonalphs = make_all_lower.map(lambda line: regex.sub(' ', line))
    all_words = replace_nonalphs.flatMap(lambda line: line.split())

    # create core corpus dictionary (i.e. only words appearing in file, no "deletes") and cache it
    # output RDD of unique_words_with_count: [(word1, count1), (word2, count2), (word3, count3)...]
    count_once = all_words.map(lambda word: (word, 1))
    unique_words_with_count = count_once.reduceByKey(lambda a, b: a + b).cache()
    
    ############
    #
    # generate deletes list
    #
    ############
    
    # generate list of n-deletes from words in a corpus of the form: [(word1, count1), (word2, count2), ...]
     
    assert MAX_EDIT_DISTANCE > 0  
    
    generate_deletes = unique_words_with_count.map(lambda (parent, count): 
                                                   (parent, get_n_deletes_list(parent, MAX_EDIT_DISTANCE)))
    expand_deletes = generate_deletes.flatMapValues(lambda x: x)
    swap = expand_deletes.map(lambda (orig, delete): (delete, ([orig], 0)))
   
    ############
    #
    # combine delete elements with main dictionary
    #
    ############
    
    corpus = unique_words_with_count.mapValues(lambda count: ([], count))
    combine = swap.union(corpus)  # combine deletes with main dictionary, eliminate duplicates
    
    # since the dictionary will only be a lookup table once created, we can
    # pass on as a Python dictionary rather than RDD by reducing locally and
    # avoiding an extra shuffle from reduceByKey
    dictionary = combine.reduceByKeyLocally(lambda a, b: (a[0]+b[0], a[1]+b[1]))

    words_processed = unique_words_with_count.map(lambda (k, v): v).reduce(lambda a, b: a + b)
    word_count = unique_words_with_count.count()   
    
    # output stats
    print 'Total words processed: %i' % words_processed
    print 'Total unique words in corpus: %i' % word_count 
    print 'Total items in dictionary (corpus words and deletions): %i' % len(dictionary)
    print '  Edit distance for deletions: %i' % MAX_EDIT_DISTANCE
    print 'Total unique words at the start of a sentence: %i' \
        % len(start_prob)
    print 'Total unique word transitions: %i' % len(transition_prob)
    
    return dictionary, start_prob, default_start_prob, transition_prob, default_transition_prob

***
### 3. Spellchecking SPARK Code Performance - Parallelizing Across Sentences

In [4]:
######################
#
# DOCUMENTATION HERE
#
######################

def get_emission_prob(edit_dist, poisson_lambda=0.01):
    '''
    The emission probability, i.e. P(observed word|intended word)
    is approximated by a Poisson(k, l) distribution, where 
    k=edit distance and l=0.01.
    
    The lambda parameter matches the one used in the AM207
    lecture notes. Various parameters between 0 and 1 were tested
    to confirm that 0.01 yields the most accurate results.
    '''
    
    return math.log(poisson.pmf(edit_dist, poisson_lambda))

def get_start_prob(word, start_prob, default_start_prob):
    try:
        return start_prob[word]
    except KeyError:
        return default_start_prob
    
def get_transition_prob(cur_word, prev_word, transition_prob, default_transition_prob):
    try:
        return transition_prob[prev_word][cur_word]
    except KeyError:
        return default_transition_prob
    
def get_belief(prev_word, prev_belief):
    try:
        return prev_belief[prev_word]
    except KeyError:
        return math.log(math.exp(min(prev_belief.values()))/2.)
    
def dameraulevenshtein(seq1, seq2):
    '''
    Calculate the Damerau-Levenshtein distance between sequences.
    Same code as word-level checking.
    '''
    
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1
    # matrix. However, only the current and two previous rows are
    # needed at once, so we only store those.
    
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    
    for x in xrange(len(seq1)):
        
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = \
            oneago, thisrow, [0] * len(seq2) + [x + 1]
        
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # this block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
                
    return thisrow[len(seq2) - 1]

def get_suggestions(string, dictionary, longest_word_length=20, 
                    min_count=100, max_sug=10):
    '''
    Return list of suggested corrections for potentially incorrectly
    spelled word.
    Code based on get_suggestions function from word-level checking,
    with the addition of the min_count parameter, which only
    considers words that have occur more than min_count times in the
    (dictionary) corpus.
    '''
    
    if (len(string) - longest_word_length) > MAX_EDIT_DISTANCE:
        # to ensure Viterbi can keep running -- use the word itself
        return [(string, 0)]
    
    suggest_dict = {}
    
    queue = [string]
    q_dictionary = {}  # items other than string that we've checked
    
    while len(queue)>0:
        q_item = queue[0]  # pop
        queue = queue[1:]
        
        # process queue item
        if (q_item in dictionary) and (q_item not in suggest_dict):
            if (dictionary[q_item][1]>0):
            # word is in dictionary, and is a word from the corpus,
            # and not already in suggestion list so add to suggestion
            # dictionary, indexed by the word with value (frequency
            # in corpus, edit distance)
            # note: q_items that are not the input string are shorter
            # than input string since only deletes are added (unless
            # manual dictionary corrections are added)
                assert len(string)>=len(q_item)
                suggest_dict[q_item] = \
                    (dictionary[q_item][1], len(string) - len(q_item))
            
            # the suggested corrections for q_item as stored in
            # dictionary (whether or not q_item itself is a valid
            # word or merely a delete) can be valid corrections
            for sc_item in dictionary[q_item][0]:
                if (sc_item not in suggest_dict):
                    
                    # compute edit distance
                    # suggested items should always be longer (unless
                    # manual corrections are added)
                    assert len(sc_item)>len(q_item)
                    # q_items that are not input should be shorter
                    # than original string 
                    # (unless manual corrections added)
                    assert len(q_item)<=len(string)
                    if len(q_item)==len(string):
                        assert q_item==string
                        item_dist = len(sc_item) - len(q_item)

                    # item in suggestions list should not be the same
                    # as the string itself
                    assert sc_item!=string           
                    # calculate edit distance using Damerau-
                    # Levenshtein distance
                    item_dist = dameraulevenshtein(sc_item, string)
                    
                    if item_dist<=MAX_EDIT_DISTANCE:
                        # should already be in dictionary if in
                        # suggestion list
                        assert sc_item in dictionary  
                        # trim list to contain state space
                        if (dictionary[q_item][1]>0): 
                            suggest_dict[sc_item] = \
                                (dictionary[sc_item][1], item_dist)
        
        # now generate deletes (e.g. a substring of string or of a
        # delete) from the queue item as additional items to check
        # -- add to end of queue
        assert len(string)>=len(q_item)
        if (len(string)-len(q_item))<MAX_EDIT_DISTANCE \
            and len(q_item)>1:
            for c in range(len(q_item)): # character index        
                word_minus_c = q_item[:c] + q_item[c+1:]
                if word_minus_c not in q_dictionary:
                    queue.append(word_minus_c)
                    # arbitrary value to identify we checked this
                    q_dictionary[word_minus_c] = None

    # return list of suggestions: (correction, edit distance)
    
    # only include words that have appeared a minimum number of times
    # make sure that we do not lose the original word
    as_list = [i for i in suggest_dict.items() 
               if (i[1][0]>min_count or i[0]==string)]
    
    # only include the most likely suggestions (based on frequency
    # and edit distance from original word)
    trunc_as_list = sorted(as_list, 
            key = lambda (term, (freq, dist)): (dist, -freq))[:max_sug]
    
    if len(trunc_as_list)==0:
        # to ensure Viterbi can keep running
        # -- use the word itself if no corrections are found
        return [(string, 0)]
        
    else:
        # drop the word frequency - not needed beyond this point
        return [(i[0], i[1][1]) for i in trunc_as_list]

    '''
    Output format:
    get_suggestions('file', dictionary)
    [('file', 0), ('five', 1), ('fire', 1), ('fine', 1), ('will', 2),
    ('time', 2), ('face', 2), ('like', 2), ('life', 2), ('while', 2)]
    '''
    
def viterbi(words, dictionary, start_prob, default_start_prob, 
            transition_prob, default_transition_prob):
    
    V = [{}]
    path = {}
    path_context = []
    
    # character level correction - used to determine state space
    corrections = get_suggestions(words[0], dictionary)
        
    # Initialize base cases (t == 0)
    for sug_word in corrections:
        
        # compute the value for all possible starting states
        V[0][sug_word[0]] = math.exp(
            get_start_prob(sug_word[0], start_prob, 
                           default_start_prob)
            + get_emission_prob(sug_word[1]))
        
        # remember all the different paths (only one state so far)
        path[sug_word[0]] = [sug_word[0]]
 
    # normalize for numerical stability
    path_temp_sum = sum(V[0].values())
    V[0].update({k: math.log(v/path_temp_sum) 
                 for k, v in V[0].items()})
    
    # keep track of previous state space
    prev_corrections = [i[0] for i in corrections]
    
    if len(words) == 1:
        path_context = [max(V[0], key=lambda i: V[0][i])]
        return path_context

    # run Viterbi for t > 0
    for t in range(1, len(words)):

        V.append({})
        new_path = {}
        
        # character level correction
        corrections = get_suggestions(words[t], dictionary)
 
        for sug_word in corrections:
        
            sug_word_emission_prob = get_emission_prob(sug_word[1])
            
            # compute the values coming from all possible previous
            # states, only keep the maximum
            (prob, word) = max(
                (get_belief(prev_word, V[t-1]) 
                + get_transition_prob(sug_word[0], prev_word, 
                    transition_prob, default_transition_prob)
                + sug_word_emission_prob, prev_word) 
                               for prev_word in prev_corrections)

            # save the maximum value for each state
            V[t][sug_word[0]] = math.exp(prob)
            
            # remember the path we came from to get this maximum value
            new_path[sug_word[0]] = path[word] + [sug_word[0]]
            
        # normalize for numerical stability
        path_temp_sum = sum(V[t].values())
        V[t].update({k: math.log(v/path_temp_sum) 
                     for k, v in V[t].items()})
        
        # keep track of previous state space
        prev_corrections = [i[0] for i in corrections]
 
        # don't need to remember the old paths
        path = new_path
     
    (prob, word) = max((V[t][sug_word[0]], sug_word[0]) 
                       for sug_word in corrections)
    path_context = path[word]
    
    return path_context

def get_count_mismatches(sentences):
    orig_sentence, sug_sentence = sentences
    count_mismatches = len([(orig_sentence[i], sug_sentence[i]) for i in range(len(orig_sentence))
            if orig_sentence[i]!=sug_sentence[i]])
    return count_mismatches, orig_sentence, sug_sentence

def correct_document_context_parallel_sentences(fname, dictionary,
                             start_prob, default_start_prob,
                             transition_prob, default_transition_prob):
    
    ############
    #
    # load file & initial processing
    #
    ############
    
    # broadcast Python dictionaries to workers
    bc_dictionary = sc.broadcast(dictionary)
    bc_start_prob = sc.broadcast(start_prob)
    bc_transition_prob = sc.broadcast(transition_prob)
    
    # convert all text to lowercase and drop empty lines
    make_all_lower = sc.textFile(fname) \
        .map(lambda line: line.lower()) \
        .filter(lambda x: x!='')
    
    regex = re.compile('[^a-z ]')
    
    # split into sentences -> remove special characters -> convert into list of words
    split_sentence = make_all_lower.flatMap(lambda line: line.split('.')) \
            .map(lambda sentence: regex.sub(' ', sentence)) \
            .map(lambda sentence: sentence.split()).cache()
    
    # use accumulator to count the number of words checked
    accum_total_words = sc.accumulator(0)
    split_words = split_sentence.flatMap(lambda x: x).foreach(lambda x: accum_total_words.add(1))
    
    # assign each sentence a unique id
    sentence_id = split_sentence.zipWithIndex().map(lambda (k, v): (v, k)).cache()

    ############
    #
    # spell-checking
    #
    ############

    # apply Viterbi algorithm to each sentence
    sentence_correction = sentence_id.mapValues(lambda v: (v, 
                viterbi(v, bc_dictionary.value, bc_start_prob.value, 
                        default_start_prob, bc_transition_prob.value, default_transition_prob)))
    
    ############
    #
    # output results
    #
    ############
    
    # count the number of errors per sentence, drop any sentences without errors
    sentence_errors = sentence_correction.mapValues(lambda v: (get_count_mismatches(v))). \
            filter(lambda (k, v): v[0]>0).cache()               
    
    # collect all sentences with identified errors
    sentence_errors_list = sentence_errors.collect()
    
    # number of potentially misspelled words
    num_errors = sum([s[1][0] for s in sentence_errors_list])
    
    # print identified errors (eventually output to file)
    for sentence in sentence_errors_list:
        print 'Line %i: %s --> %s' % (sentence[0], ' '.join(sentence[1][1]), ' '.join(sentence[1][2]))
    
    print '-----'
    print 'Total words checked: %i' % accum_total_words.value
    print 'Total potential errors found: %i' % num_errors

<div class="alert alert-info">
  <strong>SAMPLE OUTPUTS</strong>
</div>

In [4]:
%%time
dictionary, start_prob, default_start_prob, transition_prob, default_transition_prob = \
    parallel_create_dictionary('testdata/big.txt')

Creating dictionary...
Total words processed: 1105285
Total unique words in corpus: 29157
Total items in dictionary (corpus words and deletions): 2151998
  Edit distance for deletions: 3
Total unique words at the start of a sentence: 15297
Total unique word transitions: 27224
CPU times: user 12.8 s, sys: 1.45 s, total: 14.3 s
Wall time: 1min 5s


In [6]:
%%time
correct_document_context_parallel_sentences('testdata/test.txt', dictionary,
        start_prob, default_start_prob, transition_prob, default_transition_prob)

Line 3: this is ax test --> this is a test
Line 4: this is za test --> this is a test
Line 5: thee is a test --> there is a test
Line 6: her tee set --> her to set
-----
Total words checked: 27
Total potential errors found: 4
CPU times: user 7.44 s, sys: 527 ms, total: 7.97 s
Wall time: 20.1 s


***
### 4. Spellchecking SPARK Code Performance - Parallelizing Across Possible Word Combinations

In [4]:
######################
#
# DOCUMENTATION HERE
#
######################

def dameraulevenshtein(seq1, seq2):
    '''
    Calculate the Damerau-Levenshtein distance between sequences.
    Same code as word-level checking.
    '''
    
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1
    # matrix. However, only the current and two previous rows are
    # needed at once, so we only store those.
    
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    
    for x in xrange(len(seq1)):
        
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = \
            oneago, thisrow, [0] * len(seq2) + [x + 1]
        
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
                
    return thisrow[len(seq2) - 1]

def get_suggestions(string, dictionary, longest_word_length=20, 
                    min_count=100, max_sug=10):
    '''
    Return list of suggested corrections for potentially incorrectly
    spelled word.
    Code based on get_suggestions function from word-level checking,
    with the addition of the min_count parameter, which only
    considers words that have occur more than min_count times in the
    (dictionary) corpus.
    '''
    
    if (len(string) - longest_word_length) > MAX_EDIT_DISTANCE:
        # to ensure Viterbi can keep running -- use the word itself
        return [(string, 0)]
    
    suggest_dict = {}
    
    queue = [string]
    q_dictionary = {}  # items other than string that we've checked
    
    while len(queue)>0:
        q_item = queue[0]  # pop
        queue = queue[1:]
        
        # process queue item
        if (q_item in dictionary) and (q_item not in suggest_dict):
            if (dictionary[q_item][1]>0):
            # word is in dictionary, and is a word from the corpus,
            # and not already in suggestion list so add to suggestion
            # dictionary, indexed by the word with value (frequency
            # in corpus, edit distance)
            # note: q_items that are not the input string are shorter
            # than input string since only deletes are added (unless
            # manual dictionary corrections are added)
                assert len(string)>=len(q_item)
                suggest_dict[q_item] = \
                    (dictionary[q_item][1], len(string) - len(q_item))
            
            # the suggested corrections for q_item as stored in
            # dictionary (whether or not q_item itself is a valid
            # word or merely a delete) can be valid corrections
            for sc_item in dictionary[q_item][0]:
                if (sc_item not in suggest_dict):
                    
                    # compute edit distance
                    # suggested items should always be longer (unless
                    # manual corrections are added)
                    assert len(sc_item)>len(q_item)
                    # q_items that are not input should be shorter
                    # than original string 
                    # (unless manual corrections added)
                    assert len(q_item)<=len(string)
                    if len(q_item)==len(string):
                        assert q_item==string
                        item_dist = len(sc_item) - len(q_item)

                    # item in suggestions list should not be the same
                    # as the string itself
                    assert sc_item!=string           
                    # calculate edit distance using Damerau-
                    # Levenshtein distance
                    item_dist = dameraulevenshtein(sc_item, string)
                    
                    if item_dist<=MAX_EDIT_DISTANCE:
                        # should already be in dictionary if in
                        # suggestion list
                        assert sc_item in dictionary  
                        # trim list to contain state space
                        if (dictionary[q_item][1]>0): 
                            suggest_dict[sc_item] = \
                                (dictionary[sc_item][1], item_dist)
        
        # now generate deletes (e.g. a substring of string or of a
        # delete) from the queue item as additional items to check
        # -- add to end of queue
        assert len(string)>=len(q_item)
        if (len(string)-len(q_item))<MAX_EDIT_DISTANCE \
            and len(q_item)>1:
            for c in range(len(q_item)): # character index        
                word_minus_c = q_item[:c] + q_item[c+1:]
                if word_minus_c not in q_dictionary:
                    queue.append(word_minus_c)
                    # arbitrary value to identify we checked this
                    q_dictionary[word_minus_c] = None

    # return list of suggestions: (correction, edit distance)
    
    # only include words that have appeared a minimum number of times
    # make sure that we do not lose the original word
    as_list = [i for i in suggest_dict.items() 
               if (i[1][0]>min_count or i[0]==string)]
    
    # only include the most likely suggestions (based on frequency
    # and edit distance from original word)
    trunc_as_list = sorted(as_list, 
            key = lambda (term, (freq, dist)): (dist, -freq))[:max_sug]
    
    if len(trunc_as_list)==0:
        # to ensure Viterbi can keep running
        # -- use the word itself if no corrections are found
        return [(string, 0)]
        
    else:
        # drop the word frequency - not needed beyond this point
        return [(i[0], i[1][1]) for i in trunc_as_list]

    '''
    Output format:
    get_suggestions('file', dictionary)
    [('file', 0), ('five', 1), ('fire', 1), ('fine', 1), ('will', 2),
    ('time', 2), ('face', 2), ('like', 2), ('life', 2), ('while', 2)]
    '''
    
def get_emission_prob(edit_dist, poisson_lambda=0.01):
    '''
    The emission probability, i.e. P(observed word|intended word)
    is approximated by a Poisson(k, l) distribution, where 
    k=edit distance and l=0.01.
    
    The lambda parameter matches the one used in the AM207
    lecture notes. Various parameters between 0 and 1 were tested
    to confirm that 0.01 yields the most accurate results.
    '''
    
    return math.log(poisson.pmf(edit_dist, poisson_lambda))

######################
# Multiple helper functions are used to avoid KeyErrors when
# attempting to access values that are not present in dictionaries,
# in which case the previously specified default value is returned.
######################

def get_start_prob(word, start_prob, default_start_prob):
    try:
        return start_prob[word]
    except KeyError:
        return default_start_prob
    
def get_transition_prob(cur_word, prev_word, transition_prob, default_transition_prob):
    try:
        return transition_prob[prev_word][cur_word]
    except KeyError:
        return default_transition_prob

def get_belief(prev_word, prev_belief):
    try:
        return prev_belief[prev_word]
    except KeyError:
        return math.log(math.exp(min(prev_belief.values()))/2.)  

def map_sentence_words(sentence, tmp_dict):
    return [[word, get_suggestions(word, tmp_dict)] 
            for i, word in enumerate(sentence)]

def split_suggestions(sentence):
    result = []
    for word in sentence:
        result.append([(word[0], s[0], get_emission_prob(s[1])) for s in word[1]])
    return result

def get_word_combos(sug_lists):
    return list(itertools.product(*sug_lists))

def split_combos(combos):
    sent_id, combo_list = combos
    return [[sent_id, c] for c in combo_list]

def get_combo_prob(combo, tmp_sp, d_sp, tmp_tp, d_tp):
    
    # first word in sentence
    # emission prob * start prob
    orig_path = [combo[0][0]]
    sug_path = [combo[0][1]]
    prob = combo[0][2] + get_start_prob(combo[0][1], tmp_sp, d_sp)
    
    # subsequent words
    for i, w in enumerate(combo[1:]):
        orig_path.append(w[0])
        sug_path.append(w[1])
        prob += w[2] + get_transition_prob(w[1], combo[i-1][1], tmp_tp, d_tp)
    
    return orig_path, sug_path, prob

def get_count_mismatches_prob(sentences):
    orig_sentence, sug_sentence, prob = sentences
    count_mismatches = len([(orig_sentence[i], sug_sentence[i]) for i in range(len(orig_sentence))
            if orig_sentence[i]!=sug_sentence[i]])
    return count_mismatches, orig_sentence, sug_sentence

def correct_document_context_parallel_combos(fname, dictionary,
                             start_prob, default_start_prob,
                             transition_prob, default_transition_prob):
    
    ############
    #
    # load file & initial processing
    #
    ############
    
    # broadcast Python dictionaries to workers
    bc_dictionary = sc.broadcast(dictionary)
    bc_start_prob = sc.broadcast(start_prob)
    bc_transition_prob = sc.broadcast(transition_prob)
    
    # convert all text to lowercase and drop empty lines
    make_all_lower = sc.textFile(fname) \
        .map(lambda line: line.lower()) \
        .filter(lambda x: x!='')
    
    regex = re.compile('[^a-z ]')
    
    # split into sentences -> remove special characters -> convert into list of words
    split_sentence = make_all_lower.flatMap(lambda line: line.split('.')) \
            .map(lambda sentence: regex.sub(' ', sentence)) \
            .map(lambda sentence: sentence.split()).cache()
    
    # use accumulator to count the number of words checked
    accum_total_words = sc.accumulator(0)
    split_words = split_sentence.flatMap(lambda x: x).foreach(lambda x: accum_total_words.add(1))
    
    # assign each sentence a unique id
    sentence_id = split_sentence.zipWithIndex().map(lambda (k, v): (v, k)).cache()
    
    ############
    #
    # spell-checking
    #
    ############
    
    # look up possible suggestions for each word in each sentence
    sentence_words = sentence_id.mapValues(lambda v: map_sentence_words(v, bc_dictionary.value))
    
    # look up emission probabilities for each word
    # i.e. P(observed word|intended word)
    sentence_word_sug = sentence_words.mapValues(lambda v: split_suggestions(v))
    
    # generate all possible corrected combinations (using Cartesian product)
    # i.e. a sentence with 4 word, each of which have 5 possible suggestions,
    # will yield 5^4 possible combinations
    sentence_word_combos = sentence_word_sug.mapValues(lambda v: get_word_combos(v))
    
    # flatmap into all possible combinations per sentence
    # format: [sentence id, 
    # [(observed first word, potential first word, P(observed first word|intended first word)]), 
    # (observed second word, potential second word, P(observed second word|intended second word)]), ...]
    sentence_word_combos_split = sentence_word_combos.flatMap(lambda x: split_combos(x))
    
    # calculate the probability of each word combination being the intended one, given what was observed
    # note: the approach does not allow for normalization across iterations, so may yield different results
    sentence_word_combos_prob = sentence_word_combos_split.mapValues(lambda v:  
                                get_combo_prob(v, bc_start_prob.value, default_start_prob, 
                                               bc_transition_prob.value, default_transition_prob))
    
    # identify the word combination with the highest probability for each sentence
    sentence_max_prob = sentence_word_combos_prob.reduceByKey(lambda a,b: a if a[2] > b[2] else b)

    ############
    #
    # output results
    #
    ############
    
    # count the number of errors per sentence, drop any sentences without errors
    sentence_errors = sentence_max_prob.mapValues(lambda v: (get_count_mismatches_prob(v))) \
            .filter(lambda (k, v): v[0]>0).cache()
               
    # collect all sentences with identified errors
    sentence_errors_list = sentence_errors.collect()
    
    # number of potentially misspelled words
    num_errors = sum([s[1][0] for s in sentence_errors_list])
    
    # print identified errors (eventually output to file)
    for sentence in sentence_errors_list:
        print 'Line %i: %s --> %s' % (sentence[0], ' '.join(sentence[1][1]), ' '.join(sentence[1][2]))
    
    print '-----'
    print 'Total words checked: %i' % accum_total_words.value
    print 'Total potential errors found: %i' % num_errors

<div class="alert alert-info">
  <strong>SAMPLE OUTPUTS</strong>
</div>

In [5]:
%%time
dictionary, start_prob, default_start_prob, transition_prob, default_transition_prob = \
    parallel_create_dictionary('testdata/big.txt')

Creating dictionary...
Total words processed: 1105285
Total unique words in corpus: 29157
Total items in dictionary (corpus words and deletions): 2151998
  Edit distance for deletions: 3
Total unique words at the start of a sentence: 15297
Total unique word transitions: 27224
CPU times: user 11.3 s, sys: 1.01 s, total: 12.3 s
Wall time: 54.7 s


In [6]:
%%time
correct_document_context_parallel_combos('testdata/test.txt', dictionary,
        start_prob, default_start_prob, transition_prob, default_transition_prob)

Line 6: her tee set --> her the set
Line 3: this is ax test --> this is as test
Line 4: this is za test --> this is a test
Line 5: thee is a test --> then is a test
-----
Total words checked: 27
Total potential errors found: 4
CPU times: user 9.14 s, sys: 660 ms, total: 9.8 s
Wall time: 29.1 s


***
### 5. Spellchecking SPARK Code Performance - Parallelizing Across Viterbi Algorithm Iterations

In [16]:
######################
#
# DOCUMENTATION HERE
#
######################

def dameraulevenshtein(seq1, seq2):
    '''
    Calculate the Damerau-Levenshtein distance between sequences.
    Same code as word-level checking.
    '''
    
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1
    # matrix. However, only the current and two previous rows are
    # needed at once, so we only store those.
    
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    
    for x in xrange(len(seq1)):
        
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = \
            oneago, thisrow, [0] * len(seq2) + [x + 1]
        
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
                
    return thisrow[len(seq2) - 1]

def get_suggestions(string, dictionary, longest_word_length=20, 
                    min_count=100, max_sug=10):
    '''
    Return list of suggested corrections for potentially incorrectly
    spelled word.
    Code based on get_suggestions function from word-level checking,
    with the addition of the min_count parameter, which only
    considers words that have occur more than min_count times in the
    (dictionary) corpus.
    '''
    
    if (len(string) - longest_word_length) > MAX_EDIT_DISTANCE:
        # to ensure Viterbi can keep running -- use the word itself
        return [(string, 0)]
    
    suggest_dict = {}
    
    queue = [string]
    q_dictionary = {}  # items other than string that we've checked
    
    while len(queue)>0:
        q_item = queue[0]  # pop
        queue = queue[1:]
        
        # process queue item
        if (q_item in dictionary) and (q_item not in suggest_dict):
            if (dictionary[q_item][1]>0):
            # word is in dictionary, and is a word from the corpus,
            # and not already in suggestion list so add to suggestion
            # dictionary, indexed by the word with value (frequency
            # in corpus, edit distance)
            # note: q_items that are not the input string are shorter
            # than input string since only deletes are added (unless
            # manual dictionary corrections are added)
                assert len(string)>=len(q_item)
                suggest_dict[q_item] = \
                    (dictionary[q_item][1], len(string) - len(q_item))
            
            # the suggested corrections for q_item as stored in
            # dictionary (whether or not q_item itself is a valid
            # word or merely a delete) can be valid corrections
            for sc_item in dictionary[q_item][0]:
                if (sc_item not in suggest_dict):
                    
                    # compute edit distance
                    # suggested items should always be longer (unless
                    # manual corrections are added)
                    assert len(sc_item)>len(q_item)
                    # q_items that are not input should be shorter
                    # than original string 
                    # (unless manual corrections added)
                    assert len(q_item)<=len(string)
                    if len(q_item)==len(string):
                        assert q_item==string
                        item_dist = len(sc_item) - len(q_item)

                    # item in suggestions list should not be the same
                    # as the string itself
                    assert sc_item!=string           
                    # calculate edit distance using Damerau-
                    # Levenshtein distance
                    item_dist = dameraulevenshtein(sc_item, string)
                    
                    if item_dist<=MAX_EDIT_DISTANCE:
                        # should already be in dictionary if in
                        # suggestion list
                        assert sc_item in dictionary  
                        # trim list to contain state space
                        if (dictionary[q_item][1]>0): 
                            suggest_dict[sc_item] = \
                                (dictionary[sc_item][1], item_dist)
        
        # now generate deletes (e.g. a substring of string or of a
        # delete) from the queue item as additional items to check
        # -- add to end of queue
        assert len(string)>=len(q_item)
        if (len(string)-len(q_item))<MAX_EDIT_DISTANCE \
            and len(q_item)>1:
            for c in range(len(q_item)): # character index        
                word_minus_c = q_item[:c] + q_item[c+1:]
                if word_minus_c not in q_dictionary:
                    queue.append(word_minus_c)
                    # arbitrary value to identify we checked this
                    q_dictionary[word_minus_c] = None

    # return list of suggestions: (correction, edit distance)
    
    # only include words that have appeared a minimum number of times
    # make sure that we do not lose the original word
    as_list = [i for i in suggest_dict.items() 
               if (i[1][0]>min_count or i[0]==string)]
    
    # only include the most likely suggestions (based on frequency
    # and edit distance from original word)
    trunc_as_list = sorted(as_list, 
            key = lambda (term, (freq, dist)): (dist, -freq))[:max_sug]
    
    if len(trunc_as_list)==0:
        # to ensure Viterbi can keep running
        # -- use the word itself if no corrections are found
        return [(string, 0)]
        
    else:
        # drop the word frequency - not needed beyond this point
        return [(i[0], i[1][1]) for i in trunc_as_list]

    '''
    Output format:
    get_suggestions('file', dictionary)
    [('file', 0), ('five', 1), ('fire', 1), ('fine', 1), ('will', 2),
    ('time', 2), ('face', 2), ('like', 2), ('life', 2), ('while', 2)]
    '''
    
def get_emission_prob(edit_dist, poisson_lambda=0.01):
    '''
    The emission probability, i.e. P(observed word|intended word)
    is approximated by a Poisson(k, l) distribution, where 
    k=edit distance and l=0.01.
    
    The lambda parameter matches the one used in the AM207
    lecture notes. Various parameters between 0 and 1 were tested
    to confirm that 0.01 yields the most accurate results.
    '''
    
    return math.log(poisson.pmf(edit_dist, poisson_lambda))

######################
# Multiple helper functions are used to avoid KeyErrors when
# attempting to access values that are not present in dictionaries,
# in which case the previously specified default value is returned.
######################

def get_start_prob(word, start_prob, default_start_prob):
    try:
        return start_prob[word]
    except KeyError:
        return default_start_prob
    
def get_transition_prob(cur_word, prev_word, transition_prob, default_transition_prob):
    try:
        return transition_prob[prev_word][cur_word]
    except KeyError:
        return default_transition_prob

def get_belief(prev_word, prev_belief):
    try:
        return prev_belief[prev_word]
    except KeyError:
        return math.log(math.exp(min(prev_belief.values()))/2.)  

def get_count_mismatches(sentences):
    orig_sentence, sug_sentence = sentences
    count_mismatches = len([(orig_sentence[i], sug_sentence[i]) for i in range(len(orig_sentence))
            if orig_sentence[i]!=sug_sentence[i]])
    return count_mismatches, orig_sentence, sug_sentence

def get_sentence_word_id(words):
    return [(i, w) for i, w in enumerate(words)]

def split_sentence_words(sentence):
    sent_id, words = sentence
    return [[sent_id, w] for w in words]

def start_word_prob(words, tmp_sp, d_sp):
    orig_word, sug_words = words
    probs = [(w[0], 
              math.exp(get_start_prob(w[0], tmp_sp, d_sp) + get_emission_prob(w[1]))
             ) 
             for w in sug_words]
    sum_probs = sum([p[1] for p in probs])
    probs = [([p[0]], math.log(p[1]/sum_probs)) for p in probs]
    return probs

def split_suggestions(sentence):
    sent_id, (word, word_sug)  = sentence
    return [[sent_id, (word, w)] for w in word_sug]

def subs_word_prob(words, tmp_tp, d_tp):
    
    # unpack values
    sent_id = words[0]
    cur_word = words[1][0][0]
    cur_sug = words[1][0][1][0]
    cur_sug_ed = words[1][0][1][1]
    prev_sug = words[1][1]
    
    # belief + transition probability + emission probability
    (prob, word) = max((p[1]
                 + get_transition_prob(cur_sug, p[0][-1], tmp_tp, d_tp)
                 + get_emission_prob(cur_sug_ed), p[0])
                     for p in prev_sug)
    
    return sent_id, (word + [cur_sug], math.exp(prob))

def normalize(probs):
    sum_probs = sum([p[1] for p in probs])
    return [(p[0], math.log(p[1]/sum_probs)) for p in probs]

def get_max_path(final_paths):
    max_path = max((p[1], p[0]) for p in final_paths)
    return max_path[1]

def correct_document_context_parallel_steps(fname, dictionary,
                             start_prob, default_start_prob,
                             transition_prob, default_transition_prob):
    
    ############
    #
    # load file & initial processing
    #
    ############
    
    # broadcast Python dictionaries to workers
    bc_dictionary = sc.broadcast(dictionary)
    bc_start_prob = sc.broadcast(start_prob)
    bc_transition_prob = sc.broadcast(transition_prob)
    
    # convert all text to lowercase and drop empty lines
    make_all_lower = sc.textFile(fname) \
        .map(lambda line: line.lower()) \
        .filter(lambda x: x!='')
    
    regex = re.compile('[^a-z ]')
    
    # split into sentences -> remove special characters -> convert into list of words
    split_sentence = make_all_lower.flatMap(lambda line: line.split('.')) \
            .map(lambda sentence: regex.sub(' ', sentence)) \
            .map(lambda sentence: sentence.split()).cache()
    
    # use accumulator to count the number of words checked
    accum_total_words = sc.accumulator(0)
    split_words = split_sentence.flatMap(lambda x: x).foreach(lambda x: accum_total_words.add(1))
    
    # assign each sentence a unique id
    sentence_id = split_sentence.zipWithIndex().map(lambda (k, v): (v, k)).cache()
    
    ############
    #
    # spell-checking
    #
    ############

    # number each word in a sentence, and split into individual words
    sentence_word_id = sentence_id.mapValues(lambda v: get_sentence_word_id(v)) \
            .flatMap(lambda x: split_sentence_words(x))
    
    # get suggestions for each word
    sentence_word_suggestions = sentence_word_id.mapValues(lambda v: 
                                            (v[0], v[1], get_suggestions(v[1], bc_dictionary.value))).cache()
    
    # filter for the first words in sentences
    sentence_word_1 = sentence_word_suggestions.filter(lambda (k, v): v[0]==0) \
            .mapValues(lambda v: (v[1], v[2]))
    
    # calculate probability for each suggestion
    # format: (sentence id, [path-probability pairs])
    sentence_path = sentence_word_1.mapValues(lambda v: 
                                              start_word_prob(v, bc_start_prob.value, default_start_prob))
    
    word_num = 1
    
    # filter for the next words in sentences
    sentence_word_next = sentence_word_suggestions.filter(lambda (k,v): v[0]==word_num) \
            .mapValues(lambda v: (v[1], v[2]))
    
    # check that there are more words left
    while not sentence_word_next.isEmpty():

        # split into suggestions
        sentence_word_next_split = sentence_word_next.flatMap(lambda x: split_suggestions(x))
        
        # join on previous path
        # format: (sentence id, ((current word, (current word suggestion, edit distance)), 
        #         [(previous path-probability pairs)]))
        sentence_word_next_path = sentence_word_next_split.join(sentence_path)
        
        # calculate path with max probability
        sentence_word_next_path_prob = sentence_word_next_path.map(lambda x:
                                                subs_word_prob(x, bc_transition_prob.value, default_transition_prob))
        
        # normalize for numerical stability
        sentence_path = sentence_word_next_path_prob.groupByKey().mapValues(lambda v: normalize(v))
        
        word_num += 1
        
        # filter for the next words in sentences
        sentence_word_next = sentence_word_suggestions.filter(lambda (k, v): v[0]==word_num) \
                .mapValues(lambda v: (v[1], v[2]))
        
    # get most likely path (sentence)
    sentence_suggestion = sentence_path.mapValues(lambda v: get_max_path(v))

    # join with original path (sentence)
    sentence_max_prob = sentence_id.join(sentence_suggestion)
        
    ############
    #
    # output results
    #
    ############
    
    # count the number of errors per sentence, drop any sentences without errors
    sentence_errors = sentence_max_prob.mapValues(lambda v: (get_count_mismatches(v))) \
            .filter(lambda (k, v): v[0]>0).cache()
               
    # collect all sentences with identified errors
    sentence_errors_list = sentence_errors.collect()
    
    # number of potentially misspelled words
    num_errors = sum([s[1][0] for s in sentence_errors_list])
    
    # print identified errors (eventually output to file)
    for sentence in sentence_errors_list:
        print 'Line %i: %s --> %s' % (sentence[0], ' '.join(sentence[1][1]), ' '.join(sentence[1][2]))
    
    print '-----'
    print 'Total words checked: %i' % accum_total_words.value
    print 'Total potential errors found: %i' % num_errors

<div class="alert alert-info">
  <strong>SAMPLE OUTPUTS</strong>
</div>

In [5]:
%%time
dictionary, start_prob, default_start_prob, transition_prob, default_transition_prob = \
    parallel_create_dictionary('testdata/big.txt')

Creating dictionary...
Total words processed: 1105285
Total unique words in corpus: 29157
Total items in dictionary (corpus words and deletions): 2151998
  Edit distance for deletions: 3
Total unique words at the start of a sentence: 15297
Total unique word transitions: 27224
CPU times: user 10.1 s, sys: 880 ms, total: 11 s
Wall time: 1min 6s


In [17]:
%%time
correct_document_context_parallel_steps('testdata/test.txt', dictionary,
        start_prob, default_start_prob, transition_prob, default_transition_prob)

Line 3: this is ax test --> this is a test
Line 4: this is za test --> this is a test
Line 5: thee is a test --> there is a test
-----
Total words checked: 27
Total potential errors found: 3
CPU times: user 8.69 s, sys: 1.94 s, total: 10.6 s
Wall time: 45.6 s
