# A machine learning model to understand fancy abbreviations

Recently I bumped into a [question](https://stackoverflow.com/questions/43510778) on Stackoverflow, how to recover phrases from abbreviations, e.g. turn *wtrbtl* into *water bottle*, and *bsktball* into *basketball*. The question had an additional complication: lack of comprehensive list of word. That means, we need an algorithm able to invent new likely words.

I was intrigued, and started researching, which algorithms and math lie behind modern spell-checkers. It turned out that a good spell-checker can be made with a n-gram language model, a model of word distortions, and a greedy beam search algorithm. The whole construction is called a 'noisy channel model' (see the 'Spelling Correction and the Noisy Channel' section in http://web.stanford.edu/~jurafsky/slp3 for more details). 

With this knowledge and Python, I wrote a model from scratch. After training on "The Fellowship of the Ring" text, it was able to recognize abbreviations of modern sports terms.

Spell checkers are widely used: from your phone's keyboard to search engines and voice assistants. It's not easy to make a good spell checker, because it has to be really fast and universal (able to correct unseen words) at the same time. That's why there is so much science in spell checkers. This article is aimed to give idea of this scince and just to make fun.

### Math behind a spell checker
In the noisy channel model, each abbreviation is the result of a random distortion of the original phrase.

To recover the original phrase, we need to answer two questions: which original phrases are likely, and which distortions are likely?


By the Bayes theorem, $$p(phrase|abbreviation) \sim p(phrase) p(abbreviation|phrase) = \\ = p(phrase) \sum p(distortion|phrase) $$. Here $distortion$ is any change of the original $phrase$, which turns it into the observable $abbreviation$. The $\sim$ symbol means "proportional", because LHS is a probability distribution, but RHS is generally not.

Both original phrase likelihood and distortion likelihood can be estimated with statistical models. I will use the simplest models - character n-grams(https://en.wikipedia.org/wiki/N-gram). I could use more difficult models (e.g recurrent neural networks), but it doesn't change the principle.

With such models, we can reconstruct probable original phrases letter by letter, using a greedy directed search algorithm.


### N-gram language model

N-gram model looks at the previous n-1 letters and estimates the probability of the next (n'th) letter conditional on them. For example, probability of letter "g" appearing after "bowlin" sequence would be calculated by 4-gram model as $p(g|bowlin)=p(g|lin)$, because the model ignores all the characters before these 4, for the sake of simplicity. Conditional probabilities, such as this, are determined ("learned") on a training corpus of texts. In my example,
$$p(g|lin)=\frac{\#(ling)}{\#(lin\bullet)}=\frac{\#(ling)}{\#(lina)+\#(linb)+\#(linc)+...}$$. 
Here #(ling) is number of occurences of "ling" in the training text. $\#(lin\bullet)$  is number of all 4-grams in the text, starting with "lin".

In order to estimate correctly even the rare n-grams, I apply two tricks. First, for each counter I add a positive number $\delta$. It guarantees that I will not divide by zero. Second, I use not only n-grams (which can occur rarely in the text), but also n-1 grams (more frequent), and so on, down to 1-grams (unconditional probabilities of letters). But I discount lesser-order counters with an $\alpha$ multiplier. Thus, in fact I calculate $p(g|lin)$ as 

$$p(g|lin)=\frac{(\#(ling)+1) + \alpha (\#(ing)+1) + \alpha^2 (\#(ng)+1) + \alpha^3 (\#(g)+1)}{(\#(lin\bullet)+1) + \alpha (\#(in\bullet)+1) + \alpha^2 (\#(n\bullet)+1) + \alpha^3 (\#(\bullet)+1)}$$

For those who prefer implementation to theory, there is the Python code for my n-gram model.

In [5]:
from collections import defaultdict, Counter
import numpy as np
import pandas as pd

class LanguageNgramModel:
    """ 
    The model remembers and predicts which letters follow which.
    Constructor parameters:
        order - number of characters the model remembers, or n-1
        smoothing - the number, added to each counter for stability
        recursive - weight of the model of one order less
    Learned parameters:
        counter_ - storage of n-grams, as dict of counters  
        vocabulary_ - set of characters that the model knows
    """
    def __init__(self, order=1, smoothing=1.0, recursive=0.001):
        self.order = order
        self.smoothing = smoothing
        self.recursive = recursive
    
    def fit(self, corpus):
        """ Estimate freqency of all n-grams in the text
        parameters:
            corpus - a text string 
        """
        self.counter_ = defaultdict(lambda: Counter())
        self.vocabulary_ = set()
        for i, token in enumerate(corpus[self.order:]):
            context = corpus[i:(i+self.order)]
            self.counter_[context][token] += 1
            self.vocabulary_.add(token)
        self.vocabulary_ = sorted(list(self.vocabulary_))
        if self.recursive > 0 and self.order > 0:
            self.child_ = LanguageNgramModel(self.order-1, self.smoothing, self.recursive)
            self.child_.fit(corpus)
            
    def get_counts(self, context):
        """ Estimate frequency of all symbols that may follow the context
        Parameters:
            context - text string (only the last self.order chars matter)
        Returns: 
            freq - vector of letter conditional frequencies, as pandas.Series
        """
        if self.order:
            local = context[-self.order:]
        else:
            local = ''
        freq_dict = self.counter_[local]
        freq = pd.Series(index=self.vocabulary_)
        for i, token in enumerate(self.vocabulary_):
            freq[token] = freq_dict[token] + self.smoothing
        if self.recursive > 0 and self.order > 0:
            child_freq = self.child_.get_counts(context) * self.recursive
            freq += child_freq
        return freq
    
    def predict_proba(self, context):
        """ Estimate probability of all symbols that may follow the context
        Parameters:
            context - text string (only the last self.order chars matter)
        Returns: 
            freq - vector of letter conditional frequencies, as pandas.Series
        """
        counts = self.get_counts(context)
        return counts / counts.sum()
    
    def single_log_proba(self, context, continuation):
        """ Estimate log probability of the certain continuation of the context
        Parameters:
            context - text string, known beginning of the phrase
            continuation - text string, its hypothetical end
        Returns: 
            result - a float, log of probability
        """
        result = 0.0
        for token in continuation:
            result += np.log(self.predict_proba(context)[token])
            context += token
        return result
    
    def single_proba(self, context, continuation):
        """ Estimate probability of the certain continuation of the context
        Parameters:
            context - text string, known beginning of the phrase
            continuation - text string, its hypothetical end
        Returns: 
            result - a float, probability
        """
        return np.exp(self.single_log_proba(context, continuation))

### The model of abbreviations
We needed the language model to understand which original phrases are probable. We need the model of abbreviations (or "distortions") to understand, how the original phrases usually change.

I will assume that the only possible distortion is inclusion of some characters (including whitespace) from the phrase. The model may be modified to take into account other distortion types, e.g. replacements and permutations of characters.

I don't have a large sample to train a complex model on it. Therefore, for distortions I will use 1-grams. It means, the model will just remember for each character the probability of its exclusion from abbreviation. But I still code it as a general n-gram model, just in case.


In [7]:
class MissingLetterModel:
    """ 
    The model remembers and predicts which letters are usually missed.
    Constructor parameters:
        order - number of characters the model remembers, or n-1
        smoothing_missed - the number added to missed counter
        smoothing_total - the number added to total counter
    Learned parameters:
        missed_counter_ - counter of occurences of the missed characters 
        total_counter_ - counter of occurences of all characters 
    """
    def __init__(self, order=0, smoothing_missed=0.3, smoothing_total=1.0):
        self.order = order
        self.smoothing_missed = smoothing_missed
        self.smoothing_total = smoothing_total
    
    def fit(self, sentence_pairs):
        """ Estimate of missing probability for each symbol
        Parameters:
            sentence_pairs - list of (original phrase, abbreviation)
        In the abbreviation, all missed symbols are replaced with "-"
        """
        self.missed_counter_ = defaultdict(lambda: Counter())
        self.total_counter_ = defaultdict(lambda: Counter())
        for (original, observed) in sentence_pairs:
            for i, (original_letter, observed_letter) \
                    in enumerate(zip(original[self.order:], observed[self.order:])):
                context = original[i:(i+self.order)]
                if observed_letter == '-':
                    self.missed_counter_[context][original_letter] += 1
                self.total_counter_[context][original_letter] += 1 
    
    def predict_proba(self, context, last_letter):
        """ Estimate of probability of last_letter being missed after context"""
        if self.order:
            local = context[-self.order:]
        else:
            local = ''
        missed_freq = self.missed_counter_[local][last_letter] + self.smoothing_missed
        total_freq = self.total_counter_[local][last_letter] + self.smoothing_total
        return missed_freq / total_freq
    
    def single_log_proba(self, context, continuation, actual=None):
        """ Estimate log probability that after context, 
            continuation is abbreviated to actual.
        If actual is None, it is assumed that nothing is abbreviated.
        """
        if not actual:
            actual = continuation
        result = 0.0
        for orig_token, act_token in zip(continuation, actual):
            pp = self.predict_proba(context, orig_token)
            if act_token != '-':
                pp = 1 - pp
            result += np.log(pp)
            context += orig_token
        return result
    
    def single_proba(self, context, continuation, actual=None):
        """ Estimate probability that after context, 
            continuation is abbreviated to actual.
        If actual is None, it is assumed that nothing is abbreviated.
        """
        return np.exp(self.single_log_proba(context, continuation, actual))

### Toy examples

Train the bigram language model on just one example and see suggested continutions for "bra". "B" is the most probable (it goes after "a" in most cases).

In [21]:
lang_model = LanguageNgramModel(1)
lang_model.fit(' abracadabra ')
print(lang_model.predict_proba(' bra'))

     0.181777
a    0.091297
b    0.272529
c    0.181686
d    0.181686
r    0.091025
dtype: float64


Train the distortion model on one (word, distortion) sample. Probability of abbreviating "a" away is higher than "b".

In [22]:
missed_model = MissingLetterModel(0)
missed_model.fit([('abracadabra', 'abr-c-d-br-')]) 

print({letter: missed_model.predict_proba('abr', letter) for letter in 'abc'})

{'a': 0.7166666666666667, 'b': 0.09999999999999999, 'c': 0.15}


Estimated probability of "abra" abbreviated as "abr-"

In [30]:
print(missed_model.single_proba('', 'abra', 'abr-'))

0.164475


In [36]:
np.log10(27) * 10

14.313637641589875

### Greedy search for the most probable phrase

Having models of language and distortions, theoretically we can estimate likelihood of any original phrase. But for this we need to loop over *all* the possible (original phrase, distortion) pairs. There are just too many of them: e.g. with 27 character alphabet there are $27^{10}$ possible 10-letter phrases. We need a smarter algorithm to avoid this near-infinite looping.

I will exploit the fact that the models are single-character-based, and will construct the phrase letter by letter. I wil make a <a href="https://en.wikipedia.org/wiki/Heap_(data_structure)">heap</a> of incomplete candidate phrases, and evaluate likelihood of each. The best candidate will be extended with multiple possible one-letter continuations, and added to the heap. To cut the number of options, I will save only the "good enough" candidates. The complete candidates will be set aside, to be returned as an solution in the end. The pricedure will be repeated unless either the heap or the maximum number of iterations runs out.

The quality of candidates will be evaluated as log-probability of the abreviation, given that the original phrase begins with the candidate and ends (because the candidate is incomplete) as the abbreviation itself. To manage the search, I introduce two parameters: "optimism" and "freedom". "Optimism" evaluates, how the likelihood will improve when the candidate completes. It makes sense to set "optimism" between 0 and 1; the closer it is to 1, the faster the algorithm will try to add new characters. "Freedom" is the allowable loss of quality in comparison to the curent best candidate. The higher the "freedom", the more options would be included, and the slower the algorithm would be. If the "freedom" is too low, the heap may deplete before any reasonable phrase is found.

In [37]:
from heapq import heappush, heappop

def generate_options(prefix_proba, prefix, suffix, 
                     lang_model, missed_model, optimism=0.5, cache=None):
    """ Generate partial options of abbreviation decoding (a helper function)
    Parameters:
        prefix_proba - log probability of decoded part of the abbreviation
        prefix - decoded part of the abbreviation
        suffix - not decoded part of the abbreviation
        lang_model - the language model
        missed_model - the abbreviation probability model
        optimism - coefficient for log likelihood of the word end
        cache - storage of suffix likelihood estimates
    Returns: list of options in the form (likelihood estimate, decoded part, 
        not decoded part, the new letter, the suffix likelihood estimate)
    """
    options = []
    for letter in lang_model.vocabulary_ + ['']:
        if letter:  # here we assume the character was missing
            next_letter = letter
            new_suffix = suffix
            new_prefix = prefix + next_letter
            proba_missing_state = - np.log(missed_model.predict_proba(prefix, letter))
        else:  # here we assume there was no missing character
            next_letter = suffix[0]
            new_suffix = suffix[1:]
            new_prefix = prefix + next_letter
            proba_missing_state = - np.log((1 - missed_model.predict_proba(prefix, next_letter)))
        proba_next_letter = - np.log(lang_model.single_proba(prefix, next_letter))
        if cache:
            proba_suffix = cache[len(new_suffix)] * optimism
        else:
            proba_suffix = - np.log(lang_model.single_proba(new_prefix, new_suffix)) * optimism
        proba = prefix_proba + proba_next_letter + proba_missing_state + proba_suffix
        options.append((proba, new_prefix, new_suffix, letter, proba_suffix))
    return options

print(generate_options(0, ' ', 'brac ', lang_model, missed_model))

[(6.929663174828117, '  ', 'brac ', ' ', 3.7800651217336947), (5.0428796453387541, ' a', 'brac ', 'a', 3.4572571306016755), (8.0948719475345303, ' b', 'brac ', 'b', 3.8466616057719989), (7.6238078617051874, ' c', 'brac ', 'c', 3.7800651217336947), (7.6238078617051874, ' d', 'brac ', 'd', 3.7800651217336947), (8.0948719475345303, ' r', 'brac ', 'r', 3.8466616057719989), (4.8582382617757647, ' b', 'rac ', '', 2.8072524973494524)]


This function explores the graph on noisy channel in the best-first manner, until it runs out of attempts or out of optimistic nodes.

In [74]:
def noisy_channel(word, lang_model, missed_model, freedom=3.0, 
                  max_attempts=10000, optimism=0.9, verbose=False):
    """ Suggest phrases, for which word may be the abbreviation 
    parameters:
        word - string, the abbreviation
        lang_model - the language model
        missed_model - the abbreviation probability model
        freedom - possible quality range of log likelihood of the candidates
        max_attempts - maximum number of iterations
        optimism - coefficient for log likelihood of the word end
        verbose - whether to print current candidates in the runtime
    returns: dict of keys - suggested phrases, and values - 
        minus log likelihood of candidates
        The less this value, the more likely the suggestion
    """
    query = word + ' '
    prefix = ' '
    prefix_proba = 0.0
    suffix = query
    full_origin_logprob = -lang_model.single_log_proba(prefix, query)
    no_missing_logprob = -missed_model.single_log_proba(prefix, query)
    best_logprob = full_origin_logprob + no_missing_logprob
    # add an empty prefix to the heap
    heap = [(best_logprob * optimism, prefix, suffix, '', best_logprob * optimism)]
    # add the default candidate (without missing characters) 
    candidates = [(best_logprob, prefix + query, '', None, 0.0)]
    if verbose:
        print('baseline score is', best_logprob)
    # prepare storage of the phrase suffix probabilities
    cache = {}
    for i in range(len(query)+1):
        future_suffix = query[:i]
        cache[len(future_suffix)] = -lang_model.single_log_proba('', future_suffix) # rough approximation
        cache[len(future_suffix)] += -missed_model.single_log_proba('', future_suffix) # at least add missingness
    
    for i in range(max_attempts):
        if not heap:
            break
        next_best = heappop(heap)
        if verbose:
            print(next_best)
        if next_best[2] == '':  # the phrase is fully decoded
            # if the phrase is good enough, add it to the answer
            if next_best[0] <= best_logprob + freedom:
                candidates.append(next_best)
                # update estimate of the best likelihood
                if next_best[0] < best_logprob:
                    best_logprob = next_best[0]
        else: # # the phrase is not fully decoded - generate more options
            prefix_proba = next_best[0] - next_best[4] # all proba estimate minus suffix
            prefix = next_best[1]
            suffix = next_best[2]
            new_options = generate_options(
                prefix_proba, prefix, suffix, lang_model, 
                missed_model, optimism, cache)
            # add only the solution potentioally no worse than the best + freedom
            for new_option in new_options: 
                if new_option[0] < best_logprob + freedom:
                    heappush(heap, new_option)
    if verbose:
        print('heap size is', len(heap), 'after', i, 'iterations')
    result = {}
    for candidate in candidates:
        if candidate[0] <= best_logprob + freedom:
            result[candidate[1][1:-1]] = candidate[0]
    return result

Apply our algorithm to suggest deciphering of "brc". 

In [42]:
result = noisy_channel('brc', lang_model, missed_model, verbose=True, freedom=1)
print(result)

baseline score is 7.68318306228
(6.9148647560475442, ' ', 'brc ', '', 6.9148647560475442)
(6.755450684372974, ' b', 'rc ', '', 4.7044649199466617)
(5.8249119494605051, ' br', 'c ', '', 2.6863637325526679)
(7.088440394887126, ' brc', ' ', '', 1.7075575253192956)
(7.1392598304831516, ' bra', 'c ', 'a', 2.6863637325526679)
(7.6831830622750497, ' brc ', '', '', -0.0)
(8.0284469273601662, ' brac', ' ', '', 1.7075575253192956)
(8.3621576081202385, ' a', 'brc ', 'a', 6.776535093383159)
(7.6954572168460142, ' ab', 'rc ', '', 4.7044649199466617)
(6.7649184819335453, ' abr', 'c ', '', 2.6863637325526679)
(8.0284469273601662, ' abrc', ' ', '', 1.7075575253192956)
(8.0792663629561936, ' abra', 'c ', 'a', 2.6863637325526679)
(8.6231895947480908, ' abrc ', '', '', -0.0)
(8.6231895947480908, ' brac ', '', '', -0.0)
(8.6740629096242063, ' brca', ' ', 'a', 1.7075575253192956)
heap size is 0 after 15 iterations
{'brc': 7.6831830622750497, 'abrc': 8.6231895947480908, 'brac': 8.6231895947480908}


### Testing on hobbits

To really test the algorithm, we need a good language model. I was wondering, how well a model could decipher the abbreviations, if it had been trained on a deliberately limited corpus - one book on an unusial topicl. The first such book that got into my hand was "The Lord Of The Rings: The Fellowship of the Ring". Well, let's see how well the hobbit language can help to decipher the modern sports term.s

But first we need to train the models. 

In [54]:
import re
# read the text
with open('Fellowship_of_the_Ring.txt', encoding = 'utf-8') as f:
    text = f.read()
# leave only letters and spaces in the text
text2 = re.sub(r'[^a-z ]+', '', text.lower().replace('\n', ' '))
all_letters = ''.join(list(sorted(list(set(text2)))))
print(repr(all_letters)) # ' abcdefghijklmnopqrstuvwxyz'
# Prepare training sample for the abbreviation model 
missing_set =  (
    [(all_letters, '-' * len(all_letters))] * 3 # all chars missing
    + [(all_letters, all_letters)] * 10 # all chars are NOT missing
    + [('aeiouy', '------')] * 30 # only vowels are missing
)
# Train the both models
big_lang_m = LanguageNgramModel(order=4, smoothing=0.001, recursive=0.01)
big_lang_m.fit(text2)
big_err_m = MissingLetterModel(order=0, smoothing_missed=0.1)
big_err_m.fit(missing_set)

' abcdefghijklmnopqrstuvwxyz'


For the abbreviation model, I manually created a short corpus, where 25% of consonants and 75% of vowels are abbreviated away.

I chose the 5-gram language model, after comparing average likelihood of different models on the "test set" (end of the book). It seems that the quality of character probability prediction grows with model order. But I didn't go beyound order of 5, because large order means slow training and application of the model.

In [55]:
for i in range(5):
    tmp = LanguageNgramModel(i, 0.001, 0.01)
    tmp.fit(text2[0:-5000])
    print(i, tmp.single_log_proba(' ', text2[-5000:]))

0 -13858.8600648
1 -11608.8867664
2 -9235.21749986
3 -7461.78935696
4 -6597.9544372


After training, I applied the algorithm to different contractions. To begin with, I asked "sm", meaning "Sam". The model recognized him easily, and added other options (although with higher score, and thus less probable).

In [60]:
noisy_channel('sm', big_lang_m, big_err_m)

{'sam': 7.3438449620080997,
 'same': 9.5091694602417469,
 'some': 7.6890573935288824}

"Frodo" is also deciphered from "frd" without problems.

In [61]:
noisy_channel('frd', big_lang_m, big_err_m)

{'frodo': 6.8904938902680888}

And the "ring" too, from "rng".

In [62]:
noisy_channel('rng', big_lang_m, big_err_m)


{'ring': 7.6317120419343913}

Before running "wtrbtl", I tried the first part, "wtr". "Water" is deciphered perfectly.


In [63]:
noisy_channel('wtr', big_lang_m, big_err_m)

{'water': 8.6405279255413898}

With "bottle" the model is less confident. After all, battles appear more frequently in "The Lord of the Rings", than bottles.

In [64]:
noisy_channel('btl', big_lang_m, big_err_m)

{'battle': 12.620490427990008,
 'bottle': 13.3327872548629,
 'but all': 14.66815480120338,
 'but ill': 15.387630853411283}

But in some contexts, this is exactly what is needed.

In [68]:
noisy_channel('batlhrse', big_lang_m, big_err_m)

{'battle horse': 25.194823785457018, 'battle horses': 27.40528952535044}

For "wtrbtl" the model proposed multiple options, but "water bottle" the second among them.

In [65]:
noisy_channel('wtrbtl', big_lang_m, big_err_m)

{'water battle': 23.76999162985074,
 'water bottle': 23.962598992336815,
 'water but all': 24.445047133561353,
 'water but ill': 25.164523185769259,
 'water but lay': 25.601336188357113,
 'water but lie': 26.668305553728047}

"Basketball", never seen before, was recognized almost correctly, because the word "basket" has occured in the training text. But I had to extend the width of the search beam from 3 to 5, to discover this option.

In [148]:
print(noisy_channel('bsktball', big_lang_m, big_err_m, freedom=5))

{'bsktball': 33.193085889457429, 'basket ball': 33.985227947093364}


The word "ball" has never occured in the training text, so the model failed to recognize "bowling ball" in the "bwlingbl". But it proposed "bewilling Bilbo", "bowling blow", and several other alternatives. The word "bowling" has also never occured in "The Lord of the Rings", but the model somehow managed to reconstruct it with its common understanding of English language.

In [150]:
print(noisy_channel('bwlingbl', big_lang_m, big_err_m, freedom=5))

{'bwling blue': 31.318936077746862, 'bwling bilbo': 30.695249686758611, 'bwling ble': 34.490254059547475, 'bwling black': 31.980325659562851, 'bwling blow': 33.15061216480305, 'bewilling blue': 30.937989778499748, 'bewilling bilbo': 30.314303387511497, 'bewilling ble': 34.109307760300361, 'bewilling black': 31.599379360315737, 'bewilling blin': 34.685939493896406, 'bewilling blow': 32.769665865555929, 'bewilling bill': 32.156071732628014, 'bewilling below': 32.195518180732158, 'bwling bill': 32.537018031875135, 'bewilling belia': 32.550377929021479, 'bwling below': 32.576464479979279, 'bwling belia': 32.931324228268608, 'bwling belt': 33.203704016765826, 'bwling bling': 33.393527121566656, 'bwling bell': 34.180762531759534, 'bowling blue': 30.676613106535022, 'bowling bilbo': 30.052926715546771, 'bowling ble': 33.847931088335635, 'bowling black': 31.338002688351011, 'bowling blin': 34.42456282193168, 'bowling blow': 32.508289193591203, 'bowling bill': 31.894695060663285, 'bowling below

### Generation of weird texts.
In the end, to amuse you, I tried to abbreviate the beginning of "The Lord of the Rings". It looks weird.


In [144]:
part = text[10502:11149]
result = ''
for i, letter in enumerate(part):
    if np.random.rand() * 0.5 < big_err_m.single_proba(part[0:i], letter):
        result += letter
print(result)

This bok s largly cncernd wth Hbbts, nd frm its pges  readr ma dscver much f thir charctr nd  littl f thir hstr. Furthr nfrmaton will als b fond n the selction from the Red Bok f Wstmarch that hs already ben publishd, ndr th ttle of The Hobbit. Tht stor was dervd from the arlir chpters of the Red Bok, cmpsed by Blbo hmslf, th first Hobbit t bcome famos n the world at large, nd clld b him There and Bck Again, sinc thy tld f his journey into th East and his return: n dvntr whch latr nvolved all the Hobbits n th grat vnts of that Ag that re hr rlatd.


The language model may be used to generate a completely new text. It has some Tolkien style, but completely no sense.

In [174]:
np.random.seed(20)
text = "Frodo"
for i in range(300):
    proba = big_lang_m.predict_proba(text)
    text += np.random.choice(proba.index, size=1, p=proba)[0]
print(text+'.')

Frodo would me but them but his slipped in he see said pippin silent the names for follow as days are or the hobbits rever any forward spoke ened with and many when idle off they hand we cried plunged they lit a simply attack struggled itself it for in a what it was barrow the will the ears what all grow.


Generation of meaningful texts from scatch is still beyound the reach of data science. For it requires real artificial intelligence, able to understand the complex storyline that took place in the Middle-earth.

### Conclusions
Natural language processing is a complex mixture of science, technology, and magic. Even linguistic scientists cannot fully understand the laws of human speech. The times when machines indeed understand texts are not to come soon. 

Natural language processing is also fun. Armed with a couple of statictical models, you can both recognize and generate non-obvious abbreviations. For those who want to continue my playing with data, there is a [jupyter notebook](https://github.com/avidale/weirdMath/blob/master/nlp/abbreviation_spellchecker_english.ipynb) with the complete code of the experiment. As for this blog, the other experiments will follow. So subscribe! :-)