In [1]:
from collections import Counter,defaultdict
import random, gzip, textwrap

In [19]:
def make_word_trigrams(filename):

    #returns list of words in file:
    with gzip.open(filename) as f: words = f.read().decode('utf-8').split()

    trigrams = defaultdict(list)
    
    bigram=tuple(words[:2])
    startwords=[bigram]
    
    for w in words[2:] + words[:2]:
    #keys of trigram dict are tuples, values are lists
        trigrams[bigram].append(w)
        if bigram[0].endswith('.') and bigram[1][0].isupper():
            startwords.append((bigram[1],w))
        bigram=(bigram[1],w)

    return trigrams,startwords

In [20]:
def random_word_text(ngrams, startgrams, r=None, num_words=100):
         
    if r != None: current = startgrams[r]
    else: current = random.choice(startgrams)
    random_text = list(current)
    
    # continue past num_words until ends in .    
    while len(random_text)< num_words or not random_text[-1].endswith('.'):
        nxt = random.choice(ngrams[current])
        random_text.append(nxt)
        current = (*current[1:], nxt) #move to next window (current can be (n-1)-gram)
        # avoid long loops if too few periods in training text
        if len(random_text) > 2*num_words: random_text[-1] += '.'
        
    return textwrap.fill(' '.join(random_text))

The source text files for the below can be found here:

1. <a href="https://courses.cit.cornell.edu/info2950_2018sp/resources/sherlock.txt.gz">sherlock.txt.gz</a>
2. <a href="https://courses.cit.cornell.edu/info2950_2018sp/resources/oz.txt.gz">oz.txt.gz</a>
3. <a href="https://courses.cit.cornell.edu/info2950_2018sp/resources/di.txt.gz">di.txt.gz</a>
4. <a href="https://courses.cit.cornell.edu/info2950_2018sp/resources/shakespeare.txt.gz">shakespeare.txt.gz</a>

In [21]:
trigrams_sh,startwords_sh = make_word_trigrams('sherlock.txt.gz')

In [5]:
#most common bigram keys
sorted([(bi,len(trigrams_sh[bi])) for bi in trigrams_sh],key=lambda x: x[1],reverse=True)[:10]

[(('of', 'the'), 700),
 (('in', 'the'), 479),
 (('to', 'the'), 297),
 (('I', 'have'), 247),
 (('that', 'I'), 245),
 (('at', 'the'), 224),
 (('upon', 'the'), 195),
 (('and', 'I'), 191),
 (('to', 'be'), 189),
 (('and', 'the'), 186)]

In [6]:
Counter(startwords_sh).most_common(10)  #starts of sentences

[(('It', 'was'), 91),
 (('It', 'is'), 88),
 (('I', 'have'), 49),
 (('He', 'was'), 38),
 (('There', 'was'), 31),
 (('There', 'is'), 28),
 (('I', 'am'), 27),
 (('I', 'was'), 27),
 (('I', 'had'), 26),
 (('On', 'the'), 22)]

In [7]:
len(set(trigrams_sh[('of','the')])) #distinct continuations from "of the ..."

548

In [8]:
print (random_text(trigrams_sh,startwords_sh)) #random sherlock

Above the woods picking flowers. She states that while she was sick
with fear, and the police-court." "I could do without her. In only one
thing to do it again. Then I started from London to this man Horner is
innocent?" "I cannot think." "When you see that your circulation is
more interesting than her little income, she would send him away to
the Southampton Road, a small bedroom, which looked out into smoke
like so many as you may rest assured that she died from some strong
motive for securing the situation." "I am afraid that I had been out
to the celebrated Mr.


In [22]:
trigrams_di,startwords_di=make_word_trigrams('di.txt.gz')
trigrams_oz,startwords_oz=make_word_trigrams('oz.txt.gz')

In [10]:
print (random_word_text(trigrams_oz,startwords_oz)) #random Oz

Others polished the blade until all the beasts caught sight of us, he
called to them that it was quite an important person. By and by
standing upon the pretty gardens and woods far below them. Dorothy
found she was riding quite easily. After the Lion walking with stately
strides at Dorothy's feet, and the ropes got twisted, so that I shall
have a basket to the city and everyone came to the Wicked Witch and
setting them free from bondage." Dorothy listened to this queer bridge
when a sharp growl made them all look up, and he soon made up her mind
to harness him like a cat.


In [23]:
print (random_word_text(trigrams_di,startwords_di))  #random declaration of independence

He is at this time transporting large Armies of foreign Mercenaries to
compleat the works of death, desolation, and tyranny, already begun
with circumstances of our intentions, do, in the Course of human
events it becomes necessary for the sole purpose of fatiguing them
into compliance with his measures. He has abdicated Government here,
by declaring us out of his Protection and waging War against us. He
has refused to pass Laws of Nature and of Right ought to be tried for
pretended offences: For abolishing the forms to which the Laws for the
accommodation of large districts of people, unless those people would
relinquish the right of Representation in the mean time exposed to all
the dangers of invasion from without, and convulsions within.


In [24]:
trigrams_oz_di= dict(list(trigrams_oz.items()) + list(trigrams_di.items()))
for bigram in set(trigrams_oz) & set (trigrams_di):
    trigrams_oz_di[bigram] = trigrams_oz[bigram] + trigrams_di[bigram]
startwords_oz_di = startwords_oz + startwords_di
    
print (random_word_text(trigrams_oz_di,startwords_di))  #mashup of declaration and oz

Prudence, indeed, will dictate that Governments long established
should not marry the pretty milkmaid was much too vexed to make him
work." "Your commands shall be glad to be able to melt me and I was a
little thing, except a coward I shall never get my brains I shall find
the Emerald City," said the Scarecrow, in a wonderful place. It was a
very bad man," said Dorothy. "And I should have set the house was
small, for the tenure of their salaries. He has refused to pass others
to subject us to a big table near by was loaded with delicious fruits
and nuts, pies and cakes, and many cows and sheep and horses and pigs
and chickens, all made of china, were standing about in groups.


In [25]:
trigrams_sk,startwords_sk = make_word_trigrams('shakespeare.txt.gz')  

In [26]:
print (random_word_text(trigrams_sk,startwords_sk)) #random shakespeare

Sir, I lack some part of the Goths and Tamora was queen- To quit the
mines? Have the patricians of you. For my part, knew the stars give
light To their deaf pillows will discharge my bond, Thou but offend'st
thy lungs and split thy heart burst out, I fear the wolf behowls the
moon; or, rather, the Neapolitan prince. PORTIA. Ay, but when?
ORLANDO. Why, how now, Sir Hugh, persuade me from hence, I faint. O
Iras, Charmian! 'Tis no counterfeit. To die is true- but for vacancy,
Had gone to meet you. PETRUCHIO. It cannot fail but by the wrist and
held me glad of your unworthy thinking.


In [15]:
trigrams_sh_sk = dict(list(trigrams_sh.items()) + list(trigrams_sk.items()))
for bigram in set(trigrams_sh) & set (trigrams_sk):
    trigrams_sh_sk[bigram] = trigrams_sh[bigram] + trigrams_sk[bigram]
startwords_sh_sk = startwords_sh + startwords_sk

In [27]:
print (random_word_text(trigrams_sh_sk,startwords_sh))  #mashup of shakespeare and sherlock

Now from this day is not! O night, O long and well-deserved bed; [To
TOUCHSTONE] And you yourself shall keep the hills adjoining to the
block of the cause, But jealous souls will not practise to deceive,
Yet, to avoid him. Which of you some sport with the same care to stay
a man as rare as Phoenix. 'Od's my will! Her love to berhyme her),
Dido a dowdy, Cleopatra a gypsy, Helen and Hero hildings and harlots,
This be not easily controlled when she seem'd to tell his Grace. FIRST
CITIZEN. Give him tending; He brings you to proceed, And justly and
religiously unfold Why the devil, you are vanished.


### Now generalize to word n-grams

In [17]:
def make_word_ngrams(filename, n=3):

    #returns list of words in file
    with gzip.open(filename) as f: words = f.read().decode('utf-8').split()

    ngrams = defaultdict(list)
    startgrams = [tuple(words[:n-1])] #will be list of n-1 grams to start sentences
    
    for i in range(len(words)-n+1):
    #keys of ngram dict are (n-1)-tuples, values are lists
        ngram = words[i:i+n]
        ngrams[tuple(ngram[:-1])].append(ngram[-1])
        if ngram[0].endswith('.') and ngram[1][0].isupper():
            startgrams.append(tuple(ngram[1:]))

    return ngrams,startgrams

In [18]:
ngrams_sk, startgrams_sk = make_word_ngrams('tiny_shakespeare.txt.gz', 3)  

In [28]:
print(random_word_text(ngrams_sk,startgrams_sk,100))

To a cruel war I sent thee thither: I, that please some, try all, both
joy and sorrow was too strict to make thee think thy swan a crow.
ROMEO: When the tongue's office should be thus bold in war; Those will
I sit me down. To whom God will, there be some other name! What's in a
day. LUCENTIO: Hearest thou, Biondello? BIONDELLO: I pray thee?
MISTRESS OVERDONE: Well; what has he closely mew'd her up, Signior
Baptista, my business was great; and in such matters: as they were The
common muck of the death of Hermione, visited that removed house.


### Now generalize to character n-grams

In [30]:
tokens = sorted(set(gzip.open('tiny_shakespeare.txt.gz').read().decode('utf-8')))
len(tokens)

65

In [33]:
print(''.join(tokens))


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [34]:
def make_char_ngrams(filename, n=3):

    #returns list of words in file
    with gzip.open(filename) as f: chars = f.read().decode('utf-8')

    ngrams = defaultdict(Counter)
    startgrams = [chars[:n-1]] #will be list of n-1 grams to start sentences
    
    for i in range(len(chars)-n+1):
    #keys of ngram dict are (n-1)-tuples, values are lists
        ngram = chars[i:i+n]
        ngrams[ngram[:-1]][ngram[-1]] += 1
        if chars[i:i+2] == '. ' and chars[i+2].isupper() and i+n < len(chars):
            startgrams.append(chars[i+2:i+1+n])

    ngrams = {k:[v[c] for c in tokens] for k,v in ngrams.items()}
    
    return ngrams,startgrams

In [40]:
def random_char_text(ngrams, startgrams, r=None, num_chars=500):
         
    if r != None: current = startgrams[r]
    else: current = random.choice(startgrams)
    random_text = current
    
    # continue past num_words until ends in .    
    while len(random_text)< num_chars or not random_text[-1] == '.':
        nxt = random.choices(tokens,ngrams[current])[0]
        random_text += nxt
        current = current[1:] + nxt
        # avoid long loops if too few periods in training text
        if len(random_text) > 2*num_chars: random_text += '.'
        
    return textwrap.fill(random_text)

In [36]:
chngrams_sk,chstartgrams_sk = make_char_ngrams('tiny_shakespeare.txt.gz', n=4)
len(chngrams_sk)

11556

In [53]:
print(random_char_text(chngrams_sk,chstartgrams_sk,10))

What of my bled where timoison that this cond can vice: musinst
fathem, With him far not what shonour Stil i' thy bore a let prayer's
cannot say, yet, have the lainst Grow In ricest wringdoms betwentle of
lies my exper. Thou dow: Leonthreasure youre in by oate mightere ple,
for I down briclery you keys hen I be conter, not acceed bel, Musts
and not rattom.  QUEEN MARGARENCENTIO: Which come and to be Rive with
therd: Too stiff theer i' the visedge.  VOLING RICHARD III: At Gent,
to serious reseemen atteral.
