# The unreasonable effectiveness of Character-level Language Models
## (and why RNNs are still cool)

### [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo)

RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.
It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. 

Go read it now, then come back here. 

You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that?
Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons.
This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call $n$, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The **unsmoothed** part means that if we did not see a given letter following $h$, we will just give it a probability of zero.

And that's all there is to it.


### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.


In [27]:
from collections import *

def train_char_lm(fname, order=4):
    data = open(fname).read()
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.items()]
    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

Let's train it on Andrej's Shakespears's text:

In [28]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

URL transformed to HTTPS due to an HSTS policy
--2018-09-20 14:45:22--  https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [text/plain]
Saving to: ‘shakespeare_input.txt.3’


2018-09-20 14:45:22 (9.16 MB/s) - ‘shakespeare_input.txt.3’ saved [4573338/4573338]



In [29]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [30]:
lm['ello']

[('r', 0.059625212947189095),
 ('w', 0.817717206132879),
 ('u', 0.03747870528109029),
 (',', 0.027257240204429302),
 (' ', 0.013628620102214651),
 ('.', 0.0068143100511073255),
 ('?', 0.0068143100511073255),
 (':', 0.005110732538330494),
 ('n', 0.0017035775127768314),
 ("'", 0.017035775127768313),
 ('!', 0.0068143100511073255)]

In [31]:
lm['Firs']

[('t', 1.0)]

In [32]:
lm['rst ']

[('C', 0.09550561797752809),
 ('f', 0.011235955056179775),
 ('i', 0.016853932584269662),
 ('t', 0.05377207062600321),
 ('u', 0.0016051364365971107),
 ('S', 0.16292134831460675),
 ('h', 0.019261637239165328),
 ('s', 0.03290529695024077),
 ('R', 0.0008025682182985554),
 ('b', 0.024879614767255216),
 ('c', 0.012841091492776886),
 ('O', 0.018459069020866775),
 ('w', 0.024077046548956663),
 ('a', 0.02247191011235955),
 ('m', 0.02247191011235955),
 ('n', 0.020064205457463884),
 ('I', 0.009630818619582664),
 ('L', 0.10674157303370786),
 ('M', 0.0593900481540931),
 ('l', 0.01043338683788122),
 ('o', 0.030497592295345103),
 ('H', 0.0040128410914927765),
 ('d', 0.015248796147672551),
 ('W', 0.033707865168539325),
 ('K', 0.008025682182985553),
 ('q', 0.0016051364365971107),
 ('G', 0.0898876404494382),
 ('g', 0.011235955056179775),
 ('k', 0.0040128410914927765),
 ('e', 0.0032102728731942215),
 ('y', 0.002407704654895666),
 ('r', 0.0072231139646869984),
 ('p', 0.00882825040128411),
 ('A', 0.0056179

So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter.

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [33]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [34]:
def generate_text(lm, order, nletters=1000):
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [35]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print(generate_text(lm, 2))

Fir, lay, wit I'll.

Everis perve mads, hold For day 'come thou losess joinglaut love;
Andere say youl'd ged-wor:
If of an der they, aw he youlive; I, yourstrood win useld fle in thou
Thowand upost, be
FEENRY Brid with,
It fripinsighte prow: he thided
MADY VINDA:
Prinjoy prot ithathy is the to stall welfsay hathe coun I man his calls con:'
'You, God noth, the go,
Whatumne fath hen theas, van.
The beeper handeciortill will the way VING CASTAVISTAFF:

Be to min spen thou, st of te giderew amen, I arthe yourbuty; I cou to prent?

Com o'er mad.
As my thus, ink a was got brich th hionst heake in hone'ere, be,
Leoull to my low plabide fat, andell go for fring and bitherwity a for'strood,
As dour therwit wing-joyse your pellost uppriflume? I come.

POLK:
Nay fuld mippy; feres; the nits foliemenfor sub, wearran a gail?

QUICUS Paw, no hishic whould
scrow ye, an:
Com or lover. I wor dold my comfortand
HESS PAGO:
Whe falf; withy, aturld hou se re
The he this in foollove, a goo.

Fireglive to ote

Not so great.. but what if we increase the order to 4?

### order 4

In [36]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First Helent citizen:
My fairs.

POSTHUMUS LEONATO:
In shake
my safests at is too rot? What, indirection:
Then humilia,
To give mistance, he world be body, heavy dear.

TALBOT:
Arise; who day in a bristiano dare let the must is again the not fire.

MONTANO:
Lady:
Welcomes of a most be, if noble Barbaron we victor, I cares.

Frence no leave a noble Antony.

BERT:
Indeed.
Kind my child our bound, I head, be habit in him for powers
Detest, cher, let it.

Chain'd bonds them with he mortal, sir.

ISABELLA:
I am so their batten hath me.

PRINCESS:
God of that I may.

BAPTISTA:
A year, thou made back pass fill not purch a hey:
Think the man between likeness of brother, fee-sick
I shows; but is presume of salt be death.
Her her any, such take his gold,
It mind,
To-nightful sleep!
Speak in out, beforeigned not your mine other but thou heart.

First Clown: happily, for my bonesty, this
Make thy for the time to all use me and a heart the devised.
I wonder'd in the got?

CASSANIO:
Writely educats.

In [38]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First Lawren me,
Than to does herefore,
Acquail from our many a tallow.

HELENA:
Canst husband to me founds, unto?

Protectors too.

MARK ANTONIO:
Lay him or where no cool and damned love enemy's derive thus who, says but them. Peace,
Who, I wilt the hungry patient all be not the vapoured be yet his tents of other reason
Unto them
Is not so young me mouth: Troilus!
Here is
But ye! There's at service ingration
In signity are no make it once to theres are her your courteer thy to Caesar, assure, bond brother'd hold serve you tell-eyed deeds come lies appeals are wink'd on he power'd in returns.

TRANIO:
Be you go with nod; O wick
That searches
To plague
Of his
unbolts that me
Holdier all,
Ere I see,
And, for this in pains wife's sort
To conquet our lease
your him and senterbury! Flavish for sake merce he is vantague sension my fathere, if mine would be perce and which lightning,
I wench gives Aufidius sometime; but he reeky past.

DUKE OF GAUNT:
Wait up,
Where pathink
Might
They dost gra

This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [39]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print(generate_text(lm, 7))

First Citizen:
This wealth,
Methinks in passage to glad your indiscreetly as a call
To every
thing than a pound of Hotspur's name then; 'neighbour, he shall we at thy husband for me:
When he was cast: and your blessed Milford go,
And leave here from honour sets him too.

YORK:
I will hold therefore, hence, awake him a desire it: speak more in my passionate my grief lodged to our flighty protest the flight, awhile; but I love these yellow stockings for the ripe wants their wills; so that watch,
That have been the snake,
For Doll is in the gibbets and these
Which time
Be somewhat suit might be fortunes for twenty valiant I am
Last night at Herne their villanous shamed me, Master Brook, whereon he should humours; throw thine eyes, earls, down into desperately; you are both the earth, becomes to make one more.

DUKE:
But, silence with Mowbray?

JOHN TALBOT:
The breaking it on.
Ay, marry; I'll know that name be Horatio,--or I do fear the sentence of ducats?' Or
Shall more continent:
He hath

### How about 10?

In [40]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print(generate_text(lm, 10))

First Citizen:
Would you with the loss
Of what is done cannot reason, beldams as you are all well.
Write me a prologue; and let the clothes that shall revive:
Upon a wooden one?

MARGARET:
Let me give light
To the understand I think
The duke is very willingly.

EARL OF WORCESTER:
Ay, grief, I fear me, will never tremble: my life
for yours.

IACHIMO:
Should you be.

ROSALIND:
I have made our porter? My master!

GLOUCESTER:
But have I none,
But what is done, spurn her home but this populous
And here she comes towards his design
Moves like a god! the beauty thinks it were a baby still. I love you; and with most austere
sanctimony be the gods,
Let's kill him rather. I'll do something give him a present day he is deliver'd of these days!
A giving hand, thou great commander and our son.

QUEEN ELIZABETH:
Poor heart, advise:
An you be mine, my lord?

HAMLET:
And smelt so? pah!

HORATIO:
As thou lovest me not; for he that buildeth on the whole weapons. Keep them all;
By God, he be not one aliv

### This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

### So why am I impressed with the RNNs after all?

Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. 

However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. 

If the examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [42]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt

URL transformed to HTTPS due to an HSTS policy
--2018-09-20 14:47:10--  https://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6206996 (5.9M) [text/plain]
Saving to: ‘linux_input.txt.1’


2018-09-20 14:47:11 (9.67 MB/s) - ‘linux_input.txt.1’ saved [6206996/6206996]



In [43]:
lm = train_char_lm("linux_input.txt", order=10)
print(generate_text(lm, 10))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 5809090: invalid start byte

In [None]:
lm = train_char_lm("linux_input.txt", order=15)
print(generate_text(lm, 15))

In [None]:
lm = train_char_lm("linux_input.txt", order=20)
print(generate_text(lm, 20))

In [None]:
print(generate_text(lm, 20))

In [None]:
print(generate_text(lm, 20, nletters=5000))

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the 
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

The LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive.

## The End

In [44]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../css/notebook.css", "r").read()
    return HTML(styles)
css_styling()