# The unreasonable effectiveness of Character-level Language Models
## (and why RNNs are still cool)

### [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo)

RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.
It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. 

Go read it now, then come back here. 

You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that?
Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons.
This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call $n$, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematiacally, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The **unsmoothed** part means that if we did not see a given letter following $h$, we will just give it a probability of zero.

And that's all there is to it.


### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.


In [1]:
from collections import *

def train_char_lm(fname, order=4):
    data = file(fname).read()
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in xrange(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.iteritems()]
    outlm = {hist:normalize(chars) for hist, chars in lm.iteritems()}
    return outlm

Let's train it on Andrej's Shakespears's text:

In [2]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

--2017-09-29 11:41:11--  http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving cs.stanford.edu... 171.64.64.64
Connecting to cs.stanford.edu|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573338 (4.4M) [text/plain]
Saving to: ‘shakespeare_input.txt’


2017-09-29 11:41:12 (8.61 MB/s) - ‘shakespeare_input.txt’ saved [4573338/4573338]



In [3]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [4]:
lm['ello']

[('!', 0.0068143100511073255),
 (' ', 0.013628620102214651),
 ("'", 0.017035775127768313),
 (',', 0.027257240204429302),
 ('.', 0.0068143100511073255),
 ('r', 0.059625212947189095),
 ('u', 0.03747870528109029),
 ('w', 0.817717206132879),
 ('n', 0.0017035775127768314),
 (':', 0.005110732538330494),
 ('?', 0.0068143100511073255)]

In [5]:
lm['Firs']

[('t', 1.0)]

In [6]:
lm['rst ']

[("'", 0.0008025682182985554),
 ('A', 0.0056179775280898875),
 ('C', 0.09550561797752809),
 ('B', 0.009630818619582664),
 ('E', 0.0016051364365971107),
 ('D', 0.0032102728731942215),
 ('G', 0.0898876404494382),
 ('F', 0.012038523274478331),
 ('I', 0.009630818619582664),
 ('H', 0.0040128410914927765),
 ('K', 0.008025682182985553),
 ('M', 0.0593900481540931),
 ('L', 0.10674157303370786),
 ('O', 0.018459069020866775),
 ('N', 0.0008025682182985554),
 ('P', 0.014446227929373997),
 ('S', 0.16292134831460675),
 ('R', 0.0008025682182985554),
 ('T', 0.0032102728731942215),
 ('W', 0.033707865168539325),
 ('a', 0.02247191011235955),
 ('c', 0.012841091492776886),
 ('b', 0.024879614767255216),
 ('e', 0.0032102728731942215),
 ('d', 0.015248796147672551),
 ('g', 0.011235955056179775),
 ('f', 0.011235955056179775),
 ('i', 0.016853932584269662),
 ('h', 0.019261637239165328),
 ('k', 0.0040128410914927765),
 ('m', 0.02247191011235955),
 ('l', 0.01043338683788122),
 ('o', 0.030497592295345103),
 ('n', 0.0

So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter.

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [7]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [8]:
def generate_text(lm, order, nletters=1000):
    history = "~" * order
    out = []
    for i in xrange(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [9]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print generate_text(lm, 2)

Fir, th
Theessing to he at wit,
Norm Des then oth mord, beld thromet hin weake riefour any for ferty?
And my meatermse he to makes mand: haterse at you, be in he card natery, bou fand on pure's re ain goot eventle enot chatinight wouslad hime Baptill in-men weeple?

Sirtakess't this sommels but hat sto eiss sh, wharks:
We I dearcesinaked be face samin brood.
Sece soughtly,
Is briewel st you--

KING Romears you be! thir?
I haroke fody the ow withers th thend her, deect apeace.
Ay, beggaved oake whicus,
Thave hold, I dre virs,
fore:
Am 'hoselcomeral o's us ded hin wixt Why,
Salbospee: dou that wit.

And liveread hom is of twit costen's man, welcouts, hathas aninke minkin Edware the not ade, thou hosed
pep is dearr's falt: ink ger bur me
It And olsen em on.

Loolts, to:
I man be of to
beind ther sausir. Whal a Giver, fand cand I weet is med hised gengue it pored ung: 't:
Why, courefe,
To els, the fall we hath thour he andied up man Sper isty'refur Surs the at have ford, I welorry stinand 

Not so great.. but what if we increase the order to 4?

### order 4

In [10]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print generate_text(lm, 4)

First need on't like.

KING HENRY IV:
Say, go to live the him, we'll give me audaciously
to be terready.
Call have soft this made to see to man
And then? how,
From that rest are you hath hast I will commanderstant your gentleman,
Thou would not, Sir John, you are at he powers
to take thee claim on to your climb should of valiancasest it is the friend your pring Lewishes those that their with thus drain,
And, nor the play tongue;
For, but to't: 'tis gold:
We hast more
Harm my lord:
The secretion joy is undo dish chough yoke lord?

PEMBROKE:
So am to used the cannot a king's that I am bounding bloody, groaning face?

STEPHANO:
Caesar, set.

CANTERBURY:
Hark! what herefores trius.

DUCHESS QUICKLY:
Go, call nothing disdain we had regard
Young, from of knight I beauty take your petite,
And, and for my most villant, to that my namest as you calm,
Unlike a shall yea, two?

SUFFOLK:
Because and swore vow'd at the will
whither
oppose
me this by not Gaoler:
My king
To bring, Sir John, we will n

In [11]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print generate_text(lm, 4)

First Gentleman:
If therefore, were were 'tis; and among terms in fault?

BARDOLPH:
Brave.

Constand,
as ther's; we matten haster:
My lord! On Thy browns not victor?

Fool:
I pray
the me his day?

CARDINAL CAMPEIUS:
Come entreach done to Claudio's,
To friends,
Would we, than with to Timony.

DON ADRIANA:
Why, which thee!

MARK ANTONY:
'Twere
And is no morrows,
And the devil an you refuge been the who
Shrunk!

First-born,
As go youth thee are you;
Your patience
designieur draw, I to your any poor and I have to statesby, made not you are thy done:
Impromio?
The plots heard.

BENVOLIO:
Believe; gives,
Any treason.
The needs much all cock; and am apprehends.

LAFEU:
Your lusty.

DON ADRIANA:
Nay, I'll conscience he shall find wrong answer
With the time comes yielders.

TAMORA:
My lord, and on your requite men!

DEMETRIUS:
Well man like hail, when die ere before is not a bounter,
And what can very for three day.

GADSHILLE:
At you, hearteen must stance! coward.

TIMON:
I tell breatening: 't

This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [12]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print generate_text(lm, 7)

First Citizen:
O royal head in two daughters,
You and me;
Which thou so look'd not and let 'em win the end.

QUEEN MARGARET:
My lord, he fearful?

Messenger:
Gracious meet withal?

HORTENSIO:
I have been his great men;
For the power he comes your pleasure in the achieved as his: it was,'--
the juvenal? why then their brethren.

QUINTUS:
Right, you do me, if these wall;
And our tardy soldiers are wound that enwraps me heart of this is a devil in an ass and our court. Let us revenge his cur the thunders by, I could not Marcius,
I have almost a mile, and doughy youth and speed, we will enfranchise you: but a limb?

FERDINAND:
Thy oath remorse: swear the head of your face to think he is gone,
Having four.
Now I have sent to make
this in my life
Became the parties, and blind, thou ever blush and blood, spirit of his good your swords.

SOMERSET:
No such matter: what thought:
The higher by pure love me, sir, no more should
comedy.

MALVOLIO:
Sir Valentine,
Now mingled than they have dancing a

### How about 10?

In [13]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print generate_text(lm, 10)

First Citizen:
That can I witness with him, thief!

IAGO:
Yet be contented,
Forswear Bianca and her lord is come.

Nurse:
By my troth,
welcome too.
How dearly they do't! 'Tis her breath.'

LAUNCE:
Ay, that's the
scene that I was sent thither, and he is one of these drones, that robb'd these men have died when Claudio lie,
Who loved his father,
Henry the Seventh succeeded in his rages, and his daughter were legitimate: fine word,--legitimate: fine word,--legitimate construction.

TAMORA:
O cruel, irreligious truth and upright,
Like softest music to her ear.

THURIO:
Nay then, two treys, and if your garments sit upon me;
Sometime she driveth o'er a soldier, that shall Clarence closely mew'd her up,
Because I love him welcome,
While I use further spoken,
That you are.

POSTHUMUS LEONATUS:
Agreed.

MARCUS ANDRONICUS:
I give it you, my lord well, that I have seen our wishes had a womb.
And fertile: let a
beast be lord of such another to Page's wife, with my lady? mistress,
That make ingrate

### This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

### So why am I impressed with the RNNs after all?

Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. 

However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. 

If the examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [14]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt

--2017-09-29 11:42:10--  http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving cs.stanford.edu... 171.64.64.64
Connecting to cs.stanford.edu|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6206996 (5.9M) [text/plain]
Saving to: ‘linux_input.txt’


2017-09-29 11:42:11 (8.97 MB/s) - ‘linux_input.txt’ saved [6206996/6206996]



In [15]:
lm = train_char_lm("linux_input.txt", order=10)
print generate_text(lm, 10)

~
 *     waiters.  For requeue_pi code can enforce the read-side critical section,
 * as long as this is needed for procs, tasks) for each trigger associated
 * to not have a bug somewhere outside mems_allowed mask from
 * then on but never vice versa.  Handle both possibility to update the clocksource_delta(csnow, cs->cs_last = csnow;
			continue;
		/*
		 * Print torture_rwlock_read_data *rd)
{
	/* used to incremented once, even if this means that the timekeeping_apply_adjustment, sem) &
						RWSEM_ACTIVE_READ_BIAS;
	struct irq_desc *desc, unsigned long mem_len,
					struct rt_mutex *lock, struct task_struct *task = (struct syscall_metadata *)call->data;
	int busiest_capacity;
	unsigned long flags;
	int res, ret;

	if (page_to_pfn(page));
}

static int
func_set_flag(struct file *filp, int on)
{
	struct lock_class(struct tracer_opt trace_options_init_dentry(tr);
	if (ret < 0)
		return;
		}
		if (opts->release_agent);
	if (likely(list_empty(&trace_bprint_event_file *file,
			  char *bu

In [16]:
lm = train_char_lm("linux_input.txt", order=15)
print generate_text(lm, 15)


 *     wait->flags &= ~WQ_FLAG_EXCLUSIVE;
	spin_lock_irq(&rq->lock);

	return ret;
}

static int irq_node_proc_show(struct seq_file *s, void *unused)
{
	if (ftrace_dump_on_oops);
	return NOTIFY_DONE;
	}
	return NOTIFY_OK;
	mutex_lock(&show_mutex);
	switch (val) {
	case 0:
		/*
		 * When soft_disable is not set but the SOFT_MODE flag */
		__ftrace_event_enable_disable_cmds(void)
{
	int i, cpu;

	depth = curr->lockdep_depth; i++) {
		hlock = curr->held_locks + i;
		/*
		 * We dont care about collisions. Nodes with
		 * the same object file are loaded.
		 * The initial one takes precedence.
		 */
		if (!chain_head && ret != 2)
			if (!check_prev_add(curr, hlock, next,
						distance, trylock_loop))
				return 0;
		chain_head = 1;
	}
	chain_key = iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_CHAIN_HLOCKS);
#endif

#ifdef CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_CFS_BANDWIDTH
static DEFINE_PER_CPU(struct rcu_dynticks *rdtp = this_cpu_ptr(lg->lock, i);
		arch_spin_unlock(&cfs_b->

In [17]:
lm = train_char_lm("linux_input.txt", order=20)
print generate_text(lm, 20)

/*
 * linux/kernel/irq/proc.c
 *
 * Copyright 2003-2004 Red Hat, Inc.
 * Copyright (C) 2012 Rafael J. Wysocki <rjw@sisk.pl>
 */

#include <linux/personality.h>
#include <linux/init_task.h>
#include <linux/debug_locks.h>

#include "mutex-debug.h"

/*
 * Must be called with pm_mutex held.  If it is successful, control
 * reappears in the restored target kernel.
 */
static int resume_target_kernel(bool platform_mode)
{
	int error;

	error = memory_bm_find_bit(bm, pfn, &addr, &bit);
	if (!error)
		set_bit(bit, addr);

	return error;
}

static int kill_as_cred_perm(const struct cred *new, const struct cred *cred = current_cred(), *tcred;

	if (current == task)
		return 0;

	tcred = __task_cred(tsk);
		if (!uid_eq(cred->uid, make_kuid(ns, 0)) ||
		    !gid_eq(cred->gid, make_kgid(ns, 0)))
			goto out;
	}

	rec->counter++;
 out:
	local_irq_restore(flags);
			return;
		}
		mask = rnp->grpmask;
		if (rnp->parent == NULL) {
			raw_spin_unlock(&rnp_up->lock); /* irqs still off */
	}
	raw_spin_unl

In [18]:
print generate_text(lm, 20)

/*
 * linux/kernel/irq/pm.c
 *
 * Copyright (C) 2012 Bojan Smojver <bojan@rexursive.com>
 *
 * This file is subject to the terms and conditions of the GNU General Public License
* along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
 *
 * Copyright (C) 2012 Rafael J. Wysocki <rjw@sisk.pl>
 */

#include <linux/irq.h>
#include <linux/smp.h>
#include <linux/delay.h>
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/workqueue.h>
#include <linux/futex.h>
#include <linux/rcupdate.h>
#include <linux/err.h>
#include <linux/console.h>
#include <linux/seq_file.h>
#include <linux/kgdb.h>
#include <linux/syscalls.h>
#include <linux/ptrace.h>
#include <linux/pid_namespace.h>
#include <net/genetlink.h>
#include <linux/stat.h>
#include <linux/module.h>

#define CREATE_TRACE_POINTS
#include "trace_events_filter_test

/* This part must be outside protection */
#include <trace/define_trace.h>
/* audit --

In [19]:
print generate_text(lm, 20, nletters=5000)

/*
 * linux/kernel/irq/spurious.c
 *
 * Copyright (C) Jay Lan,	<jlan@sgi.com>
 *
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/>.
 */

#define pr_fmt(fmt) "Kprobe smoke test: " fmt

#include <linux/ctype.h>
#include <linux/freezer.h>

#include <asm/setup.h>

#include "trace.h"

/* Our two options */
enum {
	TRACE_NOP_OPT_ACCEPT) },
	/* Option that will be accepted by set_flag callback */
	{ TRACER_OPT(test_nop_accept, TRACE_NOP_OPT_ACCEPT = 0x1,
	TRACE_NOP_OPT_REFUSE) },
	{ } /* Always set a last empty entry */
};

static struct trace_kprobe, tp.args) +	\
	(sizeof(struct probe_arg) * (n)))


static nokprobe_inline unsigned 

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the 
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

The LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive.