# The unreasonable effectiveness of Character-level Language Models
## (and why RNNs are still cool)

### [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo)

RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.
It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. 

Go read it now, then come back here. 

You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that?
Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons.
This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call $n$, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematically, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The **unsmoothed** part means that if we did not see a given letter following $h$, we will just give it a probability of zero.

And that's all there is to it.


### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.


In [21]:
from collections import *

def train_char_lm(fname, order=4):
    data = open(fname).read()
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.items()]
    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

Let's train it on Andrej's Shakespears's text:

In [3]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

--2024-10-02 08:57:06--  http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt [following]
--2024-10-02 08:57:09--  https://www.cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving www.cs.stanford.edu (www.cs.stanford.edu)... 23.216.154.115, 23.216.154.123
Connecting to www.cs.stanford.edu (www.cs.stanford.edu)|23.216.154.115|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-10-02 08:57:11 ERROR 404: Not Found.



In [22]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [25]:
lm['hell']

[('!', 0.06912442396313365),
 (' ', 0.22119815668202766),
 ("'", 0.018433179723502304),
 (',', 0.20276497695852536),
 ('-', 0.059907834101382486),
 ('.', 0.1336405529953917),
 ('i', 0.03225806451612903),
 ('\n', 0.018433179723502304),
 (':', 0.018433179723502304),
 (';', 0.027649769585253458),
 ('?', 0.03225806451612903),
 ('s', 0.009216589861751152),
 ('o', 0.15668202764976957)]

In [28]:
lm['hirl']

[('w', 0.3333333333333333),
 ('i', 0.26666666666666666),
 (' ', 0.13333333333333333),
 ('e', 0.13333333333333333),
 ('s', 0.13333333333333333)]

In [30]:
lm['rst ']

[('C', 0.09550561797752809),
 ('f', 0.011235955056179775),
 ('i', 0.016853932584269662),
 ('t', 0.05377207062600321),
 ('u', 0.0016051364365971107),
 ('S', 0.16292134831460675),
 ('h', 0.019261637239165328),
 ('s', 0.03290529695024077),
 ('R', 0.0008025682182985554),
 ('b', 0.024879614767255216),
 ('c', 0.012841091492776886),
 ('O', 0.018459069020866775),
 ('w', 0.024077046548956663),
 ('a', 0.02247191011235955),
 ('m', 0.02247191011235955),
 ('n', 0.020064205457463884),
 ('I', 0.009630818619582664),
 ('L', 0.10674157303370786),
 ('M', 0.0593900481540931),
 ('l', 0.01043338683788122),
 ('o', 0.030497592295345103),
 ('H', 0.0040128410914927765),
 ('d', 0.015248796147672551),
 ('W', 0.033707865168539325),
 ('K', 0.008025682182985553),
 ('q', 0.0016051364365971107),
 ('G', 0.0898876404494382),
 ('g', 0.011235955056179775),
 ('k', 0.0040128410914927765),
 ('e', 0.0032102728731942215),
 ('y', 0.002407704654895666),
 ('r', 0.0072231139646869984),
 ('p', 0.00882825040128411),
 ('A', 0.0056179

So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter.

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [31]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [46]:
def generate_text(lm, order, nletters=1000):
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [41]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print(generate_text(lm, 2))

First id faing.

Knot cius, bad, the? wom me, all worn
fale him ster OF Sixterandeakey Loolearrep youd I of aft letiles wil pow sestrep uplen:
PAGO:
To be crojeshaver'd we froy;
Whap a lights?

Put ditterve ited ch its ing thal!

If exce ablen lion.

SHYLO:
ACUS:
Upostell shour younry e' ank whoss heyet Hery mard clion al force ow me his all your Gall wif mallover hout,
A mess ever, nothave hold: facte to
to frit, I welf;
And theribloveak 'stes anty the you
At har weakfuld me.

PA:
Hast is grand als compar ithair, me shenced athere nin:
As mard th City bare's I' the beirebear you se
That with both I and but mis in don sommors,
Tamed,
Ser:
Oned wourd, eatted
AES:
I whords pay le
Merome fain-gn, sty
We hippre
Splay,
Thent and abell mere'sto hasty withe noteast I andst, mot ine mang a northy your dook whave,
Thaver:
If stions! that se useect will ant the the
For baccusen a
fortur cry for ber?

GO:
Bet, Ha,
Blow utly
beake strivers mortuare ithis for le haten tholt is a come fin, my twit n

Not so great.. but what if we increase the order to 4?

### order 4

In [42]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First the storal, in her mindly most his o' mistress well well cale legs wither nature, if you of and thouse.

HERMIA:

NORTHUMBERLANDO:
Pray you get the madam?

MARIA:
O villant our arms.

POSTHUMUS LUCIUS:
Welcome, how I not give shut on thee betimes yellow'd, my lord, do o'clock, you, marry eye, man's he is, that can love wrong should gored togetherein thy half the did respect they welcome, King will?

CELIA:
Away cold,
You my rederatian a little very:
Third Willion! death false-book, be ever for aught for neither his bear an every peace of dried 'Inest let that done,
But in where's grace, that I the done,
Yet againstinction off and privy cup, and tractised, met time to makest world
corn,
And whence. First is numberland's
body.

PANDA:
My pray God spur trains?

SHALLOW:
Plead quoth of God, cloved, stay be had a fail'd by tyranny me, to stance, if it of thou can.

MARTIUS ENOBARBUS:
Stand's cur!

GLOUCESTER:
What, and I gallant look upon them down,
Provost:
'Tis be-legged I in fight 

In [43]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First Gently,
Then charm me! I seen;
Think the give it privately lord!

PRINCENTIO:
Willing lord? Who shall I lives in
such and call'd too get as seven most degrees; which, infinisters of iron, sir; penitor.

LEON:
Doubt now much
first Out
up the great news at the
soul,
The hath man? In vail'd do I love, no more Mistrength all stillion witness worth
Repent for malice;
Lay heart, dispring; I was younged wit
Who thing but I that I say with him i' the
greatenerals! Ah, where ye to be god-dew in Frence, and welcome Ottomach or well my utmost in cond Gentlemanded you are say knife?

EDGAR:
How now with measure, alack Cassion him the pardon my life;
I am the King of you till well have here play ther's but me block'd of a dog they can you love Silvia, cally disclose this the was now you chid?

HENRY:
It is more long.
Foreknow his spection
I'll honour torn an in his farth a swain
To behalf she deservant:
Such ribs,
And Indied light!

LADY MACBETH:
The pity is dunghill,
For they heavy.

CASCA:


This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [47]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print(generate_text(lm, 7))

First Clown:
Look in your with mine eyes
Of dropsies, the odds all you not guilty of her, the prince and
desire yet she was my friends, mine eye would not cast away to-night?

First Senators, &C:
Weapons, such a spousal,
That sorrows I might, the scabbard. What married and things I won in France: 'tis a lion and hark.
Take up the bourn
No travel for green,
By all you be call'd neat.--Still virginal fellow that I gather
The motto, 'In hac spe vivo.'

SIMONIDES:
Ay, Timon.

Messenger:
The south-west: while you shalt bleeding little ease.
And are contagious break with her mother!
I swear the
plain him, Dolabella,
A novices! To
be slow in my reign.

KING PHILIP:
What dost thou should have a sectary astronomers for my brother!

CORNWALL:
How now, my liege, that happy! but move a whip
To meet his valour, in
true wretches to his honourable man in Windsor was but a finger of wet,
The rest.
And that sound, sound and a bloody wars,
And all too cold a night-bird mute,
That not ignorance! Please y

### How about 10?

In [50]:
lm = train_char_lm("shakespeare_input.txt", order=40)
print(generate_text(lm, 40))

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



### This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

### So why am I impressed with the neural networks after all?

Generating English a character at a time -- not so impressive in my view. The NN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. 

However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. 

If the examples are not cherry-picked, and the output is generally that nice, then the NN did learn something not trivial at all.

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [15]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt

--2024-10-02 08:57:25--  http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt [following]
--2024-10-02 08:57:27--  https://www.cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving www.cs.stanford.edu (www.cs.stanford.edu)... 23.216.154.123, 23.216.154.115
Connecting to www.cs.stanford.edu (www.cs.stanford.edu)|23.216.154.123|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-10-02 08:57:28 ERROR 404: Not Found.



In [51]:
lm = train_char_lm("linux_input.txt", order=10)
print(generate_text(lm, 10))

/*
 * linux/kernel.h>
#include <linux/mount.h>
#include <linux/tracehook.h>
#include <linux/notifier.h>
#include <linux/sched.h>
#include <linux/acct.h>
#include <linux/keyctl.h>
#include <linux/file.h>
#include <linux/kallsyms.h>

#include <linux/syscalls.h>
#include <linux/time.h
		 */
		asm("" : "+rm"(nsec));
		nsec -= NSEC_PER_SEC / HZ) * j;
#elif HZ > USEC_PER_SEC == 1GHz and @from is NSEC_PER_SEC / USER_HZ)) == 0
# if HZ < USER_HZ
	x = div_u64_rem() for another threads");
torture_param(int, stat_interval * clock->mult;
	tk->ntp_error <= 0)) {
		ret = -ENOMEM;
	next_key = array_map_lookup_elem = htab_map_free(struct audit_parent structure has multiple
 * time.
 *
 * Called when prev != next.
 */
static void torture_must_stop())
			break;
		prev_page = head_page->entries[i]);
	}
	buf += sprintf(buf, sizeof(buffer) - len;
			goto out;

			/*
			 * We are sharing ->siglock held, which matches with the sighand lock held.
 */

/* unrunnable is < 0 */
#define KDB_CMD_KGDB) {
		if (new_f

In [17]:
lm = train_char_lm("linux_input.txt", order=15)
print(generate_text(lm, 15))

/*
 * linux/kernel/time/tick-broadcast-hrtimer.c
 * This file emulates a local clock event devices which handle everything in page size chunks ensure
	 * the destination addresses
	 * through very weird things can happen
	 * if the module is unloaded, and then by giving the lock_torture_stats_print();
		torture_shutdown task started");
	do {
		schedule_timeout_interruptible(1);
	}
	smp_mb(); /* matches sched_clock_init() */

	if (!sched_clock_irqtime) {
		irqtime_account_hi_update(void)
{
	ktime_t period;

	write_seqlock(&tsk->vtime_seqlock);
}

void * __weak arch_kexec_kernel_verify_sig(image, image->kernel_buf_len);

	if (ret < 0)
		errors++;

	num_tests++;
	ret = test_kprobe();
	if (ret < 0)
		goto out_balanced;

	if (env->idle != CPU_NOT_IDLE) &&
	    likely(p->policy != SCHED_NORMAL && policy != SCHED_DEADLINE task.
 *
 * Only the static values are converted to jiffies when they are still
	 * active. Clear the pending bitmasks, but must still be cleared in our caller bc CLONE_THRE

In [18]:
lm = train_char_lm("linux_input.txt", order=20)
print(generate_text(lm, 20))

/*
 * linux/kernel/irq/chip.c
 *
 * Copyright 2003-2007 Red Hat Inc., Durham, North Carolina.
 * Copyright 2005 Hewlett-Packard Development Company, L.P.
 * Copyright (C) 2004-2006 Tom Rini <trini@kernel.crashing.org>
 * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.
 * Copyright (c) 2003 Patrick Mochel
 * Copyright (c) 2009 Wind River Systems, Inc.  All Rights Reserved.
 * Written by David Howells (dhowells@redhat.com)
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License
 *  along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/gpl-2.0.html.
 *
 * Copyright (c) 2001   David Howells (dhowells@redhat.com).
 * - Derived partially from idea by Andrea Arcangeli <andrea@suse.de>
 * - Derived also from comments by Linus
 */
#include <linux/swap.h>
#include <linux/cpu.h>
#i

In [19]:
print(generate_text(lm, 20))

/*
 * linux/kernel/irq/autoprobe.c
 *
 * Copyright (C) IBM Corporation, 2014
 *
 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 */

#include <linux/module.h>	/* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
#include <linux/clocksource.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/percpu.h>
#include <linux/mount.h>
#include <linux/slab.h>
#include <linux/kdb.h>
#include <linux/module.h>

#include <asm/uaccess.h>

/*
 * mutex protecting text section modification (dynamic code patching).
 * some users need to sleep (allocating memory...) while they hold this lock.
 *
 * NOT exported to modules - patching kernel text is a really delicate matter.
 */
DEFINE_MUTEX(trace_types_lock);

	return ret;
}

static int update_cpumask(struct cpuset *cs, nodemask_t *new_mems,
		     bool cpus_updated, bool mems_updated)
{
	bool is_empty;

	spin_lock_irq(&callback_lock);
		if (!on_dfl)
			cpumask_copy(top_cpuset.effective_mems = node_states[N_MEMORY].
 * Call this routine anyti

In [20]:
print(generate_text(lm, 20, nletters=5000))

/*
 * linux/kernel/irq/autoprobe.c
 *
 * Copyright (C) 1992, 1998-2006 Linus Torvalds, Ingo Molnar
 * Copyright(C) 2005-2007, Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *  Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *  Copyright (C) 2004 Pavel Machek <pavel@ucw.cz>
 * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
 *                                       /* 5 bit base 2 exponent, 20 bits mantissa.
 * The leading bit of the mantissa is not stored, but implied for
 * non-zero exponents.
 * Largest encodable value is 50 bits.
 */

#define MANTSIZE2       20                          |
 *      |                                         ----\n");
	return 0;
}

static int trace_selftest_startup_function_graph(struct tracer *trace, struct trace_array *tr)
{
	start_branch_trace(tr);
	return 0;
}

static int tk_debug_show_sleep_time, NULL);
}

static const struct file_operations proc_cgroupstats_operations);
	return 0;
}
device_initcall(audit_watch_in

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the 
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

Neural networks, on the other hand, seemed to have just learn it on its own. And that's impressive.

## Names dataset

Let's do a head to head comparison between n-gram LMs and a neural network that uses trigram word embeddings as input to predict the next token. 

Notice that the model is very similar to the word2vec continuous bag of words (CBOW) model except that instead of a masked token surrounded by the tokens used to predict the masked token, instead we predict the next token (as if it was masked) using the left to right context. Both models use n-gram word embeddings but n-gram LMs use a multi-layer model instead of a single layer like word2vec.

In [53]:
lm = train_char_lm("names.txt", order=3)
print(generate_text(lm, 3))

emmalyiah
arie
miscotter
khia
zel
zaylen
ceren
tuley
riya
elle
kano
malillyanayah
zsofiyya
neh
elistaro
kiyah
kah
batisaary
stroyalayns
filia
leg
helyner
rovian
febena
johaan
enedyn
carshir
abishawn
jentriousseta
maizajana
adharlos
stabell
prinsley
drey
jermana
lisseelin
chanell
kadeisha
sojoledgermiya
lon
jola
zamaleya
adeli
jenia
faylon
vadisyn
javaremari
selessy
pres
badina
raskilen
asir
aadilee
javienne
ausayla
calle
alvantwan
avraj
zaevyanal
mylen
wojcie
maizaelius
samsamyah
shious
maha
weston
ryel
devannoxley
timonne
christon
khalyzah
ellah
kazelle
mishaun
drya
amario
nuh
orianiya
maedore
uzziel
kelbinyelle
kair
ryehinah
han
durelle
uriebe
cha
mayia
aylen
caelynn
kelvin
isa
tafsandier
dani
courtney
natehilderiatthaniya
dalyx
len
hedy
mahelik
artarley
lee
abdulla
paulamis
paul
assian
kateh
melrahirah
lei
ajit
awaz
camatavi
shranzaryiah
irwani
kharmaniel
rubison
niz
yazlyn
hasey
jahli
naya
naclarden
conson
blayzaheem
tymir
candil
lawsyn
jamiyana
calle
jazmino
gracious
everlyn
ellan

## The End

In [21]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../css/notebook.css", "r").read()
    return HTML(styles)
css_styling()