# The unreasonable effectiveness of Character-level Language Models
## (and why RNNs are still cool)

### [Yoav Goldberg](http://www.cs.biu.ac.il/~yogo)

RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.
It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. 

Go read it now, then come back here. 

You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that?
Indeed. I was quite impressed as well.

However, it feels to me that most readers of the post are impressed by the wrong reasons.
This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.

In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below.

## Unsmoothed Maximum Likelihood Character Level Language Model 

The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word "hello"), or the letter `w` if we believe we are in the middle of the word "mellow". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).

We will call $n$, the number of letters we need to guess based on, the _order_ of the language model.

RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a "state" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.

So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?

Mathematically, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.

Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The **unsmoothed** part means that if we did not see a given letter following $h$, we will just give it a probability of zero.

And that's all there is to it.


### Training Code
Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.


In [2]:
from collections import *

def train_char_lm(fname, order=4):
    data = open(fname).read()
    lm = defaultdict(Counter)
    pad = "~" * order
    data = pad + data
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c,cnt/s) for c,cnt in counter.items()]
    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

Let's train it on Andrej's Shakespears's text:

In [3]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

--2024-10-02 08:43:28--  http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt [following]
--2024-10-02 08:43:29--  https://www.cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
Resolving www.cs.stanford.edu (www.cs.stanford.edu)... 23.72.36.136, 23.72.36.153
Connecting to www.cs.stanford.edu (www.cs.stanford.edu)|23.72.36.136|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-10-02 08:43:32 ERROR 404: Not Found.



In [4]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Ok. Now let's do some queries:

In [5]:
lm['then']

[('?', 0.045532157085941945),
 (' ', 0.5156516789982926),
 ('c', 0.0443938531587934),
 ('.', 0.035287421741605006),
 (',', 0.21457029026750143),
 (';', 0.02902675014228799),
 (':', 0.017643710870802503),
 ('!', 0.00796812749003984),
 ('\n', 0.02390438247011952),
 ("'", 0.006260671599317018),
 ('s', 0.03414911781445646),
 ('o', 0.0005691519635742744),
 ('i', 0.021627774615822423),
 ('-', 0.0005691519635742744),
 ('t', 0.0022766078542970974),
 ('e', 0.0005691519635742744)]

In [6]:
lm['Firs']

[('t', 1.0)]

In [7]:
lm['rst ']

[('C', 0.09550561797752809),
 ('f', 0.011235955056179775),
 ('i', 0.016853932584269662),
 ('t', 0.05377207062600321),
 ('u', 0.0016051364365971107),
 ('S', 0.16292134831460675),
 ('h', 0.019261637239165328),
 ('s', 0.03290529695024077),
 ('R', 0.0008025682182985554),
 ('b', 0.024879614767255216),
 ('c', 0.012841091492776886),
 ('O', 0.018459069020866775),
 ('w', 0.024077046548956663),
 ('a', 0.02247191011235955),
 ('m', 0.02247191011235955),
 ('n', 0.020064205457463884),
 ('I', 0.009630818619582664),
 ('L', 0.10674157303370786),
 ('M', 0.0593900481540931),
 ('l', 0.01043338683788122),
 ('o', 0.030497592295345103),
 ('H', 0.0040128410914927765),
 ('d', 0.015248796147672551),
 ('W', 0.033707865168539325),
 ('K', 0.008025682182985553),
 ('q', 0.0016051364365971107),
 ('G', 0.0898876404494382),
 ('g', 0.011235955056179775),
 ('k', 0.0040128410914927765),
 ('e', 0.0032102728731942215),
 ('y', 0.002407704654895666),
 ('r', 0.0072231139646869984),
 ('p', 0.00882825040128411),
 ('A', 0.0056179

So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter.

### Generating from the model
Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution.

In [8]:
from random import random

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x = x - v
            if x <= 0: return c

To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn.

In [9]:
def generate_text(lm, order, nletters=1000):
    history = "~" * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Generated Shakespeare from different order models

Let's try to generate text based on different language-model orders. Let's start with something silly:

### order 2:

In [10]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print(generate_text(lm, 2))

Fireell be brainge ney and dress off don, wite, I thoulenmestim,
Nows I
her andke hus a lestir. I word ther,
Nown.

Depood's is speaks thein myster.

His mights; horge he shang beak frot oninnot.

Sol fold dround, tandrult he whing to mone all up;
I durse you bot hich thave faid of spint, they from a selithetchan CLAMLEONZABEDFORK As sirese ind.

KINSTEARDIO:
Thaturd:
RO:
Tom ANA:
Ther camet's uphres bare
Uposell, war wil prown wor; vance fe
shand Exponess: livere ithinve itay! will sof he shou, th.

VALSTAFF:
Inch wis undand hes own mock-plefou hours.

QUEEDWARWICHAN:
I'll hem as ore! Attere
corld th dog,
To you he th Rome th ton;
I plest theen make to riandear pat anstirthisell sh a vand thery of onsweath heity the fox of ame haten tonemul of Fainexece.

PORAND:
But snam mad ingelf they
vany parray dumand Troy:
Not,
Eves
WAROKE He my the theand begivanden, nured Pran th.

Obs.

DESDES:
Come tyraventle by wer, am rof of Caesell hat?

Come beck a mand alaze this dar Deaves hisdo thenca

Not so great.. but what if we increase the order to 4?

### order 4

In [11]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First slightership joy.

Messed.

LUCIANA:
To seem to lend omingly of he with
hath blest king a little self
And, by life.
Provide: pray you the strust a friend
Wouldst to hear
Shall in my chilling madam weeps not sufficer:--and 'tis serve they first,
The days.

BANQUO:
What shake all returnings,
Of your makes friend.

Fool:
Titan prince?

HERMIONE:
Now, with cannot serpent bear for Proclusion
I and let thou with the
neces; and says of thy lords, and son, like a pen,
Upwards England, but than voice, you will prick in he put that wast thou!

LAERTES:
O godlike mine world's case.
Be this day dedicatested to you.

PANTHINO:
Pray hedge,
Tranio.

SIR TOBY BELCH:
Who wait.

QUEEN ELIZABETH:
Presumptuous with glas Hamlet's eart?
O grieved me? fed of his speaker: form or drops of them; let him: so yield
And drown by healthout my dismiss'd,
With passemblanchings that enwraps and my volumes me.

Seconditious poor me:
if proceeding enough I carous Moor; and mans: yet himself in quake your count, e

In [12]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

First Citizen:
I shalted consent the early and the posts and the leaven hereby this dig herefore theirs.

Paintine,
That's a thou see intoler, and world, I praying
Know are old.

VALENTIO:
Conside more.

EDMUND:
Anton a search's poor presume unto the king my rehend whose are thy with you: I would not afears on all, Sir John and her in grace, I endle conjurer be, it south
A lady home, Pistorious the fault
Doth give men no scarf upon me: I short in speaks, by my sough their shall weary chivalry of Clarence.

SILVIA:
Adsummer himself:
Upon us people thee, lay for hit has shame, sir, gently little it from fly time;
And I know, yet extinct and as the scourt, I advise of your
Can chast he's with thee got weave day, man again: you have you, lord!

AJAX:
The bless, you
hast have great say! vex'd sincent gather,
For to wait upon thou not off thyself of the me! O Proteus heap'd shall watch else their gracious she's damned all
that now it again
returns will, her, as while all
Stand do thyself and

This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?

### order 7

In [13]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print(generate_text(lm, 7))

First Clown:
Nay, they supposed by thy
tabour, and then the circumstance; bleed'st among the letters full of discretion but hear of god art there is he, as he uses of the diadem
On her kind common mother and dull upon this work neither confirm,
I'll lay odds to natural coward is.

First Citizen:
If it be their barks my very
exquisite Sir Thomas,
And the regal dignity
And let us in his timeless dolours spread with this day. What news?

PETRUCHIO:
Pluck a dainty dish.

KING HENRY V:
Even so stored love
Is by a forgot him so.

PARIS:
Fairly answer this time throng; look on her; she would not kin thou art beside the absent time o' day is long;
Else the bee sucks. there is not.
Great Duke of Norfolk be replete with me, Signior Gremio, your mine own life:
The care to return.

ORLANDO:
And so the time we were come ambush on the senses of this protector?

ANDROMACHE:
Do not spare to find him leave,
To hide it.

DROMIO OF EPHESUS:
Didst thou strew.
Come, come, my birth--wherefore, one play the 

### How about 10?

In [14]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print(generate_text(lm, 10))

First Citizen:
Heralds, from off her sight,
As I can of those we have not in abundance,
To feed again, thou slave, thy words revive my heart shall be recorded in this world goes.

GLOUCESTER:
This apoplexy is, as I take it, an ague. Where the people mad?

SIR TOBY BELCH:
Come, Sir Thomas Gargrave, hast thou not, when you find men worthy enough a herdsman: yea, him too,
That makes sport
To the prince's jester: a very dull kindred.

LEONATO:
My heart an ever-burning heart,
Aaron and thy
defence absence. Sweet Bianca,
Take me think on thy Proteus, Love's a mighty rock;
Which being season this in our prisons:
Therefore my good lord; I am the
grave of it.

PRIAM:
Paris, you speak like a descended?

LADY MACBETH:
Is Banquo gone from Troilus than Agamemnon;
and a man of falsehood!
Why did he swears,
To that I can make him fall
His crest that I know.

KING:
The heaven for grace and merciless process by,
How I persuade him that makes her blood;
Not rascal-like, to fall down with that particular

### This works pretty well

With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!

### So why am I impressed with the neural networks after all?

Generating English a character at a time -- not so impressive in my view. The NN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. 

However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. 

If the examples are not cherry-picked, and the output is generally that nice, then the NN did learn something not trivial at all.

Just for the fun of it, let's see what our simple language model does with the linux-kernel code:

In [15]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt

--2024-10-02 08:43:46--  http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt [following]
--2024-10-02 08:43:47--  https://www.cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt
Resolving www.cs.stanford.edu (www.cs.stanford.edu)... 23.72.36.153, 23.72.36.136
Connecting to www.cs.stanford.edu (www.cs.stanford.edu)|23.72.36.153|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-10-02 08:43:49 ERROR 404: Not Found.



In [16]:
lm = train_char_lm("linux_input.txt", order=10)
print(generate_text(lm, 10))

~~
 *     wait->flags & SD_WAKE_AFFINE) {
			int cpu;

	for_each_console(con);
		}
	}

	freezer->state = PERF_EVENT_IOC_SET_FILTER:
		return 0;
}
EXPORT_SYMBOL(cpu_all_bits, NR_CPUS) = CPU_BITS_ALL;
#else
static int aggr_fault_handle:
			error = -EINVAL;

	mutex_lock(&kexec_mutex);
	e = audit_buffer *ab, u32 secid)
{
	int cur;
	unsigned long flags;
	int restart a system shutdown.
		 */
		print_cpu_stall(&rcu_bh_ctrlblk.rcucblist,
	.curtail	= &rcu_sched_qs() and rcu callback invocation of the per-CPU
	 * rcu_sched_qs_mask variable by this cfs_rq, as
			 * we have sole access to sensitive action, we now want the system call with lock->wait_lock);
		update_tasks_nodemask - change a reference to it, after
 *	initializing a held locks:
	 */
	delta_exec;
		raw_spin_lock(&lock->base, cur);
		wake_up_process(clone_flags & CLONE_IO) {
			gc->reg_writel(gc, mask, ct->regs.disable);
extern asmlinkage __visible void tracing_stats_fops);
	if (very_verbose(class)) {
		pr_err("huh, entered softirq. I

In [17]:
lm = train_char_lm("linux_input.txt", order=15)
print(generate_text(lm, 15))


 *     wait->flags &= ~WQ_FLAG_WOKEN;		condition = true;
 *     smp_mb(); // A				smp_wmb(); // C
 *     if (!wait->flags & WQ_FLAG_WOKEN)	wait->flags |= WQ_FLAG_WOKEN;
 *         schedule();
 *
 * This would cause the waiter on which @task was blocked in @next_lock,
	 * so we can detect whether we acquired the ctx->lock, such as to serialize against the stores.
		 *
		 * In addition, this synchronize_sched for details.
	 *
	 * http://thread.gmane.org/gmane.linux.kernel/1420814
	 */
	if (current->pi_state_cache = NULL;

	return tg;
}

#ifdef CONFIG_NO_HZ_COMMON */

#ifdef CONFIG_PROVE_LOCKING)
/*
 * Forwards and backwards
 * dependency.
 *
 * This could impact scaling on very large systems.  Be reluctant to
 * use notify_on_release(struct inode *inode;

	if (no_uprobe_events() || !valid_vma(vma, true))
		return 0;

	if (strcmp(argv[0], "btc") == 0) {
		unsigned long flags;

	/*
	 * If the arch has a polling bit, we maintain an invariant:
		 *
		 * Our polling bit is clear if we're not

In [18]:
lm = train_char_lm("linux_input.txt", order=20)
print(generate_text(lm, 20))

/*
 * linux/kernel/irq/pm.c
 *
 * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307  USA
 *
 * Copyright (C) 2011 Peter Zijlstra <pzijlstr@redhat.com>
 *
 * Provides a framework for enqueueing and running callbacks from hardirq
 * context. The enqueueing is NMI-safe.
 */

#include <linux/export.h>
#include <linux/sched.h>
#include <linux/kthread.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/export.h>
#include <linux/err.h>
#include <linux/freezer.h>
#include <linux/module.h>
#include <linux/tracepoint.h>

#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
#include <linux

In [19]:
print(generate_text(lm, 20))

/*
 * linux/kernel/irq/spurious.c
 *
 * Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *  Copyright (C) 2013  Linus Torvalds
 */

#include <linux/delay.h>
#include <linux/atomic.h>
#include <linux/uaccess.h>

static int compat_put_timex(struct compat_timex)) ||
			__get_user(txc->shift, &utp->shift) ||
			__put_user(txc->errcnt, &utp->errcnt) ||
			__put_user(txc->shift, &utp->shift) ||
			__put_user(tv->tv_sec, &ctv->tv_sec) ||
			__put_user(txc->time.tv_usec, &utp->time.tv_usec) ||
			__get_user(txc->jitter, &utp->jitter) ||
			__get_user(txc->jitter, &utp->jitter) ||
			__put_user(txc->esterror, &utp->esterror) ||
			__put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0;
}

static int __init init_syscall_trace(struct ftrace_graph_ent,	graph_ent	)
		__field_desc(	unsigned long,	map,	len	)
		__field_desc(	int, 		map,	map_id	)
		__field_desc(	unsigned long,	ret,		overrun	)
		__field_desc(	int, 		map,	map_id	)
		__field_desc(	unsigned long,	map,	len	)
		__

In [20]:
print(generate_text(lm, 20, nletters=5000))

/*
 * linux/kernel/irq/resend.c
 *
 * Copyright (C) 2008 Thomas Gleixner <tglx@linutronix.de>
 *
 *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra
 *  Copyright (C) 2004 Pavel Machek <pavel@ucw.cz>
 * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU Library General Public
 * License version 2, as published by the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License version 2 as
 * p

Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the 
and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. 

How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as "have I seen ( but not )" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. 

Neural networks, on the other hand, seemed to have just learn it on its own. And that's impressive.

## Names dataset

Let's do a head to head comparison between n-gram LMs and a neural network that uses trigram word embeddings as input to predict the next token. 

Notice that the model is very similar to the word2vec continuous bag of words (CBOW) model except that instead of a masked token surrounded by the tokens used to predict the masked token, instead we predict the next token (as if it was masked) using the left to right context. Both models use n-gram word embeddings but n-gram LMs use a multi-layer model instead of a single layer like word2vec.

In [21]:
lm = train_char_lm("names.txt", order=3)
print(generate_text(lm, 3))

emmie
olin
hajessa
yohan
kaythm
romie
joma
jari
dey
alynn
savaleah
lena
milley
annaislenasia
keemae
avery
ston
julin
tilanellandellah
yanna
yamira
zen
karley
dry
whitly
korie
jakorelle
melen
kora
elyn
mel
davishayellagrachiann
evel
aiyah
sebankleighaneel
henda
bayla
mykalatisai
elmin
lamilianna
ashio
elin
azellulunay
markus
corban
marcina
norryn
baaz
mari
osh
daya
rylen
dia
nay
alexani
kamouhamdynn
ava
nuri
jamorian
saat
stas
slavius
den
yani
shi
anaya
car
khaliyanny
emere
makhalson
uly
jaxxon
tei
lessa
samarius
lydiella
alaniyah
elai
zaraline
alarra
skirture
kamira
mohsehaja
kilinnlynn
liand
hariah
naline
deda
keathalina
benja
dmarianai
austian
rus
dalyn
arrick
roby
kaiven
davi
torindyley
kimirrollise
olinard
kameseun
haunachiso
milo
sali
yul
shlynn
shikaire
sha
aree
win
kremilanishaydana
kaiseerainnalia
elle
abelle
ada
hene
absalin
aara
zoda
ann
manuol
kaarion
faelektrape
henden
von
phinyona
tiwan
spurge
jenna
kota
ailee
logic
dena
kanys
anaisley
cas
payton
krissa
janando
kamanuella


## The End

In [21]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../css/notebook.css", "r").read()
    return HTML(styles)
css_styling()