In a standard LDA topic model words are generated from $k$ topic-specific language models, where $k$ is the number of topics, and the choice of topic is made word by word.  Making topic choice independent of previous topic choices is true to the idea of topic models as providing the $k$ language models that best account for the overall distribution of words in a document set, but it is not the most natural assumption given our usual conception of
what a topic is.  Indeed, it is impossible to conceive of a coherent text in which the topic changed every word.
Here we explore the idea of **topic inertia**.  Once is topic is settled on; the model should make it difficult
to change topics.  As we make out way through a document, topic shifts should be rare.

We will begin by exploring a very simple idea.  Let's model the probability of a topic choice as
an event that is as probable as the word at rank $r$ (a hyper-parameter to be chosen by some clever means).
For concreteness, let's choose $r=100$,

We'll load some standard English frequency data and see what that gets us.
Data taken from [Adam Kilgfariff's BNC frequency page.](https://www.kilgarriff.co.uk/bnc-readme.html#raw)
He describes the procedure for creating the data as follows:

>The first 5,000 words of all documents (=files) longer than 5,000 words in the written part of the BNC were taken. There were 2018 of these, so the subcorpus was slightly over 10M words. (I used written-only on the premise that the spoken material would be too different to usefully treat as part of the same population - of course, one might say this about all sorts of subcorpora, but never mind.) A frequency list was produced for each of these (truncated) documents. Then, taking the 8189 word-pos pairs occurring 100 times or more in the sample, a 2018x8189 table giving the frequency of each word in each document was produced. For each word, the mean and variance was calculated. There were two ways to calculate mean and variance: including the zeros (eg always dividing by 2018) or excluding them (dividing by the number of documents the word occurred in). For most purposes, it is the former that is of interest so this is what I present. The "exclusive" figures may readily be reconstructed.

So we set $N$ (the size of the corpus) at 10,000,000.

See [Explaining LDA.](https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation)

In [24]:
# Choose topic boundary marker to license switching topics
# prob of topic boundary marker depends on length of documeny and number of topics in the model.
# model keeping the number of topic switches per doc low but not SO low it cant happen
# let's say topic odel switch is about as probable as observing a token of the word at rank 101
# "said" in the printout below.  Occurs 7,861 times in the 10 M word corpios

In [19]:
import pandas as pd

df = pd.read_csv("bnc_freq_stub.txt", header=0,sep=r"\s+",index_col="word")#,engine="python" )

In [21]:
df.head()

Unnamed: 0_level_0,pos,freq,docs,mean,var,mn_per_var
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the,at0,677594,2018,335.8,5003.6,14.9
of,prf,353416,2018,175.1,2860.4,16.3
and,cjc,284466,2018,141.0,1157.6,8.2
a,at0,215069,2018,106.6,550.5,5.2
in,prp,207531,2018,102.8,846.2,8.2


In [32]:
r = 100
df.index[r],df.iloc[r]

('said',
 pos            vvd
 freq          7861
 docs          1094
 mean           3.9
 var           56.8
 mn_per_var    14.6
 Name: said, dtype: object)

In [30]:
N=1e7
r = 100
freq= df.iloc[r]["freq"]
p_top_switch = freq/N
p_top_switch

0.0007861

In [35]:
#  prob of one or more topic switches in a 1_000 word sequence
#  better than even
p_no_switch_1000 = (1-p_top_switch)**1_000
p_switch_1000 = 1 - p_no_switch_1000
p_switch_1000

0.5445225814643666

In [36]:
from nltk.corpus import wordnet as wn