<a href="https://colab.research.google.com/github/alt-nikitha/NLP-For-Dummies/blob/master/Zipf's_Law_and_Mandelbrot's_Approximation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Zipf's law states that the term frequency is inversely proportional to its rank in the frequency table. This means that if the most common word in the text makes up 7% of the table, the second most common word will make up half of 7% which is 3.5%, the third most common will make up one-third of 7% which is 2.3% and so on. This means that the product of the rank and the term frequency is almost constant, approximately equal to the term frequency of the most common word. For actual texts it is not entirely linear as described. But let us see how this works.

In [1]:
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords

Now let us generate the frequency table for the top 10 words in some sample text. For more details about its generation, go through the term frequency notebook in the repo.

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
stop_words=set(stopwords.words('english'))


In [4]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [5]:
words=nltk.Text(nltk.corpus.gutenberg.words('bryant-stories.txt'))
words=[word.lower() for word in words if(word.isalpha())]
words=[word.lower() for word in words if(word not in stop_words)]
fDist=FreqDist(words)

print(len(words))
print(len(set(words)))




21718
3688


The frequency table thus has the words in the order of occurrence, the frequency of the words, the proportion of the text that they make up, and the product of term frequency and rank, which is the constant of proportionality in Zipf's law.

In [6]:
i=1
for x,v in fDist.most_common(10):
  print(x,v,v/len(fDist),v*i)
  i+=1


little 597 0.1618763557483731 597
said 453 0.12283080260303687 906
came 191 0.05178958785249458 573
one 183 0.04962039045553145 732
could 158 0.042841648590021694 790
king 141 0.038232104121475055 846
went 122 0.03308026030368764 854
would 112 0.03036876355748373 896
great 110 0.02982646420824295 990
day 107 0.02901301518438178 1070


We see that it is not exactly linear but it is true for the words in the mid frequency range where the product lies in the 800s range. Since this does not accurately describe the relationship between the term frequency and rank, another law was developed called the Mandelbrot Approximation. This added another constant beta to the rank. The value of this constant was approximated to around 2.7. Let's see if this law holds for the text we have.

In [7]:
i=1
for x,v in fDist.most_common(10):
  print(x,v,v/len(fDist),v*(i+2.7))
  i+=1

little 597 0.1618763557483731 2208.9
said 453 0.12283080260303687 2129.1
came 191 0.05178958785249458 1088.7
one 183 0.04962039045553145 1226.1000000000001
could 158 0.042841648590021694 1216.6000000000001
king 141 0.038232104121475055 1226.6999999999998
went 122 0.03308026030368764 1183.3999999999999
would 112 0.03036876355748373 1198.3999999999999
great 110 0.02982646420824295 1287.0
day 107 0.02901301518438178 1358.8999999999999


Now we see that there is more uniformity in the product, especially in the mid frequency range. So we can see that for this text, the mandelbort's approximation seems to fit the word frequencies better than Zipf's law.