# Plotting in Python: Examining Zipf distributions

Python's most often used plotting library is `matplotlib`. In addition to the material covered in this notebook, there are many [official tutorials](https://matplotlib.org/stable/tutorials/index.html), as well as condensed [cheatsheets and handouts](https://matplotlib.org/cheatsheets/). The cheatsheets and handouts in particular are extremely useful in practice, as they pack a lot of information at a glance. There is a lot of functionality in Matplotlib, and these resources allow casual users to quickly zero in on what they need.

`seaborn` is an additional library based on `matplotlib` which makes some often-needed higher-level plotting tasks easier; see [here](http://seaborn.pydata.org/) for details.

In [None]:
import nltk
from nltk.book import text2

In [None]:
fd = nltk.FreqDist(text2)

In [None]:
x, y = [], []
for r, (c, f) in enumerate(fd.most_common()):
    x.append(r)
    y.append(f)

In [None]:
import matplotlib as mpl
from matplotlib import pyplot as plt

Matplotlib configuration -- make inline plots larger by default by setting a higher DPI:

In [None]:
# mpl.rc("figure", dpi=300)

This is what a similar type of configuration might look like when using `seaborn`:

In [None]:
# import seaborn as sns

# sns.set(style="darkgrid", rc={"figure.dpi": 300})

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_ylabel("Frequency")
ax.set_xlabel("Rank")
ax.set_title("Zipf's law in " + text2.name)

In [None]:
fig, ax = plt.subplots()
ax.loglog(x, y)
ax.set_ylabel("Frequency")
ax.set_xlabel("Rank")
ax.set_title("Zipf's law in " + text2.name)

In [None]:
import random

In [None]:
text = ""
letters = "abcde "

for _ in range(int(1e5)):
    text += random.choice(letters)

In [None]:
text[:100]

In [None]:
fd = nltk.FreqDist(nltk.word_tokenize(text))

x, y = [], []
for r, (c, f) in enumerate(fd.most_common()):
    x.append(r)
    y.append(f)

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_ylabel("Frequency")
ax.set_xlabel("Rank")
ax.set_title("Zipf's law in randomly generated text")

In [None]:
fig, ax = plt.subplots()
ax.loglog(x, y)
ax.set_ylabel("Frequency")
ax.set_xlabel("Rank")
ax.set_title("Zipf's law in randomly generated text")

How to explain the jagged, step-like appearance of the curve at the beginning? (Compare with the relatively smooth line which results from subjecting *Sense and Sensibility* to the same analysis.)

It's caused by random text generation: all one-letter words have the highest chance of occurring repeatedly at roughly the same probability, then two letter words, then three letter words, etc. Cf. the top of the frequency distribution:

In [None]:
fd.most_common(10)