# Unseen Semantic Diversity

In [1]:
from pathlib import Path

plays = {}

root = Path("../data/shakespeare_txt")
for path in root.iterdir():
    if path.suffix == ".txt":
        title = path.name[:path.name.index("_")]
        with path.open() as f:
            forms, lemmas = zip(*[line.rstrip('\n').split('\t') for line in f])
            plays[title] = {"forms": forms, "lemma": lemmas}

## Estimating the Vocabulary Size of Shakespeare's Lexicon

In [2]:
import copia.utils

counts = copia.utils.to_abundance(
    [w for play in plays.values() for w in play["forms"]]
)

observed_lexical_diversity = len(counts)

print(f"Observed lexical diversity: {round(observed_lexical_diversity)}")

Observed lexical diversity: 31761


The observed lexical diversity is slightly higher than the number given by Efron and Thisted (1976): 

$$
\sum^{\infty}_{x = 1} n_x = 31534
$$

But the difference in negligible. How many words did Shakespeare actually know? We use the unseen species estimator Chao1 (Chao 1984) to estimate Shakespeare's true vocabulary size:

In [3]:
import copia.estimators

estimated_lexical_diversity = copia.estimators.chao1(counts)
print(f"Estimated lexical diversity: {round(estimated_lexical_diversity)}")

Estimated lexical diversity: 55285


This number is close to the estimation by Efron and Thisted (1976) who calculate that the complete vocabulary size of Shakespeare was around 57704.

We can do the same thing for the individual plays. When applied to plays, the question we're asking is how large the vocabulary of a play would if we increase its length to infinity. 

In [4]:
import pandas as pd 

estimates = []

for play in plays:
    counts = copia.utils.to_abundance([w for w in plays[play]["forms"]])
    observed_lexical_diversity = len(counts)
    estimated_lexical_diversity = copia.estimators.chao1(counts)
    estimates.append({"play": play, 
                      "obs": observed_lexical_diversity, 
                      "est": estimated_lexical_diversity})
estimates = pd.DataFrame(estimates)
estimates["coverage"] = estimates["obs"] / estimates["est"]
estimates.sort_values("coverage")

Unnamed: 0,play,obs,est,coverage
19,antony-and-cleopatra,4405,10453.165176,0.421403
12,king-lear,4672,11029.200114,0.423603
5,macbeth,3694,8579.005524,0.430586
20,troilus-and-cressida,4708,10719.541123,0.439198
21,twelfth-night,3389,7694.191321,0.440462
26,hamlet,5157,11684.907005,0.441339
22,henry-iv-part-2,4483,9989.904893,0.448753
6,alls-well-that-ends-well,3862,8597.776736,0.449186
9,the-merry-wives-of-windsor,3590,7958.490507,0.451091
36,measure-for-measure,3699,8178.341008,0.452292


Not sure if it means anything, but it's interesting to see so many of the really famous plays so high up in the ranking. Those high in the ranking have a remarkably high lexical diversity given the number of observed words. 

## Estimating the Semantic Diversity of Shakespeare's Lexicon

In [5]:
import numpy as np

data = np.load("../data/shakespeare.embs.forms.npz")
keys, vecs = data["keys"], data["vecs"]

In [12]:
import collections
from sklearn.metrics import pairwise

dists = pairwise.euclidean_distances(vecs)
counts = collections.Counter(w for play in plays.values() for w in play["forms"])
counts = np.array([counts[w] for w in keys])

In [13]:
FAD = copia.estimators._compute_fad(dists, counts)

In [14]:
FAD

{'obs': 225784400,
 'F0+': 128125009,
 'F+0': 128125025,
 'F00': 72489077,
 'FAD': 554523511,
 'CI_lower': 554389793.6411281,
 'CI_upper': 554657282.3637618}

In [17]:
estimates = []

for play in plays:
    counts = collections.Counter(w for w in plays[play]["forms"])
    counts = np.array([counts[w] for w in keys])
    fad = copia.estimators._compute_fad(dists, counts)
    fad["play"] = play
    estimates.append(fad)

estimates = pd.DataFrame(estimates)
estimates["coverage"] = estimates["obs"] / estimates["FAD"]
estimates.sort_values("coverage")

Unnamed: 0,obs,F0+,F+0,F00,FAD,CI_lower,CI_upper,play,coverage
12,7904076,8312928,8312930,8744367,33274300,33215800.0,33332930.0,king-lear,0.237543
21,5467608,5602939,5602937,5739279,22412763,22367620.0,22458030.0,twelfth-night,0.243951
19,6837614,6677690,6677692,6516168,26709164,26662910.0,26755520.0,antony-and-cleopatra,0.256003
9,5546402,5405187,5405184,5262179,21618951,21577870.0,21660140.0,the-merry-wives-of-windsor,0.256553
6,6025474,5757431,5757433,5494505,23034842,22992390.0,23077400.0,alls-well-that-ends-well,0.261581
20,8869442,8385267,8385275,7924672,33564656,33515500.0,33613910.0,troilus-and-cressida,0.264249
5,5101202,4796390,4796390,4508320,19202302,19166450.0,19238240.0,macbeth,0.265656
22,8233899,7597630,7597622,7008716,30437866,30391860.0,30483960.0,henry-iv-part-2,0.270515
26,10685396,9812487,9812488,9003166,39313536,39263320.0,39363840.0,hamlet,0.271799
36,5640040,5059964,5059961,4535726,20295691,20259560.0,20331910.0,measure-for-measure,0.277893
