# Corpus Lexical Entropy $$H(V_D)$$

**Definition.**  
Let  
$$
d = \{ t_1, \dots, t_n \}
$$  
be a document, and let  
$$
D = \{ d_1, \dots, d_N \}
$$  
be a corpus. The vocabulary  
$$
V_D = \bigcup_{d \in D} d
$$  
is the set of all unique terms in the corpus. For each term $t \in V_D$, define  
$$
p_D(t) := \frac{\bigl|\{\,d \in D : t \in d\}\bigr|}{|D|}\,.
$$  
A term follows a Bernoulli distribution with parameter $p_D(t)$, so its Shannon entropy is  
$$
H(t) = p_D(t)\,\log_2\!\Bigl(\tfrac{1}{p_D(t)}\Bigr)\;+\;(1 - p_D(t))\,\log_2\!\Bigl(\tfrac{1}{1 - p_D(t)}\Bigr).
$$  
Finally, the **Corpus Lexical Entropy** is  
$$
H(V_D) := \sum_{t \in V_D} H(t).
$$

**Range.**  
- **Minimum**:  
  $$
  \min H(V_D) = 0
  $$  
  occurs if and only if every $p_D(t)\in\{0,1\}$ (i.e.\ each term is in all documents or in none).  
- **Maximum**:  
  $$
  \max H(V_D) = |V_D|
  $$  
  since for each $t$, $H(t)\le1$, with equality exactly when $p_D(t)=\tfrac12$.

---

# Document Pairwise Diversity $$D_J(D)$$

**Definition.**  
Given the same corpus $$D=\{d_1,\dots,d_N\}$$, let  
$$
\delta_J(d_i,d_j) = 1 - \frac{\lvert d_i \cap d_j\rvert}{\lvert d_i \cup d_j\rvert}
$$  
be the Jaccard distance between documents. The **average pairwise diversity** is  
$$
D_J(D) = \frac{1}{\binom{N}{2}} \sum_{1\le i<j\le N} \delta_J(d_i,d_j).
$$

**Range.**  
- **Minimum**:  
  $$
  \min D_J(D) = 0
  $$  
  when all documents share exactly the same set of terms ($\delta_J=0$ for every pair).  
- **Maximum**:  
  $$
  \max D_J(D) = 1
  $$  
  when every pair of documents is disjoint ($\delta_J=1$ for every pair).


In [None]:
import math
from itertools import combinations
import pandas as pd

def compute_corpus_lexical_entropy(corpus):
    """
    Computes Corpus Lexical Entropy H(V_D) for a given corpus.
    corpus: list of strings (documents)
    """
    docs = [set(doc.lower().split()) for doc in corpus]
    N = len(docs)
    vocab = set().union(*docs)
    
    def entropy(p):
        if p == 0 or p == 1:
            return 0.0
        return p * math.log2(1/p) + (1 - p) * math.log2(1/(1 - p))
    
    H = 0.0
    for term in vocab:
        p_t = sum(1 for d in docs if term in d) / N
        H += entropy(p_t)
    return H

def compute_average_jaccard(corpus):
    """
    Computes average pairwise Jaccard distance D_J(D) for a given corpus.
    corpus: list of strings (documents)
    """
    docs = [set(doc.lower().split()) for doc in corpus]
    pairs = list(combinations(docs, 2))
    
    def jaccard_distance(a, b):
        inter = len(a & b)
        union = len(a | b)
        return 1 - inter / union if union > 0 else 0
    
    distances = [jaccard_distance(d1, d2) for d1, d2 in pairs]
    return sum(distances) / len(distances) if distances else 0

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [6]:
def run_experiment(corpora):
    results = []
    for name, corpus in corpora.items():
        H = compute_corpus_lexical_entropy(corpus)
        DJ = compute_average_jaccard(corpus)
        results.append({
            "Corpus": name,
            "H(V_D)": round(H, 4),
            "D_J(D)": round(DJ, 4)
        })

    df = pd.DataFrame(results).set_index("Corpus")
    return df

In [None]:
V = {"a", "b", "c", "d", "e"}

# D1 just uses one term per document
D1 = {
    "a", 
    "b", 
    "c", 
    "d",
    "e",
} # Expected value of the metric: maximal

# D2 has no two documents using the same terms
D2 = {
    "a b",
    "c", 
    "d e",
} # Expected value of the metric: maximal

#D3 adds to D2 a new document, adding no new terms
D3 = {
    "a b", 
    "c", 
    "d e",
    "b c e", # we expect a lower diversity than D2, because it's repeating terms.
}

# D4 adds to D2 a new document, with a new term
D4 = {
    "a b",
    "c", 
    "d e", 
    "f",
}

# D5 splits the vocabulary in two parts, reaching the theoretical maximum for H(Vd)
D5 = {
    "a d e",
    "b c"
}


# In D6, each term {a, b, c, d, e} appears in exactly half of the documents.
D6 = {
    "a b",   
    "a c e",  
    "b d e", 
    "c d"    
} 

run_experiment({'D1' : D1, "D2" : D2, "D3": D3, "D4" : D4, "D5": D5, "D6": D6})

Unnamed: 0_level_0,H(V_D),D_J(D)
Corpus,Unnamed: 1_level_1,Unnamed: 2_level_1
D1,3.6096,1.0
D2,4.5915,1.0
D3,4.6226,0.8611
D4,4.8677,1.0
D5,5.0,1.0
D6,5.0,0.8


What do we mean for "diversity"? To me, diversity in this context should be more *document-oriented*: Given a corpus D, picking two random documents, the less terms in common they have, the higher the diversity.

$H(V_D)$ focuses more on terms distribution within the corpus. Which can be ok, but may not be optimal for our goal.

At this stage, I would rather prefer H(D):
- In both D1 and D2, we can pick any pair of documents, we will find no intersection between them. Therefore, the diversity should be maximal for both ->  **$D_j(D)$ does, $H(V_D)$ does not.**
- D1 and D5 should be both maximal, because documents are totally diverse (as in D1 vs D2) ->  **$D_j(D)$ does, $H(V_D)$ does not.** ;
- In D6, if you pick the first document and last document, then they are totally diverse. Otherwise, no matter the pair you take, there will be some intersection between documents. This should be reflected by a lower level of divesity ->  **$D_j(D)$ does, $H(V_D)$ does not.**