# Corpus Lexical Entropy $$H(V_D)$$

**Definition.**  
Let  
$$
d = \{ t_1, \dots, t_n \}
$$  
be a document, and let  
$$
D = \{ d_1, \dots, d_N \}
$$  
be a corpus. The vocabulary  
$$
V_D = \bigcup_{d \in D} d
$$  
is the set of all unique terms in the corpus. For each term $t \in V_D$, define  
$$
p_D(t) := \frac{\bigl|\{\,d \in D : t \in d\}\bigr|}{|D|}\,.
$$  
A term follows a Bernoulli distribution with parameter $p_D(t)$, so its Shannon entropy is  
$$
H(t) = p_D(t)\,\log_2\!\Bigl(\tfrac{1}{p_D(t)}\Bigr)\;+\;(1 - p_D(t))\,\log_2\!\Bigl(\tfrac{1}{1 - p_D(t)}\Bigr).
$$  
Finally, the **Corpus Lexical Entropy** is  
$$
H(V_D) := \sum_{t \in V_D} H(t).
$$

**Range.**  
- **Minimum**:  
  $$
  \min H(V_D) = 0
  $$  
  occurs if and only if every $p_D(t)\in\{0,1\}$ (i.e.\ each term is in all documents or in none).  
- **Maximum**:  
  $$
  \max H(V_D) = |V_D|
  $$  
  since for each $t$, $H(t)\le1$, with equality exactly when $p_D(t)=\tfrac12$.

---

# Document Pairwise Diversity $$D_J(D)$$

**Definition.**  
Given the same corpus $$D=\{d_1,\dots,d_N\}$$, let  
$$
\delta_J(d_i,d_j) = 1 - \frac{\lvert d_i \cap d_j\rvert}{\lvert d_i \cup d_j\rvert}
$$  
be the Jaccard distance between documents. The **average pairwise diversity** is  
$$
D_J(D) = \frac{1}{\binom{N}{2}} \sum_{1\le i<j\le N} \delta_J(d_i,d_j).
$$

**Range.**  
- **Minimum**:  
  $$
  \min D_J(D) = 0
  $$  
  when all documents share exactly the same set of terms ($\delta_J=0$ for every pair).  
- **Maximum**:  
  $$
  \max D_J(D) = 1
  $$  
  when every pair of documents is disjoint ($\delta_J=1$ for every pair).


In [None]:
import math
from itertools import combinations
import pandas as pd

def compute_corpus_lexical_entropy(corpus):
    """
    Computes Corpus Lexical Entropy H(V_D) for a given corpus.
    corpus: list of strings (documents)
    """
    docs = [set(doc.lower().split()) for doc in corpus]
    N = len(docs)
    vocab = set().union(*docs)

    def entropy(p):
        if p == 0 or p == 1:
            return 0.0
        return p * math.log2(1/p) + (1 - p) * math.log2(1/(1 - p))

    H = 0.0
    for term in vocab:
        p_t = sum(1 for d in docs if term in d) / N
        H += entropy(p_t)
    return H

def compute_average_jaccard(corpus):
    """
    Computes average pairwise Jaccard distance D_J(D) for a given corpus.
    corpus: list of strings (documents)
    """
    docs = [set(doc.lower().split()) for doc in corpus]
    pairs = list(combinations(docs, 2))

    def jaccard_distance(a, b):
        inter = len(a & b)
        union = len(a | b)
        return 1 - inter / union if union > 0 else 0

    distances = [jaccard_distance(d1, d2) for d1, d2 in pairs]
    return sum(distances) / len(distances) if distances else 0

In [None]:
def run_experiment(corpora):
    results = []
    for name, corpus in corpora.items():
        H = compute_corpus_lexical_entropy(corpus)
        DJ = compute_average_jaccard(corpus)
        results.append({
            "Corpus": name,
            "H(V_D)": round(H, 4),
            "D_J(D)": round(DJ, 4)
        })

    df = pd.DataFrame(results).set_index("Corpus")
    return df

# Toy examples

In [None]:
V = {"a", "b", "c", "d", "e"}

# D1 just uses one term per document
D1 = {
    "a",
    "b",
    "c",
    "d",
    "e",
} # Expected value of the metric: maximal

# D2 has no two documents using the same terms
D2 = {
    "a b",
    "c",
    "d e",
} # Expected value of the metric: maximal

#D3 adds to D2 a new document, adding no new terms
D3 = {
    "a b",
    "c",
    "d e",
    "b c e", # we expect a lower diversity than D2, because it's repeating terms.
}

# D4 adds to D2 a new document, with a new term
D4 = {
    "a b",
    "c",
    "d e",
    "f",
}

# D5 splits the vocabulary in two parts, reaching the theoretical maximum for H(Vd)
D5 = {
    "a d e",
    "b c"
}


# In D6, each term {a, b, c, d, e} appears in exactly half of the documents.
D6 = {
    "a b",
    "a c e",
    "b d e",
    "c d"
}

run_experiment({'D1' : D1, "D2" : D2, "D3": D3, "D4" : D4, "D5": D5, "D6": D6})

Unnamed: 0_level_0,H(V_D),D_J(D)
Corpus,Unnamed: 1_level_1,Unnamed: 2_level_1
D1,3.6096,1.0
D2,4.5915,1.0
D3,4.6226,0.8611
D4,4.8677,1.0
D5,5.0,1.0
D6,5.0,0.8


What do we mean for "diversity"? To me, diversity in this context should be more *document-oriented*: Given a corpus D, picking two random documents, the less terms in common they have, the higher the diversity.

$H(V_D)$ focuses more on terms distribution within the corpus. Which can be ok, but may not be optimal for our goal.

At this stage, I would rather prefer H(D):
- In both D1 and D2, we can pick any pair of documents, we will find no intersection between them. Therefore, the diversity should be maximal for both ->  **$D_j(D)$ does, $H(V_D)$ does not.**
- D1 and D5 should be both maximal, because documents are totally diverse (as in D1 vs D2) ->  **$D_j(D)$ does, $H(V_D)$ does not.** ;
- In D6, if you pick the first document and last document, then they are totally diverse. Otherwise, no matter the pair you take, there will be some intersection between documents. This should be reflected by a lower level of divesity ->  **$D_j(D)$ does, $H(V_D)$ does not.**

# Test on real datasets

In [1]:
! pip install datasets -U -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-nvrtc-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu

In [2]:
# Function to sample and preprocess a corpus
from datasets import load_dataset
import random
def preprocess_corpus(corpus, num_samples, seed=42):
    random.seed(seed)
    sampled_docs = random.sample(corpus, min(num_samples, len(corpus)))
    return [doc.strip() for doc in sampled_docs if isinstance(doc, str) and len(doc.split()) > 0]

common_crawl = load_dataset("cc_news", split="train")
stackexchange_law = load_dataset("ymoslem/Law-StackExchange", split = "train")
pubmed_ul = load_dataset("qiaojin/PubMedQA", "pqa_unlabeled",split = "train")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.57k [00:00<?, ?B/s]

train-00000-of-00005.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

train-00001-of-00005.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

train-00002-of-00005.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

train-00003-of-00005.parquet:   0%|          | 0.00/245M [00:00<?, ?B/s]

train-00004-of-00005.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/708241 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/407 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


law-stackexchange-questions-answers.json:   0%|          | 0.00/106M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/24370 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/5.19k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/66.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/61249 [00:00<?, ? examples/s]

In [3]:
# Mount Drive
from google.colab import drive
import os
drive.mount('/content/drive', force_remount = True)
project_path= '/content/drive/MyDrive/P10- RAG-GAS/'
os.chdir(project_path)

Mounted at /content/drive


In [4]:
import pandas as pd
data = pd.read_csv('data_eng.csv')

In [5]:
# Function to sample and preprocess a corpus
def preprocess_corpus(corpus, num_samples, seed=42):
    random.seed(seed)
    sampled_docs = random.sample(corpus, min(num_samples, len(corpus)))
    return [doc.strip() for doc in sampled_docs if isinstance(doc, str) and len(doc.split()) > 0]

In [6]:
corpora_downsampled = {
    "CC-News": preprocess_corpus(common_crawl['text'],data.shape[0]),
    "PMED-A": preprocess_corpus(pubmed_ul['long_answer'],data.shape[0]),
    "PMED-Q": preprocess_corpus(pubmed_ul['question'],data.shape[0]),
    "CoRe" : data['Summary'].tolist()
}

corpora = {
    "CC-News": common_crawl['text'],
    "PMED-A": pubmed_ul['long_answer'],
    "PMED-Q": pubmed_ul['question'],
    "CoRe" : data['Summary'].tolist()
}

In [7]:
! pip install scipy -U -q

In [8]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
import sys

def compute_H_VD(X_bin):
    N = X_bin.shape[0]
    df = np.asarray(X_bin.sum(axis=0)).flatten()
    p_t = np.clip(df / N, 1e-12, 1 - 1e-12)
    entropy_terms = -p_t * np.log2(p_t) - (1 - p_t) * np.log2(1 - p_t)
    hvd = entropy_terms.sum()
    return hvd

def compute_D_j(X_bin, feature_names, doc_idx1, doc_idx2, out=sys.stdout):
    G = X_bin @ X_bin.T  # prodotto scalare binario
    diag = G.diagonal()

    # Termini attivi
    doc1_term_indices = X_bin[doc_idx1].nonzero()[1]
    doc2_term_indices = X_bin[doc_idx2].nonzero()[1]
    terms_doc1 = set(feature_names[doc1_term_indices])
    terms_doc2 = set(feature_names[doc2_term_indices])

    inter = terms_doc1 & terms_doc2
    union = terms_doc1 | terms_doc2

    dot = G[doc_idx1, doc_idx2]
    union_size = diag[doc_idx1] + diag[doc_idx2] - dot
    sim = dot / union_size
    dist = 1 - sim

    print(f"\n[DEBUG] Documenti {doc_idx1} e {doc_idx2}", file=out)
    print(f"Termini attivi doc {doc_idx1} ({len(terms_doc1)}): {terms_doc1}", file=out)
    print(f"Termini attivi doc {doc_idx2} ({len(terms_doc2)}): {terms_doc2}", file=out)
    print(f"Intersezione ({len(inter)}): {inter}", file=out)
    print(f"Unione ({len(union)}): {union}", file=out)
    print(f"Prodotto scalare (|A ∩ B|): {dot}", file=out)
    print(f"Unione (|A ∪ B|): {union_size}", file=out)
    print(f"Similarità Jaccard: {sim:.4f}", file=out)
    print(f"Distanza Jaccard: {dist:.4f}", file=out)

    # Media su tutte le coppie
    G = G.tocoo()
    mask = G.row < G.col
    row = G.row[mask]
    col = G.col[mask]
    data = G.data[mask]

    union_all = diag[row] + diag[col] - data
    sim_all = data / union_all
    dist_all = 1 - sim_all
    mean_dist = dist_all.mean()

    return mean_dist

def evaluate_corpora(corpora_dict):
    results = []
    vectorizer = TfidfVectorizer(binary=True, use_idf=False, norm=None, lowercase=True)

    with open("H(Vd)_vs_Dj.txt", "w", encoding="utf-8") as f:
        for name, corpus in corpora_dict.items():
            print("------------------------------------------")
            print(f"\nEvaluating corpus: {name}")
            print("------------------------------------------", file=f)
            print(f"Evaluating corpus: {name}", file=f)

            X = vectorizer.fit_transform(corpus)
            feature_names = vectorizer.get_feature_names_out()
            X_bin = X.astype(bool).astype(int)

            D = compute_D_j(X_bin, feature_names, 0, 1, out=f)
            H = compute_H_VD(X_bin)

            results.append({"Corpus": name, "H(V_D)": H, "D_J(D)": D})

        df = pd.DataFrame(results)
        print("\nSummary:\n")
        print(df)
        print("\nSummary:\n", file=f)
        print(df.to_string(index=False), file=f)

    return df


In [9]:
# Test
evaluate_corpora({'test': corpora['CoRe']})

------------------------------------------

Evaluating corpus: test

Summary:

  Corpus      H(V_D)    D_J(D)
0   test  136.681478  0.656567


Unnamed: 0,Corpus,H(V_D),D_J(D)
0,test,136.681478,0.656567


In [10]:
# Downampled Corpora
evaluate_corpora(corpora_downsampled)

------------------------------------------

Evaluating corpus: CC-News
------------------------------------------

Evaluating corpus: PMED-A
------------------------------------------

Evaluating corpus: PMED-Q
------------------------------------------

Evaluating corpus: CoRe

Summary:

    Corpus       H(V_D)    D_J(D)
0  CC-News  1233.695110  0.932686
1   PMED-A   238.091729  0.929374
2   PMED-Q   106.771756  0.939078
3     CoRe   136.681478  0.656567


Unnamed: 0,Corpus,H(V_D),D_J(D)
0,CC-News,1233.69511,0.932686
1,PMED-A,238.091729,0.929374
2,PMED-Q,106.771756,0.939078
3,CoRe,136.681478,0.656567
