# Redundancy in Government Documents

## Abstract

E-government initiatives offer unprecedented transparency into government proceedings. But parsing and gleaning meaning from a large corpus is often intractable for the average citizen. Furthermore, the use of "administrative jargon" creates a barrier to understanding and parsing by the general public. Exploiting redundancy in these texts may facilitate summarization, insights and visualition of these texts.

In what follows, a simplifying assumption is made that measures of similarity can be used to signal redundancy, where redundancy is considered as a more strict definition of similarity. For example, "my dog ate my homework" and "my assignment was eaten by a canine" are both similar and redundant, whereas the former is completely similar, in one sense, to "My homework is to write about what my dog ate" but not redundant.

Given the relationship between similarity and redundancy, from the evaluation of similarity measures, a threshold for  redundancy may be gleaned from human experts. In other words, the threshold for some similarity measuring signaling redundancy is a heuristic to be determined.

## Data

Two sources of data were used for this project:

1. Minutes of the Federal Open Market Commission (FOMC)
2. Barcelona's Municipal Gazette

For both data sources, the available data was subset by a time window to make computation tractable (years in the case of FOMC minutes and months for Gaceta data). Corpii were composed of all documents available in the time window.

#### FOMC data

[Source](http://stanford.edu/~rezab/useful/fomc_minutes.html)

Minutes of FOMC meetings are available in plain text from 1967 to 2008. Each year there are about 8 reports released.

The period composing a corpus for FOMC data for this project was a year, so a corpus comprised of 8 documents.

#### Barcelona's Municipal Gazette

[Source](https://w33.bcn.cat/GasetaMunicipal/Inici?lang=EN)

PDFs of the Municipal Gazette are available from the years from 2000-2016. The number of publications per year varies, but 560 total are available. These 560 pdf documents were parsed into 13,116 sub-documents.

Sub-documents are extracted from PDF documents in 2 phases: First, raw text is extracted programmatically. Second, a program defined by a native Catalan speaker using heuristics parses out sub-documents composing the main bodies of text from the original.

The PDF documents included long tables of contents and data tables. In the second phase, these non-textual objects are omitted and resulting sub-documents are comprised of main sectiosn from the original document body, such as "Acords" (i.e. "agreements") and various commissions' reports.

The period composing a corpus for this project was a month-year, selected such that again the size of the corpus was computationally manageable.

## Question

### Is redundancy a tool for summarization?

E-government initiatives offer unprecedented transparency into government proceedings. But often their size and use of "administrative jargon" creates a barrier to understanding and parsing by the general public. Exploiting redundancy in these texts may facilitate the initiative to provide insight via summarization and visualization of these texts.

Citizens should have transparency in government, but transparency is meaningless when its manifestation is only in undigestable amounts of text. How is relevant information stored in these corpii?

This project explores the hypothesis repetition conveys information about how a body that produced a corpus operates (e.g. named entity relationships) and or conveys summary information.

The field agrees that redundancy and similarity are theoretically valuable sources of information about a corpus:

This is the motivation for evaluating redundancy in the current effort. Additional motivations are included as inspirational:

> Sentence similarity is considered the basis of many natural language tasks such as information retrieval, question answering and text summarization. ([A Comprehensive Comparative Study of Word and Sentence Similarity Measures](https://www.researchgate.net/publication/294873785_A_Comprehensive_Comparative_Study_of_Word_and_Sentence_Similarity_Measures))


> ..redundancy can be exploited to identify important and accurate information for applications such as summarization and question answering ([Sentence Fusion for Multidocument News Summarization](http://www.mitpressjournals.org/doi/pdf/10.1162/089120105774321091)).


> ...most applications based on Twitter share the goal of providing tweets that are both informative and diverse... to keep a high level of diversity, redundant tweets should be removed from the set of tweets displayed to the user ([Linguistic Redundancy in Twitter](http://www.aclweb.org/anthology/D11-1061.pdf)).


> ...from a computational linguistic point of view, the high redundancy in micro-blogs gives the unprecedented opportunity to study classical tasks ... on very large corpora characterized by an original and emerging linguistic style, pervaded with ungrammatical and colloquial expressions, abbreviations, and new linguistic forms  ([Linguistic Redundancy in Twitter](http://www.aclweb.org/anthology/D11-1061.pdf)).


> O'Shea et al. applied text similarity in Conversational Agents, which are computer programs that interact with humans through natural language dialogue ([Text Similarity using Google Tri-grams](https://web.cs.dal.ca/~eem/cvWeb/pubs/2012-Aminul-CAI.pdf)).

## Current Methods

To start the task of analyzing redundancy, a review of existing methods informed the path forward.

### Methods to Measure Similarity and Redundancy

In the existing literature, measurements of similarity have been separated into **corpus-**, **knowledge-**, and **hybrid-based** methods. Hybrid methods are excluded from the current review.

The practical difference between corpus and knowledge-based methods is the corpus based depends on word frequences from a specific corpus. A restriction on corpus-based methods is they are quite domain-dependent and often do not generalize outside of a given corpus. This could pose a potential problem for the current efforts if required to measure redundancy across e-government initiatives, but this is not a current requirement, so this limitation is acceptable.

All methods listed below are included given their pertinence to the current problem, with the exception of methods listed in **Of Interest**.

#### Corpus-Based Word Similarity

The bag-of-words (BOW) method is often used as a baseline measurement of similarity between documents. Given a document-term matrix, taking the dot product or cosine of the dot-product between two columns (e.g. documents) gives a BOW-based similarity score of those two documents. The same method can be followed for the tf-idf version of this matrix.

[Latent Semantic Analysis][Latent Semantic Analysis] (LSA) measures the similarity between words using a word-count per document (e.g. words x documents, or transpose of the document-term matrix) matrix and computing the cosine of the dot product between 2 rows. This within-corpus word similarity measure will enrich measurements of similarity when comparing documents in methods for computing sentence-based similarities in what follows.

#### Knowlege-Based Word Similarity

WordNet bag-of-words (WBOW) is a "knowledge-based" version of Latent Semantic Analysis and is frequently used to enrich measurements of similarity in texts. WordNets are human-generated lexicons and thus do not require the pre-computation or corpus-dependency of LSA. WordNets are popular but may be limited in depth. It will be interesting to see what is available for Catalan.

#### Knowledge-Based Document Similarity

Knowledge-based document similarity measures listed in [Atoum](https://www.researchgate.net/publication/294873785_A_Comprehensive_Comparative_Study_of_Word_and_Sentence_Similarity_Measures) use a knowledge-based measurement of word similarity within a document and some quantification for document structure. For example, measurments composed of [WBOW plus part-of-speech (POS) tree kernels](http://ieeexplore.ieee.org/xpl/) or [POS tags](http://www.sciencedirect.com/science/article/pii/S0957417410011875?np=y). Others are listed in [Atoum Section 2.2.2](https://www.researchgate.net/publication/294873785_A_Comprehensive_Comparative_Study_of_Word_and_Sentence_Similarity_Measures). Theses methods demonstrated poor results or were not evaluated in [Atoum](https://www.researchgate.net/publication/294873785_A_Comprehensive_Comparative_Study_of_Word_and_Sentence_Similarity_Measures), and some of the more attractive versions are not available for review. For these reasons, focus will be on corpus-based methods (and possibly hybrid-based methods later on).

#### Corpus-Based Document Similarity

Corpus-based measures of document similarity rely on string similarity, string edit distance and word orders. Other common methods are the [edit-distance](https://en.wikipedia.org/wiki/Edit_distance) and [Smith-Waterman Alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). These methods will be used as baselines for more advanced methods, but also may provide valuable insights.

In [Atoum](https://www.researchgate.net/publication/294873785_A_Comprehensive_Comparative_Study_of_Word_and_Sentence_Similarity_Measures), the highest-performing method was the [Google Tri-Gram](https://web.cs.dal.ca/~eem/cvWeb/pubs/2012-Aminul-CAI.pdf) approach. This approach calculates a word-similarity metric using trigrams and then uses it in a subsequent text similarity metric. In essence, this metric evaluates the similarity of word `w_a` and `w_b` by measuring the frequency of tri-gram instances containing `w_a` and `w_b` in positions 1 and 3 of the trigram.

Tree-based measurements leverage a tree data-structure representation of a document (i.e. sentence). In [Linguistic Redundancy in Twitter](http://www.aclweb.org/anthology/D11-1061.pdf) the most successful formulation was a combination metric using WBOW and the Syntatic First-Order Rule Content Model (FOR). The FOR feature space introduced by [Zanzotto and Moschitti](http://www.aclweb.org/anthology/P06-1051) constructs features as a pair of syntatic tree fragments augmented with variables which are evaluated for similarity.

[Simfinder](http://www.mitpressjournals.org/doi/pdf/10.1162/089120105774321091) also uses sentence syntax trees to compte sentence similarity, without expectation on their complate alignment.